Wget Download All Files In Directory

Wget A Folder
Wget Entire Directory
Wget Download To Directory

Wget can recursively downloaddata or web pages.This is a key featureWget has that cURL does not have.While cURL is a library with a command-line front end, Wget is a command-line tool.Since recursive download requires severalWget options,it is perhaps best shown by example.

Example

To add: The above code would allow you to download ALL files from the targeted directory to the directory of your choice in a single command. Break Down of Command: The Parameters for which wget uses to download the entire directory at once instead of one file at a time: wget -e robots=off -r -no-parent The Destination of Downloaded Files.

This downloads the files to whatever directory you ran the command in.To use Wget to recursively download using FTP, simply change https:// to ftp:// using the FTP directory.

Wget: Simple Command to make CURL request and download remote files to our local machine.execute='robots = off': This will ignore robots.txt file while crawling through pages. It is helpful if you’re not getting all of the files.mirror: This option will basically mirror the directory structure for the given URL. Everybody knows wget and how to use it, it’s one of my favorite tools expecially when I need to download an ISO or a single file, using wget with recurse on an entire site is not a big problem but when you need to download only a specified directory it could cause headaches when dealing with different options. The “-r” switch tells wget to recursively download every file on the page and the “-A.pdf” switch tells wget to only download PDF files. You could switch pdf to mp3 for instance to download all mp3 files on the specified URL. GNU Wget is a command-line utility for downloading files from the web. With Wget, you can download files using HTTP, HTTPS, and FTP protocols. Wget provides a number of options allowing you to download multiple files, resume downloads, limit the bandwidth, recursive downloads, download in the background, mirror a website, and much more.

Wget recursive download options

--recursive: download recursively (and place in recursive folders on your PC)
--recursive --level=1: recurse but --level=1 don’t go below specified directory
-Q 1g: total overall download --quota option, for example to stop downloading after 1 GB has been downloaded altogether
-np: Never get parent directories (sometimes a site will link upwards)
-nc: no clobber – don’t re-download files you already have
-nd: no directory structure on download (put all files in one directory commanded by -P)
-nH: don’t put vestigial site name directories on your PC
-A: only accept files matching globbed pattern
--cut-dirs=4: don’t put a vestigial hierarchy of directories above the desired directory on your PC.Set the number equal to the number of directories on server (here aaa/bbb/ccc/ddd is four)
-e robots=off: Many sites will block robots from mindlessly consuming huge amounts of data.Here we override this setting telling Apache that we’re (somewhat) human.
--random-wait: To avoid excessive download requests (that can get you auto-banned from downloading) we politely wait in-between file downloads
--wait 1: making the random wait time average to about 1 second before starting to download the next file. This helps avoid anti-leeching measures.

When you request a downloaded dataset from the Data Portal, there are many ways to work with the results. Sometimes, rather than accessing the data through THREDDS (such as via .ncml or the subset service), you just want to download all of the files to work with on your own machine.

There are several methods you can use to download your delivered files from the server en masse, including:

Wget A Folder

shell – curl or wget
python – urllib2
java – java.net.URL

Below, we detail how you can use wget or python to do this.

It’s important to note that the email notification you receive from the system will contain two different web links. They look very similar, but the directories they point to differ slightly.

First Link: https://opendap.oceanobservatories.org/thredds/catalog/ooi/sage-marine-rutgers/20171012T172409-CE02SHSM-SBD11-06-METBKA000-telemetered-metbk_a_dcl_instrument/catalog.html

The first link (which includes thredds/catalog/ooi) will point to your dataset on a THREDDS server. THREDDS provides additional capabilities to aggregrate or subset the data files if you use a THREDDS or OpenDAP compatible client, like ncread in Matlab or pydap in Python.

Second Link: https://opendap.oceanobservatories.org/async_results/sage-marine-rutgers/20171012T172409-CE02SHSM-SBD11-06-METBKA000-telemetered-metbk_a_dcl_instrument

The second link points to a traditional Apache web directory. From here, you can download files directly to your machine by simply clicking on them.

Using wget

Wget Entire Directory

First you need to make sure you have wget installed on your machine. If you are on a mac and have the homebrew package manager installed, in the terminal you can type:

Alternatively, you can grab wget off GitHub here https://github.com/jay/wget

Once wget is installed, you can recursively download an entire directory of data using the following command (make sure you use the second (Apache) web link (URL) provided by the system when using this command):

This simpler version may also work.

Wget Download To Directory

Here is an explanation of the specified flags.

-r signifies that wget should recursively download data in any subdirectories it finds.
-l1 sets the maximum recursion to 1 level of subfolders.
-nd copies all matching files to current directory. If two files have identical names it appends an extension.
-nc does not download a file if it already exists.
-np prevents files from parent directories from being downloaded.
-e robots=off tells wget to ignore the robots.txt file. If this command is left out, the robots.txt file tells wget that it does not like web crawlers and this will prevent wget from working.
-A.nc restricts downloading to the specified file types (with .nc suffix in this case)
–no-check-certificate disregards the SSL certificate check. This is useful if the SSL certificate is setup incorrectly, but make sure you only do this on servers you trust.

Using python

wget is rather blunt, and will download all files it finds in a directory, though as we noted you can specify a specific file extension.

If you want to be more granular about which files you download, you can use Python to parse through the data file links it finds and have it download only the files you really want. This is especially useful when your download request results in a lot of large data files, or if the request includes files from many different instruments that you may not need.

Here is an example script that uses the THREDDS service to find all .nc files included in the download request. Under the hood, THREDDS provides a catalog.xml file which we can use to extract the links to the available data files. This xml file is relatively easier to parse than raw html.

The first part of the main() function creates an array of all of the files we would like to download (in this case, only ones ending in .nc), and the second part actually downloads them using urllib.urlretrieve(). If you want to download only files from particular instruments, or within specific date ranges, you can customize the code to filter out just the files you want (e.g. using regex).

Don’t forget to update the server_url and request_url variables before running the code. You may also need to install the required libraries if you don’t already have them on your machine.

— Last revised on May 31, 2018 —