Wget: downloading files selectively and recursively?

Question about wget, subfolder, and index.html.

Let’s say I am inside “travels/” folder and this is in “website.com”: “website.com/travels/”.

Folder “travels/” contains a lot of files and other (sub)folders: “website.com/travels/list.doc” , “website.com/travels/cover.png” , “website.com/travels/[1990] America/” , “website.com/travels/[1994] Japan/”, and so on…

How can I download solely all “.mov” and “.jpg” that resides in all the subfolders only? I don’t want to pick files from “travels/” (e.g. not “website.com/travels/list.doc”)

I found a wget command (on Unix&Linux Exchange, I don’t remember what was the discussion) capable of downloading from subfolders only their “index.html”, not others contents. Why download only index files?

Answer

This command will download only images and movies from a given website:

wget -nd -r -P /save/location -A jpeg,jpg,bmp,gif,png,mov "http://www.somedomain.com"

According to wget man:

-nd prevents the creation of a directory hierarchy (i.e. no directories).

-r enables recursive retrieval. See Recursive Download for more information.

-P sets the directory prefix where all files and directories are saved to.

-A sets a whitelist for retrieving only certain file types. Strings and patterns are accepted, and both can be used in a comma separated list (as seen above). See Types of Files for more information.

If you would like to download subfolders you need to use the flag --no-parent, something similar to this command:

wget -r -l1 --no-parent -P /save/location -A jpeg,jpg,bmp,gif,png,mov "http://www.somedomain.com"

-r: recursive retrieving
-l1: sets the maximum recursion depth to be 1
--no-parent: does not ascend to the parent; only downloads from the specified subdirectory and downwards hierarchy

Regarding the index.html webpage. It will be excluded once the flag -A is included in the command wget, because this flag will force wget to download specific type of files, meaning if html is not included in the list of accepted files to be downloaded (i.e. flag A), then it will not be downloaded and wget will output in terminal the following message:

Removing /save/location/default.htm since it should be rejected.

wget can download specific type of files e.g. (jpg, jpeg, png, mov, avi, mpeg, …. etc) when those files are exist in the URL link provided to wget for example:

Let’s say we would like to download .zip and .chd files from this website

In this link there are folders and .zip files (scroll to the end). Now, let’s say we would like to run this command:

wget -r --no-parent -P /save/location -A chd,zip "https://archive.org/download/MAME0.139_MAME2010_Reference_Set_ROMs_CHDs_Samples/roms/"

This command will download .zip files and at the same time it will create an empty folders for the .chd files.

In order to download the .chd files, we would need to extract the names of the empty folders, then convert those folder names to its actual URLs. Then, put all the URLs of interest in a text file file.txt, finally feed this text file to wget, as follows:

wget -r --no-parent -P /save/location -A chd,zip -i file.txt

The previous command will find all the chd files.

Attribution
Source : Link , Question Author : T. Caio , Answer Author : Community

Leave a Comment