Programing Excavation: January 2014

Hi to all,
I have run into to a problem lately while trying to download the content of an external Yum repository of Linux RPMs for Offline use (The Cloudera Hadoop repository),
it's nice to just right click and download the file when we're talking about a couple of files,
but what will you do when we're talking about hundreds of files that take up to 19 GB??

Yes, my home PC is running windows, so the most convenient way I've found to download all of the repository is via "Wget", you've probably know the command in Linux OS, but there is an implementation in Windows as well.
It could be found at Wget 32 bit.
Just download it and lets start working:

First, you'll need to find the source directory you want to download,
and in the script, you should change the **Wanted Source URL** with the actual source you want to download.

Second, put the "wget.exe" file in the wanted directory, and put the Batch File shown below in the same place.

Here's the script:

Then just run the batch file with double click or via the windows command line.
The process might take a while, but eventually all the files will be downloaded.

Take to notice that you won't need all the files you've downloaded, because it actually takes everything,
so i've added some flags to Wget that will filter some of the unneeded files, but you can't adjust the command with more properties that you can read in the Wget Usage.

But i'll explain the flags I've used briefly: (for example the URL: http://hostname/aaa/bbb/ccc/ddd/)
wget – recursively download all files from certain given directory
--execute="robots = off": Specify whether the norobots convention is respected by Wget

--mirror: Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps ftp directory listings

--convert-links: After the download is complete, convert the links in the document to make them suitable for local viewing

--no-parent: Do not ever ascend to the parent directory when retrieving recursively

--wait=5: Wait the specified number of seconds between the retrievals.

-R: Specify comma-separated lists of file name suffixes or patterns to accept or reject

So for conclusion, you'll end up with all of the files in your local machine,
ready for transfer for offline use.

Hope this helps you to avoid a lot of hard labor of downloading all of your files manually. (And to tell you the truth it might be more efficient than my prior post of Downloading an eclipse update site),

Demi Ben-Ari

Programing Excavation

Friday, January 10, 2014

How to download the content of a whole File Server in Windows