Having ditched Windows entirety at the start of the year and made the switch to Linux (Ubuntu flavour), I’ve discovered a range of tools that would have come in very handy in my days as a Windows user. One of those tools is called wget. Wget is basically a bit of software that allows you to download files from webservers via the command line. That in itself doesn’t sound very exciting, but when you start wielding some of its options, you can do some interesting things with it. To showcase some of the things that wget can do, here is a collection of one-liners that you might find interesting or useful – I haven’t come up with them all myself, mostly collected them from around the net from forums and places such as command line fu:
[To understand what the options after the wget command, you’ll need to refer to the wget documentation]
Download a single file
Start with an easy one!
Download an entire website
If you don’t want to be courteous, then you can ignore the –random-wait switch if you don’t mind running the risk of getting banned. If you only want to download the site to a certain depth, then you can use the switch -l followed by a number to indicate the depth e.g. “-l 2” to a depth level of 2. Downloading an entire website can also be achieved using the -mirror parameter.
wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com
Download an entire ftp directory using wget
Handy if you don’t want to spark up an FTP client.
wget -r ftp://username:firstname.lastname@example.org or wget --ftp-user=username --ftp-password=password example.com
Check for broken links using wget as a spider
This is the command line equivalent of using a Windows based tool called Xenu Link Sleuth. It will spider an entire website, ignoring the robots.txt and generate a log file of all broken links. Handy.
wget --spider -o wget.log -e robots=off --wait 1 -r -p http://www.example.com
Get server information
wget -S -q -O - http://www.example.com | grep ^Server
Diff remote webpages using wget
diff <(wget -q -O - http://www.example.com) <(wget -q -O - http://www.example2.com)
Schedule a download
If you've got a big file to download, you could schedule getting it in this manner. Couple this command with a cron job and you could feasibly create a snapshot of a website at given time intervals.
echo 'wget http://www.example.com' | at 01:00