Wget Command Usage and Examples

Sarath Pillai's picture
Download files using wget

We all have the habit of using download managers while downloading from internet in windows operating system. There are couple of nice Download managers available for windows as a graphical utility such as IDM,FDM etc. But if you are working in an environment where your primary interface is linux console without graphical X window, then what?

There are pretty good number of linux console based download managers. Our topic of interest in this post is one of them called WGET.

Wget is a wonderful tool to download files from internet. wget is a very old tool used by linux system admins for their tasks. Wget is something more than just a download manager...Lets understand some of its capabilities.

Wget's main capability lies in the recursive downloads its capable of doing. Like we do something recursively in directories, wget can do the same with hyperlinks. Wget is very much useful in scripting as it is command based, and can do almost anything with it.

Linux users dont need to worry about installing wget, because its a pretty old tool and most of the distribution includes it out of the box.

Lets see how wget works.

[root@slashroot ~]# wget
wget: missing URL
Usage: wget [OPTION]... [URL]...
Try `wget --help' for more options.

simply typing the command wont give you anything as shown above, because it requires atleast a url as an argument to download.

For example lets download the package of dhcp from centos url as shown below.

[root@myvm1 ~]# wget http://mirror.centos.org/centos/5/os/i386/CentOS/dhcp-3.0.5-31.el5.i386.rpm
--11:56:14--  http://mirror.centos.org/centos/5/os/i386/CentOS/dhcp-3.0.5-31.el5.i386.rpm
Resolving mirror.centos.org... 64.251.25.238
Connecting to mirror.centos.org|64.251.25.238|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 893723 (873K) [application/x-rpm]
Saving to: `dhcp-3.0.5-31.el5.i386.rpm.3'
100%[=======================================>] 893,723      140K/s   in 6.6s
11:56:22 (133 KB/s) - `dhcp-3.0.5-31.el5.i386.rpm.3' saved [893723/893723]

Now the file will be downloaded and saved in the same directory where you ran the wget command. In the above case dhcp package got saved in my root's home directory.

Mirror a Whole website using Wget

There is an option in wget to mirror a remote website in your home directory, which means that you can download the entire website in your desired folder, and then browse it from your own folder.

--mirror

-m(mirror)

you can use either of the above option to mirror a website.

[root@slashroot ~]# wget -mrk in.yahoo.com
--20:26:55--  http://in.yahoo.com/
Resolving in.yahoo.com... 72.30.38.140, 2001:4998:c:401::c:9101, 2001:4998:c:401::c:9102
Connecting to in.yahoo.com|72.30.38.140|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Last-modified header missing -- time-stamps turned off.
--20:26:56--  http://in.yahoo.com/
Connecting to in.yahoo.com|72.30.38.140|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `in.yahoo.com/index.html'
 [              <=>                       ] 74,971      15.5K/s

it will take a long time depending upon the site size..

i have used -mrk options in the above command.

-m you already know for mirror option

-r recursively to fetch all the links

-k to change all the links downloaded with respect to our directory.

Ignore Robot.txt using Wget

Robot.txt is a file that says search engine crawlers about what all areas of a website can be visited by the crawler. In the above scenario, we have mirrored a whole website..Robot.txt file tells the spider,crawler to ignore files or paths denied in that. Wget also does mirroring the same way as the search engine crawler does, so even wget will not be able to access things denied in robots.txt.

You can easily override this by telling wget to ignore robots.txt as shown below,

[root@slashroot ~]# wget -e robots=off -r -nc yahoo.com
--20:36:26--  http://yahoo.com/
Resolving yahoo.com... 72.30.38.140, 98.138.253.109, 98.139.183.24
Connecting to yahoo.com|72.30.38.140|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.yahoo.com/ [following]

-e option with robots=off ignores any robots.txt files.

-nc option will not download already downloaded files in the directory.

Specify filetypes that can be downloaded using Wget

 

You can even specify filetypes that can be downloaded using wget from a particular website URL. this can be done as shown below.

[root@slashroot ~]# wget -e --robots=off -r --level=0 -nc --accept jpg,gif,bmp yahoo.com
--20:40:56--  http://yahoo.com/
Resolving yahoo.com... 72.30.38.140, 98.138.253.109, 98.139.183.24
Connecting to yahoo.com|72.30.38.140|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.yahoo.com/ [following]

level=0 options says wget about the depth of directories to download. in this case its 0 which means that it will download everything it finds on the way.

--accept specifies the file type formats that can be downloaded.

How to download multiple files from urls using Wget

downloading multiple files are same as downloading single file. You just need to give thee required URL's with space in between seperating them.

[root@slashroot ~]# wget http://mirror.centos.org/centos/5/os/i386/CentOS/dhcp-3.0.5-31.el5.i386.rpm http://www.google.com

 

Resume Downloads using Wget

The first thing that needs to be understood is that resuming download capability is a function thats provided by the server from where you are downloading a file.

It does not matter if your download manger supports resuming a download.If the server does not support resuming then the resume will fail.

resuming a cancelled download can be done by using -c option in wget.

[root@slashroot ~]# wget -c yahoo.com

replace yahoo.com with whatever was your URL from where you were downloading the file.

Limit Download Speed with Wget

this option of limiting download speed in wget becomes handy if you have a lot of other applications that require more bandwidth on the server.

So we can limit the download speed of our download using wget as shown below.

 

[root@slashroot ~]# wget --limit-rate=100k yahoo.com

Specify password for urls in wget

You can specify passwords for URL's that require authentications while downloading.  this can be done with the following options.

--http-user and --http-password

for example suppose the site xyz.com requires passoword for donwloding then that can be done as the following.

[root@slashroot ~]# wget --http-user=sarath --http-password=123456 xyz.com

Download a number of files from different URL's using a file in WGET

downloading files from different and many urls using command line by giving URL's in space seperated way in not advisable when you have a very large number of urls.

You can make a file named anything u like and add all the URL's inside each in one line as shown below.

[root@slashroot ~]# cat mydownloadlist
First URl
Second URL
Third URL
[root@slashroot ~]# wget -i mydownloadlist
Rate this article: 
Average: 3.8 (33 votes)

Comments

Hi,

As per steps , I created new user and started apache but first and last line still showing it running by root user . Is this ok ?

root 15511 1 0 19:43 ? 00:00:00 /usr/local/apache/bin/httpd -k start
apache 15512 15511 0 19:43 ? 00:00:00 /usr/local/apache/bin/httpd -k start
apache 15513 15511 0 19:43 ? 00:00:00 /usr/local/apache/bin/httpd -k start
apache 15514 15511 0 19:43 ? 00:00:00 /usr/local/apache/bin/httpd -k start
apache 15516 15511 0 19:43 ? 00:00:00 /usr/local/apache/bin/httpd -k start
root 15599 22931 0 19:43 pts/1 00:00:00 grep httpd

Sarath Pillai's picture

Where does apache come into picture here?

Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.