wget 常用例子

来源：互联网发布：知乎量子纠缠与超光速编辑：程序博客网时间：2024/05/29 15:43

1 COMMON COMMANDS
- 1-1 Download a single file from the Internet
- 1-2 Download a file but save it locally under a different name
- 1-3 Download a file and save it in a specific folder
- 1-4 Resume an interrupted download previously started by wget itself
- 1-5 Download a file but only if the version on server is newer than your local copy
- 1-6 Download multiple URLs with wget Put the list of URLs in another text file on separate lines and pass it to wget
- 1-7 Download a list of sequentially numbered files from a server
- 1-8 Download a web page with all assets like stylesheets and inline images that are required to properly display the web page offline
2 MIRROR WEBSITES WITH WGET
- 2-1 Download an entire website including all the linked pages and files
- 2-2 Download all the MP3 files from a sub directory
- 2-3 Download all images from a website in a common folder
- 2-12 Download the PDF documents from a website through recursion but stay within specific domains
- 2-4 Download all files from a website but exclude a few directories
3 WGET FOR DOWNLOADING RESTRICTED CONTENT
- 3-1 Download files from websites that check the User Agent and the HTTP Referer
- 3-2 Download files from a password protected sites
- 3-3 Fetch pages that are behind a login page You need to replace user and password with the actual form fields while the URL should point to the Form Submit action page
- 3-4 Find the size of a file without downloading it look for Content Length in the response the size is in bytes
- 3-5 Download a file and display the content on screen without saving it locally
- 3-6 Know the last modified date of a web page check the Last Modified tag in the HTTP header
- 3-7 Check the links on your website to ensure that they are working The spider option will not save the pages locally
4 WGET HOW TO BE NICE TO THE SERVER

1) COMMON COMMANDS

1-1) Download a single file from the Internet

wget http://example.com/file.iso

1-2) Download a file but save it locally under a different name

wget ‐‐output-document=filename.html example.com

1-3) Download a file and save it in a specific folder

wget ‐‐directory-prefix=folder/subfolder example.com

1-4. Resume an interrupted download previously started by wget itself

wget ‐‐continue example.com/big.file.iso

1-5. Download a file but only if the version on server is newer than your local copy

wget ‐‐continue ‐‐timestamping wordpress.org/latest.zip

1-6. Download multiple URLs with wget. Put the list of URLs in another text file on separate lines and pass it to wget.

wget ‐‐input list-of-file-urls.txt

1-7. Download a list of sequentially numbered files from a server

wget http://example.com/images/{1..20}.jpg

1-8. Download a web page with all assets – like stylesheets and inline images – that are required to properly display the web page offline.

wget ‐‐page-requisites ‐‐span-hosts ‐‐convert-links ‐‐adjust-extension http://example.com/dir/file

2) MIRROR WEBSITES WITH WGET

2-1) Download an entire website including all the linked pages and files

wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber http://example.com/

2-2) Download all the MP3 files from a sub directory

wget ‐‐level=1 ‐‐recursive ‐‐no-parent ‐‐accept mp3,MP3 http://example.com/mp3/

2-3) Download all images from a website in a common folder

wget ‐‐directory-prefix=files/pictures ‐‐no-directories ‐‐recursive ‐‐no-clobber ‐‐accept jpg,gif,png,jpeg http://example.com/images/

2-12. Download the PDF documents from a website through recursion but stay within specific domains.

wget ‐‐mirror ‐‐domains=abc.com,files.abc.com,docs.abc.com ‐‐accept=pdf http://abc.com/

2-4) Download all files from a website but exclude a few directories.

wget ‐‐recursive ‐‐no-clobber ‐‐no-parent ‐‐exclude-directories /forums,/support http://example.com

3) WGET FOR DOWNLOADING RESTRICTED CONTENT

Wget can be used for downloading content from sites that are behind a login screen or ones that check for the HTTP referer and the User Agent strings of the bot to prevent screen scraping.

3-1) Download files from websites that check the User Agent and the HTTP Referer

wget ‐‐refer=http://google.com ‐‐user-agent=”Mozilla/5.0 Firefox/4.0.1″ http://nytimes.com

3-2) Download files from a password protected sites

wget ‐‐http-user=labnol ‐‐http-password=hello123 http://example.com/secret/file.zip

wget ‐‐cookies=on ‐‐save-cookies cookies.txt ‐‐keep-session-cookies ‐‐post-data ‘user=labnol&password=123’ http://example.com/login.phpwget ‐‐cookies=on ‐‐load-cookies cookies.txt ‐‐keep-session-cookies http://example.com/paywall

RETRIEVE FILE DETAILS WITH WGET

3-4) Find the size of a file without downloading it (look for Content Length in the response, the size is in bytes)

wget ‐‐spider ‐‐server-response http://example.com/file.iso

3-5) Download a file and display the content on screen without saving it locally.

wget ‐‐output-document – ‐‐quiet google.com/humans.txt

3-6) Know the last modified date of a web page (check the Last Modified tag in the HTTP header).

wget ‐‐server-response ‐‐spider http://www.labnol.org/

3-7) Check the links on your website to ensure that they are working. The spider option will not save the pages locally.

wget ‐‐output-file=logfile.txt ‐‐recursive ‐‐spider http://example.com

Also see: Essential Linux Commands

4) WGET – HOW TO BE NICE TO THE SERVER?

The wget tool is essentially a spider that scrapes / leeches web pages but some web hosts may block these spiders with the robots.txt files. Also, wget will not follow links on web pages that use the rel=nofollow attribute.

You can however force wget to ignore the robots.txt and the nofollow directives by adding the switch ‐‐execute robots=off to all your wget commands. If a web host is blocking wget requests by looking at the User Agent string, you can always fake that with the ‐‐user-agent=Mozilla switch.

The wget command will put additional strain on the site’s server because it will continuously traverse the links and download files. A good scraper would therefore limit the retrieval rate and also include a wait period between consecutive fetch requests to reduce the server load.

wget ‐‐limit-rate=20k ‐‐wait=60 ‐‐random-wait ‐‐mirror example.com

In the above example, we have limited the download bandwidth rate to 20 KB/s and the wget utility will wait anywhere between 30s and 90 seconds before retrieving the next resource.

Finally, a little quiz. What do you think this wget command will do?
wget ‐‐span-hosts ‐‐level=inf ‐‐recursive dmoz.org

References to:
1. https://www.labnol.org/software/wget-command-examples/28750/

阅读全文

0 0