wget 常用例子
来源:互联网 发布:知乎量子纠缠与超光速 编辑:程序博客网 时间:2024/05/29 15:43
- 1 COMMON COMMANDS
- 1-1 Download a single file from the Internet
- 1-2 Download a file but save it locally under a different name
- 1-3 Download a file and save it in a specific folder
- 1-4 Resume an interrupted download previously started by wget itself
- 1-5 Download a file but only if the version on server is newer than your local copy
- 1-6 Download multiple URLs with wget Put the list of URLs in another text file on separate lines and pass it to wget
- 1-7 Download a list of sequentially numbered files from a server
- 1-8 Download a web page with all assets like stylesheets and inline images that are required to properly display the web page offline
- 2 MIRROR WEBSITES WITH WGET
- 2-1 Download an entire website including all the linked pages and files
- 2-2 Download all the MP3 files from a sub directory
- 2-3 Download all images from a website in a common folder
- 2-12 Download the PDF documents from a website through recursion but stay within specific domains
- 2-4 Download all files from a website but exclude a few directories
- 3 WGET FOR DOWNLOADING RESTRICTED CONTENT
- 3-1 Download files from websites that check the User Agent and the HTTP Referer
- 3-2 Download files from a password protected sites
- 3-3 Fetch pages that are behind a login page You need to replace user and password with the actual form fields while the URL should point to the Form Submit action page
- 3-4 Find the size of a file without downloading it look for Content Length in the response the size is in bytes
- 3-5 Download a file and display the content on screen without saving it locally
- 3-6 Know the last modified date of a web page check the Last Modified tag in the HTTP header
- 3-7 Check the links on your website to ensure that they are working The spider option will not save the pages locally
- 4 WGET HOW TO BE NICE TO THE SERVER
1) COMMON COMMANDS
1-1) Download a single file from the Internet
wget http://example.com/file.iso
1-2) Download a file but save it locally under a different name
wget ‐‐output-document=filename.html example.com
1-3) Download a file and save it in a specific folder
wget ‐‐directory-prefix=folder/subfolder example.com
1-4. Resume an interrupted download previously started by wget itself
wget ‐‐continue example.com/big.file.iso
1-5. Download a file but only if the version on server is newer than your local copy
wget ‐‐continue ‐‐timestamping wordpress.org/latest.zip
1-6. Download multiple URLs with wget. Put the list of URLs in another text file on separate lines and pass it to wget.
wget ‐‐input list-of-file-urls.txt
1-7. Download a list of sequentially numbered files from a server
wget http://example.com/images/{1..20}.jpg
1-8. Download a web page with all assets – like stylesheets and inline images – that are required to properly display the web page offline.
wget ‐‐page-requisites ‐‐span-hosts ‐‐convert-links ‐‐adjust-extension http://example.com/dir/file
2) MIRROR WEBSITES WITH WGET
2-1) Download an entire website including all the linked pages and files
wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber http://example.com/
2-2) Download all the MP3 files from a sub directory
wget ‐‐level=1 ‐‐recursive ‐‐no-parent ‐‐accept mp3,MP3 http://example.com/mp3/
2-3) Download all images from a website in a common folder
wget ‐‐directory-prefix=files/pictures ‐‐no-directories ‐‐recursive ‐‐no-clobber ‐‐accept jpg,gif,png,jpeg http://example.com/images/
2-12. Download the PDF documents from a website through recursion but stay within specific domains.
wget ‐‐mirror ‐‐domains=abc.com,files.abc.com,docs.abc.com ‐‐accept=pdf http://abc.com/
2-4) Download all files from a website but exclude a few directories.
wget ‐‐recursive ‐‐no-clobber ‐‐no-parent ‐‐exclude-directories /forums,/support http://example.com
3) WGET FOR DOWNLOADING RESTRICTED CONTENT
Wget can be used for downloading content from sites that are behind a login screen or ones that check for the HTTP referer and the User Agent strings of the bot to prevent screen scraping.
3-1) Download files from websites that check the User Agent and the HTTP Referer
wget ‐‐refer=http://google.com ‐‐user-agent=”Mozilla/5.0 Firefox/4.0.1″ http://nytimes.com
3-2) Download files from a password protected sites
wget ‐‐http-user=labnol ‐‐http-password=hello123 http://example.com/secret/file.zip
3-3) Fetch pages that are behind a login page. You need to replace user and password with the actual form fields while the URL should point to the Form Submit (action) page.
wget ‐‐cookies=on ‐‐save-cookies cookies.txt ‐‐keep-session-cookies ‐‐post-data ‘user=labnol&password=123’ http://example.com/login.phpwget ‐‐cookies=on ‐‐load-cookies cookies.txt ‐‐keep-session-cookies http://example.com/paywall
RETRIEVE FILE DETAILS WITH WGET
3-4) Find the size of a file without downloading it (look for Content Length in the response, the size is in bytes)
wget ‐‐spider ‐‐server-response http://example.com/file.iso
3-5) Download a file and display the content on screen without saving it locally.
wget ‐‐output-document – ‐‐quiet google.com/humans.txt
3-6) Know the last modified date of a web page (check the Last Modified tag in the HTTP header).
wget ‐‐server-response ‐‐spider http://www.labnol.org/
3-7) Check the links on your website to ensure that they are working. The spider option will not save the pages locally.
wget ‐‐output-file=logfile.txt ‐‐recursive ‐‐spider http://example.com
Also see: Essential Linux Commands
4) WGET – HOW TO BE NICE TO THE SERVER?
The wget tool is essentially a spider that scrapes / leeches web pages but some web hosts may block these spiders with the robots.txt files. Also, wget will not follow links on web pages that use the rel=nofollow attribute.
You can however force wget to ignore the robots.txt and the nofollow directives by adding the switch ‐‐execute robots=off to all your wget commands. If a web host is blocking wget requests by looking at the User Agent string, you can always fake that with the ‐‐user-agent=Mozilla switch.
The wget command will put additional strain on the site’s server because it will continuously traverse the links and download files. A good scraper would therefore limit the retrieval rate and also include a wait period between consecutive fetch requests to reduce the server load.
wget ‐‐limit-rate=20k ‐‐wait=60 ‐‐random-wait ‐‐mirror example.com
In the above example, we have limited the download bandwidth rate to 20 KB/s and the wget utility will wait anywhere between 30s and 90 seconds before retrieving the next resource.
Finally, a little quiz. What do you think this wget command will do?
wget ‐‐span-hosts ‐‐level=inf ‐‐recursive dmoz.org
References to:
1. https://www.labnol.org/software/wget-command-examples/28750/
- wget 常用例子
- Wget常用参数
- wget命令常用参数
- wget常用的参数
- wget常用参数
- wget常用选项
- wget 常用参数释义
- Java +apache+wget下载例子
- wget常用参数 (转载)
- 常用的linux命令-wget
- 常用的 wget 参数详解
- wget 使用的15个震撼例子
- wget
- wget
- wget
- wget
- wget
- wget
- 你的vue起步
- 七个评委打分,去掉一个最高分,去掉一个最低分,最终得平均分
- 章节7 支持的CPU
- 【3分钟带你学】Ajax
- 带权区间调度问题,软件的期中复习
- wget 常用例子
- ios-视频录制保存
- java线程安全篇之synchronized对象锁的同步和异步(三)
- jquery选择器之基本选择器
- TensorFlow教程——常用函数解析
- ubuntu16.04 安装Hadoop2.7.2
- a3~a6
- 171113_官网下载MySQL
- Android Framework------之Keyguard 简单分析