Using wget To Download Entire Websites
来源:互联网 发布:郝斌c语言视频教程全集 编辑:程序博客网 时间:2024/05/16 09:27
本文转载至:http://jamsubuntu.blogspot.com/2009/02/using-wget-to-download-entire-websites.html
Basic wget Commands:
To download a file from the Internet type:
wget http://www.example.com/downloads.zip
If you are downloading a large file, for example an ISO image, this could take some time. If your Internet connection goes down, then what do you do? You will have to start the download again. If you are downloading a 700Mb ISO image on a slow connection, this could be very annoying! To get around this problem, you can use the -c parameter. This will continue the download after any disruptions. eg:
wget -c http://www.example.com/linux.iso
I have came across some websites that do not allow you to download any files using a download manager. To get around this,
wget -U mozilla http://www.example.com/image.jpg
This will pass wget off as being a Mozilla web browser
Downloading Entire Sites:
Wget is also able to download an entire website. But because this can put a heavy load upon the server, wget will obey the robots.txt file.
wget -r -p http://www.example.com
The -p parameter tells wget to include all files, including images. This will mean that all of the HTML files will look how they should do.
So what if you don't want wget to obey by the robots.txt file? You can simply add -e robots=off to the command like this:
wget -r -p -e robots=off http://www.example.com
As many sites will not let you download the entire site, they will check your browsers identity. To get around this, use -U mozilla as I explained above.
wget -r -p -e robots=off -U mozilla http://www.example.com
A lot of the website owners will not like the fact that you are downloading their entire site. If the server sees that you are downloading a large amount of files, it may automatically add you to it's black list. The way around this is to wait a few seconds after every download. The way to do this using wget is by including --wait=X (where X is the amount of seconds.)
you can also use the parameter: --random-wait to let wget chose a random number of seconds to wait. To include this into the command:
wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com
Other Useful wget Parameters:
--limit-rate=20k : Limits the rate at which it downloads files. (20Kb/s)
-b : Continues wget after logging out. Very useful if you are connecting to your home PC via SSH.
-o $HOME/wget_log.txt : Logs the output of the wget command to a text file within your home directory. Useful for if you are using wget in the background, as you can check for any errors that may appear.
Basic wget Commands:
To download a file from the Internet type:
wget http://www.example.com/downloads.zip
If you are downloading a large file, for example an ISO image, this could take some time. If your Internet connection goes down, then what do you do? You will have to start the download again. If you are downloading a 700Mb ISO image on a slow connection, this could be very annoying! To get around this problem, you can use the -c parameter. This will continue the download after any disruptions. eg:
wget -c http://www.example.com/linux.iso
I have came across some websites that do not allow you to download any files using a download manager. To get around this,
wget -U mozilla http://www.example.com/image.jpg
This will pass wget off as being a Mozilla web browser
Downloading Entire Sites:
Wget is also able to download an entire website. But because this can put a heavy load upon the server, wget will obey the robots.txt file.
wget -r -p http://www.example.com
The -p parameter tells wget to include all files, including images. This will mean that all of the HTML files will look how they should do.
So what if you don't want wget to obey by the robots.txt file? You can simply add -e robots=off to the command like this:
wget -r -p -e robots=off http://www.example.com
As many sites will not let you download the entire site, they will check your browsers identity. To get around this, use -U mozilla as I explained above.
wget -r -p -e robots=off -U mozilla http://www.example.com
A lot of the website owners will not like the fact that you are downloading their entire site. If the server sees that you are downloading a large amount of files, it may automatically add you to it's black list. The way around this is to wait a few seconds after every download. The way to do this using wget is by including --wait=X (where X is the amount of seconds.)
you can also use the parameter: --random-wait to let wget chose a random number of seconds to wait. To include this into the command:
wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com
Other Useful wget Parameters:
--limit-rate=20k : Limits the rate at which it downloads files. (20Kb/s)
-b : Continues wget after logging out. Very useful if you are connecting to your home PC via SSH.
-o $HOME/wget_log.txt : Logs the output of the wget command to a text file within your home directory. Useful for if you are using wget in the background, as you can check for any errors that may appear.
0 0
- Using wget To Download Entire Websites
- Download Oracle JDK using wget --no-check-certificat
- How to solve: when using tab in gnomeTerminal , the entire screen flash,
- pass on to the entire
- How to download dynamically loaded content using python
- tools to save an entire Web site
- Are you looking for a way to get the entire text of a word document into a RichEdit without using the Clipboard?
- C#Websites Directory to Sunwen
- How to download streaming audio or video media from the internet using the MMS protocol?
- How To Download The Latest Updates And Patches Using 11.2.0.2 OUI [ID 1295074.1]
- Android: How to download the latest zip Android Source Code easily and using it in Intellij
- Using CURL to download a remote file from a valid URL in c++
- Openstack: python API “how to download image from glance using the python api”
- websites
- WebSites
- websites
- websites
- How to Copy an Entire Directory in VB6?
- wamp出现You don’t have permission to access/on this server提示
- cortex_m3_stm32嵌入式学习笔记(十七):内部温度传感器(ADC采集)
- 归档的概念和用法
- 擦干眼泪,掩饰悲伤,只因明天我想要坚强
- 基于 less 构建的 css3 动画库-88种
- Using wget To Download Entire Websites
- EditPlus编写html入门教程
- NYOJ 107 A Famous ICPC Team【简单题】
- linux 网站汇总
- github常见操作和常见错误!错误提示:fatal: remote origin already exists.
- MD532位和16位
- <蓝桥杯>比赛基础
- C++ 语言知识汇集 -03
- 测试我的博客