用wget避开robots.txt的下载限制

来源：互联网发布：淘宝女装店广告语编辑：程序博客网时间：2024/04/30 11:41

在网站的目录下放置一个robots.txt，并在里面禁止wget的行为，那么默认情况下wget是不会下载整个网站的内容的。

比如wget -r http://www.example.com的时候，如果www.example.com上面放了一个robots.txt并在里面注明wget disallow的话。

那么wget就默认不会下载这个网站。

如果想要让wget忽略robots.txt的规则的话，那么就加上-e robots=off，同时，注意网站管理员加上robots.txt经常是因为负载的考虑，所以记得我们加上--wait 1来减少我们的下载对服务器的负担。

参考：

http://addictivecode.org/FrequentlyAskedQuestions#How_can_I_make_Wget_ignore_the_robots.txt_file.2BAC8-no-follow_attribute.3F
http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/: 15个wget的典型应用场景

一个常用的wget保存整个网站的option组合：

wget --mirror --no-parent --page-requisites --convert-links --no-host-directories --cut-dirs=2 --load-cookies cookies.txt --directory-prefix=. http://www.example.com/wiki/index.php/Main_Page