Fetch robots.txt for a Site(chilkat/python 学习三)寻找robots.txt

来源:互联网 发布:武汉程序员待遇 编辑:程序博客网 时间:2024/04/30 08:13

这个学习内容还算有点意思,了解爬虫的人应该都知道robots.txt;他的作用我就不说了。
chilkat提供了一个函数来找robots.txt,现在让我们来看看他的工作吧,我找了
url1 = "www.google.cn"
url2 = "www.baidu.com"
url3 = "www.sina.com.cn"
url4 = "www.sohu.com"
url5 = "www.tom.com"
列表,看看google 的robots.txt还是蛮有意思的;

代码:
  1. spider = chilkat.CkSpider()
  2. url1 = "www.google.cn"
  3. url2 = "www.baidu.com"
  4. url3 = "www.sina.com.cn"
  5. url4 = "www.sohu.com"
  6. url5 = "www.tom.com"
  7. spider.Initialize(url1)
  8. robotsText = spider.fetchRobotsText()
  9. print robotsText
google的robots.txt输出,当然你可以试着找你感兴趣的网站的robots.txt

User-agent: *

Allow: /searchhistory/

Disallow:/news?output=xhtml&

Allow: /news?output=xhtml

Disallow: /search

Disallow: /groups

Disallow: /images

Disallow: /catalogs

Disallow: /catalogues

Disallow: /news

Disallow: /nwshp

Allow: /news?btcid=

Disallow: /news?btcid=*&

Allow: /news?btaid=

Disallow: /news?btaid=*&

Disallow: /setnewsprefs?

Disallow: /index.html?

Disallow: /?

Disallow: /addurl/image?

Disallow: /pagead/

Disallow: /relpage/

Disallow: /relcontent

Disallow: /sorry/

Disallow: /imgres

Disallow: /keyword/

Disallow: /u/

Disallow: /univ/

Disallow: /cobrand

Disallow: /custom

Disallow:/advanced_group_search

Disallow: /googlesite

Disallow: /preferences

Disallow: /setprefs

Disallow: /swr

Disallow: /url

Disallow: /default

Disallow: /m?

Disallow: /m/?

Disallow: /m/lcb

Disallow: /m/news?

Disallow: /m/setnewsprefs?

Disallow: /m/search?

Disallow: /wml?

Disallow: /wml/?

Disallow: /wml/search?

Disallow: /xhtml?

Disallow: /xhtml/?

Disallow: /xhtml/search?

Disallow: /xml?

Disallow: /imode?

Disallow: /imode/?

Disallow: /imode/search?

Disallow: /jsky?

Disallow: /jsky/?

Disallow: /jsky/search?

Disallow: /pda?

Disallow: /pda/?

Disallow: /pda/search?

Disallow: /sprint_xhtml

Disallow: /sprint_wml

Disallow: /pqa

Disallow: /palm

Disallow: /gwt/

Disallow: /purchases

Disallow: /hws

Disallow: /bsd?

Disallow: /linux?

Disallow: /mac?

Disallow: /microsoft?

Disallow: /unclesam?

Disallow: /answers/search?q=

Disallow: /local?

Disallow: /local_url

Disallow: /froogle?

Disallow: /products?

Disallow: /froogle_

Disallow: /product_

Disallow: /products_

Disallow: /print

Disallow: /books

Disallow: /patents?

Disallow: /scholar?

Disallow: /complete

Disallow: /sponsoredlinks

Disallow: /videosearch?

Disallow: /videopreview?

Disallow: /videoprograminfo?

Disallow: /maps?

Disallow: /mapstt?

Disallow: /mapslt?

Disallow: /maps/stk/

Disallow: /maps/br?

Disallow: /mapabcpoi?

Disallow: /translate?

Disallow: /center

Disallow: /ie?

Disallow: /sms/demo?

Disallow: /katrina?

Disallow: /blogsearch?

Disallow: /blogsearch/

Disallow: /blogsearch_feeds

Disallow: /advanced_blog_search

Disallow: /reader/

Disallow: /uds/

Disallow: /chart?

Disallow: /transit?

Disallow: /mbd?

Disallow: /extern_js/

Disallow: /calendar/feeds/

Disallow: /calendar/ical/

Disallow: /cl2/feeds/

Disallow: /cl2/ical/

Disallow: /coop/directory

Disallow: /coop/manage

Disallow: /trends?

Disallow: /trends/music?

Disallow: /notebook/search?

Disallow: /music

Disallow: /musica

Disallow: /musicad

Disallow: /musicas

Disallow: /musicl

Disallow: /musics

Disallow: /musicsearch

Disallow: /musicsp

Disallow: /musiclp

Disallow: /browsersync

Disallow: /call

Disallow: /archivesearch?

Disallow: /archivesearch/url

Disallow:/archivesearch/advanced_search

Disallow: /base/search?

Disallow: /base/reportbadoffer

Disallow: /base/s2

Disallow: /urchin_test/

Disallow: /movies?

Disallow: /codesearch?

Disallow:/codesearch/feeds/search?

Disallow: /wapsearch?

Disallow: /safebrowsing

Disallow: /reviews/search?

Disallow: /orkut/albums

Disallow: /jsapi

Disallow: /views?

Disallow: /c/

Disallow: /cbk

Disallow:/recharge/dashboard/car

Disallow:/recharge/dashboard/static/

Disallow: /translate_c?

Disallow: /s2/profiles/me

Allow: /s2/profiles

Disallow: /s2

Disallow: /transconsole/portal/

Disallow: /gcc/

Disallow: /aclk

Disallow: /cse?

Disallow: /tbproxy/

Disallow: /MerchantSearchBeta/

Disallow: /ime/

Disallow: /websites?

Disallow: /shenghuo/search?

Disallow:/support/forum/search?

Disallow: /reviews/polls/

 

上面标黄颜色的是允许爬的地方,虽然google是爬虫的教父,但是看来他对其他的爬虫还是很小气的。




原创粉丝点击