Fetch robots.txt for a Site(chilkat/python 学习三)寻找robots.txt
来源:互联网 发布:武汉程序员待遇 编辑:程序博客网 时间:2024/04/30 08:13
这个学习内容还算有点意思,了解爬虫的人应该都知道robots.txt;他的作用我就不说了。
chilkat提供了一个函数来找robots.txt,现在让我们来看看他的工作吧,我找了
url1 = "www.google.cn"
url2 = "www.baidu.com"
url3 = "www.sina.com.cn"
url4 = "www.sohu.com"
url5 = "www.tom.com"
列表,看看google 的robots.txt还是蛮有意思的;
代码:
- spider = chilkat.CkSpider()
- url1 = "www.google.cn"
- url2 = "www.baidu.com"
- url3 = "www.sina.com.cn"
- url4 = "www.sohu.com"
- url5 = "www.tom.com"
- spider.Initialize(url1)
- robotsText = spider.fetchRobotsText()
- print robotsText
User-agent: *
Allow: /searchhistory/
Disallow:/news?output=xhtml&
Allow: /news?output=xhtml
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalogues
Disallow: /news
Disallow: /nwshp
Allow: /news?btcid=
Disallow: /news?btcid=*&
Allow: /news?btaid=
Disallow: /news?btaid=*&
Disallow: /setnewsprefs?
Disallow: /index.html?
Disallow: /?
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/
Disallow: /relcontent
Disallow: /sorry/
Disallow: /imgres
Disallow: /keyword/
Disallow: /u/
Disallow: /univ/
Disallow: /cobrand
Disallow: /custom
Disallow:/advanced_group_search
Disallow: /googlesite
Disallow: /preferences
Disallow: /setprefs
Disallow: /swr
Disallow: /url
Disallow: /default
Disallow: /m?
Disallow: /m/?
Disallow: /m/lcb
Disallow: /m/news?
Disallow: /m/setnewsprefs?
Disallow: /m/search?
Disallow: /wml?
Disallow: /wml/?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/?
Disallow: /xhtml/search?
Disallow: /xml?
Disallow: /imode?
Disallow: /imode/?
Disallow: /imode/search?
Disallow: /jsky?
Disallow: /jsky/?
Disallow: /jsky/search?
Disallow: /pda?
Disallow: /pda/?
Disallow: /pda/search?
Disallow: /sprint_xhtml
Disallow: /sprint_wml
Disallow: /pqa
Disallow: /palm
Disallow: /gwt/
Disallow: /purchases
Disallow: /hws
Disallow: /bsd?
Disallow: /linux?
Disallow: /mac?
Disallow: /microsoft?
Disallow: /unclesam?
Disallow: /answers/search?q=
Disallow: /local?
Disallow: /local_url
Disallow: /froogle?
Disallow: /products?
Disallow: /froogle_
Disallow: /product_
Disallow: /products_
Disallow: /print
Disallow: /books
Disallow: /patents?
Disallow: /scholar?
Disallow: /complete
Disallow: /sponsoredlinks
Disallow: /videosearch?
Disallow: /videopreview?
Disallow: /videoprograminfo?
Disallow: /maps?
Disallow: /mapstt?
Disallow: /mapslt?
Disallow: /maps/stk/
Disallow: /maps/br?
Disallow: /mapabcpoi?
Disallow: /translate?
Disallow: /center
Disallow: /ie?
Disallow: /sms/demo?
Disallow: /katrina?
Disallow: /blogsearch?
Disallow: /blogsearch/
Disallow: /blogsearch_feeds
Disallow: /advanced_blog_search
Disallow: /reader/
Disallow: /uds/
Disallow: /chart?
Disallow: /transit?
Disallow: /mbd?
Disallow: /extern_js/
Disallow: /calendar/feeds/
Disallow: /calendar/ical/
Disallow: /cl2/feeds/
Disallow: /cl2/ical/
Disallow: /coop/directory
Disallow: /coop/manage
Disallow: /trends?
Disallow: /trends/music?
Disallow: /notebook/search?
Disallow: /music
Disallow: /musica
Disallow: /musicad
Disallow: /musicas
Disallow: /musicl
Disallow: /musics
Disallow: /musicsearch
Disallow: /musicsp
Disallow: /musiclp
Disallow: /browsersync
Disallow: /call
Disallow: /archivesearch?
Disallow: /archivesearch/url
Disallow:/archivesearch/advanced_search
Disallow: /base/search?
Disallow: /base/reportbadoffer
Disallow: /base/s2
Disallow: /urchin_test/
Disallow: /movies?
Disallow: /codesearch?
Disallow:/codesearch/feeds/search?
Disallow: /wapsearch?
Disallow: /safebrowsing
Disallow: /reviews/search?
Disallow: /orkut/albums
Disallow: /jsapi
Disallow: /views?
Disallow: /c/
Disallow: /cbk
Disallow:/recharge/dashboard/car
Disallow:/recharge/dashboard/static/
Disallow: /translate_c?
Disallow: /s2/profiles/me
Allow: /s2/profiles
Disallow: /s2
Disallow: /transconsole/portal/
Disallow: /gcc/
Disallow: /aclk
Disallow: /cse?
Disallow: /tbproxy/
Disallow: /MerchantSearchBeta/
Disallow: /ime/
Disallow: /websites?
Disallow: /shenghuo/search?
Disallow:/support/forum/search?
Disallow: /reviews/polls/
- Fetch robots.txt for a Site(chilkat/python 学习三)寻找robots.txt
- robots.txt for zencart
- robots.txt。
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- Robots.txt指南
- 如何实现关闭窗口时呈现对话框
- asp.net treeview控件无刷新选择和删除节点的ajax方法
- 公交换乘算法
- Asp.net 2.0 TreeView控件使用jQuery无刷新添加节点详细说明
- 经典的两个时间相对论实验
- Fetch robots.txt for a Site(chilkat/python 学习三)寻找robots.txt
- 第一财经周刊:MySpace入乡 终于随俗
- 让你学习java事半功倍的帖子(节选)
- Eclipse升级到V3.4--Ganymede
- linux 驱动模板
- 深入理解linux网络内幕 第一章
- 测不准原理(还真有此事?)
- linux新手要了解的十个知识点
- 在javascript中使用(读取、设置)Asp.net服务器的属性、方法和事件