闲聊Robots协议

来源:互联网 发布:网络服务商的英文缩写 编辑:程序博客网 时间:2024/04/30 03:28

其实,我了解搜索引擎方面的知识是比较晚的~~~对robots协议还是来自2012年的“3B大战“也就是360和百度之间的一场争论!!

360呢,在2012年推出了自己的一款搜索引擎”360搜索“,并在发布没多久就一跃成为中国第二大搜索引擎,超越搜狗,仅次于百度!!!

但是呢,百度就指出自己的Robots文本中已设定不允许360爬虫进入,而360的爬虫依然对“百度知道”、“百度百科”等百度网站内容进行抓取。

违反了国际上”Robots协议“。一下是关于这方面大家可以查看:http://baike.baidu.com/view/9230864.htm  至此呢,我才了解到了”Robots协议“

 

百度一下,了解到”

        robots协议(也称为爬虫协议、爬虫规则、机器人协议等)也就是robots.txt,网站通过robots协议告诉搜索引擎哪些页面可以抓取,哪些页面不能抓取。Robots协议是网站国际互联网界通行的道德规范,其目的是保护网站数据和敏感信息、确保用户个人信息和隐私不被侵犯。因其不是命令,故需要搜索引擎自觉遵守。一些病毒如malware(马威尔病毒)经常通过忽略robots协议的方式,获取网站后台数据和个人信息。

       

           robots.txt文件是一个文本文件,使用任何一个常见的文本编辑器,比如Windows系统自带的Notepad,就可以创建和编辑它。robots.txt是一个协议,而不是一个命令。robots.txt是搜索引擎中访问网站的时候要查看的第一个文件。robots.txt文件告诉蜘蛛程序在服务器上什么文件是可以被查看的。
当一个搜索蜘蛛访问一个站点时,它会首先检查该站点根目录下是否存在robots.txt,如果存在,搜索机器人就会按照该文件中的内容来确定访问的范围;如果该文件不存在,所有的搜索蜘蛛将能够访问网站上所有没有被口令保护的页面。百度官方建议,仅当您的网站包含不希望被搜索引擎收录的内容时,才需要使用robots.txt文件。如果您希望搜索引擎收录网站上所有内容,请勿建立robots.txt文件。
如果将网站视为酒店里的一个房间,robots.txt就是主人在房间门口悬挂的“请勿打扰”或“欢迎打扫”的提示牌。这个文件告诉来访的搜索引擎哪些房间可以进入和参观,哪些房间因为存放贵重物品,或可能涉及住户及访客的隐私而不对搜索引擎开放。但robots.txt不是命令,也不是防火墙,如同守门人无法阻止窃贼等恶意闯入者。

 

 

以上是来自百度的解释!!Robots仅仅是一种协议而已!如果你不遵循它,那也没办法!只能通过打官司解决了!!

 

我们来看一下各大网站的Robots.txt吧~~~

 

www.baidu.com/robots.txt

 

User-agent: BaiduspiderDisallow: /w?User-agent: GooglebotDisallow: /updateDisallow: /historyDisallow: /usercardDisallow: /usercenterUser-agent: MSNBotAllow: /User-agent: Baiduspider-imageDisallow: /w?User-agent: YoudaoBotAllow: /User-agent: Sogou web spiderDisallow: /updateDisallow: /historyDisallow: /usercardDisallow: /usercenterUser-agent: Sogou inst spiderDisallow: /updateDisallow: /historyDisallow: /usercardDisallow: /usercenterUser-agent: Sogou spider2Disallow: /updateDisallow: /historyDisallow: /usercardDisallow: /usercenterUser-agent: Sogou blogDisallow: /updateDisallow: /historyDisallow: /usercardDisallow: /usercenterUser-agent: Sogou News SpiderDisallow: /updateDisallow: /historyDisallow: /usercardDisallow: /usercenterUser-agent: Sogou Orion spiderDisallow: /updateDisallow: /historyDisallow: /usercardDisallow: /usercenterUser-agent: JikeSpiderAllow: /User-agent: SosospiderAllow: /User-agent: YYspiderAllow: /User-agent: PangusoSpiderAllow: /User-agent: yisouspiderAllow: /User-agent: EasouSpiderAllow: /User-agent: *Disallow: /


 

上面是什么意思就不用多说了吧、?User-agent后面跟的也就是网络爬虫的名字了!!!

正如百度所说,确实没允许360spider进行爬取!!

 

 

www.google.com/robots.txt

 

User-agent: *Disallow: /searchDisallow: /sdchDisallow: /groupsDisallow: /imagesDisallow: /catalogsAllow: /catalogs/aboutAllow: /catalogs/p?Disallow: /cataloguesDisallow: /newsAllow: /news/directoryDisallow: /nwshpDisallow: /setnewsprefs?Disallow: /index.html?Disallow: /?Allow: /?hl=Disallow: /?hl=*&Disallow: /addurl/image?Disallow: /pagead/Disallow: /relpage/Disallow: /relcontentDisallow: /imgresDisallow: /imglandingDisallow: /sbdDisallow: /keyword/Disallow: /u/Disallow: /univ/Disallow: /cobrandDisallow: /customDisallow: /advanced_group_searchDisallow: /googlesiteDisallow: /preferencesDisallow: /setprefsDisallow: /swrDisallow: /urlDisallow: /defaultDisallow: /m?Disallow: /m/Disallow: /wml?Disallow: /wml/?Disallow: /wml/search?Disallow: /xhtml?Disallow: /xhtml/?Disallow: /xhtml/search?Disallow: /xml?Disallow: /imode?Disallow: /imode/?Disallow: /imode/search?Disallow: /jsky?Disallow: /jsky/?Disallow: /jsky/search?Disallow: /pda?Disallow: /pda/?Disallow: /pda/search?Disallow: /sprint_xhtmlDisallow: /sprint_wmlDisallow: /pqaDisallow: /palmDisallow: /gwt/Disallow: /purchasesDisallow: /hwsDisallow: /bsd?Disallow: /linux?Disallow: /mac?Disallow: /microsoft?Disallow: /unclesam?Disallow: /answers/search?q=Disallow: /local?Disallow: /local_urlDisallow: /shihui?Disallow: /shihui/Disallow: /froogle?Disallow: /products?Disallow: /products/Disallow: /froogle_Disallow: /product_Disallow: /products_Disallow: /products;Disallow: /printDisallow: /books/Disallow: /bkshp?*q=*Disallow: /books?*q=*Disallow: /books?*output=*Disallow: /books?*pg=*Disallow: /books?*jtp=*Disallow: /books?*jscmd=*Disallow: /books?*buy=*Disallow: /books?*zoom=*Allow: /books?*q=related:*Allow: /books?*q=editions:*Allow: /books?*q=subject:*Allow: /books/aboutAllow: /booksrightsholdersAllow: /books?*zoom=1*Allow: /books?*zoom=5*Disallow: /ebooks/Disallow: /ebooks?*q=*Disallow: /ebooks?*output=*Disallow: /ebooks?*pg=*Disallow: /ebooks?*jscmd=*Disallow: /ebooks?*buy=*Disallow: /ebooks?*zoom=*Allow: /ebooks?*q=related:*Allow: /ebooks?*q=editions:*Allow: /ebooks?*q=subject:*Allow: /ebooks?*zoom=1*Allow: /ebooks?*zoom=5*Disallow: /patents?Disallow: /patents/download/Disallow: /patents/pdf/Disallow: /patents/related/Disallow: /scholarDisallow: /citations?Allow: /citations?user=Allow: /citations?view_op=new_profileAllow: /citations?view_op=top_venuesDisallow: /completeDisallow: /s?Disallow: /sponsoredlinksDisallow: /videosearch?Disallow: /videopreview?Disallow: /videoprograminfo?Allow: /maps?hq=http://maps.google.com/help/maps/directions/biking/mapleft.kml&ie=UTF8&ll=37.687624,-122.319717&spn=0.346132,0.727158&z=11&lci=bike&dirflg=b&f=dAllow: /maps/api/js?Disallow: /maps?Disallow: /mapstt?Disallow: /mapslt?Disallow: /maps/stk/Disallow: /maps/br?Disallow: /mapabcpoi?Disallow: /maphp?Disallow: /mapprint?Disallow: /maps/api/js/Disallow: /maps/api/staticmap?Disallow: /mld?Disallow: /staticmap?Disallow: /places/Allow: /places/$Disallow: /maps/previewDisallow: /maps/placeDisallow: /help/maps/streetview/partners/welcome/Disallow: /help/maps/indoormaps/partners/Disallow: /lochp?Disallow: /centerDisallow: /ie?Disallow: /sms/demo?Disallow: /katrina?Disallow: /blogsearch?Disallow: /blogsearch/Disallow: /blogsearch_feedsDisallow: /advanced_blog_searchDisallow: /uds/Disallow: /chart?Disallow: /transit?Disallow: /mbd?Disallow: /extern_js/Disallow: /xjs/Disallow: /calendar/feeds/Disallow: /calendar/ical/Disallow: /cl2/feeds/Disallow: /cl2/ical/Disallow: /coop/directoryDisallow: /coop/manageDisallow: /trends?Disallow: /trends/music?Disallow: /trends/hottrends?Disallow: /trends/viz?Disallow: /notebook/search?Disallow: /musicaDisallow: /musicadDisallow: /musicasDisallow: /musiclDisallow: /musicsDisallow: /musicsearchDisallow: /musicspDisallow: /musiclpDisallow: /browsersyncDisallow: /callDisallow: /archivesearch?Disallow: /archivesearch/urlDisallow: /archivesearch/advanced_searchDisallow: /base/reportbadofferDisallow: /urchin_test/Disallow: /movies?Disallow: /codesearch?Disallow: /codesearch/feeds/search?Disallow: /wapsearch?Disallow: /safebrowsingAllow: /safebrowsing/diagnosticAllow: /safebrowsing/report_badware/Allow: /safebrowsing/report_error/Allow: /safebrowsing/report_phish/Disallow: /reviews/search?Disallow: /orkut/albumsAllow: /jsapiDisallow: /views?Disallow: /c/Disallow: /cbkAllow: /cbk?output=tile&cb_client=maps_svDisallow: /recharge/dashboard/carDisallow: /recharge/dashboard/static/Disallow: /translate_a/Disallow: /translate_cDisallow: /translate_fDisallow: /translate_static/Disallow: /translate_suggestionDisallow: /profiles/meAllow: /profilesDisallow: /s2/profiles/meAllow: /s2/profilesAllow: /s2/photosAllow: /s2/staticDisallow: /s2Allow: /s2/search/socialDisallow: /transconsole/portal/Disallow: /gcc/Disallow: /aclkDisallow: /cse?Disallow: /cse/homeDisallow: /cse/panelDisallow: /cse/manageDisallow: /tbproxy/Disallow: /imesync/Disallow: /shenghuo/search?Disallow: /support/forum/search?Disallow: /reviews/polls/Disallow: /hosted/images/Disallow: /ppob/?Disallow: /ppob?Disallow: /ig/add?Disallow: /adwordsresellersDisallow: /accounts/o8Allow: /accounts/o8/idDisallow: /topicsearch?q=Disallow: /xfx7/Disallow: /squared/apiDisallow: /squared/searchDisallow: /squared/tableDisallow: /toolkit/Allow: /toolkit/*.htmlDisallow: /globalmarketfinder/Allow: /globalmarketfinder/*.htmlDisallow: /qnasearch?Disallow: /app/updatesDisallow: /sidewiki/entry/Disallow: /quality_form?Disallow: /labs/popgadget/searchDisallow: /buzz/postDisallow: /compressiontest/Disallow: /analytics/reporting/Disallow: /analytics/admin/Disallow: /analytics/web/Disallow: /analytics/feeds/Disallow: /analytics/settings/Disallow: /alerts/Disallow: /ads/searchDisallow: /phone/compare/?Allow: /alerts/manageAllow: /alerts/removeDisallow: /travel/clkDisallow: /hotelfinder/rpcDisallow: /hotels/rpcDisallow: /flights/rpcDisallow: /commercesearch/services/Disallow: /evaluation/Disallow: /chrome/browser/mobile/tourDisallow: /compare/*/apply*Disallow: /forms/perks/Disallow: /baraza/*/searchDisallow: /baraza/*/reportDisallow: /shopping/suppliers/searchDisallow: /ct/Disallow: /edu/cs4hs/Disallow: /trustedstores/s/Disallow: /trustedstores/tm2Disallow: /trustedstores/verifyDisallow: /adwords/proposalDisallow: /shopping/product/Disallow: /shopping/sellerDisallow: /shopping/reviewerSitemap: http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xmlSitemap: http://www.google.com/hostednews/sitemap_index.xmlSitemap: http://www.google.com/sitemaps_webmasters.xmlSitemap: http://www.gstatic.com/sitemaps/websearch_hreflang/sitemap_index.xmlSitemap: http://www.google.com/ventures/sitemap_ventures.xmlSitemap: http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xmlSitemap: http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xmlSitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xmlSitemap: http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml


 

诶?为什么多出了Sitemap这个元素呢?

 

           前面说过爬虫会通过网页内部的链接发现新的网页。但是如果没有连接指向的网页怎么办?或者用户输入条件生成的动态网页怎么办?能否让网站管理员通知搜索引擎他们网站上有哪些可供抓取的网页?这就是sitemap,最简单的 Sitepmap 形式就是 XML 文件,在其中列出网站中的网址以及关于每个网址的其他数据(上次更新的时间、更改的频率以及相对于网站上其他网址的重要程度等等),利用这些信息搜索引擎可以更加智能地抓取网站内容。

sitemap是另一个话题,足够开一篇新的文章聊的,这里就不展开了,有兴趣的同学可以参考sitemap

新的问题来了,爬虫怎么知道这个网站有没有提供sitemap文件,或者说网站管理员生成了sitemap,(可能是多个文件),爬虫怎么知道放在哪里呢?

由于robots.txt的位置是固定的,于是大家就想到了把sitemap的位置信息放在robots.txt里。这就成为robots.txt里的新成员了。

 

 

以上是跟的xml文件形式,大家可以打开看一下~~~其实还可以后跟txt格式的~~如:

 

 

 

大家打开看看!!!!!!!当然还可以是压缩包的形式哦~~我们看一下亚马逊的

 

 

http://www.amazon.cn/robots.txt

 

 

User-agent: *Disallow: /buycarDisallow: /cartDisallow: /checkoutDisallow: /classDisallow: /comDisallow: /commonDisallow: /cssDisallow: /dllDisallow: /docDisallow: /dp/e-mail-friend/Disallow: /dp/manual-submit/Disallow: /dp/product-availability/Disallow: /dp/rate-this-item/Disallow: /dp/shipping/Disallow: /dp/twister-update/Disallow: /gp/aws/ssopDisallow: /gp/cartDisallow: /gp/css/homepage.htmlDisallow: /gp/customer-reviews/common/duDisallow: /gp/flexDisallow: /gp/gfixDisallow: /gp/historyDisallow: /gp/item-dispatchDisallow: /gp/music/clipserveDisallow: /gp/music/wma-pop-upDisallow: /gp/offer-listingDisallow: /gp/product/e-mail-friendDisallow: /gp/product/product-availabilityDisallow: /gp/product/rate-this-itemDisallow: /gp/recsradioDisallow: /gp/slredirectDisallow: /gp/twitter/Disallow: /gp/voteDisallow: /gp/voting/Disallow: /gp/yourstoreDisallow: /incDisallow: /jsDisallow: /libDisallow: /mn/bookLookInsideAppDisallow: /mn/checkInitAppDisallow: /mn/checkoutAlertMsgAppDisallow: /mn/checkoutredirectAppDisallow: /mn/giftCardAppDisallow: /mn/loginApplicationDisallow: /mn/loyaltyAppDisallow: /mn/orderAddrAppDisallow: /mn/orderCfmAppDisallow: /mn/orderDetailAppDisallow: /mn/orderFailAppDisallow: /mn/orderHistoryAppDisallow: /mn/orderModifyAppDisallow: /mn/orderSummaryAppDisallow: /mn/paymentRedriveAppDisallow: /mn/recommendReviewAppDisallow: /mn/releaseReviewAppDisallow: /mn/reviewVoteApplicationDisallow: /mn/selectPaymentMethodAppDisallow: /mn/selectShippingOpptionApplicationDisallow: /mn/shipmentTraceAppDisallow: /mn/shoppingCartApplicationDisallow: /mn/tellFriendDisallow: /mn/thankYouApplicationDisallow: /mn/virtualAccountAppDisallow: /mn/yourAccountAppDisallow: /paperDisallow: /xmlDisallow: /youraccountDisallow: /ap/signinDisallow: /gp/registry/wishlist/Disallow: /wishlist/Allow: /wishlist/universal*Allow: /wishlist/vendor-button*Allow: /wishlist/get-button*Disallow: /gp/wishlist/Allow: /gp/wishlist/universal*Allow: /gp/wishlist/vendor-button*Allow: /gp/wishlist/ipad-install*Disallow: /registry/wishlist/Disallow: /gp/help/customer/display.html*nodeId=200843370Disallow: /gp/help/customer/display.html*nodeId=200877580Disallow: /gp/help/customer/display.html*nodeId=200877590Disallow: /gp/help/customer/display.html*nodeId=200879080Disallow: /gp/help/customer/display.html*nodeId=200879100Disallow: /gp/help/customer/display.html*nodeId=200879120Disallow: /gp/help/customer/display.html*nodeId=200879160Disallow: /gp/help/customer/display.html*nodeId=200879140Disallow: /gp/help/customer/display.html*nodeId=200877610Disallow: /gp/help/customer/display.html*nodeId=200878960Disallow: /gp/help/customer/display.html*nodeId=200878980Disallow: /gp/help/customer/display.html*nodeId=200879000Disallow: /gp/help/customer/display.html*nodeId=200879040Disallow: /gp/help/customer/display.html*nodeId=200879020Disallow: /gp/help/customer/display.html*nodeId=200877630Disallow: /gp/help/customer/display.html*nodeId=200879200Disallow: /gp/help/customer/display.html*nodeId=200879220Disallow: /gp/help/customer/display.html*nodeId=200879240Disallow: /gp/help/customer/display.html*nodeId=200879280Disallow: /gp/help/customer/display.html*nodeId=200879260Disallow: /gp/help/customer/display.html*nodeId=200877650Disallow: /gp/help/customer/display.html*nodeId=200879320Disallow: /gp/help/customer/display.html*nodeId=200879340Disallow: /gp/help/customer/display.html*nodeId=200879360Disallow: /gp/help/customer/display.html*nodeId=200879400Disallow: /gp/help/customer/display.html*nodeId=200879380Disallow: /gp/help/customer/display.html*nodeId=200877560Disallow: /gp/help/customer/display.html*nodeId=200843460Disallow: /gp/help/customer/display.html*nodeId=200843440Disallow: /gp/help/customer/display.html*nodeId=200899270Disallow: /gp/help/customer/display.html*nodeId=200879440Disallow: /gp/help/customer/display.html*nodeId=200899330Disallow: /gp/help/customer/display.html*nodeId=200899350Disallow: /gp/help/customer/display.html*nodeId=200899390Disallow: /gp/help/customer/display.html*nodeId=200899410Disallow: /gp/help/customer/display.html*nodeId=200899430Disallow: /gp/help/customer/display.html*nodeId=200899220Disallow: /gp/help/customer/display.html*nodeId=200899450Disallow: /gp/help/customer/display.html*nodeId=200899670Disallow: /gp/help/customer/display.html*nodeId=200899530Disallow: /gp/help/customer/display.html*nodeId=200899470Disallow: /gp/help/customer/display.html*nodeId=200899550Disallow: /gp/help/customer/display.html*nodeId=200899570Disallow: /gp/help/customer/display.html*nodeId=200899590Disallow: /gp/help/customer/display.html*nodeId=200899490Disallow: /gp/help/customer/display.html*nodeId=200899510Disallow: /gp/help/customer/display.html*nodeId=200899610Disallow: /gp/help/customer/display.html*nodeId=200899630Disallow: /gp/help/customer/display.html*nodeId=200899650Disallow: /gp/help/customer/display.html*nodeId=200879180Disallow: /gp/help/customer/display.html*nodeId=200879060Disallow: /gp/help/customer/display.html*nodeId=200879300Disallow: /gp/help/customer/display.html*nodeId=200879420Disallow: /gp/help/customer/display.html*nodeId=200899290Disallow: /gp/help/customer/display.html*nodeId=200899310Disallow: /gp/help/customer/display.html*nodeId=200843380Disallow: /gp/help/customer/display.html*nodeId=200843420Disallow: /gp/help/customer/display.html*nodeId=200899230Disallow: /gp/help/customer/display.html*nodeId=200899250Disallow: /gp/help/customer/display.html*nodeId=200899370Disallow: /gp/help/contact-us/general-questions.html*?type&email&skip=trueDisallow: /gp/help/customer/accessibility?ie=UTF8&initialIssue=forgotpw&skip=trueDisallow: /gp/registry/search.htmlDisallow: /gp/orc/rml/Disallow: /gp/digital/fiona/manageDisallow: /gp/entity-alert/externalDisallow: /gp/customer-reviews/dynamic/sims-boxDisallow: /review/dynamic/sims-boxDisallow: /gp/redirect.html# Sitemap filesSitemap: http://www.amazon.cn/sitemap_feed_index1.xmlSitemap: http://www.amazon.cn/sitemaps.f3053414d236e84.SitemapIndex_0.xml.gzSitemap: http://www.amazon.cn/sitemaps.1946f6b8171de60.SitemapIndex_0.xml.gzSitemap: http://www.amazon.cn/sitemaps.c21f969b5f03d33.SitemapIndex_0.xml.gz


 

 我们可以将压缩包,下载下来,打开可以看到是一个xml文件!!

 

 

 

我们再来看一个:

 

 

 

哎?ia_archiver是什么爬虫啊?没见过啊?

 

百度一下~~

 

ia_archiver是alexa的一个爬虫程序,用于检测网站是否做了alexa排名的作弊。
ia_archiver程序会自动在互联网上爬行,刺探每个Web页面的流量信息。尤其是当某个网站的流量超过Alexa设定的阈值时,IA_Archiver就会马上爬到该网站的服务器上,分析此网站的流量是否正常,有没有作弊行为。

邀请ia_arhiver来访

到alexa官网进行登记即可。

禁止ia_archiver访问

ia_archiver是一个中等强度的爬虫。如果你觉得它占用了过多的服务器资源,同时不关心网站alexa排名的话,可以屏蔽这个爬虫。方法为在服务器上的网站根目录建立robots.txt,包含以下内容:
User-agent: ia_archiver Disallow: /
上面在全站之内禁止ia_archiver爬行。或者禁止爬行某个目录:
User-agent: ia_archiver Disallow: /somedir/

 

基本上就这些了~~~

 

 

还有一些好玩的~~大家可以参考:http://lusongsong.com/reed/732.html

 

关于robots协议就到这里了!!