Collection of Robots.txt Files
来源:互联网 发布:南阳网络营销策划公司 编辑:程序博客网 时间:2024/05/02 00:39
The implementation of a suitable robots.txt file is very important for search engine optimization. There is plenty of advice around the Internet for the creation of such files (if you are looking for an introduction on this topic read “Creat a robots.txt file“), but what if instead of looking at what people say we could look at what people do?
That is what I did, collecting the robots.txt files from a wide range of blogs and websites. Below you will find them.
Key Takeaways
- Only 2 out of 30 websites that I checked were not using a robots.txt file
- Even if you don’t have any specific requirements for the search bots, therefore, you probably should use a simple robots.txt file
- Most people stick to the “User-agent: *” attribute to cover all agents
- The most common “Disallowed” factor is the RSS Feed
- Google itself is using a combination of closed folders (e.g., /searchhistory/) and open ones (e.g., /search), which probably means they are treated differently
- A minority of the sites included the sitemap URL on the robots.txt file
The Minimalistic Guys
Problogger.net
User-agent: *
Disallow:
Marketing Pilgrim
User-agent: *
Disallow:
Search Engine Journal
User-agent: *
Disallow:
Matt Cutts
User-agent: *
Allow:
User-agent: *
Disallow: /files/
Pronet Advertising
User-agent: *
Disallow: /mt
Disallow: /*.cgi$
TechCrunch
User-agent: *
Disallow: /*/feed/
Disallow: /*/trackback/
The Structured Ones
Online Marketing Blog
User-agent: Googlebot
Disallow: */feed/User-agent: *
Disallow: /Blogger/
Disallow: /wp-admin/
Disallow: /stats/
Disallow: /cgi-bin/
Disallow: /2005x/
Shoemoney
User-Agent: Googlebot
Disallow: /link.php
Disallow: /gallery2
Disallow: /gallery2/
Disallow: /category/
Disallow: /page/
Disallow: /pages/
Disallow: /feed/
Disallow: /feed
Scoreboard Media
User-agent: *
Disallow: /cgi-bin/User-agent: Googlebot
Disallow: /category/
Disallow: /page/
Disallow: */feed/
Disallow: /2007/
Disallow: /2006/
Disallow: /wp-*
SEOMoz.org
User-agent: *
Disallow: /blogdetail.php?ID=537
Disallow: /blog?page
Disallow: /blog/author/
Disallow: /blog/category/
Disallow: /tracker
Disallow: /ugc?page
Disallow: /ugc/author/
Disallow: /ugc/category/
Wolf-Howl
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /noindex/
Disallow: /privacy-policy/
Disallow: /about/
Disallow: /company-biographies/
Disallow: /press-media-room/
Disallow: /newsletter/
Disallow: /contact-us/
Disallow: /terms-of-service/
Disallow: /terms-of-service/
Disallow: /information/comment-policy/
Disallow: /faq/
Disallow: /contact-form/
Disallow: /advertising/
Disallow: /information/licensing-information/
Disallow: /2005/
Disallow: /2006/
Disallow: /2007/
Disallow: /2008/
Disallow: /2009/
Disallow: /2004/
Disallow: /*?*
Disallow: /page/
Disallow: /iframes/
John Chow
sitemap: http://www.johnchow.com/sitemap.xml
User-agent: *
Disallow: /cgi-bin/
Disallow: /go/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /author/
Disallow: /page/
Disallow: /category/
Disallow: /wp-images/
Disallow: /images/
Disallow: /backup/
Disallow: /banners/
Disallow: /archives/
Disallow: /trackback/
Disallow: /feed/User-agent: Googlebot-Image
Allow: /wp-content/uploads/User-agent: Mediapartners-Google
Allow: /User-agent: duggmirror
Disallow: /
Smashing Magazine
Sitemap: http://www.smashingmagazine.com/sitemap.xml
User-agent: Mediapartners-Google*
Disallow:User-agent: *
Disallow: /styles/
Disallow: /inc/
Disallow: /tag/
Disallow: /cc/
Disallow: /category/User-agent: MSIECrawler
Disallow: /User-agent: psbot
Disallow: /User-agent: Fasterfox
Disallow: /User-agent: Slurp
Crawl-delay: 200
Gizmodo
User-Agent: Googlebot
Disallow: /index.xml$
Disallow: /excerpts.xml$
Allow: /sitemap.xml$
Disallow: /*view=rss$
Disallow: /*?view=rss$
Disallow: /*format=rss$
Disallow: /*?format=rss$
Sitemap: http://gizmodo.com/sitemap.xml
Lifehacker
User-Agent: Googlebot
Disallow: /index.xml$
Disallow: /excerpts.xml$
Allow: /sitemap.xml$
Disallow: /*view=rss$
Disallow: /*?view=rss$
Disallow: /*format=rss$
Disallow: /*?format=rss$
Sitemap: http://lifehacker.com/sitemap.xml
The Mainstream Media
Wall Street Journal
User-agent: *
Disallow: /article_email/
Disallow: /article_print/
Disallow: /PA2VJBNA4R/
Sitemap: http://online.wsj.com/sitemap.xml
ZDNet
User-agent: *
Disallow: /Ads/
Disallow: /redir/
# Disallow: /i/ is removed per 190723
Disallow: /av/
Disallow: /css/
Disallow: /error/
Disallow: /clear/
Disallow: /mac-ad
Disallow: /adlog/
# URS per bug 239819, these were expanded
Disallow: /1300-
Disallow: /1301-
Disallow: /1302-
Disallow: /1303-
Disallow: /1304-
Disallow: /1305-
Disallow: /1306-
Disallow: /1307-
Disallow: /1308-
Disallow: /1309-
Disallow: /1310-
Disallow: /1311-
Disallow: /1312-
Disallow: /1313-
Disallow: /1314-
Disallow: /1315-
Disallow: /1316-
Disallow: /1317-
NY Times
# robots.txt, www.nytimes.com 6/29/2006
#
User-agent: *
Disallow: /pages/college/
Disallow: /college/
Disallow: /library/
Disallow: /learning/
Disallow: /aponline/
Disallow: /reuters/
Disallow: /cnet/
Disallow: /partners/
Disallow: /archives/
Disallow: /indexes/
Disallow: /thestreet/
Disallow: /nytimes-partners/
Disallow: /financialtimes/
Allow: /pages/
Allow: /2003/
Allow: /2004/
Allow: /2005/
Allow: /top/
Allow: /ref/
Allow: /services/xml/User-agent: Mediapartners-Google*
Disallow:
YouTube
# robots.txt file for YouTube
User-agent: Mediapartners-Google*
Disallow:User-agent: *
Disallow: /profile
Disallow: /results
Disallow: /browse
Disallow: /t/terms
Disallow: /t/privacy
Disallow: /login
Disallow: /watch_ajax
Disallow: /watch_queue_ajax
Bonus
User-agent: *
Allow: /searchhistory/
Disallow: /news?output=xhtml&
Allow: /news?output=xhtml
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalogues
Disallow: /news
Disallow: /nwshp
Disallow: /?
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/
Disallow: /relcontent
Disallow: /sorry/
Disallow: /imgres
Disallow: /keyword/
Disallow: /u/
Disallow: /univ/
Disallow: /cobrand
Disallow: /custom
Disallow: /advanced_group_search
Disallow: /advanced_search
Disallow: /googlesite
Disallow: /preferences
Disallow: /setprefs
Disallow: /swr
Disallow: /url
Disallow: /default
Disallow: /m?
Disallow: /m/search?
Disallow: /wml?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/search?
Disallow: /xml?
Disallow: /imode?
Disallow: /imode/search?
Disallow: /jsky?
Disallow: /jsky/search?
Disallow: /pda?
Disallow: /pda/search?
- Collection of Robots.txt Files
- robots.txt。
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- robots.txt
- Robots.txt指南
- Robots.txt指南
- 超实用的javascript小技巧,不看后悔
- 跨越边界: 动态类型语言中的 Web 开发策略
- Spring学习笔记
- 中英文颜色对照表
- 爱情三部曲
- Collection of Robots.txt Files
- memcpy和memmove的区别
- 严超的个人简历(Yanchao's Resume)
- ASP.NET 状态管理建议
- Portal
- 堆、栈讨论
- 第一个软件项目后的心得体会
- linux下面建立ftp服务器
- Cobra —— 可视化Python虚拟机