Using robots.txt To Control Search Engine Spiders

来源：互联网发布：淘宝星空联盟金卡原理编辑：程序博客网时间：2024/04/30 11:05

转自 http://www.activewebhosting.com/faq/web-robots.html

What are robots and spiders?

Search engines such as Google and Yahoo! use what is called 'robots' or 'spiders'to visit pages on the internet and then automatically add them to their searchdatabase. Many people even add their sites manually rather than wait for a robotor spider to visit their web site. When you put a web page on your web server, it can take some time for your site to show up in their search engine. Once thepage is entered into their database, however, it can take a long time also forthe page to be removed should the page move or be taken off the server. For moreinformation on how the major search engine spiders work, please see the pagesbelow:

Yahoo!'s Web Crawler
Google Information for Webmasters
MSNBot Troubleshooting

However, there may be times where you have information that you do not want toshare with everyone or have the search engines put in their database. You mayeven have a whole directory you wish to keep secret.

One way to keep a search engine from adding your pages to their database is toput a file called robots.txt in the directory where the pages you wishto protect exist. While this is not a fool-proof way to protect your pages, itmay help keep them from showing up in most search engine databases, at the veryleast.

How do I create a robots.txt file?

You can create a robots.txt file from any Linux text editor or any text editor that saves to Unix format. This is important as the file must have Unix style line breaks. Please see Text Editors You Can Use To Create CGI Scriptsfor more information. Note that the robots.txt file must be in the rootdirectory (and not in a sub directory) of your CGI or web server. You can haveone in each if you want. Put a robots.txt file in the root directory of yourCGI server to control spidering of files on only that server. Put a robots.txt file in the root directory of your web server to control spidering of files on your web server only. Your robots.txt file will only affect your own server(s) andnot anyone else.

The robots.txt file usually needs only two fields: User-agent andDisallow. Here are a couple examples you can put in your robots.txtfile. You can add more than one User-agent or Disallow field toyour robots.txt file.

Allow all robots:

User-agent: *
Disallow:

This will allow all robots to visit all pages in the directory. Note nothing wasentered for Disallow even though it was included in the robots.txt file.

Specify rules for a certain search engine:

User-agent: googlebot
Disallow:

This specifies Disallow rules to be followed only by google robots that mayvisit your site. Note nothing was entered for Disallow even though it was included in the robots.txt file. This means all files can be added togoogle's search database.

Keep all robots out of directory:

User-agent: *
Disallow: /

This keeps all robots from adding any of the pages in the directory therobot.txt file is placed. Note the slash / in the Disallow field meansall files.

Ban a certain search engine from all directories:

User-agent: googlebot
Disallow: /

This would keep google from adding any pages in that directory to it's searchengine database.

Protecting only certain files:

User-Agent: *
Disallow: /images/
Disallow: email.html

This keeps all robots from adding all files in the images directory andthe email.html file from being added to their search database. Note thatusing Disallow: /images/ will cover the subdirectories as well, sothere would be no need to add another for each subdirectory in the imagesdirectory. Spiders will not go into the images directory at all nor visitany of the directories or files inside it.

We recommend you take a look at an example robots.txt file from PimpSoft.You may want to copy this file and edjust it to your site's needs. This file helpsto keep certain harmful robots (spiders) off your site and control how theserobots spider your site. In this way, your pages can be indexed most efficiently.

Once you have constructed and saved your robots.txt file, upload it to your webserver directory which you wish to protect using your FTP program.

Checking robot.txt Validity

Once you've uploaded the robot.txt file, it's usually a good idea to check thevalidity of the file and be sure there are no problems. You can do this using theone of the following robots.txt validators. Please be sure your robots.txt fileis uploaded to your web site and provide the proper URL to the file, such as http://yourdomain.com/robots.txt.

New Robots.txt Syntax Checker
Robots.txt Validator

Specifying Robot Rules in HTML Meta Tags

Alternatively (or even additionally) you can specify the rules in your HTML fileitself, within a meta tag. This tag appears in the head tag. Hereis an example:

<head> <meta name="robots" content="noindex,nofollow"> <title>My Page</title> </head>

In the content= area within the quotes you have a few choices you canadd. The first word before the comma you can use either index meaningthe robot will add the page to the search engine database, and noindexmeaning the robot will not add the page to the search engine database.

The second word after the comma in the content= area you have two choices. Youcan use follow to mean it will also visit all other links you have on that page and catalog them (providing there is no robot meta tag preventing it,in which case it will skip over those), or nofollow meaning the robot willact on only that page and not follow links you have on that page.

Which do I use, robot.txt or in my meta tag?

The robots.txt method is best if you want to keep robots from indexing a whole directory or even protect certain files. It also lets you change things from one file rather than from each .html file you have. This is good for keeping yourpages from being added to search engines.

The robot meta tag is best if you want search engines to add your pagesto the search engines.

Do remember though that spiders will only find content on your pages and pagesthat are linked to. If any of your pages aren't linked to, then spiders may notfind and index those pages.