OWASP-IG-001

来源：互联网发布：域名注册哪家编辑：程序博客网时间：2024/04/27 20:46

information gathering

就是信息搜集。在做渗透测试的过程中，搜集信息一定是第一步。

IG就是信息搜集。

IG-001测试Spiders, Robots, and Crawlers

这一节主要描述如何测试robot.txt文件。

每个网站都有一个robot.txt，在这个文件中声明该网站中不想被robot访问的部分，这样，该网站的部分或全部内容就可以不被搜索引擎收录了，或者指定搜索引擎只收录指定的内容。

示例：http://www.google.com/robots.txt

“robots.txt”文件包含一条或更多的记录，

User-agent：

　　该项的值用于描述搜索引擎robot的名字，在“robots.txt”文件中，如果有多条 User-agent记录说明有多个robot会受到该协议的限制，对该文件来说，至少要有一条User- agent记录。如果该项的值设为*，则该协议对任何机器人均有效，在“robots.txt”文件中，“User-agent：*”这样的记录只能有一条。

Disallow：

　　该项的值用于描述不希望被访问到的一个URL，这个URL可以是一条完整的路径，也可以是部分的，任何以Disallow开头的URL均不会被robot访问到。例如：

　　“Disallow: /help”对/help.html 和/help/index.html都不允许搜索引擎访问，而“Disallow: /help/”则允许robot访问/help.html，而不能访问/help/index.html。

　　任何一条Disallow记录为空，说明该网站的所有部分都允许被访问，在 “/robots.txt”文件中，至少要有一条Disallow记录。如果“/robots.txt”是一个空文件，则对于所有的搜索引擎robot，该网站都是开放的。

但是，网络爬虫/机器人/抓取工具可以故意忽略robots.txt文件中规定的不允许访问的URL。

怎么用呢？

先介绍google的一个工具。

链接：https://www.google.com/accounts/ServiceLogin?service=sitemaps&passive=true&nui=1&continue=https%3A%2F%2Fwww.google.com%2Fwebmasters%2Ftools%2F&followup=https%3A%2F%2Fwww.google.com%2Fwebmasters%2Ftools%2F&hl=zh-CN

google提供了一个工具，能够分析robot.txt文件，步骤如下：

1. Sign into Google Webmaster Tools with your Google Account.

创建你的google账户
2. On the Dashboard, click the URL for the site you want.

然后输入你想测试的URL

这里会返回一个提示：

3. Click Tools, and then click Analyze robots.txt.

点tools按钮，就可以开始分析robots.txt

一些链接

Whitepapers

[1] "The Web Robots Pages" - http://www.robotstxt.org/
[2] "How do I block or allow Googlebot?" - http://www.google.com/support/webmasters/bin/answer.py?answer=40364&query=googlebot&topic=&type=
[3] "(ISC)2 Blog: The Attack of the Spiders from the Clouds" - http://blog.isc2.org/isc2_blog/2008/07/the-attack-of-t.html
[4] "How do I check that my robots.txt file is working as expected?" - http://www.google.com/support/webmasters/bin/answer.py?answer=35237