larbin中的robots.txt解析

来源：互联网发布：360云盘无法连接网络编辑：程序博客网时间：2024/04/28 04:10

robots.txt是Martijn Koster在1994年编写WebCrawler时发明的。
非标准的扩展包括Crawl-delay(两次连续爬行的时间，应该很有用吧，不知道实际用得多不)，sitemap和allow。默认的实现是第一个规则取胜。Google的实现是先用allow模式然后disallow,bing查看哪个规则更明确。
标准的扩展有visit-time和request-rate等。
larbin中的robots解析不支持allow字段,解析不是逐行的而是找token，导致allow和它后面的路径都被当成是前面的disallow。如:
User-Agent: *
Disallow: /ds/
Disallow: /oceano/
Allow: /

(这个例子是http://www.china-designer.com/robots.txt)
解析之后的disallow就是/ds/,/oceano/,/Allow,/
我们将按照google的方式进行。

larbin中的robots.txt解析
解析Robots.txt 协议标准
爬虫系列8解析robots.txt
robots.txt。
robots.txt
robots.txt
robots.txt
robots.txt
robots.txt
robots.txt
robots.txt
robots.txt
robots.txt
robots.txt
robots.txt
robots.txt
robots.txt
robots.txt
升级Linux 内核不报 cannot start the X server 的解决方案
JavaScript 传递参数是值传递？还是值传递啊？
第八章定时器
(转) Rails 命令大全
一篇很好的Win32串口编程文章
larbin中的robots.txt解析
iphone UIView draw layer on picture
动态链接库和静态链接库的介绍
C语言字符相关部分内容小结
sqlserver2008 本地调试配置图解
（推荐）高并发高流量网站架构详解
几种流行的AJAX框架jQuery,Mootools,Dojo,Ext JS的对比
Linux安全体系的文件权限管理
ostringstream、istringstream、stringstream用法（转载）