【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件
来源:互联网 发布:pc蛋蛋开奖网站源码 编辑:程序博客网 时间:2024/05/16 07:09
nutch-site.xml
在nutch2.2.1中,有两份配置文件:nutch-default.xml与nutch-site.xml。
其中前者是nutch自带的默认属性,一般情况下不要修改。
如果需要修改默认属性,可以在nutch-site.xml中增加一个同名的属性,并修改其值。nutch-site.xml中的属性值会覆盖nutch-default.xml中的值。
1、db.ignore.external.links
若为true,则只抓取本域名内的网页,忽略外部链接。
可以在 regex-urlfilter.txt中增加过滤器达到同样效果,但如果过滤器过多,如几千个,则会大大影响nutch的性能。
<property> <name>db.ignore.external.links</name> <value>true</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description></property>
2、fetcher.parse
能否在抓取的同时进行解释:可以,但不 建议这样做。
<property> <name>fetcher.parse</name> <value>false</value> <description>If true, fetcher will parse content. NOTE: previous releases would default to true. Since 2.0 this is set to false as a safer default.</description></property>
官方解释
N.B. In a parsing fetcher, outlinks are processed in the reduce phase (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space, usually after a very long reduce job. Behaviour typical to this is usually observed in this situation.
In summary, if it is possible, users are advised not to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.
3、db.max.outlinks.per.page
默认情况下,Nutch只抓取某个网页的100个外部链接,导致部分链接无法抓取。若要改变此情况,可以修改此配置项。
<property> <name>db.max.outlinks.per.page</name> <value>100</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description></property>官方说明如下:http://wiki.apache.org/nutch/FAQ/
Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?
The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. To overcome this limitation change thedb.max.outlinks.per.page property to a higher value or simply -1 (unlimited).
file: conf/nutch-default.xml
<property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property>
see also: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08665.html
4、file.content.limit http.content.limit ftp.content.limit
默认情况下,nutch只抓取网页的前65536个字节,之后的内容将被丢弃。
但对于某些大型网站,首页的内容远远不止65536个字节,甚至前面65536个字节里面均是一些布局信息,并没有任何的超链接。
因此修改默认值如下:
<property> <name>file.content.limit</name> <value>-1</value> <description>The length limit for downloaded content using the file protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. </description></property><property> <name>http.content.limit</name> <value>-1</value> <description>The length limit for downloaded content using the http protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description></property><property> <name>ftp.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. </description></property>
- 【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件
- 【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件
- 【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程
- 【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程
- 【Nutch2.2.1基础教程之1】nutch相关异常
- 【Nutch2.2.1基础教程之1】nutch相关异常
- nutch2.2.1之hbase部署
- 【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二:内容分析
- 【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二:内容分析
- nutch2.2.1安装部署
- nutch2.2.1抓取流程
- nutch2.2.1安装部署
- Nutch2.2.1+Eclipse+Mysql
- nutch2.2.1 URLNormalizers 详解
- 【Nutch2.2.1源代码分析之4】Nutch加载配置文件的方法
- 【Nutch2.2.1源代码分析之4】Nutch加载配置文件的方法
- Nutch2.2.1配置mysql存储
- Nutch2.2.1介绍及使用
- eclipse gradle插件配置
- 【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程
- Hadoop1.2.1伪分布模式安装指南
- Hadoop基本原理之一:MapReduce
- 8大排序算法图文讲解
- 【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件
- hdu 2222
- 最完美解决Nginx部署ThinkPHP项目的办法
- Hadoop配置文件
- Error:Execution failed for task ':app:dexDebug'.> com.android.ide.common.process.ProcessException: o
- Hadoop入门经典:WordCount
- 使用ToolRunner运行Hadoop程序基本原理分析
- java生成UUID通用唯一识别码 (Universally Unique Identifier)
- 【Nutch2.2.1源代码分析之4】Nutch加载配置文件的方法