Nutch的安装使用

来源：互联网发布：日本经济现状知乎编辑：程序博客网时间：2024/06/10 20:25

Nutch的版本是1.1，需要jdk支持，Nutch0.9版本以上的需要jdk1.5以上。安装tomcat，要求tomcat5以上。如果是Windows系统，需安装cygwin或其他shell脚本支持软件。接下来安装步骤如下：

1.Nutch 1.1 解压缩，得到apache-nutch-1.1-bin目录

2.配置文件conf/nutch-site.xml
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch-1.0</value>
<description>HTTP 'User-Agent'</description>
</property>
<property>
<name>searcher.dir</name>
<value>/home/nutch/apache-nutch-1.1-bin/crawl</value>
<description>Path to root of crawl.</description>
</property>
</configuration>
如果不配置nutch-site.xml文件，也可以配置nutch-default.xml。如果使用nutch-default.xml配置文件，则需保证nutch-site.xml文件未被修改过，也就是空配置。

3.在apache-nutch-1.1-bin目录新增url文件，multiurls.txt，内容如下：
http://www.chinaunix.net/
http://www.163.com/
这里需要注意，url必须以"/"结束。1.1版本中url地址只有一行的话，会什么都搜不到，更换到1.2版本后没有这个问题。
测试命令：
cd apache-nutch-1.1-bin
./bin/nutch crawl multiurls.txt -dir crawl -depth 3 -threads 4 >& crawl.log
或者后面的 >& crawl.log 不加，就直接在控制台看得到搜索的结果。

4.复制apache-nutch-1.1-bin目录下nutch-1.1.war文件到Tomcat的webapps目录。

5.修改Tomcat的webapps目录下nutch-1.1目录下webapps\nutch-1.1\WEB-INF\classes\nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href =“configuration.xsl“?>
......
<configuration>
<property>
<name>searcher.dir</name>
<value>/home/nutch/apache-nutch-1.1-bin/crawl</value>
</property>
</configuration>
有资料这样描述，nutch-1.0版本之后，web项目下的nutch-site.xml的xsl需要更改一下，要不然会有很多问题。修改之后内容如下：
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href=“nutch-conf.xsl“?>

<nutch-conf>
<property>
<name>searcher.dir</name>
<value>/home/nutch/apache-nutch-1.1-bin/crawl</value>
</property>
</nutch-conf>

6.重启tomcat。在浏览器输入http://localhost:8080/nutch-1.1/，输入查询关键词。

问题一：搜索的关键词出现乱码。
解决方法：修改tomcat里conf/server.xml文件，找到内容
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443">
增加 URIEncoding="UTF-8" useBodyEncodingForURI="true" 属性，如下：
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443"
URIEncoding="UTF-8" useBodyEncodingForURI="true"/>

问题二：搜索结果中，网页快照如果有中文，出现乱码。
解决方法：编辑cached.jsp文件，找到内容
else
content = new String(bean.getContent(details));
修改为：
else {
int index = contentType.indexOf("charset=");
String charEncoding = "utf-8";
if ( index>=0 ) {
charEncoding = contentType.substring(index + 8);
}
content = new String(bean.getContent(details), charEncoding);
}

问题三：爬取的url，第一行会被忽略，也就是那个multiurls.txt文件里的第一个url，http://www.chinaunix.net/会爬不到内容。
解决方法：对于nutch0.9版本，这是一个bug，需要打补丁，NUTCH-503
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
参考文档：http://hi.baidu.com/phpasp/blog/item/f3b96209f5948bcb3ac76351.html
但是补丁发布页面说的这个错误只影响到0.9版本，1.0版本已解决此问题了。目前1.1版本尚不清楚此问题原因，而之后随着txt文件里url的增加，却不能爬取新加网页中的内容了。后来换成1.2版本，此问题消失，看来是还是bug。

Nutch帮助文档：http://wiki.apache.org/nutch/NutchTutorial

补充
nutch配置文件的加载顺序：
hadoop-default.xml > hadoop-site.xml > nutch-default.xml > nutch-site.xml > crawl-tool.xml
后加载的配置文件里的属性可以覆盖先加载配置文件中的属性。

0 0