Nutch2.3+Hbase0.94+Solr4.10.3单机集成配置安装

来源：互联网发布：php jquery ajax实例编辑：程序博客网时间：2024/05/29 10:54

Nutch起源于ApacheLucene项目，是一个可扩展和可伸缩的开源网络爬虫软件项目，包括两个版本的代码库，即：
1，Nutch1.x版本：一个成熟的产品化的爬虫。1.x版本依赖于Apache Hadoop的数据结构，并使用了细粒度配置。Hadoop对于批处理提供了很强大的功能。
2，Nutch2.x的版本：一个新兴的、直接受1.x版本启发的替代方案。该版本在存储的关键领域不与1.x版本同，新版本通过使用 Apache Gora™处理对象的持久映射使得存储从任何特定的底层数据存储分离出来。
这意味着我们可以实现一个极其灵活多变的、用来存储任何东西的模型（抓取时间、状态、内容、分析的文本、外链接、内链接等）使其集成到许多NoSQL存储解决方案。
3，两个版本的主要区别在于底层的存储不同。1.x版本是基于Hadoop架构的，底层存储使用的是HDFS，而2.x通过使用Apache Gora，使得Nutch可以访问HBase、Accumulo、Cassandra、MySQL、DataFileAvroStore、AvroStore等NoSQL。

一，安装环境
硬件：虚拟机
操作系统：Centos 6.4 64位
IP：10.51.121.10
主机名：datanode-4
安装用户：root
JDK：需要安装JDK1.7或者以上版本。这里安装的JDK为jdk1.7.0_75，并配置好了环境变量。
HBase：Nutch2.3版本官方文档中说对应的HBase版本为HBase0.94.14。Hbase0.94.14安装文档见：http://blog.csdn.net/freedomboy319/article/details/44102347
Solr：这里集成solr-4.10.3。
Ant：在安装Nutch2.3之前，需要安装Ant，并配置环境变量。这里安装的Ant为apache-ant-1.9.4，并配置好了环境变量。

二，安装Nutch2.3
1，下载地址：http://nutch.apache.org/downloads.html，这里下载apache-nutch-2.3-src.tar.gz
2，解压，执行#tar -zxvf apache-nutch-2.3-src.tar.gz，这里解压到/root/nutch目录下，则Nutch的安装目录为：/root/nutch/apache-nutch-2.3，下面的$NUTCH_HOME指/root/nutch/apache-nutch-2.3目录。
3，在$NUTCH_HOME/conf/nutch-site.xml文件中添加如下配置：

<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description></property>

4，在$NUTCH_HOME/ivy/ivy.xml文件中找到如下配置：

<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />

确保此配置生效，即如果发现此配置有注释，去掉注释。
注意：rev=0.5对应的Hbase版本是Hbase0.94.14，rev=0.3对应的Hbase版本是hbase0.90.4。

5，在 $NUTCH_HOME/conf/gora.properties文件中添加如下配置：

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

6，在$NUTCH_HOME目录下，执行：#ant runtime 命令，编译。
在编译最后如果有BUILD SUCCESSFUL，说明编译成功，若提示信息为BUILD FAILED，则说明编译失败，需要根据编译过程中输出的信息查找错误原因。
编译完之后，会新增build和runtime两个目录。

三，安装Solr,
1，到http://archive.apache.org/dist/lucene/solr/ 下载对应版本的Solr。这里下载solr-4.10.3.tgz
注意，请下载Solr4版本，本人没有调试成功与Solr5的集成。
2，解压，执行:#tar -zxvf solr-4.10.3.tgz，到/root/nutch目录。则Solr安装目录为：/root/nutch/solr-4.10.3。下面${SOLR_HOME}指/root/nutch/solr-4.10.3。
3，进入/root/nutch/solr-4.10.3/example目录，执行：#java -jar start.jar，启动Solr。
4，访问，http://localhost:8983/solr/

四，集成Solr
1，先备份Solr example 的schema.xml。

#mv ${SOLR_HOME}/example/solr/collection1/conf/schema.xml ${SOLR_HOME}/example/solr/collection1/conf/schema.xml.bak

2，复制Nutch运行目录下的schema.xml到Solr example目录下。这里${NUTCH_RUNTIME_HOME}指/root/nutch/apache-nutch-2.3/runtime/local

#cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml  ${SOLR_HOME}/example/solr/collection1/conf/

3，在Solr4.10.3版本中，笔者没有对schema.xml文件做任何修改。集成低版本的Solr可能需要做适当的修改，详细请见：http://wiki.apache.org/nutch/NutchTutorial

4，进入/root/nutch/solr-4.10.3/example目录，执行：#java -jar start.jar，重新启动Solr。

五，配置爬虫信息
1，配置agent。在/root/nutch/apache-nutch-2.3/runtime/local/conf/nutch-site.xml 文件中添加Agent信息：

<property> <name>http.agent.name</name> <value>JustinNutchAgent</value></property>

2，添加索引信息。在/root/nutch/apache-nutch-2.3/runtime/local/conf/nutch-site.xml 文件中添加如下信息：

<property><name>plugin.includes</name><value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value></property>

3，配置目标要抓取的URL，这里在/root/nutch/apache-nutch-2.3/runtime/local目录下新建myUrls文件夹，并新增seed.txt文件，在此文件中添加如下信息：

http://www.10jqka.com.cn/http://www.cnblogs.com/

这里抓取博客园和同花顺网站。

六，启动爬虫
1，先启动 Hbase，进入/root/hadoop/hbase-0.94.14/目录，执行#./bin/start-hbase.sh 脚本。
2，启动Solr，进入/root/nutch/solr-4.10.3/example目录，执行#java -jar start.jar
3，启动Nutch，开始抓取任务。
进入/root/nutch/apache-nutch-2.3/runtime/local目录，执行./bin/crawl命令

Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds><seedDir>：放置种子文件的目录<crawlID> ：抓取任务的ID<solrURL>：用于索引及搜索的solr地址<numberOfRounds>：迭代次数，即抓取深度

./bin/crawl ./myUrls/ mycrawl1 http://localhost:8983/solr/ 2
执行完之后，进入Solr的界面：http://10.51.121.10:8983/solr/#/collection1/query，有如下信息：
这里写图片描述
说明成功抓取信息，并在Solr中建立了索引，并可以在Solr中搜索到爬到的信息。

七，常见错误
1，在Fetch任务时，报如下错：

# ./bin/nutch fetch -all -crawlId mycrawl1 -threads 5

SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found
binding in
[jar:file:/root/nutch/apache-nutch-2.3/runtime/local/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/root/nutch/apache-nutch-2.3/runtime/local/lib/slf4j-log4j12-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation. SLF4J: Actual binding is of type
[org.slf4j.impl.Log4jLoggerFactory] FetcherJob: starting at 2015-03-09
17:05:53 FetcherJob: fetching all Fetcher: No agents listed in
‘http.agent.name’ property. Exception in thread “main”
java.lang.IllegalArgumentException: Fetcher: No agents listed in
‘http.agent.name’ property.
at org.apache.nutch.fetcher.FetcherJob.checkConfiguration(FetcherJob.java:273)
at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:159)
at org.apache.nutch.fetcher.FetcherJob.fetch(FetcherJob.java:254)
at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:317)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.FetcherJob.main(FetcherJob.java:324)

原因：在/root/nutch/apache-nutch-2.3/runtime/local/conf/nutch-site.xml文件中没有配置http.agent.name。需要在此文件中添加如下配置：

<property> <name>http.agent.name</name> <value>JustinNutchAgent</value></property>

2，执行# ./bin/nutch index solr.server.url=http://localhost:8983/solr -all -crawlId mycrawl1 报如下错：

IndexingJob: startingNo IndexWriters activated - check your configurationIndexingJob: done.

原因：需要在nutch-site.xml文件中配置索引插件。在/root/nutch/apache-nutch-2.3/runtime/local/conf/nutch-site.xml文件中添加如下配置信息：

<property><name>plugin.includes</name><value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value></property>

0 0