Nutch 1.0配置与运行

来源：互联网发布：小鸣单车知乎编辑：程序博客网时间：2024/05/18 03:23

http://blog.csdn.net/shirdrn/article/details/5922087

有一段时间没有关注Nutch了，借最近假期，尝试一下，以后可能会慢慢深入了解Nutch及其相关的内容（像Hadoop），话不多说，先把最简单配置的Nutch运行起来。

这次下载了Nutch 1.0，貌似和之前版本在配置上有轻微的变化。

由于Nutch基于Hadoop项目，肯定需要满足Hadoop运行的一些配置，为了更能详细地说明实际的配置和运行过程，还是采用按步骤进行解释的方式来表达。我直接使用root帐户了，Linux为RHEL 5。

1、Linux系统无密码验证配置

启动命令行窗口，执行命令行：

[plain] view plaincopy
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa   
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys   

然后执行如下命令：

[plain] view plaincopy
$ ssh localhost   

如果不需要使用密码登录，表示无密码验证配置成功。

2、配置Hadoop

（1）修改conf/hadoop-env.sh中JAVA_HOME

（2）修改conf/hadoop-site.xml的内容，如下所示：

[xhtml] view plaincopy
<?xml version="1.0"?>  
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  
  
<!-- Put site-specific property overrides in this file. -->  
  
<configuration>      
  <property>      
    <name>fs.default.name</name>      
    <value>hdfs://localhost:9000</value>      
  </property>     
  <property>      
    <name>mapred.job.tracker</name>      
    <value>localhost:9001</value>      
  </property>  
  <property>      
    <name>dfs.replication</name>      
    <value>1</value>      
  </property>   
</configuration>  

3、配置Nutch

（1）配置conf/nutch-site.xml，内容如下所示：

[xhtml] view plaincopy
<?xml version="1.0"?>  
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  
  
<!-- Put site-specific property overrides in this file. -->  
  
<configuration>  
        <property>  
                <name>http.agent.name</name>  
                <value>z-hacker</value>  
        </property>  
        <property>  
                <name>http.agent.description</name>  
                <value>I'm z-hacker.</value>  
        </property>  
        <property>  
                <name>http.agent.url</name>  
                <value>www.z-hacker.com</value>  
        </property>  
        <property>  
                <name>http.agent.email</name>  
                <value>crawler@z-hacker.com</value>  
        </property>  
</configuration>  

其中，http.agent.name是必须配置的，其它为可选。

（2）配置conf/crawl-urlfilter.txt，修改如下一行即可：

[plain] view plaincopy
# accept hosts in MY.DOMAIN.NAME  
+^http://([a-z0-9]*/.)*sina.com.cn/   

上述配置，表示抓取新浪的网页。

4、准备Nutch爬虫入口url文件

在当前目录下创建一个目录，例如urls，在url下面可以创建文本文件，文本文件中每行一个合法的url即可。

5、格式化HDFS文件系统

[plain] view plaincopy
[root@localhost nutch-1.0]# bin/hadoop namenode -format   

如果第一次在本机Linux上运行，则不出错就应该没问题；如果第一次执行格式化，可能提示你是否重新格式化：

Re-format filesystem in /tmp/hadoop-root/dfs/name ? (Y or N) Y

输入Y即可，重新格式化。

6、启动Hadoop相关的5个后台进程

[root@localhost nutch-1.0]# bin/start-all.sh

如果没有出错，表示成功，此时启动了5个后台进程，可以通过jps命令查看一下，是否是下面5个进程：

[root@localhost nutch-1.0]# jps
15559 Jps
15407 JobTracker
15243 DataNode
15150 NameNode
15512 TaskTracker
15349 SecondaryNameNode

如果缺少某个，说明你的配置还是存在问题，需要查看日志。

例如，我的日志目录为logs，假如通过jps查看发现NameNode进程不存在，说明NameNode服务进程启动失败，查看/logs/hadoop-root-namenode-localhost.log日志文件，即可查明原因。

7、上传Nutch爬虫入口url目录

需要将本地准备好的爬虫入口目录（例如urls）及文件（目录urls下面存在url文件）上传到HDFS上，执行如下命令：

[root@localhost nutch-1.0]# bin/hadoop fs -put urls/ urls

如果没有发生异常，则上传成功。

8、简单模式启动Nutch

执行如下命令启动Nutch：

[plain] view plaincopy
[root@localhost nutch-1.0]# bin/nutch crawl urls -dir storeDir -depth 10 -threads 5 -topN 50 >&./logs/nutch.log   

上面，urls是上传的入口url的目录，storeDir是抓取下来的数据存储目录。

如果执行上述命令，查看日志，能够按照正确的流程执行，就表示配置运行成功了，可以通过如下命令查看日志：

[plain] view plaincopy
[root@localhost nutch-1.0]# tail -100f logs/nutch.log  
crawl started in: storeDir  
rootUrlDir = urls  
threads = 5  
depth = 10  
topN = 50  
Injector: starting  
Injector: crawlDb: storeDir/crawldb  
Injector: urlDir: urls  
Injector: Converting injected urls to crawl db entries.  
Injector: Merging injected urls into crawl db.  
Injector: done  
Generator: Selecting best-scoring urls due for fetch.  
Generator: starting  
Generator: segment: storeDir/segments/20101004091724  
Generator: filtering: true  
Generator: topN: 50  
Generator: Partitioning selected urls by host, for politeness.  
Generator: done.  
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.  
Fetcher: starting  
Fetcher: segment: storeDir/segments/20101004091724  
Fetcher: done  
CrawlDb update: starting  
CrawlDb update: db: storeDir/crawldb  
CrawlDb update: segments: [storeDir/segments/20101004091724]  
CrawlDb update: additions allowed: true  
CrawlDb update: URL normalizing: true  
CrawlDb update: URL filtering: true  
CrawlDb update: Merging segment data into db.  
CrawlDb update: done  
Generator: Selecting best-scoring urls due for fetch.  
Generator: starting  
Generator: segment: storeDir/segments/20101004091902  
Generator: filtering: true  
Generator: topN: 50  
Generator: Partitioning selected urls by host, for politeness.  
Generator: done.  
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.  
Fetcher: starting  
Fetcher: segment: storeDir/segments/20101004091902  
Fetcher: done  
CrawlDb update: starting  
CrawlDb update: db: storeDir/crawldb  
CrawlDb update: segments: [storeDir/segments/20101004091902]  
CrawlDb update: additions allowed: true  
CrawlDb update: URL normalizing: true  
CrawlDb update: URL filtering: true  
CrawlDb update: Merging segment data into db.  
CrawlDb update: done  
Generator: Selecting best-scoring urls due for fetch.  
Generator: starting  
Generator: segment: storeDir/segments/20101004092053  
Generator: filtering: true  
Generator: topN: 50  
Generator: Partitioning selected urls by host, for politeness.  
Generator: done.  
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.  
Fetcher: starting  
Fetcher: segment: storeDir/segments/20101004092053  
Fetcher: done  
CrawlDb update: starting  
CrawlDb update: db: storeDir/crawldb  
CrawlDb update: segments: [storeDir/segments/20101004092053]  
CrawlDb update: additions allowed: true  
CrawlDb update: URL normalizing: true  
CrawlDb update: URL filtering: true  
CrawlDb update: Merging segment data into db.  
CrawlDb update: done  
Generator: Selecting best-scoring urls due for fetch.  
Generator: starting  
Generator: segment: storeDir/segments/20101004092711  
Generator: filtering: true  
Generator: topN: 50  
Generator: Partitioning selected urls by host, for politeness.  
Generator: done.  
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.  
Fetcher: starting  
Fetcher: segment: storeDir/segments/20101004092711  
Fetcher: done  
CrawlDb update: starting  
CrawlDb update: db: storeDir/crawldb  
CrawlDb update: segments: [storeDir/segments/20101004092711]  
CrawlDb update: additions allowed: true  
CrawlDb update: URL normalizing: true  
CrawlDb update: URL filtering: true  
CrawlDb update: Merging segment data into db.  
CrawlDb update: done  
Generator: Selecting best-scoring urls due for fetch.  
Generator: starting  
Generator: segment: storeDir/segments/20101004092833  
Generator: filtering: true  
Generator: topN: 50  
Generator: 0 records selected for fetching, exiting ...  
Stopping at depth=4 - no more URLs to fetch.  
LinkDb: starting  
LinkDb: linkdb: storeDir/linkdb  
LinkDb: URL normalize: true  
LinkDb: URL filter: true  
LinkDb: adding segment: hdfs://localhost:9000/user/root/storeDir/segments/20101004091724  
LinkDb: adding segment: hdfs://localhost:9000/user/root/storeDir/segments/20101004091902  
LinkDb: adding segment: hdfs://localhost:9000/user/root/storeDir/segments/20101004092053  
LinkDb: adding segment: hdfs://localhost:9000/user/root/storeDir/segments/20101004092711  
LinkDb: done  
Indexer: starting  
Indexer: done  
Dedup: starting  
Dedup: adding indexes in: storeDir/indexes  
Dedup: done  
merging indexes to: storeDir/index  
Adding hdfs://localhost:9000/user/root/storeDir/indexes/part-00000  
done merging  
crawl finished: storeDir  

9、查看Nutch执行结果

可以根据上面我们指定的数据存储目录来查看HDFS上存储情况，执行如下命令：

[plain] view plaincopy
[root@localhost nutch-1.0]# bin/hadoop fs -ls /user/root/storeDir  
Found 5 items  
drwxr-xr-x   - root supergroup          0 2010-10-04 09:28 /user/root/storeDir/crawldb  
drwxr-xr-x   - root supergroup          0 2010-10-04 09:31 /user/root/storeDir/index  
drwxr-xr-x   - root supergroup          0 2010-10-04 09:30 /user/root/storeDir/indexes  
drwxr-xr-x   - root supergroup          0 2010-10-04 09:29 /user/root/storeDir/linkdb  
drwxr-xr-x   - root supergroup          0 2010-10-04 09:27 /user/root/storeDir/segments