Nutch1.3和Hadoop0.20.203.0的整合

来源：互联网发布：oa免费办公软件编辑：程序博客网时间：2024/06/10 20:55

一、Hadoop的安装。

http://blog.csdn.net/deqingguo/article/details/6907372

二、Nutch1.3的下载安装

svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ ~/nutch

也可以直接在http://labs.renren.com/apache-mirror//nutch/ 上下载，我下的是1.3版本。

三、修改conf/下的nutch-site.xml

<configuration><property>  <name>http.agent.name</name>  <value>HD nutch agent</value>  <description>HTTP 'User-Agent' request header. MUST NOT be empty -  please set this to a single word uniquely related to your organization.  </description></property> <property>  <name>http.robots.agents</name>  <value>HD nutch agent</value>  <description>The agent strings we'll look for in robots.txt files,  comma-separated, in decreasing order of precedence. You should  put the value of http.agent.name as the first agent name, and keep the  default * at the end of the list. E.g.: BlurflDev,Blurfl,*  </description></property> </configuration>

四、将hadoop中conf下的所有文件考到nutch的conf下。

五、用ant重新编译Nutch，如果ant没安装apt-get install ant可以直接安装~

注：如果没有重新编译，对于nutch-site.xml的修改是无效的，会出现Nutch Fetcher: No agents listed in ‘http.ag ent.name’ property的错误

六、进入到runtime/deploy/bin下：

./nutch crawl hdfs://localhost:9000/user/fzuir/urls.txt -dir hdfs://localhost:9000/user/fzuir/crawled -depth 3 -topN 10

这个时候，还会报一个错误：

NullPointerException at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs.

这个是因为Nutch1.3的一个bug，在Nutch的官网上有提到在1.4的版本上有修改，但是1.4还么有发布，所有就根据官网的提示自己改下两个java文件，然后重新编译下：

修改的第一个文件是：src/java/org/apache/nutch/parse/ParseOutputFormat.java

 public void checkOutputSpecs(FileSystem fs, JobConf job) throws IOException {-    Path out = FileOutputFormat.getOutputPath(job);-    if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME)))-      throw new IOException("Segment already parsed!");+      Path out = FileOutputFormat.getOutputPath(job);+      if ((out == null) && (job.getNumReduceTasks() != 0)) {+          throw new InvalidJobConfException(+                  "Output directory not set in JobConf.");+      }+      if (fs == null) {+          fs = out.getFileSystem(job);+      }+      if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME)))+          throw new IOException("Segment already parsed!");   }

修改的第二个文件是：src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java

import org.apache.hadoop.io.SequenceFile.CompressionType;  import org.apache.hadoop.mapred.FileOutputFormat;+import org.apache.hadoop.mapred.InvalidJobConfException; import org.apache.hadoop.mapred.OutputFormat; import org.apache.hadoop.mapred.RecordWriter; import org.apache.hadoop.mapred.JobConf;@@ -46,8 +47,15 @@    public void checkOutputSpecs(FileSystem fs, JobConf job) throws IOException {     Path out = FileOutputFormat.getOutputPath(job);+    if ((out == null) && (job.getNumReduceTasks() != 0)) {+    throw new InvalidJobConfException(+    "Output directory not set in JobConf.");+    }+    if (fs == null) {+    fs = out.getFileSystem(job);+    }     if (fs.exists(new Path(out, CrawlDatum.FETCH_DIR_NAME)))-      throw new IOException("Segment already fetched!");+    throw new IOException("Segment already fetched!");   }

七、问题解决~~