Nutch1.3和Hadoop0.20.203.0的整合
来源:互联网 发布:oa免费办公软件 编辑:程序博客网 时间:2024/06/10 20:55
一、Hadoop的安装。
http://blog.csdn.net/deqingguo/article/details/6907372
二、Nutch1.3的下载安装
svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ ~/nutch
也可以直接在http://labs.renren.com/apache-mirror//nutch/ 上下载,我下的是1.3版本。
三、修改conf/下的nutch-site.xml
<configuration><property> <name>http.agent.name</name> <value>HD nutch agent</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. </description></property> <property> <name>http.robots.agents</name> <value>HD nutch agent</value> <description>The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* </description></property> </configuration>
四、将hadoop中conf下的所有文件考到nutch的conf下。
五、用ant重新编译Nutch,如果ant没安装apt-get install ant可以直接安装~
注:如果没有重新编译,对于nutch-site.xml的修改是无效的,会出现Nutch Fetcher: No agents listed in ‘http.ag ent.name’ property的错误
六、进入到runtime/deploy/bin下:
./nutch crawl hdfs://localhost:9000/user/fzuir/urls.txt -dir hdfs://localhost:9000/user/fzuir/crawled -depth 3 -topN 10
这个时候,还会报一个错误:
NullPointerException at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs.
这个是因为Nutch1.3的一个bug,在Nutch的官网上有提到在1.4的版本上有修改,但是1.4还么有发布,所有就根 据官网的提示自己改下两个java文件,然后重新编译下:
修改的第一个文件是:src/java/org/apache/nutch/parse/ParseOutputFormat.java
public void checkOutputSpecs(FileSystem fs, JobConf job) throws IOException {- Path out = FileOutputFormat.getOutputPath(job);- if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME)))- throw new IOException("Segment already parsed!");+ Path out = FileOutputFormat.getOutputPath(job);+ if ((out == null) && (job.getNumReduceTasks() != 0)) {+ throw new InvalidJobConfException(+ "Output directory not set in JobConf.");+ }+ if (fs == null) {+ fs = out.getFileSystem(job);+ }+ if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME)))+ throw new IOException("Segment already parsed!"); }
修改的第二个文件是:src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java
import org.apache.hadoop.io.SequenceFile.CompressionType; import org.apache.hadoop.mapred.FileOutputFormat;+import org.apache.hadoop.mapred.InvalidJobConfException; import org.apache.hadoop.mapred.OutputFormat; import org.apache.hadoop.mapred.RecordWriter; import org.apache.hadoop.mapred.JobConf;@@ -46,8 +47,15 @@ public void checkOutputSpecs(FileSystem fs, JobConf job) throws IOException { Path out = FileOutputFormat.getOutputPath(job);+ if ((out == null) && (job.getNumReduceTasks() != 0)) {+ throw new InvalidJobConfException(+ "Output directory not set in JobConf.");+ }+ if (fs == null) {+ fs = out.getFileSystem(job);+ } if (fs.exists(new Path(out, CrawlDatum.FETCH_DIR_NAME)))- throw new IOException("Segment already fetched!");+ throw new IOException("Segment already fetched!"); }
七、问题解决~~
- Nutch1.3和Hadoop0.20.203.0的整合
- nutch1.3+hadoop0.20.2+solr3.2搭建
- nutch1.3+hadoop0.20.2+solr3.2搭建
- Hadoop0.20.203.0的安装配置
- Solr3.6.2与nutch1.6的整合
- nutch1.0中索引的更新和维护
- 运行Hadoop0.20.203例子的一般流程
- Nutch1.0的配置与运行
- 搭建基于nutch1.0的搜索引擎
- hadoop0.19.0版的包和类的分析
- Hadoop0.23.0初探3---HDFS NN,SNN,BN和HA
- Hadoop0.23.0初探3---HDFS NN,SNN,BN和HA
- Hadoop0.23.0初探3---HDFS NN,SNN,BN和HA
- Nutch1.0和eclipse配置(详细版)
- Hadoop0.20.203.0+Hbase0.90.4完全分布式配置
- 搜索引擎Nutch1.4+solr1.4整合(成功)
- nutch1.9与solr4.8.1整合
- nutch1.0各种命令
- Hadoop0.20.203.0在关机重启后,namenode启动报错(/dfs/name is in an inconsistent state)
- INSERT All/ INSERT FIRST 小实验
- 2012校招之我的求职之路(上)
- hdu 2138 How many prime numbers (随即素数测试模版)
- VC6中将一个工程中的对话框添加到另外一个工程
- Nutch1.3和Hadoop0.20.203.0的整合
- 获取某service是否在运行
- When Linux kernel panic, what can we do ?
- JAVA中观察者模式示例
- 代码签名证书FAQ
- MultiByteToWideChar和WideCharToMultiByte用法详解 .
- 将点分式的IP地址转换成长整型
- Adapter简介 SimpleAdapter
- C51资料收集汇总