Nutch二次开发

来源：互联网发布：百度人工智能发布会编辑：程序博客网时间：2024/05/29 14:21

http://www.cnblogs.com/editice/archive/2012/06/19/2554462.html

http://blog.csdn.net/jiutao_tang/article/category/774126

http://wiki.apache.org/nutch/FrontPage

http://www.cnblogs.com/streamhope/category/310177.html

http://blog.csdn.net/amuseme_lu/article/category/330217

http://hi.baidu.com/0jiaqi/blog/item/fd9be5a6ff7bfc9ed0435870.html

http://nhy520.iteye.com/category/64782

http://deepfuture.iteye.com/category/93496

Nutch1.0导入eclipse报错解决方法

http://www.iteye.com/topic/934862

http://wiki.apache.org/nutch/RunNutchInEclipse1.0

关于nutch-1.2爬取时Exception in thread “main” java.io.IOException: Job failed!

http://owwlo.com/blog/?p=36

由于hadoop在运行时要调用cygwin的一些命令（如df命令），并parse其返回值。parse处理没考虑到不同语言环境返回值格式不同，所以造成错误。

把cygwin语言环境换成英文

编辑.bashrc文件
修改或补充
export LANG=”en_US.GBK”
export OUTPUT_CHARSET=”GBK”

1.1.1 Crawl抓取出现hadoop出错提示

配置完成nutch在cygwin中运行nutch的crawl命令时：

[Fatal Error] hadoop-site.xml:15:7: The content of elements must consist of well

-formed character data or markup.

Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseExcep

tion: The content of elements must consist of well-formed character data or mark

up.

问题解决：

hadoop-site.xml、hadoop-site.xml：其中一个标签</property>前面多了一个尖括号

1.1.2 运行crawl报错Job failed

Exception in thread "main" java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java

:439)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

问题解决：

此多为crawl-urlfilter.txt：MY.DOMAIN.NAME的修改不正确

1.1.3 又一个Job failed

Exception in thread "main" java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java

:439)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

问题解决：

多为crawl-urlfilter.txt的MY.DOMAIN.NAME修改不正确

1.1.4 Eclipse中运行nutch：Job failed

Exception in thread "main" java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

at org.apache.nutch.crawl.Injector.inject(Injector.java:162)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

问题解决：

此问题是eclipse的java版本设置问题，解决方法：

如原来使用java1.4，需要改为1.6

project-》properties-》java compiler

右 jdk compliance

compiler compliance level：改为6.0

nutch研究—遇到的错误和解决办法

http://blog.csdn.net/nxh_love/article/details/6609389

nutch-1.2爬取时Exception in thread “main” java.io.IOException: Job failed!

windows下nutch1.0环境搭建及测试

1. Compass的一个简单例子
2. lucent
3. 关于高亮显示和显示部分原始文件的原则(转载)
4. Nutch的简单使用
5. 使用前缀搜索—PrefixQuery 例子
6. lucene2.4的实例
7. 开源搜索引擎资源列表
8. compass2.02 + paoding2.04 + Lucene2.2 构建全文索引的问题 Cannot inherit from final class
9. 按词条搜索TermQuery 的使用例子
10. 开源搜索引擎Nutch初体验
11. 范围搜索的六种实现方式
12. 智能下拉（输入文字自动显示10条）
13. 用lucene3.0.1实现搜索多字段并排序功能
14. Compass介绍--在你的应用中集成搜索功能
15. 开源搜索引擎资源列表