nutch1.4 + solr3.5 上路
来源:互联网 发布:怎样设置淘宝地理位置 编辑:程序博客网 时间:2024/06/08 15:02
1.下载nutch1.4 & solr3.5
http://mirror.bjtu.edu.cn/apache/lucene/solr/3.5.0/apache-solr-3.5.0.ziphttp://mirror.bjtu.edu.cn/apache/nutch/apache-nutch-1.4-bin.zip
2.解压安装
2.1 目录规划
安装目录 nutch:/home/xcloud/spider/nutch安装目录 solr:/home/xcloud/spider/solr
2.2文件解压
解压nutchunzip apache-nutch-1.4-bin.zip
mv apache-nutch-1.4 nutch
chmod 775 -R nutch
解压solr
unzip apache-solr-3.5.0.zip
mv apache-solr-3.5.0 solr
chmod 775 -R solr
3.nutch配置/运行测试
3.1 nutch配置目录说明
nutch运行目录:$NUTCH_HOME/runtime/local/binnutch配置目录:$NUTCH_HOME/runtime/local/conf
3.2nutch测试
$NUTCH_HOME/runtime/local/bin/nutchxcloud@xcloud:~/spider/nutch/runtime/local/bin$ ./nutch
Usage: nutch [-core] COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
Expert: -core option is for developers only. It avoids building the job jar,
instead it simply includes classes compiled with ant compile-core.
NOTE: this works only for jobs executed in 'local' mode
xcloud@xcloud:~/spider/nutch/runtime/local/bin$
说明安装成功。
3.3 nutch配置修改
$NUTCH_HOME/runtime/local/conf/nutch-site.xml<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>My Nutch Agent</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
</configuration>
3.4创建种子目录
mkdir $NUTCH_HOME/runtime/local/bin/urlsecho http://nutch.apache.org/ >> $NUTCH_HOME/runtime/local/bin/urls/seed.txt
3.5修改conf/regex-urlfilter.txt
sudo gedit $NUTCH_HOME/runtime/local/conf/regex-urlfilter.txt把
# accept anything else
+.
修改为
+^http://([a-z0-9]*\.)*nutch.apache.org/
3.6爬行
./nutch crawl urls -dir crawl -depth 3 -topN 5
3.7 solrindex
./nutch solrindex http://127.0.0.1:8983/solr/ mycrawl/crawldb -linkdb mycrawl/linkdb mycrawl/segments/*
4.solr配置/运行测试
4.1启动
启动目录:${SOLR_HOME}/examplejava -jar start.jar
4.2验证
http://localhost:8983/solr/admin/http://localhost:8983/solr/admin/stats.jsp
4.3nutch&solr集成
cp $NUTCH_HOME/runtime/local/conf/schema.xml ${SOLR_HOME}/example/solr/conf/java -jar start.jar
然后在页面中输入nutch,返回xml格式到结果。
5.遇到到问题
5.1 Nutch Fetcher: No agents listed in ‘http.agent.name’ property解决:修改$NUTCH_HOME/runtime/local/conf/nutch-site.xml,如上图所示
5.2Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/xcloud/spider/nutch/runtime/local/bin/crawl/segments/20120405100520/parse_data
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/xcloud/spider/nutch/runtime/local/bin/crawl/segments/20120405100520/parse_data
Input path does not exist: file:/home/xcloud/spider/nutch/runtime/local/bin/crawl/segments/20120405101222/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
原来:bin/nutch crawl urls -dir crawl -depth 3 -topN 5
修改:
mkdir mycrawl
chmod 777 mycrawl
bin/nutch crawl urls -dir mycrawl -depth 3 -topN 5
5.3 regex-urlfilter.txt 正则匹配问题
xcloud@xcloud:~/nutch$ nutch crawl urls -dir db -depth 2 -topN 2
solrUrl is not set, indexing will be skipped...
crawl started in: db
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=null
topN = 2
Injector: starting at 2012-04-09 14:56:20
Injector: crawlDb: db/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
查看logs/hadoop.log
Caused by: java.io.IOException: Invalid first character: http://XXX/product/*
at org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:200)
at org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
... 21 more
原来:
http://XXX/product/*
修改为
+^http://XXX/product/*
5.4 timeout exception
2012-04-09 16:35:52,067 ERROR http.Http - java.net.SocketTimeoutException: Read timed out
2012-04-09 16:35:52,067 ERROR http.Http - at java.net.SocketInputStream.socketRead0(Native Method)
2012-04-09 16:35:52,067 ERROR http.Http - at java.net.SocketInputStream.read(SocketInputStream.java:129)
2012-04-09 16:35:52,067 ERROR http.Http - at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
2012-04-09 16:35:52,067 ERROR http.Http - at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
2012-04-09 16:35:52,067 ERROR http.Http - at java.io.FilterInputStream.read(FilterInputStream.java:116)
2012-04-09 16:35:52,068 ERROR http.Http - at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
2012-04-09 16:35:52,068 ERROR http.Http - at java.io.FilterInputStream.read(FilterInputStream.java:90)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:229)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:158)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)
修改conf/nutch-default.xml中的http.timeout=200000
2012-04-09 15:28:20,865 INFO api.RobotRulesParser - Couldn't get robots.txt for http://xxx.html: java.net.SocketTimeoutException: Read timed out
修改lib-http/org/apache/nutch/protocol/http/api/RobotRulesParser.java 源码
/home/xcloud/iworkspace/nutch/plugin/lib-http/org/apache/nutch/protocol/http/api/RobotRulesParser.java
class:/home/xcloud/iworkspace/nutch/bin/org/apache/nutch/protocol/http/api
jar路径:/home/xcloud/nutch/runtime/local/plugins/lib-http
- nutch1.4 + solr3.5 上路
- Apache nutch1.5 & Apache solr3.6
- Apache nutch1.5 & Apache solr3.6
- nutch1.4整合solr3.5,搜索输出xml数据配置问题
- nutch1.3集成solr3.4并支持中文
- nutch1.3+hadoop0.20.2+solr3.2搭建
- nutch1.3+hadoop0.20.2+solr3.2搭建
- Nutch1.0+Solr3.6.1+IK中文分词
- Solr3.6.2与nutch1.6的整合
- nutch1.3与solr3.4集成部署在eclipse上之——运行的输出日志
- solr3.5搭建
- mahout0.5和solr3.4的结合使用
- nutch1.3和solr3.x集成时出现Invalid UTF-8 character问题
- nutch1.5编译过程
- Nutch1
- solr3.4源代码学习笔记(一)
- solr3.4源代码学习笔记(二)
- Nutch1.4安装及测试
- 嵌入式Linux之我行——S3C2440上MMC/SD卡驱动实例开发讲解(一)
- linux环境变量
- 嵌入式Linux之我行——S3C2440上MMC/SD卡驱动实例开发讲解(二)
- 计算机笔试面试常见问题总结
- Git config 缩写配置
- nutch1.4 + solr3.5 上路
- 深入研究析构函数
- jquery 学习网站
- 误删文件恢复方法
- 具有跨浏览器兼容的事件处理javascript脚本
- 写给自己的第一篇博文
- AVL树基本操作
- 简单、常用的字符串转二进制代码
- Python_API_String Services_string.count