nutch1.4 + solr3.5 上路

来源:互联网 发布:怎样设置淘宝地理位置 编辑:程序博客网 时间:2024/06/08 15:02

1.下载nutch1.4 & solr3.5

http://mirror.bjtu.edu.cn/apache/lucene/solr/3.5.0/apache-solr-3.5.0.zip
http://mirror.bjtu.edu.cn/apache/nutch/apache-nutch-1.4-bin.zip


2.解压安装

2.1 目录规划

安装目录 nutch:/home/xcloud/spider/nutch
安装目录 solr:/home/xcloud/spider/solr


2.2文件解压

解压nutch
unzip apache-nutch-1.4-bin.zip
mv apache-nutch-1.4 nutch
chmod 775 -R nutch


解压solr
unzip apache-solr-3.5.0.zip
mv apache-solr-3.5.0 solr
chmod 775 -R solr


3.nutch配置/运行测试

3.1 nutch配置目录说明

nutch运行目录:$NUTCH_HOME/runtime/local/bin
nutch配置目录:$NUTCH_HOME/runtime/local/conf


3.2nutch测试

$NUTCH_HOME/runtime/local/bin/nutch


xcloud@xcloud:~/spider/nutch/runtime/local/bin$ ./nutch
Usage: nutch [-core] COMMAND
where COMMAND is one of:
  crawl             one-step crawler for intranets
  readdb            read / dump crawl db
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  solrindex         run the solr indexer on parsed segments and linkdb
  solrdedup         remove duplicates from solr
  solrclean         remove HTTP 301 and 404 documents from solr
  parsechecker      check the parser for a given url
  indexchecker      check the indexing filters for a given url
  domainstats       calculate domain statistics from crawldb
  webgraph          generate a web graph from existing segments
  linkrank          run a link analysis program on the generated web graph
  scoreupdater      updates the crawldb with linkrank scores
  nodedumper        dumps the web graph's node scores
  plugin            load a plugin and run one of its classes main()
  junit             runs the given JUnit test
 or
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.


Expert: -core option is for developers only. It avoids building the job jar, 
        instead it simply includes classes compiled with ant compile-core. 
        NOTE: this works only for jobs executed in 'local' mode
xcloud@xcloud:~/spider/nutch/runtime/local/bin$


说明安装成功。


3.3 nutch配置修改

$NUTCH_HOME/runtime/local/conf/nutch-site.xml


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>
  <property>
        <name>http.agent.name</name>
        <value>My Nutch Agent</value>
        <description>HTTP 'User-Agent' request header. MUST NOT be empty -
          please set this to a single word uniquely related to your organization.
          NOTE: You should also check other related properties:
                http.robots.agents
                http.agent.description
                http.agent.url
                http.agent.email
                http.agent.version
                and set their values appropriately.
        </description>
  </property>
  <property>
        <name>http.agent.description</name>
        <value></value>
        <description>Further description of our bot- this text is used in
        the User-Agent header. It appears in parenthesis after the agent name.
        </description>
  </property>
  <property>
        <name>http.agent.url</name>
        <value></value>
        <description>A URL to advertise in the User-Agent header. This will
        appear in parenthesis after the agent name. Custom dictates that this
        should be a URL of a page explaining the purpose and behavior of this crawler.
        </description>
  </property>
  <property>
        <name>http.agent.email</name>
        <value></value>
        <description>An email address to advertise in the HTTP 'From' request
        header and User-Agent header. A good practice is to mangle this
        address (e.g. 'info at example dot com') to avoid spamming.
        </description>
  </property>
</configuration>


3.4创建种子目录

mkdir $NUTCH_HOME/runtime/local/bin/urls
echo http://nutch.apache.org/ >> $NUTCH_HOME/runtime/local/bin/urls/seed.txt


3.5修改conf/regex-urlfilter.txt

sudo gedit $NUTCH_HOME/runtime/local/conf/regex-urlfilter.txt

# accept anything else
+.
修改为
+^http://([a-z0-9]*\.)*nutch.apache.org/




3.6爬行

./nutch crawl urls -dir crawl -depth 3 -topN 5


3.7 solrindex

./nutch solrindex http://127.0.0.1:8983/solr/ mycrawl/crawldb -linkdb mycrawl/linkdb mycrawl/segments/*


4.solr配置/运行测试

4.1启动

启动目录:${SOLR_HOME}/example
java -jar start.jar


4.2验证

http://localhost:8983/solr/admin/
http://localhost:8983/solr/admin/stats.jsp


4.3nutch&solr集成

cp $NUTCH_HOME/runtime/local/conf/schema.xml ${SOLR_HOME}/example/solr/conf/


java -jar start.jar
然后在页面中输入nutch,返回xml格式到结果。




5.遇到到问题

5.1 Nutch Fetcher: No agents listed in ‘http.agent.name’ property


解决:修改$NUTCH_HOME/runtime/local/conf/nutch-site.xml,如上图所示


5.2Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/xcloud/spider/nutch/runtime/local/bin/crawl/segments/20120405100520/parse_data


Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/xcloud/spider/nutch/runtime/local/bin/crawl/segments/20120405100520/parse_data
Input path does not exist: file:/home/xcloud/spider/nutch/runtime/local/bin/crawl/segments/20120405101222/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)


原来:bin/nutch crawl urls -dir crawl -depth 3 -topN 5
修改:
mkdir mycrawl
chmod 777 mycrawl

bin/nutch crawl urls -dir mycrawl -depth 3 -topN 5


5.3  regex-urlfilter.txt 正则匹配问题

xcloud@xcloud:~/nutch$ nutch crawl urls -dir db -depth 2 -topN 2
solrUrl is not set, indexing will be skipped...
crawl started in: db
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=null
topN = 2
Injector: starting at 2012-04-09 14:56:20
Injector: crawlDb: db/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

查看logs/hadoop.log

Caused by: java.io.IOException: Invalid first character: http://XXX/product/*
at org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:200)
at org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
... 21 more
原来:

http://XXX/product/*

修改为

+^http://XXX/product/*


5.4 timeout exception

2012-04-09 16:35:52,067 ERROR http.Http - java.net.SocketTimeoutException: Read timed out
2012-04-09 16:35:52,067 ERROR http.Http - at java.net.SocketInputStream.socketRead0(Native Method)
2012-04-09 16:35:52,067 ERROR http.Http - at java.net.SocketInputStream.read(SocketInputStream.java:129)
2012-04-09 16:35:52,067 ERROR http.Http - at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
2012-04-09 16:35:52,067 ERROR http.Http - at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
2012-04-09 16:35:52,067 ERROR http.Http - at java.io.FilterInputStream.read(FilterInputStream.java:116)
2012-04-09 16:35:52,068 ERROR http.Http - at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
2012-04-09 16:35:52,068 ERROR http.Http - at java.io.FilterInputStream.read(FilterInputStream.java:90)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:229)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:158)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)

修改conf/nutch-default.xml中的http.timeout=200000





2012-04-09 15:28:20,865 INFO  api.RobotRulesParser - Couldn't get robots.txt for http://xxx.html: java.net.SocketTimeoutException: Read timed out

修改lib-http/org/apache/nutch/protocol/http/api/RobotRulesParser.java 源码

/home/xcloud/iworkspace/nutch/plugin/lib-http/org/apache/nutch/protocol/http/api/RobotRulesParser.java
class:/home/xcloud/iworkspace/nutch/bin/org/apache/nutch/protocol/http/api
jar路径:/home/xcloud/nutch/runtime/local/plugins/lib-http


原创粉丝点击