Nutch全文搜索学习笔记

来源:互联网 发布:软件注册码大全 编辑:程序博客网 时间:2024/06/03 18:53
Nutch 1.3 学习笔记1
--------------------

1. Nutch是什么?

Nutch是一个开源的网页抓取工具,主要用于收集网页数据,然后对其进行分析,建立索引,以提供相应的接口来对其网页数据进行查询的一套工具。其底层使用了Hadoop来做分布式计算与存储,索引使用了Solr分布式索引框架来做,Solr是一个开源的全文索引框架,从Nutch 1.3开始,其集成了这个索引架构


2. 在哪里要可以下载到最新的Nutch?

在下面地址中可以下载到最新的Nutch 1.3二进制包和源代码
http://mirror.bjtu.edu.cn/apache//nutch/


3. 如何配置Nutch?

   3.1 对下载后的压缩包进行解压,然后cd $HOME/nutch-1.3/runtime/local

   3.2 配置bin/nutch这个文件的权限,使用chmod +x bin/nutch 

   3.3 配置JAVA_HOME,使用export JAVA_HOME=$PATH

4. 抓取前要做什么准备工作?

4.1 配置http.agent.name这个属性,如下
[html] view plaincopyprint?
  1. <PRE class=html name="code"><property>  
  2.     <name>http.agent.name</name>  
  3.     <value>My Nutch Spider</value>  
  4. </property></PRE><BR>  
  5. <BR>  
  6. <PRE></PRE>  
  7. <PRE></PRE>  
  8. <PRE></PRE>  
  9. <PRE></PRE>  

4.2 建立一个地址目录,mkdir -p urls

   在这个目录中建立一个url文件,写上一些url,如
[html] view plaincopyprint?
  1. http://nutch.apache.org/  

4.3 然后运行如下命令

[html] view plaincopyprint?
  1. bin/nutch crawl urls -dir crawl -depth 3 -topN 5  

注意,这里是不带索引的,如果要对抓取的数据建立索引,运行如下命令
[html] view plaincopyprint?
  1. bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5  

5. Nutch的抓取流程是什么样子的?

5.1 初始化crawlDb,注入初始url

[html] view plaincopyprint?
  1. <PRE class=html name="code">bin/nutch inject   
  2. Usage: Injector <crawldb> <url_dir></PRE>  
  3. <PRE></PRE>  
  4. <PRE></PRE>  
  5. <PRE></PRE>  
  6. <PRE></PRE>  


在我本地运行这个命令后的输出结果如下:
[html] view plaincopyprint?
  1. lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch inject db/crawldb urls/  
  2.         Injector: starting at 2011-08-22 10:50:01  
  3.         Injector: crawlDb: db/crawldb  
  4.         Injector: urlDir: urls  
  5.         Injector: Converting injected urls to crawl db entries.  
  6.         Injector: Merging injected urls into crawl db.  
  7.         Injector: finished at 2011-08-22 10:50:05, elapsed: 00:00:03  

5.2 产生新的抓取urls

[html] view plaincopyprint?
  1. bin/nutch generate  
  2. Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]  


本机输出结果如下:
[html] view plaincopyprint?
  1. lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch generate db/crawldb/ db/segments  
  2.         Generator: starting at 2011-08-22 10:52:41  
  3.         Generator: Selecting best-scoring urls due for fetch.  
  4.         Generator: filtering: true  
  5.         Generator: normalizing: true  
  6.         Generator: jobtracker is 'local', generating exactly one partition.  
  7.         Generator: Partitioning selected urls for politeness.  
  8.         Generator: segment: db/segments/20110822105243   // 这里会产生一个新的segment  
  9.         Generator: finished at 2011-08-22 10:52:44, elapsed: 00:00:03  

5.3 对上面产生的url进行抓取

[html] view plaincopyprint?
  1. bin/nutch fetch  
  2. Usage: Fetcher <segment> [-threads n] [-noParsing]  

这里是本地的输出结果:
[html] view plaincopyprint?
  1. lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch fetch db/segments/20110822105243/  
  2.         Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.  
  3.         Fetcher: starting at 2011-08-22 10:56:07  
  4.         Fetcher: segment: db/segments/20110822105243  
  5.         Fetcher: threads: 10  
  6.         QueueFeeder finished: total 1 records + hit by time limit :0  
  7.         fetching http://www.baidu.com/  
  8.         -finishing thread FetcherThread, activeThreads=1  
  9.         -finishing thread FetcherThread, activeThreads=  
  10.         -finishing thread FetcherThread, activeThreads=1  
  11.         -finishing thread FetcherThread, activeThreads=1  
  12.         -finishing thread FetcherThread, activeThreads=0  
  13.         -activeThreads=0spinWaiting=0fetchQueues.totalSize=0  
  14.         -activeThreads=0  
  15.         Fetcher: finished at 2011-08-22 10:56:09, elapsed: 00:00:02  


我们来看一下这里的segment目录结构
[html] view plaincopyprint?
  1. lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ ls db/segments/20110822105243/  
  2. content  crawl_fetch  crawl_generate  

5.4 对上面的结果进行解析

[html] view plaincopyprint?
  1. <PRE class=html name="code">bin/nutch parse  
  2. Usage: ParseSegment segment</PRE>  
  3. <PRE></PRE>  
  4. <PRE></PRE>  
  5. <PRE></PRE>  
  6. <PRE></PRE>  

本机输出结果:
[html] view plaincopyprint?
  1. <PRE class=html name="code">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch parse db/segments/20110822105243/  
  2. ParseSegment: starting at 2011-08-22 10:58:19  
  3. ParseSegment: segment: db/segments/20110822105243  
  4. ParseSegment: finished at 2011-08-22 10:58:22, elapsed: 00:00:02</PRE>  
  5. <PRE></PRE>  
  6. <PRE></PRE>  
  7. <PRE></PRE>  
  8. <PRE></PRE>  

我们再来看一下解析后的目录结构
[html] view plaincopyprint?
  1. <PRE class=html name="code">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ ls db/segments/20110822105243/  
  2. content  crawl_fetch  crawl_generate  crawl_parse  parse_data  parse_text</PRE>  
  3. <PRE></PRE>  
  4. <PRE></PRE>  
  5. <PRE></PRE>  
  6. <PRE></PRE>  

这里多了三个解析后的目录。


5.5 更新外链接数据库

[html] view plaincopyprint?
  1. bin/nutch updatedb  
  2. Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions]  

本机输出结果:
[html] view plaincopyprint?
  1. <PRE class=html name="code">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch updatedb db/crawldb/ -dir db/segments/  
  2. CrawlDb update: starting at 2011-08-22 11:00:09  
  3. CrawlDb update: db: db/crawldb  
  4. CrawlDb update: segments: [file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243]  
  5. CrawlDb update: additions allowed: true  
  6. CrawlDb update: URL normalizing: false  
  7. CrawlDb update: URL filtering: false  
  8. CrawlDb update: Merging segment data into db.  
  9. CrawlDb update: finished at 2011-08-22 11:00:10, elapsed: 00:00:01</PRE>  
  10. <PRE></PRE>  
  11. <PRE></PRE>  
  12. <PRE></PRE>  
  13. <PRE></PRE>  

这时它会更新crawldb链接库,这里是放在文件系统中的,像taobao抓取程序的链接库是用redis来做的,一种key-value形式的NoSql数据库。

5.6 计算反向链接
[html] view plaincopyprint?
  1. <PRE class=html name="code">bin/nutch invertlinks  
  2. Usage: LinkDb <linkdb> (-dir <segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize] [-noFilter]</PRE>  
  3. <PRE></PRE>  
  4. <PRE></PRE>  
  5. <PRE></PRE>  

本地输出结果:
[html] view plaincopyprint?
  1. <PRE class=html name="code">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch invertlinks db/linkdb -dir db/segments/  
  2. LinkDb: starting at 2011-08-22 11:02:49  
  3. LinkDb: linkdb: db/linkdb  
  4. LinkDb: URL normalize: true  
  5. LinkDb: URL filter: true  
  6. LinkDb: adding segment: file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243  
  7. LinkDb: finished at 2011-08-22 11:02:50, elapsed: 00:00:01</PRE>  
  8. <PRE></PRE>  
  9. <PRE></PRE>  
  10. <PRE></PRE>  
  11. <PRE></PRE>  

5.7 使用Solr为抓取的内容建立索引
[html] view plaincopyprint?
  1. bin/nutch solrindex  
  2. Usage: SolrIndexer <solr url> <crawldb> <linkdb> (<segment> ... | -dir <segments>  

Nutch端的输出如下:
[html] view plaincopyprint?
  1. lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch solrindex http://127.0.0.1:8983/solr/ db/crawldb/ db/linkdb/ db/segments/*  
[html] view plaincopyprint?
  1. SolrIndexer: starting at 2011-08-22 11:05:33  
[html] view plaincopyprint?
  1. SolrIndexer: finished at 2011-08-22 11:05:35, elapsed: 00:00:02  

Solr端的部分输出如下:
[html] view plaincopyprint?
  1. INFO: SolrDeletionPolicy.onInit: commits:num=1  
  2.        commit{dir=/home/lemo/Workspace/java/Apache/Solr/apache-solr-3.3.0/example/solr/data/index,segFN=segments_1,version=1314024228223,generation=1,filenames=[segments_1]  
  3. Aug 22, 2011 11:05:35 AM org.apache.solr.core.SolrDeletionPolicy updateCommits  
  4. INFO: newest commit = 1314024228223  
  5. Aug 22, 2011 11:05:35 AM org.apache.solr.update.processor.LogUpdateProcessor finish  
  6. INFO: {add=[http://www.baidu.com/]} 0 183  
  7. Aug 22, 2011 11:05:35 AM org.apache.solr.core.SolrCore execute  
  8. INFO: [] webapp=/solr path=/update params={wt=javabin&version=2status=0 QTime=183  
  9. Aug 22, 2011 11:05:35 AM org.apache.solr.update.DirectUpdateHandler2 commit  
  10. INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)  

5.8 在Solr的客户端查询
在浏览器中输入 
[html] view plaincopyprint?
  1. http://localhost:8983/solr/admin/  

查询条件为baidu

输出的XML结构为

如果你要以HTML结构显示把Solr的配置文件solrconfig.xml中的content改为如下就可以
<field name="content" type="text" stored="true" indexed="true"/>

[html] view plaincopyprint?
  1.   

[html] view plaincopyprint?
  1.     <PRE class=html name="code"><response>  
  2. <lst name="responseHeader">  
  3. <int name="status">0</int>  
  4. <int name="QTime">0</int>  
  5. <lst name="params">  
  6. <str name="indent">on</str>  
  7. <str name="start">0</str>  
  8. <str name="q">baidu</str>  
  9. <str name="version">2.2</str>  
  10. <str name="rows">10</str>  
  11. </lst>  
  12. </lst>  
  13. <result name="response" numFound="1" start="0">  
  14. <doc>  
  15. <float name="boost">1.0660036</float>  
  16. <str name="digest">7be5cfd6da4a058001300b21d7d96b0f</str>  
  17. <str name="id">http://www.baidu.com/</str>  
  18. <str name="segment">20110822105243</str>  
  19. <str name="title">百度一下,你就知道</str>  
  20. <date name="tstamp">2011-08-22T14:56:09.194Z</date>  
  21. <str name="url">http://www.baidu.com/</str>  
  22. </doc>  
  23. </result>  
  24. </response>  
  25. </PRE><BR>  
  26. <PRE></PRE>  
  27. <PRE class=html name="code" sizcache="0" sizset="68"><PRE class=html name="code" sizcache="0" sizset="69"><BLOCKQUOTE style="BORDER-BOTTOM-STYLE: none; PADDING-BOTTOM: 0px; BORDER-RIGHT-STYLE: none; MARGIN: 0px 0px 0px 40px; PADDING-LEFT: 0px; PADDING-RIGHT: 0px; BORDER-TOP-STYLE: none; BORDER-LEFT-STYLE: none; PADDING-TOP: 0px" sizcache="0" sizset="69"><PRE class=html name="code"></PRE><BR>  
  28. <H3><A name=t14></A>6 参考</H3>  
  29. http://wiki.apache.org/nutch/RunningNutchAndSolr  
  30. <PRE></PRE>  
  31. <SPAN style="FONT-FAMILY: Arial,Verdana,sans-serif"><SPAN style="WHITE-SPACE: normal"></SPAN></SPAN>  
  32. <PRE></PRE>  
  33. <BLOCKQUOTE></BLOCKQUOTE>  
  34. <PRE></PRE>  
  35. <PRE></PRE>  
  36. <PRE></PRE>  
  37. <BLOCKQUOTE></BLOCKQUOTE>  
  38. <PRE></PRE>  
  39. <PRE></PRE>  
  40. <PRE></PRE>  
  41.   
  42. </BLOCKQUOTE></PRE></PRE>  

0 0
原创粉丝点击