Nutch全文搜索学习笔记
来源:互联网 发布:软件注册码大全 编辑:程序博客网 时间:2024/06/03 18:53
Nutch 1.3 学习笔记1
--------------------
http://mirror.bjtu.edu.cn/apache//nutch/
注意,这里是不带索引的,如果要对抓取的数据建立索引,运行如下命令
在我本地运行这个命令后的输出结果如下:
本机输出结果如下:
这里是本地的输出结果:
我们来看一下这里的segment目录结构
本机输出结果:
我们再来看一下解析后的目录结构
这里多了三个解析后的目录。
本机输出结果:
这时它会更新crawldb链接库,这里是放在文件系统中的,像taobao抓取程序的链接库是用redis来做的,一种key-value形式的NoSql数据库。
5.6 计算反向链接
本地输出结果:
5.7 使用Solr为抓取的内容建立索引
Nutch端的输出如下:
Solr端的部分输出如下:
5.8 在Solr的客户端查询
在浏览器中输入
查询条件为baidu
--------------------
1. Nutch是什么?
Nutch是一个开源的网页抓取工具,主要用于收集网页数据,然后对其进行分析,建立索引,以提供相应的接口来对其网页数据进行查询的一套工具。其底层使用了Hadoop来做分布式计算与存储,索引使用了Solr分布式索引框架来做,Solr是一个开源的全文索引框架,从Nutch 1.3开始,其集成了这个索引架构2. 在哪里要可以下载到最新的Nutch?
在下面地址中可以下载到最新的Nutch 1.3二进制包和源代码http://mirror.bjtu.edu.cn/apache//nutch/
3. 如何配置Nutch?
3.1 对下载后的压缩包进行解压,然后cd $HOME/nutch-1.3/runtime/local
3.2 配置bin/nutch这个文件的权限,使用chmod +x bin/nutch
3.3 配置JAVA_HOME,使用export JAVA_HOME=$PATH
4. 抓取前要做什么准备工作?
4.1 配置http.agent.name这个属性,如下
- <PRE class=html name="code"><property>
- <name>http.agent.name</name>
- <value>My Nutch Spider</value>
- </property></PRE><BR>
- <BR>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- <property>
- <name>http.agent.name</name>
- <value>My Nutch Spider</value>
- </property>
<property> <name>http.agent.name</name> <value>My Nutch Spider</value></property>
4.2 建立一个地址目录,mkdir -p urls
在这个目录中建立一个url文件,写上一些url,如
- http://nutch.apache.org/
http://nutch.apache.org/
4.3 然后运行如下命令
- bin/nutch crawl urls -dir crawl -depth 3 -topN 5
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
注意,这里是不带索引的,如果要对抓取的数据建立索引,运行如下命令
- bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
5. Nutch的抓取流程是什么样子的?
5.1 初始化crawlDb,注入初始url
- <PRE class=html name="code">bin/nutch inject
- Usage: Injector <crawldb> <url_dir></PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- bin/nutch inject
- Usage: Injector <crawldb> <url_dir>
bin/nutch inject Usage: Injector <crawldb> <url_dir>
在我本地运行这个命令后的输出结果如下:
- lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch inject db/crawldb urls/
- Injector: starting at 2011-08-22 10:50:01
- Injector: crawlDb: db/crawldb
- Injector: urlDir: urls
- Injector: Converting injected urls to crawl db entries.
- Injector: Merging injected urls into crawl db.
- Injector: finished at 2011-08-22 10:50:05, elapsed: 00:00:03
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch inject db/crawldb urls/Injector: starting at 2011-08-22 10:50:01Injector: crawlDb: db/crawldbInjector: urlDir: urlsInjector: Converting injected urls to crawl db entries.Injector: Merging injected urls into crawl db.Injector: finished at 2011-08-22 10:50:05, elapsed: 00:00:03
5.2 产生新的抓取urls
- bin/nutch generate
- Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]
bin/nutch generateUsage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]
本机输出结果如下:
- lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch generate db/crawldb/ db/segments
- Generator: starting at 2011-08-22 10:52:41
- Generator: Selecting best-scoring urls due for fetch.
- Generator: filtering: true
- Generator: normalizing: true
- Generator: jobtracker is 'local', generating exactly one partition.
- Generator: Partitioning selected urls for politeness.
- Generator: segment: db/segments/20110822105243 // 这里会产生一个新的segment
- Generator: finished at 2011-08-22 10:52:44, elapsed: 00:00:03
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch generate db/crawldb/ db/segmentsGenerator: starting at 2011-08-22 10:52:41Generator: Selecting best-scoring urls due for fetch.Generator: filtering: trueGenerator: normalizing: trueGenerator: jobtracker is 'local', generating exactly one partition.Generator: Partitioning selected urls for politeness.Generator: segment: db/segments/20110822105243 // 这里会产生一个新的segmentGenerator: finished at 2011-08-22 10:52:44, elapsed: 00:00:03
5.3 对上面产生的url进行抓取
- bin/nutch fetch
- Usage: Fetcher <segment> [-threads n] [-noParsing]
bin/nutch fetchUsage: Fetcher <segment> [-threads n] [-noParsing]
这里是本地的输出结果:
- lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch fetch db/segments/20110822105243/
- Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
- Fetcher: starting at 2011-08-22 10:56:07
- Fetcher: segment: db/segments/20110822105243
- Fetcher: threads: 10
- QueueFeeder finished: total 1 records + hit by time limit :0
- fetching http://www.baidu.com/
- -finishing thread FetcherThread, activeThreads=1
- -finishing thread FetcherThread, activeThreads=
- -finishing thread FetcherThread, activeThreads=1
- -finishing thread FetcherThread, activeThreads=1
- -finishing thread FetcherThread, activeThreads=0
- -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
- -activeThreads=0
- Fetcher: finished at 2011-08-22 10:56:09, elapsed: 00:00:02
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch fetch db/segments/20110822105243/Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.Fetcher: starting at 2011-08-22 10:56:07Fetcher: segment: db/segments/20110822105243Fetcher: threads: 10QueueFeeder finished: total 1 records + hit by time limit :0fetching http://www.baidu.com/-finishing thread FetcherThread, activeThreads=1-finishing thread FetcherThread, activeThreads=-finishing thread FetcherThread, activeThreads=1-finishing thread FetcherThread, activeThreads=1-finishing thread FetcherThread, activeThreads=0-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0-activeThreads=0Fetcher: finished at 2011-08-22 10:56:09, elapsed: 00:00:02
我们来看一下这里的segment目录结构
- lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ ls db/segments/20110822105243/
- content crawl_fetch crawl_generate
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ ls db/segments/20110822105243/content crawl_fetch crawl_generate
5.4 对上面的结果进行解析
- <PRE class=html name="code">bin/nutch parse
- Usage: ParseSegment segment</PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- bin/nutch parse
- Usage: ParseSegment segment
bin/nutch parseUsage: ParseSegment segment
本机输出结果:
- <PRE class=html name="code">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch parse db/segments/20110822105243/
- ParseSegment: starting at 2011-08-22 10:58:19
- ParseSegment: segment: db/segments/20110822105243
- ParseSegment: finished at 2011-08-22 10:58:22, elapsed: 00:00:02</PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch parse db/segments/20110822105243/
- ParseSegment: starting at 2011-08-22 10:58:19
- ParseSegment: segment: db/segments/20110822105243
- ParseSegment: finished at 2011-08-22 10:58:22, elapsed: 00:00:02
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch parse db/segments/20110822105243/ParseSegment: starting at 2011-08-22 10:58:19ParseSegment: segment: db/segments/20110822105243ParseSegment: finished at 2011-08-22 10:58:22, elapsed: 00:00:02
我们再来看一下解析后的目录结构
- <PRE class=html name="code">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ ls db/segments/20110822105243/
- content crawl_fetch crawl_generate crawl_parse parse_data parse_text</PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ ls db/segments/20110822105243/
- content crawl_fetch crawl_generate crawl_parse parse_data parse_text
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ ls db/segments/20110822105243/content crawl_fetch crawl_generate crawl_parse parse_data parse_text
这里多了三个解析后的目录。
5.5 更新外链接数据库
- bin/nutch updatedb
- Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions]
bin/nutch updatedbUsage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions]
本机输出结果:
- <PRE class=html name="code">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch updatedb db/crawldb/ -dir db/segments/
- CrawlDb update: starting at 2011-08-22 11:00:09
- CrawlDb update: db: db/crawldb
- CrawlDb update: segments: [file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243]
- CrawlDb update: additions allowed: true
- CrawlDb update: URL normalizing: false
- CrawlDb update: URL filtering: false
- CrawlDb update: Merging segment data into db.
- CrawlDb update: finished at 2011-08-22 11:00:10, elapsed: 00:00:01</PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch updatedb db/crawldb/ -dir db/segments/
- CrawlDb update: starting at 2011-08-22 11:00:09
- CrawlDb update: db: db/crawldb
- CrawlDb update: segments: [file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243]
- CrawlDb update: additions allowed: true
- CrawlDb update: URL normalizing: false
- CrawlDb update: URL filtering: false
- CrawlDb update: Merging segment data into db.
- CrawlDb update: finished at 2011-08-22 11:00:10, elapsed: 00:00:01
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch updatedb db/crawldb/ -dir db/segments/CrawlDb update: starting at 2011-08-22 11:00:09CrawlDb update: db: db/crawldbCrawlDb update: segments: [file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243]CrawlDb update: additions allowed: trueCrawlDb update: URL normalizing: falseCrawlDb update: URL filtering: falseCrawlDb update: Merging segment data into db.CrawlDb update: finished at 2011-08-22 11:00:10, elapsed: 00:00:01
这时它会更新crawldb链接库,这里是放在文件系统中的,像taobao抓取程序的链接库是用redis来做的,一种key-value形式的NoSql数据库。
5.6 计算反向链接
- <PRE class=html name="code">bin/nutch invertlinks
- Usage: LinkDb <linkdb> (-dir <segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize] [-noFilter]</PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- bin/nutch invertlinks
- Usage: LinkDb <linkdb> (-dir <segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize] [-noFilter]
bin/nutch invertlinksUsage: LinkDb <linkdb> (-dir <segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize] [-noFilter]
本地输出结果:
- <PRE class=html name="code">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch invertlinks db/linkdb -dir db/segments/
- LinkDb: starting at 2011-08-22 11:02:49
- LinkDb: linkdb: db/linkdb
- LinkDb: URL normalize: true
- LinkDb: URL filter: true
- LinkDb: adding segment: file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243
- LinkDb: finished at 2011-08-22 11:02:50, elapsed: 00:00:01</PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch invertlinks db/linkdb -dir db/segments/
- LinkDb: starting at 2011-08-22 11:02:49
- LinkDb: linkdb: db/linkdb
- LinkDb: URL normalize: true
- LinkDb: URL filter: true
- LinkDb: adding segment: file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243
- LinkDb: finished at 2011-08-22 11:02:50, elapsed: 00:00:01
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch invertlinks db/linkdb -dir db/segments/LinkDb: starting at 2011-08-22 11:02:49LinkDb: linkdb: db/linkdbLinkDb: URL normalize: trueLinkDb: URL filter: trueLinkDb: adding segment: file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243LinkDb: finished at 2011-08-22 11:02:50, elapsed: 00:00:01
5.7 使用Solr为抓取的内容建立索引
- bin/nutch solrindex
- Usage: SolrIndexer <solr url> <crawldb> <linkdb> (<segment> ... | -dir <segments>
bin/nutch solrindexUsage: SolrIndexer <solr url> <crawldb> <linkdb> (<segment> ... | -dir <segments>
Nutch端的输出如下:
- lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch solrindex http://127.0.0.1:8983/solr/ db/crawldb/ db/linkdb/ db/segments/*
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch solrindex http://127.0.0.1:8983/solr/ db/crawldb/ db/linkdb/ db/segments/*
- SolrIndexer: starting at 2011-08-22 11:05:33
SolrIndexer: starting at 2011-08-22 11:05:33
- SolrIndexer: finished at 2011-08-22 11:05:35, elapsed: 00:00:02
SolrIndexer: finished at 2011-08-22 11:05:35, elapsed: 00:00:02
Solr端的部分输出如下:
- INFO: SolrDeletionPolicy.onInit: commits:num=1
- commit{dir=/home/lemo/Workspace/java/Apache/Solr/apache-solr-3.3.0/example/solr/data/index,segFN=segments_1,version=1314024228223,generation=1,filenames=[segments_1]
- Aug 22, 2011 11:05:35 AM org.apache.solr.core.SolrDeletionPolicy updateCommits
- INFO: newest commit = 1314024228223
- Aug 22, 2011 11:05:35 AM org.apache.solr.update.processor.LogUpdateProcessor finish
- INFO: {add=[http://www.baidu.com/]} 0 183
- Aug 22, 2011 11:05:35 AM org.apache.solr.core.SolrCore execute
- INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=0 QTime=183
- Aug 22, 2011 11:05:35 AM org.apache.solr.update.DirectUpdateHandler2 commit
- INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=/home/lemo/Workspace/java/Apache/Solr/apache-solr-3.3.0/example/solr/data/index,segFN=segments_1,version=1314024228223,generation=1,filenames=[segments_1]Aug 22, 2011 11:05:35 AM org.apache.solr.core.SolrDeletionPolicy updateCommitsINFO: newest commit = 1314024228223Aug 22, 2011 11:05:35 AM org.apache.solr.update.processor.LogUpdateProcessor finishINFO: {add=[http://www.baidu.com/]} 0 183Aug 22, 2011 11:05:35 AM org.apache.solr.core.SolrCore executeINFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=0 QTime=183Aug 22, 2011 11:05:35 AM org.apache.solr.update.DirectUpdateHandler2 commitINFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
5.8 在Solr的客户端查询
在浏览器中输入
- http://localhost:8983/solr/admin/
http://localhost:8983/solr/admin/
查询条件为baidu
输出的XML结构为
如果你要以HTML结构显示把Solr的配置文件solrconfig.xml中的content改为如下就可以
<field name="content" type="text" stored="true" indexed="true"/>
- <PRE class=html name="code"><response>
- <lst name="responseHeader">
- <int name="status">0</int>
- <int name="QTime">0</int>
- <lst name="params">
- <str name="indent">on</str>
- <str name="start">0</str>
- <str name="q">baidu</str>
- <str name="version">2.2</str>
- <str name="rows">10</str>
- </lst>
- </lst>
- <result name="response" numFound="1" start="0">
- <doc>
- <float name="boost">1.0660036</float>
- <str name="digest">7be5cfd6da4a058001300b21d7d96b0f</str>
- <str name="id">http://www.baidu.com/</str>
- <str name="segment">20110822105243</str>
- <str name="title">百度一下,你就知道</str>
- <date name="tstamp">2011-08-22T14:56:09.194Z</date>
- <str name="url">http://www.baidu.com/</str>
- </doc>
- </result>
- </response>
- </PRE><BR>
- <PRE></PRE>
- <PRE class=html name="code" sizcache="0" sizset="68"><PRE class=html name="code" sizcache="0" sizset="69"><BLOCKQUOTE style="BORDER-BOTTOM-STYLE: none; PADDING-BOTTOM: 0px; BORDER-RIGHT-STYLE: none; MARGIN: 0px 0px 0px 40px; PADDING-LEFT: 0px; PADDING-RIGHT: 0px; BORDER-TOP-STYLE: none; BORDER-LEFT-STYLE: none; PADDING-TOP: 0px" sizcache="0" sizset="69"><PRE class=html name="code"></PRE><BR>
- <H3><A name=t14></A>6 参考</H3>
- http://wiki.apache.org/nutch/RunningNutchAndSolr
- <PRE></PRE>
- <SPAN style="FONT-FAMILY: Arial,Verdana,sans-serif"><SPAN style="WHITE-SPACE: normal"></SPAN></SPAN>
- <PRE></PRE>
- <BLOCKQUOTE></BLOCKQUOTE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- <BLOCKQUOTE></BLOCKQUOTE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- </BLOCKQUOTE></PRE></PRE>
- <response>
- <lst name="responseHeader">
- <int name="status">0</int>
- <int name="QTime">0</int>
- <lst name="params">
- <str name="indent">on</str>
- <str name="start">0</str>
- <str name="q">baidu</str>
- <str name="version">2.2</str>
- <str name="rows">10</str>
- </lst>
- </lst>
- <result name="response" numFound="1" start="0">
- <doc>
- <float name="boost">1.0660036</float>
- <str name="digest">7be5cfd6da4a058001300b21d7d96b0f</str>
- <str name="id">http://www.baidu.com/</str>
- <str name="segment">20110822105243</str>
- <str name="title">百度一下,你就知道</str>
- <date name="tstamp">2011-08-22T14:56:09.194Z</date>
- <str name="url">http://www.baidu.com/</str>
- </doc>
- </result>
- </response>
<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">0</int><lst name="params"><str name="indent">on</str><str name="start">0</str><str name="q">baidu</str><str name="version">2.2</str><str name="rows">10</str></lst></lst><result name="response" numFound="1" start="0"><doc><float name="boost">1.0660036</float><str name="digest">7be5cfd6da4a058001300b21d7d96b0f</str><str name="id">http://www.baidu.com/</str><str name="segment">20110822105243</str><str name="title">百度一下,你就知道</str><date name="tstamp">2011-08-22T14:56:09.194Z</date><str name="url">http://www.baidu.com/</str></doc></result></response>
- <PRE class=html name="code" sizcache="0" sizset="69"><BLOCKQUOTE style="BORDER-BOTTOM-STYLE: none; PADDING-BOTTOM: 0px; BORDER-RIGHT-STYLE: none; MARGIN: 0px 0px 0px 40px; PADDING-LEFT: 0px; PADDING-RIGHT: 0px; BORDER-TOP-STYLE: none; BORDER-LEFT-STYLE: none; PADDING-TOP: 0px" sizcache="0" sizset="69"><PRE class=html name="code"></PRE><BR>
- <H3><A name=t14></A>6 参考</H3>
- http://wiki.apache.org/nutch/RunningNutchAndSolr
- <PRE></PRE>
- <SPAN style="FONT-FAMILY: Arial,Verdana,sans-serif"><SPAN style="WHITE-SPACE: normal"></SPAN></SPAN>
- <PRE></PRE>
- <BLOCKQUOTE></BLOCKQUOTE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- <BLOCKQUOTE></BLOCKQUOTE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- </BLOCKQUOTE></PRE>
- <BLOCKQUOTE style="BORDER-BOTTOM-STYLE: none; PADDING-BOTTOM: 0px; BORDER-RIGHT-STYLE: none; MARGIN: 0px 0px 0px 40px; PADDING-LEFT: 0px; PADDING-RIGHT: 0px; BORDER-TOP-STYLE: none; BORDER-LEFT-STYLE: none; PADDING-TOP: 0px" sizcache="0" sizset="69"><PRE class=html name="code"></PRE><BR>
- <H3><A name=t14></A>6 参考</H3>
- http://wiki.apache.org/nutch/RunningNutchAndSolr
- <PRE></PRE>
- <SPAN style="FONT-FAMILY: Arial,Verdana,sans-serif"><SPAN style="WHITE-SPACE: normal"></SPAN></SPAN>
- <PRE></PRE>
- <BLOCKQUOTE></BLOCKQUOTE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- <BLOCKQUOTE></BLOCKQUOTE>
- <PRE></PRE>
- <PRE></PRE>
- <PRE></PRE>
- </BLOCKQUOTE>
6 参考
http://wiki.apache.org/nutch/RunningNutchAndSolr
0 0
- Nutch全文搜索学习笔记
- 全文搜索工具regain学习笔记--配置文件
- 《MySQL必知必会学习笔记》:全文本搜索
- Lucene全文搜索学习笔记(一)
- Lucene全文搜索学习笔记(二)
- Lucene全文搜索学习笔记(三)
- MySQL 全文搜索笔记
- Lucene全文搜索学习
- nutch (全文搜索和Web爬虫) 基础概念
- SQL Server 学习笔记--全文搜索(1)
- Solr学习笔记1——全文搜索实现原理
- MYSQL学习笔记(十四)使用全文本搜索
- 56.笔记 MySQL学习——布尔模式全文搜索
- 57.笔记 MySQL学习——查询扩展全文搜索
- NUTCH学习笔记汇总
- nutch 学习笔记
- Nutch学习笔记
- Nutch学习笔记二
- 六种多线程方法解决UI线程堵塞
- 排序-快排
- 和菜鸟一起学产品之产品经理的自我管理能力
- 程序员应学会养生
- tomcat集群配置详解一之概念篇
- Nutch全文搜索学习笔记
- C++ vector CArray 动态申请二维数组
- obj-c中字符串和数字互相转化
- STM8单步调试的问题
- grep与正则表达式
- iOS SDK :NSUserDefaults
- 用Python实现开机延迟启动脚本
- Veriog中的四种结构(initial,always,task,function)
- C++开源日志库log4cplus