Nutch开源搜索引擎与Paoding中文分词用plugin方式集成[转]
来源:互联网 发布:什么软件做个假婚纱照 编辑:程序博客网 时间:2024/05/01 15:28
本文是我在集成中文分词paoding时积累的经验,单独成一篇文章来重点介绍,重点需要了解的有下面几个文件,a)插件目录及插件文件 build.xml,plugin.xml b)nutch-0.9/src/plugin/build.xml c)WEB-INF/classes/nutch-site.xml
然后通过按照下面的方式来配置,执行ant package就可以搞定了,这里用ant的方式来处理整个编译发布过程。
1)在src/plugin下面加入,analysis-zh和lib-paoding-analyzers目录。具体参见
E:/workspace/searchengine/nutch-0.9/src/plugin/analysis-zh
E:/workspace/searchengine/nutch-0.9/src/plugin/lib-paoding-analyzers
下面是analysis-zh中的源码,是对paoding的封装,代码超级easy,主要是把配置文件和ant脚本调对就可以了
/**
* Paoding chinese analyzer
*/
package org.apache.nutch.analysis.zh;
// JDK imp
orts
import java.io.Reader;
// Lucene imports
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
// Nutch imports
import org.apache.nutch.analysis.NutchAnalyzer;
/**
* A simple Chinese Analyzer that wraps the Lucene one.
* @author kevin tu
*/
public class ChineseAnalyzer extends NutchAnalyzer {
private final static Analyzer ANALYZER =
new net.paoding.analysis.analyzer.PaodingAnalyzer();
/** Creates a new instance of ChineseAnalyzer */
public ChineseAnalyzer() { }
public TokenStream tokenStream(String fieldName, Reader reader) {
return ANALYZER.tokenStream(fieldName, reader);
}
}
2)修改src/plugin的build.xml
<target name="deploy">
<ant dir="analysis-zh" target="deploy"/><!--kevin 20080903 add-->
<ant dir="lib-paoding-analyzers" target="deploy"/><!--kevin 20080903 add-->
...
</target>
<target name="clean">
<ant dir="analysis-zh" target="clean"/><!--kevin 20080903 add-->
<ant dir="lib-paoding-analyzers" target="clean"/><!--kevin 20080903 add-->
...
</target>
3)修改nutch-site.xml,加入|analysis-(zh)| ,这个很重重要,否则nutch只会加载默认插件,不会加载paoding的jar包,和自己写的analysis-(zh) jar包
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|analysis-(zh)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>
</description>
</property>
4)重新打包 ant package
5)配置tomcat,修改webapps/cse/WEB-INF/classes/nutch-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>local</value>
</property>
<property><!--指定本地的index目录-->
<name>searcher.dir</name>
<value>/nutch/local/crawled</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|analysis-(zh)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>
</description>
</property>
</configuration>
6)配置paoding运行参数,加入paoding-analysis.properties
paoding.imports=/
ifexists:classpath:paoding-analysis-default.properties;/
ifexists:classpath:paoding-analysis-user.properties;/
ifexists:classpath:paoding-knives-user.properties
配置export PAODING_DIC_HOME=/nutch/dic
import java.io.Reader;
// Lucene imports
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
// Nutch imports
import org.apache.nutch.analysis.NutchAnalyzer;
/**
* A simple Chinese Analyzer that wraps the Lucene one.
* @author kevin tu
*/
public class ChineseAnalyzer extends NutchAnalyzer {
private final static Analyzer ANALYZER =
new net.paoding.analysis.analyzer.PaodingAnalyzer();
/** Creates a new instance of ChineseAnalyzer */
public ChineseAnalyzer() { }
public TokenStream tokenStream(String fieldName, Reader reader) {
return ANALYZER.tokenStream(fieldName, reader);
}
}
2)修改src/plugin的build.xml
<target name="deploy">
<ant dir="analysis-zh" target="deploy"/><!--kevin 20080903 add-->
<ant dir="lib-paoding-analyzers" target="deploy"/><!--kevin 20080903 add-->
...
</target>
<target name="clean">
<ant dir="analysis-zh" target="clean"/><!--kevin 20080903 add-->
<ant dir="lib-paoding-analyzers" target="clean"/><!--kevin 20080903 add-->
...
</target>
3)修改nutch-site.xml,加入|analysis-(zh)| ,这个很重重要,否则nutch只会加载默认插件,不会加载paoding的jar包,和自己写的analysis-(zh) jar包
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|analysis-(zh)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>
</description>
</property>
4)重新打包 ant package
5)配置tomcat,修改webapps/cse/WEB-INF/classes/nutch-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>local</value>
</property>
<property><!--指定本地的index目录-->
<name>searcher.dir</name>
<value>/nutch/local/crawled</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|analysis-(zh)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>
</description>
</property>
</configuration>
6)配置paoding运行参数,加入paoding-analysis.properties
paoding.imports=/
ifexists:classpath:paoding-analysis-default.properties;/
ifexists:classpath:paoding-analysis-user.properties;/
ifexists:classpath:paoding-knives-user.properties
配置export PAODING_DIC_HOME=/nutch/dic
- Nutch开源搜索引擎与Paoding中文分词用plugin方式集成[转]
- Nutch开源搜索引擎与Paoding中文分…
- Nutch1.0开源搜索引擎与Paoding在eclipse中用plugin方式集成(终极篇)
- 搜索引擎分词:Nutch整合Paoding中文分词步骤详解
- 搜索引擎分词:Nutch整合Paoding中文分词步骤详解
- Nutch 分词 中文分词 paoding 疱丁
- paoding 中文分词学习
- Paoding中文分词参考手册
- Lucene中文分词Paoding
- 中文分词与搜索引擎
- Nutch1.0开源搜索引擎与Paoding在e…
- 搜索引擎中文开源分词系统---名字
- nutch集成中文分词搜索时出现空白页解决方法
- nutch中文分词
- nutch +中文分词
- nutch-1.0中文分词
- Lucene加中文分词paoding调研结果
- Lucene中使用Paoding中文分词
- 大型高并发、高负载网站的系统架构设计
- 程序员发展之路
- 设计模式-单例的运行过程实例解析
- Linq To Xml
- 在Windows上编译Wireshark源代码
- Nutch开源搜索引擎与Paoding中文分词用plugin方式集成[转]
- 开发基于 Nutch 的集群式搜索引擎
- VB基础1
- nutch 1.2 增量爬取url 完成 recrawl.sh 编写
- ICE: Hello, World
- 湖南省2011年选调生选拔工作公告
- 大唐双龙传中的男女
- 附1: 湖南省选调到乡镇(街道)工作的优秀毕业生推荐表
- Visual Studio的工具栏也可以保存代码片段!