solr配置中文分词器

来源：互联网发布：人工智能的出现证明编辑：程序博客网时间：2024/04/29 21:01

一、smartcn（转自http://www.07net01.com/linux/solr4_3zhipeizhizhongwenfencismartcn_455940_1372147112.html）

1、将自带的jar包拷贝到tomcat下

文件：solr-5.1.0/contrib/analysis-extras/lucene-libs/lucene-analyzers-smartcn-5.1.0.jar

拷贝到： /usr/local/tomcat/webapps/solr/WEB-INF/lib/

2、修改schema.xml，添加

<fieldType name="text_smart" class="solr.TextField" positionIncrementGap="100">    <analyzer type="index">        <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>        <filter class="solr.SmartChineseWordTokenFilterFactory"/>    </analyzer>    <analyzer type="query">        <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>        <filter class="solr.SmartChineseWordTokenFilterFactory"/>    </analyzer></fieldType>

3、测试

二、mmseg4j

1、和smartcn一样将jar包（mmseg4j-core-1.10.0.jar、mmseg4j-solr-2.3.0.jar）复制到tomcat/webapps/solr/WEB-INF/lib/

2、修改schema.xml，添加以下内容后重启tomcat

<fieldType name="text_zh_wangtest" class="solr.TextField" positionIncrementGap="100"><analyzer type="index"><tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="dic" /><filter class="solr.StopFilterFactory" words="stopwords.txt" /></analyzer></fieldType>

3、测试

三、索引字段加入中文分词

这里新建了张表articles

CREATE TABLE `articles` (  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,  `title` varchar(200) NOT NULL,  `author` varchar(50) NOT NULL,  `last_update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,  PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

网上找点数据塞进去

INSERT INTO `articles` VALUES ('1', '3.6亿曼城又被这卧底坑死！无底黑洞让1亿白花', '梓泉', '2016-01-07 09:43:17');INSERT INTO `articles` VALUES ('2', '瓜帅版曼城豪阵曝光！组最奢华中轴 梅西领4天王', '小九', '2016-01-07 09:43:41');INSERT INTO `articles` VALUES ('3', '联赛杯-弃将助小魔兽绝杀 曼城丢争议球1-2负 ', '斯科', '2016-01-07 09:43:50');INSERT INTO `articles` VALUES ('4', '瓜迪奥拉去曼城已定！英媒:他与前巴萨2巨头重逢 ', '兰尼斯特', '2016-01-07 09:44:06');INSERT INTO `articles` VALUES ('5', '曼城主帅讽皇马:同情贝帅 皇马干出这种事不奇怪 ', '马尔科', '2016-01-07 09:44:18');

solr后台新建core_articles。

修改data-config.xml

<dataConfig>    <dataSource name="solrDB" type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/wangtest" user="root" password="123456"  />    <document>      <entity dataSource="solrDB" name="articles"              query="select * from articles"              deltaImportQuery="select * from articles where id= '${dih.delta.id}'"              deltaQuery="select id from articles where last_update_time > '${dataimporter.last_index_time}'">        <field column="id" name="id" />        <field column="title" name="title" />        <field column="author" name="author" />        <field column="last_update_time" name="last_update_time" />      </entity>    </document>  </dataConfig>

修改schema.xml，添加filedType

<fieldType name="text_mmseg4j" class="solr.TextField" positionIncrementGap="100">    <analyzer type="index">      <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="dic" />      <filter class="solr.StopFilterFactory" words="stopwords.txt" />    </analyzer>  </fieldType>

再添加相关索引字段

<field name="title" type="text_mmseg4j" indexed="true" stored="true" required="true" /><field name="author" type="text_mmseg4j" indexed="true" stored="true" required="true" /><field name="last_update_time" type="date" indexed="true" stored="true" />

重启tomcat后全量导入。

在Query里搜索的结果

对title字段进行关键字搜索（如：巴萨）结果

0 0