solr中文分词(mmseg4j)

来源：互联网发布：银行防范电信网络诈骗编辑：程序博客网时间：2024/05/04 09:02

1、从http://code.google.com/p/mmseg4j/ 下载mmseg4j

mmseg4j-1.8.4解压后

2、在$SOLR_HOME下建立lib和dic两个目录，讲mmseg4j-all-1.8.4.jar拷贝到lib目录，将data里的.dic文件拷贝到dic目录

3、修改Schema.xml

添加fieldType

Xml代码

<types>
<fieldType name="textComplex" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/opt/solr/example/solr/dic"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="textMaxWord" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="/opt/solr/example/solr/dic"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="textSimple" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="/opt/solr/example/solr/dic"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
..
</types>

Xml代码

<field name="simple" type="textSimple" indexed="true" stored="true" multiValued="true"/>
<field name="complex" type="textComplex" indexed="true" stored="true" multiValued="true"/>
<field name="maxword" type="textMaxWord" indexed="true" stored="true" multiValued="true"/>

Xml代码

重启tomcat

进入 http://yourhost:8080/solr-example/admin/analysis.jsp

哦也，我们的中文分词大功告成了

我们试着提交些中文到solr里，然后进行查询

Xml代码

chinese.xml
<add>
<doc>
<field name="id">1</field>
<field name="title">夜晚和白天不同，如果相机设置不准确的话，照片拍出来就会发糊。那么本期佳能单反课堂就带您详细了解夜景拍摄的参数设置，同时为您讲解什么叫做“安全快门”。除此之外，还有更多新奇有趣的特殊拍摄手法，还等什么？马上进入本期的节目吧！</field>
</doc>
<doc>
<field name="id">2</field>
<field name="title">冰动娱乐自主研发的虚幻3即时回合制网络游戏！UnrealEngine3倾力打造、最终幻想式的创新玩法以及天马行空般的幻想三国题材将带给你耳目一新的全新感受。</field>
</doc>
<doc>
<field name="id">3</field>
<field name="title">solr是基于Lucene Java搜索库的企业级全文搜索引擎，目前是apache的一个项目。</field>
</doc>
<doc>
<field name="id">4</field>
<field name="title">中国人民银行是中华人民共和国的中央银行。</field>
</doc>
</add>

我们用curl进行提交

命令行代码

curl 'http://localhost:8080/solr-example/update/?commit=true' -H "Content-Type: text/xml" --data-binary @chinese.xml

接下来我们试着查询一下：

查询结果

Xml代码

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">2</int>
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">title:单反</str>
<str name="rows">10</str>
<str name="version">2.2</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="id">1</str>
<arr name="title"><str>夜晚和白天不同，如果相机设置不准确的话，照片拍出来就会发糊。那么本期佳能单反课堂就带您详细了解夜景拍摄的参数设置，同时为您讲解什么叫做“安全快门”。除此之外，还有更多新奇有趣的特殊拍摄手法，还等什么？马上进入本期的节目吧！</str></arr>
</doc>
</result>
</response>

可能会遇到的问题：

1、在Query String:输入中文时候会乱码导致查询不到结果

解决办法：修改tomcat的server.xml

Xml代码