Elasticsearch1.4.2 安装中文分词器

来源：互联网发布：糖豆软件下载编辑：程序博客网时间：2024/06/14 17:00

安装IK分词器

1下载源码：

https://github.com/medcl/elasticsearch-analysis-ik此页面有一个链接“DownloadZIP”，点击该链接即可下载源码。

2编译源码

该源码是通过maven管理的，通过pom文件即可导入到现有eclipse工程中。编译获得elasticsearch-analysis-ik-1.2.9.jar。

在plugins目录下建analysis-ik目录，并把上述elasticsearch-analysis-ik-1.2.9.jar拷入。

3 config目录

源码中有一个config目录，对应elasticsearch-1.4.2下的config目录。

首先把源码的config目录下的ik目录拷贝到elasticsearch-1.4.2\config下。

源码的confi目录下还有一个elasticsearch.yml，把该文件最底下的修改copy到elasticsearch-1.4.2\config下的elasticsearch.yml中。源码中的logging.yml不做处理。

index:

analysis:

analyzer:

ik:

alias: [news_analyzer_ik,ik_analyzer]

type: org.elasticsearch.index.analysis.IkAnalyzerProvider

index.analysis.analyzer.default.type :"ik"

4 lib目录

在.m2下搜“httpclient-4.3.5.jar”和“httpcore-4.3.2.jar”，拷到elasticsearch-1.4.2\lib下。

---------------------------

同义词：

在config目录下没有看到同义词词典。IK不支持同义词分词？

查了下，下面链接说如何通过filter支持同义词，不过版本是1.3。有空再看。

http://www.elastic.co/guide/en/elasticsearch/reference/1.3/analysis-synonym-tokenfilter.html#_solr_synonyms

权威指南里的同义词说明，不过没看到外链文件。

http://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms.html

安装mmseg分词器

过程跟安装IK分词器是类似的。区别：

1 plugins目录下建的是mmseg目录；

2 elasticsearch.yml中添加了源码中带的设置之后，启动后报：

“java.lang.ClassNotFoundException: org.elasticsearch.index.analysis.htmlstrip.HtmlStripTokenFilterFactory”

查了google，也没找到原因，猜是html_strip相关的问题。注掉elasticsearch.yml中char_filter相关的行后，就好了。

不过char_filter这么写是教程里的呀。http://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html

------------------------

mmseg的分词效果。解析“中华人民共和国国歌”，比IK少了4个词。mmseg词库不全。

{
   "tokens": [
      {
         "token": "中华",
         "start_offset": 2,
         "end_offset": 4,
         "type": "word",
         "position": 1
      },
      {
         "token": "华人",
         "start_offset": 3,
         "end_offset": 5,
         "type": "word",
         "position": 2
      },
      {
         "token": "人民",
         "start_offset": 4,
         "end_offset": 6,
         "type": "word",
         "position": 3
      },
      {
         "token": "共和",
         "start_offset": 6,
         "end_offset": 8,
         "type": "word",
         "position": 4
      },
      {
         "token": "国",
         "start_offset": 8,
         "end_offset": 9,
         "type": "word",
         "position": 5
      },
      {
         "token": "国歌",
         "start_offset": 9,
         "end_offset": 11,
         "type": "word",
         "position": 6
      }
   ]
}

0 0