elasticsearch ik pingyin 分词器的安装和使用

来源：互联网发布：如何黑进路由器知乎编辑：程序博客网时间：2024/05/18 00:43

ES的核心就是搜索，

那么用ES不得不提到ES的搜索机制。

提搜索机制就不得不提到 index的mapping 里的分词器

我们在搭建的过程中，默认通过 ip:9200/index 来创建一个索引。

这时的mapping为es默认的mapping, 里面的分词器为内置的standard

当我们进行类似于 Ip/index/type/1 -d{ name: "zhang san",desc:" a beautiful boy " }

插入一条文档这样操作时，mapping会自动的发现两个字段 name desc，

并且自动识别两个字段的类型为 String 并进行存储。

什么是mapping 呢

ES的mapping非常类似于静态语言中的数据类型：

声明一个变量为int类型的变量，以后这个变量都只能存储int类型的数据。同样的，一个number类型的mapping字段只能存储number类型的数据。

mapping还定义了 ES如何去索引到数据以及数据是否能被搜索到。

当你的查询没有返回相应的数据，你的mapping很有可能有问题。当你拿不准的时候，直接检查你的mapping。

Ip/index/_mapping?pretty

索引Mapping的创建删除修改。参考 http://www.cnblogs.com/zhaijunming5/p/6426940.html

搜索默认用的是，mapping的内置分词器为 standard 来我们测试一下,

/index/_analyze?analyzer=standard&text=我爱你我的家&pretty 返回结果

{  "tokens" : [    {      "token" : "我",      "start_offset" : 0,      "end_offset" : 1,      "type" : "<IDEOGRAPHIC>",      "position" : 0    },    {      "token" : "爱",      "start_offset" : 1,      "end_offset" : 2,      "type" : "<IDEOGRAPHIC>",      "position" : 1    },    {      "token" : "你",      "start_offset" : 2,      "end_offset" : 3,      "type" : "<IDEOGRAPHIC>",      "position" : 2    },    {      "token" : "我",      "start_offset" : 3,      "end_offset" : 4,      "type" : "<IDEOGRAPHIC>",      "position" : 3    },    {      "token" : "的",      "start_offset" : 4,      "end_offset" : 5,      "type" : "<IDEOGRAPHIC>",      "position" : 4    },    {      "token" : "家",      "start_offset" : 5,      "end_offset" : 6,      "type" : "<IDEOGRAPHIC>",      "position" : 5    }  ]}

一个字一个字的蹦好蛋疼，如果用我爱你都搜不出来，对中文的支持太蛋疼了。

所以我们要换一个分词器，IK

先测试下IK 的结果

{  "tokens" : [    {      "token" : "我爱你",      "start_offset" : 0,      "end_offset" : 3,      "type" : "CN_WORD",      "position" : 0    },    {      "token" : "爱你",      "start_offset" : 1,      "end_offset" : 3,      "type" : "CN_WORD",      "position" : 1    },    {      "token" : "你我",      "start_offset" : 2,      "end_offset" : 4,      "type" : "CN_WORD",      "position" : 2    },    {      "token" : "的",      "start_offset" : 4,      "end_offset" : 5,      "type" : "CN_CHAR",      "position" : 3    },    {      "token" : "家",      "start_offset" : 5,      "end_offset" : 6,      "type" : "CN_CHAR",      "position" : 4    }  ]}

很明显 IK分词就友好了很多。

ik 带有两个分词器，这里用的是ik_max_word
ik_max_word ：会将文本做最细粒度的拆分；尽可能多的拆分出词语
ik_smart：会做最粗粒度的拆分；已被分出的词语将不会再次被其它词语占有

那么下面我们来讲 5.6.1 的IK 安装过程。

参考 github原文 https://github.com/medcl/elasticsearch-analysis-ik

Install

1.download or compile

optional 1 - download pre-build package from here: https://github.com/medcl/elasticsearch-analysis-ik/releases
unzip plugin to folder your-es-root/plugins/
optional 2 - use elasticsearch-plugin to install ( version > v5.5.1 ):
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.6.1/elasticsearch-analysis-ik-5.6.1.zip

2.restart elasticsearch

方法1：下载zip包，解压到 your-es-home/plugs 下

方法2 使用es插件下载版本要大于5.5.1

最后一步，重启ES

很简单，

因为索引的Mapping不可以删除，所以我们新建一个索引进行测试。

create a indexcurl -XPUT http://localhost:9200/index2.create a mappingcurl -XPOST http://localhost:9200/index/fulltext/_mapping -d'{        "properties": {            "content": {                "type": "text",                "analyzer": "ik_max_word",                "search_analyzer": "ik_max_word"            }        }    }'3.index some docscurl -XPOST http://localhost:9200/index/fulltext/1 -d'{"content":"美国留给伊拉克的是个烂摊子吗"}'curl -XPOST http://localhost:9200/index/fulltext/2 -d'{"content":"公安部：各地校车将享最高路权"}'curl -XPOST http://localhost:9200/index/fulltext/3 -d'{"content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}'curl -XPOST http://localhost:9200/index/fulltext/4 -d'{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}'4.query with highlightingcurl -XPOST http://localhost:9200/index/fulltext/_search  -d'{    "query" : { "match" : { "content" : "中国" }},    "highlight" : {        "pre_tags" : ["<tag1>", "<tag2>"],        "post_tags" : ["</tag1>", "</tag2>"],        "fields" : {            "content" : {}        }    }}'Result{    "took": 14,    "timed_out": false,    "_shards": {        "total": 5,        "successful": 5,        "failed": 0    },    "hits": {        "total": 2,        "max_score": 2,        "hits": [            {                "_index": "index",                "_type": "fulltext",                "_id": "4",                "_score": 2,                "_source": {                    "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"                },                "highlight": {                    "content": [                        "<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首 "                    ]                }            },            {                "_index": "index",                "_type": "fulltext",                "_id": "3",                "_score": 2,                "_source": {                    "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"                },                "highlight": {                    "content": [                        "均每天扣1艘<tag1>中国</tag1>渔船 "                    ]                }            }        ]    }}

然后测试搜索正常

下面开始
pinyin的分词器使用参考原文 https://github.com/medcl/elasticsearch-analysis-pinyin

首先来看下pinyin分词器对于我爱你我的家的分词效果

{  "tokens" : [    {      "token" : "wo",      "start_offset" : 0,      "end_offset" : 1,      "type" : "word",      "position" : 0    },    {      "token" : "ai",      "start_offset" : 1,      "end_offset" : 2,      "type" : "word",      "position" : 1    },    {      "token" : "ni",      "start_offset" : 2,      "end_offset" : 3,      "type" : "word",      "position" : 2    },    {      "token" : "wo",      "start_offset" : 3,      "end_offset" : 4,      "type" : "word",      "position" : 3    },    {      "token" : "de",      "start_offset" : 4,      "end_offset" : 5,      "type" : "word",      "position" : 4    },    {      "token" : "jia",      "start_offset" : 5,      "end_offset" : 6,      "type" : "word",      "position" : 5    },    {      "token" : "wanwdj",      "start_offset" : 0,      "end_offset" : 6,      "type" : "word",      "position" : 5    }  ]}

OK，那么我们进行安装，使用

pinyin的分词器和IK一样下载ZIP，解压到plugs 下 pinyin文件夹重启。

根据官方git地址测试案例整合一个索引进行数据的拼音和中文搜索

首先，我们穿创建一个索引 Index2 在里面设置一个自定义的分析器这个分析器指向了拼音

curl -XPUT "http://localhost:9200/index2/" -d'{    "index": {        "analysis": {            "analyzer": {                "ik_pinyin_analyzer": {                    "type": "custom",                    "tokenizer": "ik_smart",                    "filter": ["my_pinyin", "word_delimiter"]                }            },            "filter": {                "my_pinyin": {                    "type": "pinyin",                    "first_letter": "prefix",                    "padding_char": " "                }            }        }    }}'

创建type message 并设置message的mapping

设置了两个字段，name desc

name设置为拼音搜索，desc设置为IK搜索

curl -XPOST http://localhost:9200/index2/message/_mapping -d'{"message": {"properties": {            "name": {               "type": "text",               "store": "no",               "term_vector": "with_positions_offsets",               "analyzer": "ik_pinyin_analyzer",               "boost": 10},

"desc":{"type": "text","analyzer": "ik_max_word","search_analyzer": "ik_max_word"}}}}'

插入数据，我这里插入了6条数据进行相关索引

{  "took" : 5,  "timed_out" : false,  "_shards" : {    "total" : 5,    "successful" : 5,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : 6,    "max_score" : 1.0,    "hits" : [      {        "_index" : "index2",        "_type" : "message",        "_id" : "5",        "_score" : 1.0,        "_source" : {          "name" : "我的宝贝",          "desc" : "骑车在沿途的风景树"        }      },      {        "_index" : "index2",        "_type" : "message",        "_id" : "2",        "_score" : 1.0,        "_source" : {          "name" : "zhangsan",          "desc" : "阿斯顿发放"        }      },      {        "_index" : "index2",        "_type" : "message",        "_id" : "4",        "_score" : 1.0,        "_source" : {          "name" : "张三",          "desc" : "从淘汰率的赫尔"        }      },      {        "_index" : "index2",        "_type" : "message",        "_id" : "6",        "_score" : 1.0,        "_source" : {          "name" : "丁雪峰",          "desc" : "依然爱你我的梦"        }      },      {        "_index" : "index2",        "_type" : "message",        "_id" : "1",        "_score" : 1.0,        "_source" : {          "name" : "李连杰",          "desc" : "谢谢我吧我爱你"        }      },      {        "_index" : "index2",        "_type" : "message",        "_id" : "3",        "_score" : 1.0,        "_source" : {          "name" : "刘德华",          "desc" : "我爱你我的家"        }      }    ]  }}

数据测试，Name搜索正常 desc ik搜索正常！

阅读全文

0 0