solr 6.6 分词

来源：互联网发布：路由器怎么没有网络啊编辑：程序博客网时间：2024/06/06 10:48

１、内置分词器StandardTokenizerFactory

StandardTokenizerFactory是solr的内置分词器。大概在managed-schema文件的380行能找到。

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">      <analyzer type="index">        <tokenizer class="solr.StandardTokenizerFactory"/>        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />        <!-- in this example, we will only use synonyms at query time        <filter class="solr.SynonymGraphFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>        <filter class="solr.FlattenGraphFilterFactory"/>        -->        <filter class="solr.LowerCaseFilterFactory"/>      </analyzer>      <analyzer type="query">        <tokenizer class="solr.StandardTokenizerFactory"/>        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>        <filter class="solr.LowerCaseFilterFactory"/>      </analyzer>    </fieldType>

这个分词器的名称叫text_general
这个分词器还添加了一些过滤filter。同时表明，index和query的时候都用这个分词器。solr中有部分fild使用了这个类型的分词器，比如text

 <field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>

下面做个实验，给mycore里面的content_core域添加默认分词器text_general

2、实验

2.1修改managed-schema

沿用之前的例子，修改mycore里面的conf/managed-schema文件。改成如下：

<!--自定义filedType-->    <fieldType name="mycore_string" class="solr.StrField" sortMissingLast="true" docValues="true" />    <fieldType name="mycore_int" class="solr.TrieIntField" docValues="true" precisionStep="0" positionIncrementGap="0"/>    <fieldType name="mycore_date" class="solr.TrieDateField" docValues="true" precisionStep="0" positionIncrementGap="0"/> <!--column的id是数据库的id,name的id是managed_schema里面的id，id是必须，并且唯一的-->    <field name="vip" type="mycore_string" indexed="true"  stored="true" />    <field name="point" type="mycore_int" indexed="false" stored="true" />    <field name="content_core" type="text_general" indexed="true" stored="true"/>    <field name="add_time" type="mycore_date" indexed="true" stored="true"/>

可以看到，把content_core的type改为text_general,让它具有分词功能。

然后像之前那样，后台http://localhost:8983/solr/#/更新配置，重新导入数据:

后台更新配置文件Core Admin—>mycore—>reload

重新导入数据:mycore—>Dataimport—>Entity—>mycore_test—>execute

2.2、分词的作用

进行如下步骤，分词查看效果。

这里写图片描述

选择mycore，点开analysis
输入一个文本内容”aa张三”
选择Analyse Fieldname / FieldType为content_core
然后点击最右边的按钮Analyse Values
可以看到，把文本分拆成”aa” ,”张”,”三”

那么这个，分词有什么作用呢？进行一下，query搜索，看一下效果。

这里写图片描述

结果发现，通过关键词“a”搜索不到结果，而关键词”aa”能搜索到aa张三。关键词”aa张”却能搜索到aa张三和bb张三，为什么？

在我们导入数据的时候，索引库根据默认的分词器text_general，对导入的数据进行分词，假设它仅导入了两条数据

aa张三
bb张三

那么“aa张三”，我们在点开analysis进行分析的时候发现它把词分成了“aa”“张””三”，
其实实验一下可以发现“bb张三”也会分词”bb”,”张”,”三”,把他们看做一个集合。则有

集合a：[aa,张，三]
集合b：[bb,张，三]

步骤:

然后输入a的时候，因为a没法分词，于是，开始在两个集合a和b里面搜索，没有发现集合内容与a相等，则搜索不到结果。
然后输入aa的时候，aa也没有分词（可以在analysis里面输入aa试验有没有分词），但是，它搜索a集合[aa,”张”，”三”]的时候，找到了集合里面的aa与当前搜索的aa匹配，于是，返回搜索到的结果”content_core:aa张三”的索引内容。
然后再次输入aa张的时候，这个关键词会进行分词（可以在analysis里面输入试验），它会分词两个词，一个是：aa,还有一个是：张。然后在两个集合里面搜索，发现，分词后的aa和张，都能在a集合[aa,”张”，”三”]搜索得到，而b集合[bb,张，三]也存在关键字张，则把两个集合所代表的content_core返回了。

这就是分词在搜索时的作用。

阅读全文

0 0