ElasticSearch之ANSJ分词器搭建搭建,解决ansj停止词bug (qq交流群 189040279)

来源:互联网 发布:淘宝联盟怎么分享 编辑:程序博客网 时间:2024/05/21 19:50
         近期因公司特定的业务需要,公司的搜索引擎由Solr跟换为ElasticSearch,团队内之前负责搜索这块的同事采用的ElasticSearch版本为2.2.1,我没有使用同事在用的版本,对这个项目的改造准备准备采用2.x最新的版本(ElasticSearch-2.4.5),没有任何原因,只是个人觉得有新版本升级,应该也是解决了一些问题,用新版说不定就可以避免旧版中出现的很多问题(虽然这些问题现在我没遇到),但实际使用过程中还是遇到了很多问题,于是将这些问题都记录下来分享出去。在使用搜索引擎的时候,为了搜索效果好一些,我们采用了ElasticSearch+Ansj分词器来搭建搜索引擎集群,同时为了改善ansj的分词效果,我们对几百万特有数据进行关键词识别,收集了几百万的专有名词添加到词库中,在使用ElasticSearch之前,也翻阅了基本相关的数据,粗略过了一两边ElasticSearch的使用配置。
         经过几个小时的整理,Elastic需要建索引的数据模型编写完成了,将模型放入建索引程序中并配置好了数据库连接,字典连接等,在ElasticSearch方面,将准备下载好的Ansj插件以及Ansj分词器配置好了就开始刷数据了,整个刷数据过程很顺利,没几个小时就将几千万的数据从MySQL刷到ElasticSearch中.。
测试ElasticSearch的搜索效果过程中,我们发现有以下问题:
(1)关键词中带有标点符号搜索不出来,例如搜索 万科,王石
(2)关键词中带有空格搜索不出来 例如搜索 万科 王石
(3)关键词加引号搜索不出数据 例如搜索 万科王石
(4)高亮显示时候搜索引擎报错,异常如下:

RemoteTransportException[[13.35][127.0.0.1:9300][indices:data/read/search[phase/fetch/id]]]; nested: FetchPhaseExecutionException[Fetch Failed [Failed to highlight field [contents]]]; nested: StringIndexOutOfBoundsException[String index out of range: -5];
Caused by: FetchPhaseExecutionException[Fetch Failed [Failed to highlight field [contents]]]; nested: StringIndexOutOfBoundsException[String index out of range: -5];
    at org.elasticsearch.search.highlight.FastVectorHighlighter.highlight(FastVectorHighlighter.java:169)
    at org.elasticsearch.search.highlight.HighlightPhase.hitExecute(HighlightPhase.java:140)
    at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:188)
    at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:605)
    at org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:408)
    at org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:405)
    at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:378)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -5
    at java.lang.String.substring(String.java:1967)
    at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.makeFragment(BaseFragmentsBuilder.java:179)
    at org.elasticsearch.search.highlight.vectorhighlight.SimpleFragmentsBuilder.makeFragment(SimpleFragmentsBuilder.java:43)
    at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.createFragments(BaseFragmentsBuilder.java:144)
    at org.apache.lucene.search.vectorhighlight.FastVectorHighlighter.getBestFragments(FastVectorHighlighter.java:186)
    at org.elasticsearch.search.highlight.FastVectorHighlighter.highlight(FastVectorHighlighter.java:146)
    ... 12 more
我们对这几个问题进行了分析,1,2,3问题都是来自同一个原因,停止词没有生效。查看了ansj分词器源码以及ansj-plugin插件代码,发现两个组件中都存在bug,ansj中的bug为无法使用空格作为停止词,ansj-plugin插件中的bug为未使用停止词。


停止词bug问题的 ansj-plugin 以及 ansj 代码修改具体如下

图片

图片

图片


修改完成后,停止词就可以正常使用,搜索 万科 王石 或者是 万科,王石 均可正常搜索。

搜索长词加引号 ("") 搜索不到数据的bug 问题
给关键词加上引号搜索,lucene内部代码实现的时候是先对引号内的数据进行分词,分词完成后进行查找,当数据满足所有词时候,再检查数据中命中目标的词向量距离是否为1,满足的即为结果候选文档。故出现该问题时候第一想到了是否是lucene版本升级导致该功能该出问题了,于是采用了不同版本的lucene进行测试以及同一个版本采用不同的分词器进行测试,测试结果发现问题出自ansj分词器。再词阅读了ansj代码,在该版本代码增加了对index类型分词方式的排序,排序是按照词的起始偏移量进行排的。而lucene在建索引的时候,会按照迭代出词的顺序来获取每个词的位置增量(ansj设置的增量为1)。
例如对句子分词: 对交易者而言波段为王 将会分词如下
ansj原生分词:[对/p, 交易者/nz, 而言/u, 波段/n, 为王/n, 交/v, 交易/vn, 易/ad, 易者/n, 者/nr, 而/cc, 言/vg, 波/n, 段/q, 为/p, 王/nr]
ansj分词排序后:[对/p, 交易者/nz, 交易/vn, 交/v, 易者/n, 易/ad, 者/nr, 而言/u, 而/cc, 言/vg, 波段/n, 波/n, 段/q, 为王/n, 为/p, 王/nr]
故lucene以此获取到的位置信息分别为:
ansj原生分词:[对 1, 交易者 2, 而言 3, 波段 4, 为王 5, 交 6, 交易 7, 易 8, 易者 9, 者 10, 而 11, 言 12, 波 13, 段 14, 为 15, 王 16]
ansj分词排序后:[对 1, 交易者 2, 交易 3, 交4, 易者5, 易 6, 者 7, 而言 8, 而 9, 言 10, 波段 11, 波12, 段13, 为王 14, 为 15, 王 16]
可以看到,排序后,搜索引擎若搜索 “对交易者” 则可以搜索到,单搜索 “波段为王” 的时候无法搜索到,因为 词汇 为王 与 词汇波段 的距离为3(采用query方式),故搜索长词加引号就容易出现搜索不到数据。
修改办法比较简单,注释掉index分词方式的排序即可,具体如下:
图片

ES 搜索高亮报错问题 StringIndexOutOfBoundsException[String index out of range: -5]
ElasticSearch在搜索的时候,高亮方式默认是根据配置mapping时候的存储格而定的,基于向量的高亮方式(fvl FastVectorHighlighter)有个要求,即查询时候使用的分词方式也使用向量(按照index方式会导致高亮重叠问题以及会出现异常)。使用过程中发现当字段值为多值时候特别容易出现高亮数组越界以及高亮标记错误。阅读了lucene高亮模块的源码,FastVectorHighlighter以及lucene-core的源码,lucene使用高亮时候有以下几个步骤。
1.解析输入高亮语句提取的目标关键字
2.读取词目标docid的向量即可,提取目标关键字的向量
3.计算匹配词向量命中的组合,提取topN组(默认为5组)
4.提取文档中的字段内容(多值数据提取后以空格链接成一个string)
5.按照提取的高亮组遍历,根据该组内词的位置偏移量截取段落 (前后段落为找超过一定长度或发现英文的标点符号)
由此可见,数据进行高亮时候是依赖建索引时候词向量的位置,而在lucene的源码带中,lucene记录的词向量偏移量如下:
词向量偏移量=上段数据值偏移量之和+该词在当前数据值中偏移量
上段数据值偏移量之合=上上段数据值偏移量之合+上段数据值最后一个词的尾偏移量


   /** Inverts one field for one document; first is true
     *  if this is the first time we are seeing this field
     *  name in this document. */
    public void invert(IndexableField field, boolean first) throws IOException, AbortingException {
      if (first) {
        // First time we're seeing this field (indexed) in
        // this document:
        invertState.reset();
      }

      IndexableFieldType fieldType = field.fieldType();

      IndexOptions indexOptions = fieldType.indexOptions();
      fieldInfo.setIndexOptions(indexOptions);

      if (fieldType.omitNorms()) {
        fieldInfo.setOmitsNorms();
      }

      final boolean analyzed = fieldType.tokenized() && docState.analyzer != null;
        
      // only bother checking offsets if something will consume them.
      // TODO: after we fix analyzers, also check if termVectorOffsets will be indexed.
      final boolean checkOffsets = indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;

      /*
       * To assist people in tracking down problems in analysis components, we wish to write the field name to the infostream
       * when we fail. We expect some caller to eventually deal with the real exception, so we don't want any 'catch' clauses,
       * but rather a finally that takes note of the problem.
       */
      boolean succeededInProcessingField = false;
     
      try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) {
        // reset the TokenStream to the first token
        stream.reset();
        invertState.setAttributeSource(stream);
        termsHashPerField.start(field, first);

        while (stream.incrementToken()) {
        
          int posIncr = invertState.posIncrAttribute.getPositionIncrement();
          invertState.position += posIncr;
          if (invertState.position < invertState.lastPosition) {
            if (posIncr == 0) {
              throw new IllegalArgumentException("first position increment must be > 0 (got 0) for field '" + field.name() + "'");
            } else {
              throw new IllegalArgumentException("position increments (and gaps) must be >= 0 (got " + posIncr + ") for field '" + field.name() + "'");
            }
          } else if (invertState.position > IndexWriter.MAX_POSITION) {
            throw new IllegalArgumentException("position " + invertState.position + " is too large for field '" + field.name() + "': max allowed position is " + IndexWriter.MAX_POSITION);
          }
          invertState.lastPosition = invertState.position;
          if (posIncr == 0) {
            invertState.numOverlap++;
          }
              
          if (checkOffsets) {
            int startOffset = invertState.offset + invertState.offsetAttribute.startOffset();
            int endOffset = invertState.offset + invertState.offsetAttribute.endOffset();
            if (startOffset < invertState.lastStartOffset || endOffset < startOffset) {
              throw new IllegalArgumentException("startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards "
                                                 + "startOffset=" + startOffset + ",endOffset=" + endOffset + ",lastStartOffset=" + invertState.lastStartOffset + " for field '" + field.name() + "'");
            }
            invertState.lastStartOffset = startOffset;
          }

          invertState.length++;
          if (invertState.length < 0) {
            throw new IllegalArgumentException("too many tokens in field '" + field.name() + "'");
          }
          //System.out.println("  term=" + invertState.termAttribute);

          // If we hit an exception in here, we abort
          // all buffered documents since the last
          // flush, on the likelihood that the
          // internal state of the terms hash is now
          // corrupt and should not be flushed to a
          // new segment:
          try {
            termsHashPerField.add();
          } catch (MaxBytesLengthExceededException e) {
            byte[] prefix = new byte[30];
            BytesRef bigTerm = invertState.termAttribute.getBytesRef();
            System.arraycopy(bigTerm.bytes, bigTerm.offset, prefix, 0, 30);
            String msg = "Document contains at least one immense term in field=\"" + fieldInfo.name + "\" (whose UTF8 encoding is longer than the max length " + DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8 + "), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '" + Arrays.toString(prefix) + "...', original message: " + e.getMessage();
            if (docState.infoStream.isEnabled("IW")) {
              docState.infoStream.message("IW", "ERROR: " + msg);
            }
            // Document will be deleted above:
            throw new IllegalArgumentException(msg, e);
          } catch (Throwable th) {
            throw AbortingException.wrap(th);
          }
        }

        stream.end();
        // TODO: maybe add some safety? then again, it's already checked
        // when we come back around to the field...
        //add by jkuang
        String value = field.stringValue();
        invertState.offset += value==null?0:value.length();
        invertState.position += invertState.posIncrAttribute.getPositionIncrement();
        /* if there is an exception coming through, we won't set this to true here:*/
        succeededInProcessingField = true;
      } finally {
        if (!succeededInProcessingField && docState.infoStream.isEnabled("DW")) {
          docState.infoStream.message("DW", "An exception was thrown while processing field " + fieldInfo.name);
        }
      }

      if (analyzed) {
        invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
        invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
      }

      invertState.boost *= field.boost();
    }
  }

但按照上上诉方方式,因先执行了 stream.end(),故invertState.offsetAttribute.endOffset() 值会被清零,另外,按照词方式,若多字段中,前n-1段只要有一段的末尾字符为停止符,则偏移量则会发生错误。
图片

修改完成后运行ElasticSearch进行高亮搜索,发现高亮依然还会报StringIndexOutOfBoundsException[String index out of range: -5] 这个是由于上面讲到的 "计算匹配词向量命中的组合,提取topN组(默认为5组)"的时候,每个组内的词没有进行排序过,不同的组也没有进行排序,而在高亮的时候,lucene高丽的方式是每次高亮都在上次高亮的位置之后进行偏移。例如高亮 万科,王石 ,高亮万科时候,记录相对偏移地址为 10,高亮王石的时候只能从10开始,但若王石的起始地址小于10 ,比如是 5,那就会出现

StringIndexOutOfBoundsException[String index out of range: -7]

可修改lucene高亮模块,对分组进行排序,具体为修改 org.apache.lucene.search.vectorhighlight.WeightedFragInfo 以及 SubInfo 构造函数。

 public WeightedFragInfo( int startOffset, int endOffset, List<SubInfo> subInfos, float totalBoost ){
      this.startOffset = startOffset;
      this.endOffset = endOffset;
      this.totalBoost = totalBoost;
      this.subInfos = subInfos;
      Collections.sort(this.subInfos, new Comparator<SubInfo>() {
        @Override
        public int compare(SubInfo o1, SubInfo o2) {
            if(o1.getTermsOffsets().size()==0){
                return -1;
            }else if(o2.getTermsOffsets().size()==0){
                return 1;
            }else{
                return o1.getTermsOffsets().get(0).getStartOffset()-o2.getTermsOffsets().get(0).getStartOffset();
            }
        }
    });
    for (SubInfo info:subInfos) {
        for (Toffs ot:info.getTermsOffsets()) {
            if(this.startOffset>ot.getStartOffset()){
                this.startOffset = ot.getStartOffset();
            }
            if(this.endOffset <ot.getEndOffset()){
                this.endOffset = ot.getEndOffset();
            }
        }
    }

    }
    
    public List<SubInfo> getSubInfos(){
      return subInfos;
    }
    
    public float getTotalBoost(){
      return totalBoost;
    }
    
    public int getStartOffset(){
      return startOffset;
    }
    
    public int getEndOffset(){
      return endOffset;
    }
    
    @Override
    public String toString(){
      StringBuilder sb = new StringBuilder();
      sb.append( "subInfos=(" );
      for( SubInfo si : subInfos )
        sb.append( si.toString() );
      sb.append( ")/" ).append( totalBoost ).append( '(' ).append( startOffset ).append( ',' ).append( endOffset ).append( ')' );
      return sb.toString();
    }
    
    /**
     * Represents the list of term offsets for some text
     */
    public static class SubInfo {
      private final String text;  // unnecessary member, just exists for debugging purpose
      private final List<Toffs> termsOffsets;   // usually termsOffsets.size() == 1,
                              // but if position-gap > 1 and slop > 0 then size() could be greater than 1
      private final int seqnum;
      private final float boost; // used for scoring split WeightedPhraseInfos.

      public SubInfo( String text, List<Toffs> termsOffsets, int seqnum, float boost ){
        this.text = text;
        this.termsOffsets = termsOffsets;
        this.seqnum = seqnum;
        this.boost = boost;
        Collections.sort(  this.termsOffsets,new Comparator<Toffs>() {

            @Override
            public int compare(Toffs o1, Toffs o2) {
                // TODO Auto-generated method stub
                return o1.getStartOffset()-o2.getStartOffset();
            }
        });

      }
      
      public List<Toffs> getTermsOffsets(){
        return termsOffsets;
      }
      
      public int getSeqnum(){
        return seqnum;
      }

      public String getText(){
        return text;
      }

      public float getBoost(){
        return boost;
      }

      @Override
      public String toString(){
        StringBuilder sb = new StringBuilder();
        sb.append( text ).append( '(' );
        for( Toffs to : termsOffsets )
          sb.append( to.toString() );
        sb.append( ')' );
        return sb.toString();
      }
    }
  }
}

同时在生成不同高亮组的代码上(BaseFragmentsBuilder)增加排序功能。

  protected List<WeightedFragInfo> discreteMultiValueHighlighting(List<WeightedFragInfo> fragInfos, Field[] fields) {
    Map<String, List<WeightedFragInfo>> fieldNameToFragInfos = new HashMap<>();
    for (Field field : fields) {
      fieldNameToFragInfos.put(field.name(), new ArrayList<WeightedFragInfo>());
    }

    fragInfos: for (WeightedFragInfo fragInfo : fragInfos) {
      int fieldStart;
      int fieldEnd = 0;
      for (Field field : fields) {
        if (field.stringValue().isEmpty()) {
          fieldEnd++;
          continue;
        }
        fieldStart = fieldEnd;
        fieldEnd += field.stringValue().length() + 1; // + 1 for going to next field with same name.

        if (fragInfo.getStartOffset() >= fieldStart && fragInfo.getEndOffset() >= fieldStart &&
            fragInfo.getStartOffset() <= fieldEnd && fragInfo.getEndOffset() <= fieldEnd) {
          fieldNameToFragInfos.get(field.name()).add(fragInfo);
          continue fragInfos;
        }

        if (fragInfo.getSubInfos().isEmpty()) {
          continue fragInfos;
        }

        Toffs firstToffs = fragInfo.getSubInfos().get(0).getTermsOffsets().get(0);
        if (fragInfo.getStartOffset() >= fieldEnd || firstToffs.getStartOffset() >= fieldEnd) {
          continue;
        }

        int fragStart = fieldStart;
        if (fragInfo.getStartOffset() > fieldStart && fragInfo.getStartOffset() < fieldEnd) {
          fragStart = fragInfo.getStartOffset();
        }

        int fragEnd = fieldEnd;
        if (fragInfo.getEndOffset() > fieldStart && fragInfo.getEndOffset() < fieldEnd) {
          fragEnd = fragInfo.getEndOffset();
        }


        List<SubInfo> subInfos = new ArrayList<>();
        Iterator<SubInfo> subInfoIterator = fragInfo.getSubInfos().iterator();
        float boost = 0.0f;  //  The boost of the new info will be the sum of the boosts of its SubInfos
        while (subInfoIterator.hasNext()) {
          SubInfo subInfo = subInfoIterator.next();
          List<Toffs> toffsList = new ArrayList<>();
          Iterator<Toffs> toffsIterator = subInfo.getTermsOffsets().iterator();
          while (toffsIterator.hasNext()) {
            Toffs toffs = toffsIterator.next();
            if (toffs.getStartOffset() >= fieldEnd) {
              // We've gone past this value so its not worth iterating any more.
              break;
            }
            boolean startsAfterField = toffs.getStartOffset() >= fieldStart;
            boolean endsBeforeField = toffs.getEndOffset() < fieldEnd;
            if (startsAfterField && endsBeforeField) {
              // The Toff is entirely within this value.
              toffsList.add(toffs);
              toffsIterator.remove();
            } else if (startsAfterField) {
              /*
               * The Toffs starts within this value but ends after this value
               * so we clamp the returned Toffs to this value and leave the
               * Toffs in the iterator for the next value of this field.
               */
              toffsList.add(new Toffs(toffs.getStartOffset(), fieldEnd - 1));
            } else if (endsBeforeField) {
              /*
               * The Toffs starts before this value but ends in this value
               * which means we're really continuing from where we left off
               * above. Since we use the remainder of the offset we can remove
               * it from the iterator.
               */
              toffsList.add(new Toffs(fieldStart, toffs.getEndOffset()));
              toffsIterator.remove();
            } else {
              /*
               * The Toffs spans the whole value so we clamp on both sides.
               * This is basically a combination of both arms of the loop
               * above.
               */
              toffsList.add(new Toffs(fieldStart, fieldEnd - 1));
            }
          }
          if (!toffsList.isEmpty()) {
            subInfos.add(new SubInfo(subInfo.getText(), toffsList, subInfo.getSeqnum(), subInfo.getBoost()));
            boost += subInfo.getBoost();
          }

          if (subInfo.getTermsOffsets().isEmpty()) {
            subInfoIterator.remove();
          }
        }
        WeightedFragInfo weightedFragInfo = new WeightedFragInfo(fragStart, fragEnd, subInfos, boost);
        fieldNameToFragInfos.get(field.name()).add(weightedFragInfo);
      }
    }

    List<WeightedFragInfo> result = new ArrayList<>();
    for (List<WeightedFragInfo> weightedFragInfos : fieldNameToFragInfos.values()) {
      result.addAll(weightedFragInfos);
    }
    Collections.sort(result, new Comparator<WeightedFragInfo>() {

      @Override
      public int compare(FieldFragList.WeightedFragInfo info1, FieldFragList.WeightedFragInfo info2) {
        return info1.getStartOffset() - info2.getStartOffset();
      }

    });


    return result;
  }
到此,Ansj+ElasticSearch+fvl高亮的bug均已被修复。
原创粉丝点击