lucene-2.9.0 索引过程(四) 合并过程

来源:互联网 发布:mac os x 10.11.3 iso 编辑:程序博客网 时间:2024/06/02 04:37

 lucene-2.9.0

此版本使用的是对数合并策略


此前颁布的lucene是通过ducument的数量来驱动索引的合并的
使用策略为立即合并策略
例如合并因为mergeFactor
1.如果满足内存中的文档数为mergeFactor则触发内存索引写入磁盘
  新增的segment文档数为mergeFactor
2.初始合并数为mergeDocs = mergeFactor
3.如果磁盘中有mergeFactor个segment, 每个segment有mergeDocs个文档数,
  触发合并为mergeFactor*mergeDocs新segment
5.更新mergeDocs = mergeFactor*mergeDocs,如果满足条件3,递归合并直至无
  可合并为止
6.调用optimize会将索引索引合并,此时不许满足mergeFactor条件

 

lucene-2.9.0 是使用内存驱动触发过程
意思是,设定内存索引大小(IndexWriter::DEFAULT_RAM_BUFFER_SIZE_MB),
   当预定内存耗尽则
1. 触发内存索引写入磁盘
2. 触发可能的合并过程
3. 合并策略为对数

 

////////////////////////////////////////////
索引合并过程

合并过程是由独立一个线程完成


IndexWriter.addDocument(Document) line: 2428
IndexWriter.addDocument(Document, Analyzer) line: 2475 
IndexWriter.flush(boolean, boolean, boolean) line: 4167 
IndexWriter.maybeMerge() line: 2990
IndexWriter.maybeMerge(boolean) line: 2994
IndexWriter.maybeMerge(int, boolean) line: 2998 
IndexWriter.updatePendingMerges(int, boolean) line: 3028
LogByteSizeMergePolicy(LogMergePolicy).findMerges(SegmentInfos) line: 444 
 

    // 有需要合并的集合
    if (spec != null) {
      final int numMerges = spec.merges.size();
      for(int i=0;i<numMerges;i++)
        registerMerge((MergePolicy.OneMerge) spec.merges.get(i));
    }


另外一个进程依旧索引文档,而一个进程在做合并操作
SegmentMerger.merge(boolean) line: 153 
IndexWriter.mergeMiddle(MergePolicy$OneMerge) line: 5012 
IndexWriter.merge(MergePolicy$OneMerge) line: 4597 
ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) line: 235 
ConcurrentMergeScheduler$MergeThread.run() line: 291

 


Thread [main] (Running)  // 主线程依旧执行索引

// 如有合并过程则开辟一个新的线程
过程如下
ConcurrentMergeScheduler$MergeThread.run() line: 291
ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) line: 235
IndexWriter.merge(MergePolicy$OneMerge) line: 4597
IndexWriter.mergeMiddle(MergePolicy$OneMerge) line: 5012
SegmentMerger.merge(boolean) line: 153  
Thread [Lucene Merge Thread #0] (Suspended (breakpoint at line 153 in SegmentMerger))

 

/////////////////////////////////////////////////////////////////

此函数决定了合并段
给每个segment依旧segment计算一个level值
取level值在相同值(相同区间值内)的segment合并,且segment超过mergeFactor数


   public MergeSpecification findMerges(SegmentInfos infos) throws IOException {

    final int numSegments = infos.size();

    // Compute levels, which is just log (base mergeFactor)
    // of the size of each segment
    float[] levels = new float[numSegments];
    final float norm = (float) Math.log(mergeFactor);

    for(int i=0;i<numSegments;i++) {
      final SegmentInfo info = infos.info(i);
      long size = size(info);

      //根据segment的文件(_cfs/fdt/fdx文件)大小计算level

      // Floor tiny segments
      if (size < 1)
        size = 1;

      levels[i] = (float) Math.log(size)/norm;
    }

    final float levelFloor;
   
    if (minMergeSize <= 0) // (minMergeSize 预设值
      levelFloor = (float) 0.0;
    else
      levelFloor = (float) (Math.log(minMergeSize)/norm);

    // Now, we quantize the log values into levels.  The
    // first level is any segment whose log size is within
    // LEVEL_LOG_SPAN of the max size, or, who has such as
    // segment "to the right".  Then, we find the max of all
    // other segments and use that to define the next level
    // segment, etc.

    MergeSpecification spec = null;

    int start = 0;

    // 遍历所有的segment ,合并相同level的segment

    while(start < numSegments) {

      // Find max level of all segments not already
      // quantized.

      float maxLevel = levels[start];

      // 有一个细节,合并成新的segment,原有的子segment会删除,因此新的segment会在较小的下标
      // 类似FIFO的栈
     
      for(int i=1+start;i<numSegments;i++) {
        final float level = levels[i];
        if (level > maxLevel)
          maxLevel = level;
      }

      // Now search backwards for the rightmost segment that
      // falls into this level:
      float levelBottom;
      if (maxLevel < levelFloor) // 平均level
        // All remaining segments fall into the min level
        levelBottom = -1.0F;
      else {
        levelBottom = (float) (maxLevel - LEVEL_LOG_SPAN); // LEVEL_LOG_SPAN = 0.75 ,最大level下调0.75以形成合并level区间

        // Force a boundary at the level floor
        if (levelBottom < levelFloor && maxLevel >= levelFloor)
          levelBottom = levelFloor;
      }

      int upto = numSegments-1;

      // 确定合并区间的上确界

      while(upto >= start) {
        if (levels[upto] >= levelBottom) {
          break;
        }
        upto--;
      }
     
      if (verbose())
        message("  level " + levelBottom + " to " + maxLevel + ": " + (1+upto-start) + " segments");
     
      int loop = 0 ;

      // Finally, record all merges that are viable at this level:
      int end = start + mergeFactor;
     
      // mergeFactor是合并系数
      // 区间有需要合并的segment
      while(end <= 1+upto) {
        boolean anyTooLarge = false;
       
        for(int i=start;i<end;i++)
        {
          final SegmentInfo info = infos.info(i);
          anyTooLarge |= (size(info) >= maxMergeSize || sizeDocs(info) >= maxMergeDocs);
        }

        if (!anyTooLarge) {
          if (spec == null)
            spec = new MergeSpecification();
          if (verbose())
            message("    " + start + " to " + end + ": add this merge");

         
          //for(int i=start;i<end;i++)
          //{
          //  final SegmentInfo info = infos.info(i);
          //  System.out.println("the " + i + ":" + "segment name :" + info.name + " docCount = " + info.docCount);
          //}
          System.out.println("loop = " + loop + " adding: start = " + start  + " end = " + end);
         
          spec.add(new OneMerge(infos.range(start, end), useCompoundFile));
        } else if (verbose())
          message("    " + start + " to " + end + ": contains segment over maxMergeSize or maxMergeDocs; skipping");

        start = end;
        end = start + mergeFactor;
      }

      start = 1+upto;
    }

    return spec;
  }

 

关于索引合并算法还有其他一些,例如几何合并、动态哈夫曼(firtex中有实现)等。

较后再总结一下,加上实验结果

原创粉丝点击