lucene源码分析---9

来源:互联网 发布:九年级下册历史书淘宝 编辑:程序博客网 时间:2024/04/19 07:24

lucene源码分析—倒排索引的写过程

本章介绍倒排索引的写过程,下一章再介绍其读过程,和前几章相似,本章所有代码会基于原有代码进行少量的改写,方便阅读,省略了一些不重要的部分。
lucene将倒排索引的信息写入.tim和.tip文件,这部分代码也是lucene最核心的一部分。倒排索引的写过程从BlockTreeTermsWriter的write函数开始,

BlockTreeTermsWriter::write

  public void write(Fields fields) throws IOException {    String lastField = null;    for(String field : fields) {      lastField = field;      Terms terms = fields.terms(field);      if (terms == null) {        continue;      }      List<PrefixTerm> prefixTerms = null;      TermsEnum termsEnum = terms.iterator();      TermsWriter termsWriter = new TermsWriter(fieldInfos.fieldInfo(field));      int prefixTermUpto = 0;      while (true) {        BytesRef term = termsEnum.next();        termsWriter.write(term, termsEnum, null);      }      termsWriter.finish();    }  }

遍历每个域,首先通过terms函数根据field名返回一个FreqProxTerms,包含了该域的所有Term;接下来fieldInfo根据域名返回域信息,并以此创建一个TermsWriter,TermsWriter是倒排索引写的主要类,接下来依次取出FreqProxTerms中的每个term,并调用TermsWriter的write函数写入.tim文件,并创建对应的索引信息,最后通过TermsWriter的finish函数将索引信息写入.tip文件中,下面依次来看。

BlockTreeTermsWriter::write->TermsWriter::write

    public void write(BytesRef text, TermsEnum termsEnum, PrefixTerm prefixTerm) throws IOException {      BlockTermState state = postingsWriter.writeTerm(text, termsEnum, docsSeen);      if (state != null) {        pushTerm(text);        PendingTerm term = new PendingTerm(text, state, prefixTerm);        pending.add(term);        if (prefixTerm == null) {          sumDocFreq += state.docFreq;          sumTotalTermFreq += state.totalTermFreq;          numTerms++;          if (firstPendingTerm == null) {            firstPendingTerm = term;          }          lastPendingTerm = term;        }      }    }

TermsWriter的write函数一次处理一个Term。postingsWriter是Lucene50PostingsWriter。write函数首先通过Lucene50PostingsWriter的writeTerm函数记录每个Term以及对应文档的相应信息。
成员变量pending是一个PendingEntry列表,PendingEntry用来保存一个Term或者是一个Block,pending列表用来保存多个待处理的Term。
pushTerm是write里的核心函数,用于具体处理一个Term,后面详细来看。write函数的最后统计文档频和词频信息并记录到sumDocFreq和sumTotalTermFreq两个成员变量中。

BlockTreeTermsWriter::write->TermsWriter::write->Lucene50PostingsWriter::writeTerm

  public final BlockTermState writeTerm(BytesRef term, TermsEnum termsEnum, FixedBitSet docsSeen) throws IOException {    startTerm();    postingsEnum = termsEnum.postings(postingsEnum, enumFlags);    int docFreq = 0;    long totalTermFreq = 0;    while (true) {      int docID = postingsEnum.nextDoc();      if (docID == PostingsEnum.NO_MORE_DOCS) {        break;      }      docFreq++;      docsSeen.set(docID);      int freq;      if (writeFreqs) {        freq = postingsEnum.freq();        totalTermFreq += freq;      } else {        freq = -1;      }      startDoc(docID, freq);      if (writePositions) {        for(int i=0;i<freq;i++) {          int pos = postingsEnum.nextPosition();          BytesRef payload = writePayloads ? postingsEnum.getPayload() : null;          int startOffset;          int endOffset;          if (writeOffsets) {            startOffset = postingsEnum.startOffset();            endOffset = postingsEnum.endOffset();          } else {            startOffset = -1;            endOffset = -1;          }          addPosition(pos, payload, startOffset, endOffset);        }      }      finishDoc();    }    if (docFreq == 0) {      return null;    } else {      BlockTermState state = newTermState();      state.docFreq = docFreq;      state.totalTermFreq = writeFreqs ? totalTermFreq : -1;      finishTerm(state);      return state;    }  }

startTerm设置.doc、.pos和.pay三个文件的指针。postings函数创建FreqProxPostingsEnum或者FreqProxDocsEnum,内部封装了FreqProxTermsWriterPerField,即第五章中每个PerField的termsHashPerField成员变量,termsHashPerField的内部保存了对应Field的所有Terms信息。
writeTerm函数接下来通过nextDoc获得下一个文档ID,获得freq词频,并累加到totalTermFreq(总词频)中。再调用startDoc记录文档的信息。addPosition函数记录词的位置、偏移和payload信息,必要时写入文件中。finishDoc记录文件指针等信息。然后创建BlockTermState,设置相应词频和文档频信息以最终返回。
writeTerm函数最后通过finishTerm写入文档信息至.doc文件,写入位置信息至.pos文件。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm

    private void pushTerm(BytesRef text) throws IOException {      int limit = Math.min(lastTerm.length(), text.length);      int pos = 0;      while (pos < limit && lastTerm.byteAt(pos) == text.bytes[text.offset+pos]) {        pos++;      }      for(int i=lastTerm.length()-1;i>=pos;i--) {        int prefixTopSize = pending.size() - prefixStarts[i];        if (prefixTopSize >= minItemsInBlock) {          writeBlocks(i+1, prefixTopSize);          prefixStarts[i] -= prefixTopSize-1;        }      }      if (prefixStarts.length < text.length) {        prefixStarts = ArrayUtil.grow(prefixStarts, text.length);      }      for(int i=pos;i<text.length;i++) {        prefixStarts[i] = pending.size();      }      lastTerm.copyBytes(text);    }

lastTerm保存了上一次处理的Term。pushTerm函数的核心功能是计算一定的条件,当满足一定条件时,就表示pending列表中待处理的一个或者多个Term,需要保存为一个block,此时调用writeBlocks函数进行保存。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks

    void writeBlocks(int prefixLength, int count) throws IOException {      int lastSuffixLeadLabel = -1;      boolean hasTerms = false;      boolean hasPrefixTerms = false;      boolean hasSubBlocks = false;      int start = pending.size()-count;      int end = pending.size();      int nextBlockStart = start;      int nextFloorLeadLabel = -1;      for (int i=start; i<end; i++) {        PendingEntry ent = pending.get(i);        int suffixLeadLabel;        if (ent.isTerm) {          PendingTerm term = (PendingTerm) ent;          if (term.termBytes.length == prefixLength) {            suffixLeadLabel = -1;          } else {            suffixLeadLabel = term.termBytes[prefixLength] & 0xff;          }        } else {          PendingBlock block = (PendingBlock) ent;          suffixLeadLabel = block.prefix.bytes[block.prefix.offset + prefixLength] & 0xff;        }        if (suffixLeadLabel != lastSuffixLeadLabel) {          int itemsInBlock = i - nextBlockStart;          if (itemsInBlock >= minItemsInBlock && end-nextBlockStart > maxItemsInBlock) {            boolean isFloor = itemsInBlock < count;            newBlocks.add(writeBlock(prefixLength, isFloor, nextFloorLeadLabel, nextBlockStart, i, hasTerms, hasPrefixTerms, hasSubBlocks));            hasTerms = false;            hasSubBlocks = false;            hasPrefixTerms = false;            nextFloorLeadLabel = suffixLeadLabel;            nextBlockStart = i;          }          lastSuffixLeadLabel = suffixLeadLabel;        }        if (ent.isTerm) {          hasTerms = true;          hasPrefixTerms |= ((PendingTerm) ent).prefixTerm != null;        } else {          hasSubBlocks = true;        }      }      if (nextBlockStart < end) {        int itemsInBlock = end - nextBlockStart;        boolean isFloor = itemsInBlock < count;        newBlocks.add(writeBlock(prefixLength, isFloor, nextFloorLeadLabel, nextBlockStart, end, hasTerms, hasPrefixTerms, hasSubBlocks));      }      PendingBlock firstBlock = newBlocks.get(0);      firstBlock.compileIndex(newBlocks, scratchBytes, scratchIntsRef);      pending.subList(pending.size()-count, pending.size()).clear();      pending.add(firstBlock);      newBlocks.clear();    }

hasTerms表示将要合并的项中是否含有Term(因为特殊情况下,合并的项只有子block)。
hasPrefixTerms表示是否有词的前缀,假设一直为false。
hasSubBlocks和hasTerms对应,表示将要合并的项中是否含有子block。
start和end的规定了需要合并的Term或Block在待处理的pending列表中的范围。
writeBlocks函数接下来遍历pending列表中每个待处理的Term或者Block,suffixLeadLabel保存了树中某个节点下的各个Term的byte,lastSuffixLeadLabel则是对应的最后一个不同的byte,检查所有项中是否有Term和子block,并对hasTerms和hasSubBlocks进行相应的设置。如果pending中的Term或block太多,大于minItemsInBlock和maxItemsInBlock计算出来的阈值,就会调用writeBlock写成一个block,最后也会写一次。
writeBlocks函数接下来通过compileIndex函数将一个block的信息写入FST结构中(保存在其成员变量index中),FST是有限状态机的缩写,其实就是将一棵树的信息保存在其自身的结构中,而这颗树是由所有Term的每个byte形成的,后面来看。
writeBlocks函数最后清空被保存的一部分pending列表,并添加刚刚创建的block到pending列表中。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->writeBlock
第一种情况

    private PendingBlock writeBlock(int prefixLength, boolean isFloor, int floorLeadLabel, int start, int end, boolean hasTerms, boolean hasPrefixTerms, boolean hasSubBlocks) throws IOException {      long startFP = termsOut.getFilePointer();      boolean hasFloorLeadLabel = isFloor && floorLeadLabel != -1;      final BytesRef prefix = new BytesRef(prefixLength + (hasFloorLeadLabel ? 1 : 0));      System.arraycopy(lastTerm.get().bytes, 0, prefix.bytes, 0, prefixLength);      prefix.length = prefixLength;      int numEntries = end - start;      int code = numEntries << 1;      if (end == pending.size()) {        code |= 1;      }      termsOut.writeVInt(code);      boolean isLeafBlock = hasSubBlocks == false && hasPrefixTerms == false;      final List<FST<BytesRef>> subIndices;      boolean absolute = true;      if (isLeafBlock) {        subIndices = null;        for (int i=start;i<end;i++) {          PendingEntry ent = pending.get(i);          PendingTerm term = (PendingTerm) ent;          BlockTermState state = term.state;          final int suffix = term.termBytes.length - prefixLength;          suffixWriter.writeVInt(suffix);          suffixWriter.writeBytes(term.termBytes, prefixLength, suffix);          statsWriter.writeVInt(state.docFreq);          if (fieldInfo.getIndexOptions() != IndexOptions.DOCS) {            statsWriter.writeVLong(state.totalTermFreq - state.docFreq);          }          postingsWriter.encodeTerm(longs, bytesWriter, fieldInfo, state, absolute);          for (int pos = 0; pos < longsSize; pos++) {            metaWriter.writeVLong(longs[pos]);          }          bytesWriter.writeTo(metaWriter);          bytesWriter.reset();          absolute = false;        }      } else {        ...      }      termsOut.writeVInt((int) (suffixWriter.getFilePointer() << 1) | (isLeafBlock ? 1:0));      suffixWriter.writeTo(termsOut);      suffixWriter.reset();      termsOut.writeVInt((int) statsWriter.getFilePointer());      statsWriter.writeTo(termsOut);      statsWriter.reset();      termsOut.writeVInt((int) metaWriter.getFilePointer());      metaWriter.writeTo(termsOut);      metaWriter.reset();      if (hasFloorLeadLabel) {        prefix.bytes[prefix.length++] = (byte) floorLeadLabel;      }      return new PendingBlock(prefix, startFP, hasTerms, isFloor, floorLeadLabel, subIndices);    }

termsOut封装了.tim文件的输出流,其实是FSIndexOutput,其getFilePointer函数返回的startFP保存了该文件可以插入的指针。
writeBlock函数首先提取相同的前缀,例如需要写为一个block的Term有aaa,aab,aac,则相同的前缀为aa,保存在类型为BytesRef的prefix中,BytesRef用于封装一个byte数组。
numEntries保存了本次需要写入多少个Term或者Block,code封装了numEntries的信息,并在最后一个bit表示后面是否还有。然后将code写入.tim文件中。
isLeafBlock表示是否是叶子节点。bytesWriter、suffixWriter、statsWriter、metaWriter在内存中模拟文件。
writeBlock函数接下来遍历需要写入的Term或者Block,suffix表示最后取出的不同字幕的长度,例如aaa,aab,aac则suffix为1,首先写入该长度suffix,最终写入suffixWriter中的为a、b、c。再往下往statsWriter中写入词频和文档频率。
再往下postingsWriter是Lucene50PostingsWriter,encodeTerm函数在longs中保存了.doc、.pos和.pay中文件指针的偏移,然后singletonDocID、lastPosBlockOffset、skipOffset等信息保存在bytesWriter中,再将longs的指针写入metaWriter中,最后把其余信息写入bytesWriter中。
再往下调用bytesWriter、suffixWriter、statsWriter、metaWriter的writeTo函数将内存中的数据写入.tim文件中。
writeBlock函数最后创建PendingBlock并返回,PendingBlock封装了本次写入的各个Term或者子Block的信息。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->writeBlock
第二种情况

    private PendingBlock writeBlock(int prefixLength, boolean isFloor, int floorLeadLabel, int start, int end, boolean hasTerms, boolean hasPrefixTerms, boolean hasSubBlocks) throws IOException {      long startFP = termsOut.getFilePointer();      boolean hasFloorLeadLabel = isFloor && floorLeadLabel != -1;      final BytesRef prefix = new BytesRef(prefixLength + (hasFloorLeadLabel ? 1 : 0));      System.arraycopy(lastTerm.get().bytes, 0, prefix.bytes, 0, prefixLength);      prefix.length = prefixLength;      int numEntries = end - start;      int code = numEntries << 1;      if (end == pending.size()) {        code |= 1;      }      termsOut.writeVInt(code);      boolean isLeafBlock = hasSubBlocks == false && hasPrefixTerms == false;      final List<FST<BytesRef>> subIndices;      boolean absolute = true;      if (isLeafBlock) {        ...      } else {        subIndices = new ArrayList<>();        boolean sawAutoPrefixTerm = false;        for (int i=start;i<end;i++) {          PendingEntry ent = pending.get(i);          if (ent.isTerm) {            PendingTerm term = (PendingTerm) ent;            BlockTermState state = term.state;            final int suffix = term.termBytes.length - prefixLength;            if (minItemsInAutoPrefix == 0) {              suffixWriter.writeVInt(suffix << 1);              suffixWriter.writeBytes(term.termBytes, prefixLength, suffix);            } else {              code = suffix<<2;              int floorLeadEnd = -1;              if (term.prefixTerm != null) {                sawAutoPrefixTerm = true;                PrefixTerm prefixTerm = term.prefixTerm;                floorLeadEnd = prefixTerm.floorLeadEnd;                if (prefixTerm.floorLeadStart == -2) {                  code |= 2;                } else {                  code |= 3;                }              }              suffixWriter.writeVInt(code);              suffixWriter.writeBytes(term.termBytes, prefixLength, suffix);              if (floorLeadEnd != -1) {                suffixWriter.writeByte((byte) floorLeadEnd);              }            }            statsWriter.writeVInt(state.docFreq);            if (fieldInfo.getIndexOptions() != IndexOptions.DOCS) {              statsWriter.writeVLong(state.totalTermFreq - state.docFreq);            }            postingsWriter.encodeTerm(longs, bytesWriter, fieldInfo, state, absolute);            for (int pos = 0; pos < longsSize; pos++) {              metaWriter.writeVLong(longs[pos]);            }            bytesWriter.writeTo(metaWriter);            bytesWriter.reset();            absolute = false;          } else {            PendingBlock block = (PendingBlock) ent;            final int suffix = block.prefix.length - prefixLength;            if (minItemsInAutoPrefix == 0) {              suffixWriter.writeVInt((suffix<<1)|1);            } else {              suffixWriter.writeVInt((suffix<<2)|1);            }            suffixWriter.writeBytes(block.prefix.bytes, prefixLength, suffix);            suffixWriter.writeVLong(startFP - block.fp);            subIndices.add(block.index);          }        }      }      termsOut.writeVInt((int) (suffixWriter.getFilePointer() << 1) | (isLeafBlock ? 1:0));      suffixWriter.writeTo(termsOut);      suffixWriter.reset();      termsOut.writeVInt((int) statsWriter.getFilePointer());      statsWriter.writeTo(termsOut);      statsWriter.reset();      termsOut.writeVInt((int) metaWriter.getFilePointer());      metaWriter.writeTo(termsOut);      metaWriter.reset();      if (hasFloorLeadLabel) {        prefix.bytes[prefix.length++] = (byte) floorLeadLabel;      }      return new PendingBlock(prefix, startFP, hasTerms, isFloor, floorLeadLabel, subIndices);    }

第二种情况表示要写入的不是叶子节点,如果是Term,和第一部分一样,如果是一个子block,写入子block的相应信息,最后创建的PendingBlock需要封装每个Block对应的FST结构,即subIndices。

writeBlocks函数调用完writeBlock函数后将pending列表中的Term或者Block写入.tim文件中,接下来要通过PendingBlock的compileIndex函数针对刚刚写入.tim文件中的Term创建索引信息,最后要将这些信息写入.tip文件中,用于查找。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex

    public void compileIndex(List<PendingBlock> blocks, RAMOutputStream scratchBytes, IntsRefBuilder scratchIntsRef) throws IOException {      scratchBytes.writeVLong(encodeOutput(fp, hasTerms, isFloor));      if (isFloor) {        scratchBytes.writeVInt(blocks.size()-1);        for (int i=1;i<blocks.size();i++) {          PendingBlock sub = blocks.get(i);          scratchBytes.writeByte((byte) sub.floorLeadByte);          scratchBytes.writeVLong((sub.fp - fp) << 1 | (sub.hasTerms ? 1 : 0));        }      }      final ByteSequenceOutputs outputs = ByteSequenceOutputs.getSingleton();      final Builder<BytesRef> indexBuilder = new Builder<>(FST.INPUT_TYPE.BYTE1,                                                           0, 0, true, false, Integer.MAX_VALUE,                                                           outputs, false,                                                           PackedInts.COMPACT, true, 15);      final byte[] bytes = new byte[(int) scratchBytes.getFilePointer()];      scratchBytes.writeTo(bytes, 0);      indexBuilder.add(Util.toIntsRef(prefix, scratchIntsRef), new BytesRef(bytes, 0, bytes.length));      scratchBytes.reset();      for(PendingBlock block : blocks) {        if (block.subIndices != null) {          for(FST<BytesRef> subIndex : block.subIndices) {            append(indexBuilder, subIndex, scratchIntsRef);          }          block.subIndices = null;        }      }      index = indexBuilder.finish();    }

fp是对应.tim文件的指针,encodeOutput函数将fp、hasTerms和isFloor信息封装到一个长整型中,然后将该长整型存入scratchBytes中。compileIndex函数接下来创建Builder,用于构造索引树,再往下将scratchBytes中的数据存入byte数组bytes中。
compileIndex最核心的部分是通过Builder的add函数依次将Term或者Term的部分前缀添加到一颗树中,由frontier数组维护,进而添加到FST中。compileIndex最后通过Builder的finish函数将add添加后的FST树中的信息写入缓存中,后续添加到.tip文件里。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::Builder

  public Builder(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, boolean doShareSuffix, boolean doShareNonSingletonNodes, int shareMaxTailLength, Outputs<T> outputs, boolean doPackFST, float acceptableOverheadRatio, boolean allowArrayArcs, int bytesPageBits) {    this.minSuffixCount1 = minSuffixCount1;    this.minSuffixCount2 = minSuffixCount2;    this.doShareNonSingletonNodes = doShareNonSingletonNodes;    this.shareMaxTailLength = shareMaxTailLength;    this.doPackFST = doPackFST;    this.acceptableOverheadRatio = acceptableOverheadRatio;    this.allowArrayArcs = allowArrayArcs;    fst = new FST<>(inputType, outputs, doPackFST, acceptableOverheadRatio, bytesPageBits);    bytes = fst.bytes;    if (doShareSuffix) {      dedupHash = new NodeHash<>(fst, bytes.getReverseReader(false));    } else {      dedupHash = null;    }    NO_OUTPUT = outputs.getNoOutput();    final UnCompiledNode<T>[] f = (UnCompiledNode<T>[]) new UnCompiledNode[10];    frontier = f;    for(int idx=0;idx<frontier.length;idx++) {      frontier[idx] = new UnCompiledNode<>(this, idx);    }  }

Builder的构造函数主要是创建了一个FST,并初始化frontier数组,frontier数组中的每个元素UnCompiledNode代表树中的每个节点。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::add

  public void add(IntsRef input, T output) throws IOException {    ...    int pos1 = 0;    int pos2 = input.offset;    final int pos1Stop = Math.min(lastInput.length(), input.length);    while(true) {      frontier[pos1].inputCount++;      if (pos1 >= pos1Stop || lastInput.intAt(pos1) != input.ints[pos2]) {        break;      }      pos1++;      pos2++;    }    final int prefixLenPlus1 = pos1+1;    if (frontier.length < input.length+1) {      final UnCompiledNode<T>[] next = ArrayUtil.grow(frontier, input.length+1);      for(int idx=frontier.length;idx<next.length;idx++) {        next[idx] = new UnCompiledNode<>(this, idx);      }      frontier = next;    }    freezeTail(prefixLenPlus1);    for(int idx=prefixLenPlus1;idx<=input.length;idx++) {      frontier[idx-1].addArc(input.ints[input.offset + idx - 1],                             frontier[idx]);      frontier[idx].inputCount++;    }    final UnCompiledNode<T> lastNode = frontier[input.length];    if (lastInput.length() != input.length || prefixLenPlus1 != input.length + 1) {      lastNode.isFinal = true;      lastNode.output = NO_OUTPUT;    }    for(int idx=1;idx<prefixLenPlus1;idx++) {      final UnCompiledNode<T> node = frontier[idx];      final UnCompiledNode<T> parentNode = frontier[idx-1];      final T lastOutput = parentNode.getLastOutput(input.ints[input.offset + idx - 1]);      final T commonOutputPrefix;      final T wordSuffix;      if (lastOutput != NO_OUTPUT) {        commonOutputPrefix = fst.outputs.common(output, lastOutput);        wordSuffix = fst.outputs.subtract(lastOutput, commonOutputPrefix);        parentNode.setLastOutput(input.ints[input.offset + idx - 1], commonOutputPrefix);        node.prependOutput(wordSuffix);      } else {        commonOutputPrefix = wordSuffix = NO_OUTPUT;      }      output = fst.outputs.subtract(output, commonOutputPrefix);    }    if (lastInput.length() == input.length && prefixLenPlus1 == 1+input.length) {      lastNode.output = fst.outputs.merge(lastNode.output, output);    } else {      frontier[prefixLenPlus1-1].setLastOutput(input.ints[input.offset + prefixLenPlus1-1], output);    }    lastInput.copyInts(input);  }

add函数首先计算和上一个字符串的共同前缀,prefixLenPlus1表示FST数中的相同前缀的长度,如果存在,后面就需要进行相应的合并。接下来通过for循环调用addArc函数依次添加input即Term中的每个byte至frontier中,形成一个FST树,由frontier数组维护,然后设置frontier数组中的最后一个UnCompiledNode,将isFinal标志位设为true。add函数最后将output中的数据(文件指针等信息)存入本次frontier数组中最前面的一个UnCompiledNode中,并设置lastInput为本次的input。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::add->freezeTail

  private void freezeTail(int prefixLenPlus1) throws IOException {    final int downTo = Math.max(1, prefixLenPlus1);    for(int idx=lastInput.length(); idx >= downTo; idx--) {      boolean doPrune = false;      boolean doCompile = false;      final UnCompiledNode<T> node = frontier[idx];      final UnCompiledNode<T> parent = frontier[idx-1];      if (node.inputCount < minSuffixCount1) {        doPrune = true;        doCompile = true;      } else if (idx > prefixLenPlus1) {        if (parent.inputCount < minSuffixCount2 || (minSuffixCount2 == 1 && parent.inputCount == 1 && idx > 1)) {          doPrune = true;        } else {          doPrune = false;        }        doCompile = true;      } else {        doCompile = minSuffixCount2 == 0;      }      if (node.inputCount < minSuffixCount2 || (minSuffixCount2 == 1 && node.inputCount == 1 && idx > 1)) {        for(int arcIdx=0;arcIdx<node.numArcs;arcIdx++) {          final UnCompiledNode<T> target = (UnCompiledNode<T>) node.arcs[arcIdx].target;          target.clear();        }        node.numArcs = 0;      }      if (doPrune) {        node.clear();        parent.deleteLast(lastInput.intAt(idx-1), node);      } else {        if (minSuffixCount2 != 0) {          compileAllTargets(node, lastInput.length()-idx);        }        final T nextFinalOutput = node.output;        final boolean isFinal = node.isFinal || node.numArcs == 0;        if (doCompile) {          parent.replaceLast(lastInput.intAt(idx-1),                             compileNode(node, 1+lastInput.length()-idx),                             nextFinalOutput,                             isFinal);        } else {          parent.replaceLast(lastInput.intAt(idx-1),                             node,                             nextFinalOutput,                             isFinal);          frontier[idx] = new UnCompiledNode<>(this, idx);        }      }    }  }

freezeTail函数的核心功能是将不会再变化的节点通过compileNode函数添加到FST结构中。
replaceLast函数设置父节点对应的参数,例如其子节点在bytes中的位置target,是否为最后一个节点isFinal等等。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::add->freezeTail->compileNode

  private CompiledNode compileNode(UnCompiledNode<T> nodeIn, int tailLength) throws IOException {    final long node;    long bytesPosStart = bytes.getPosition();    if (dedupHash != null && (doShareNonSingletonNodes || nodeIn.numArcs <= 1) && tailLength <= shareMaxTailLength) {      if (nodeIn.numArcs == 0) {        node = fst.addNode(this, nodeIn);        lastFrozenNode = node;      } else {        node = dedupHash.add(this, nodeIn);      }    } else {      node = fst.addNode(this, nodeIn);    }    long bytesPosEnd = bytes.getPosition();    if (bytesPosEnd != bytesPosStart) {      lastFrozenNode = node;    }    nodeIn.clear();    final CompiledNode fn = new CompiledNode();    fn.node = node;    return fn;  }

compileNode的核心部分是调用FST的addNode函数添加节点。dedupHash是一个hash缓存,这里不管它。如果bytesPosEnd不等于bytesPosStart,表示有节点写入bytes中了,设置lastFrozenNode为当前node(其实是bytes中的缓存指针位置)。compileNode函数最后创建CompiledNode,设置其中的node并返回。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::add->freezeTail->compileNode->FST::addNode

  long addNode(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn) throws IOException {    T NO_OUTPUT = outputs.getNoOutput();    if (nodeIn.numArcs == 0) {      if (nodeIn.isFinal) {        return FINAL_END_NODE;      } else {        return NON_FINAL_END_NODE;      }    }    final long startAddress = builder.bytes.getPosition();    final boolean doFixedArray = shouldExpand(builder, nodeIn);    if (doFixedArray) {      if (builder.reusedBytesPerArc.length < nodeIn.numArcs) {        builder.reusedBytesPerArc = new int[ArrayUtil.oversize(nodeIn.numArcs, 1)];      }    }    builder.arcCount += nodeIn.numArcs;    final int lastArc = nodeIn.numArcs-1;    long lastArcStart = builder.bytes.getPosition();    int maxBytesPerArc = 0;    for(int arcIdx=0;arcIdx<nodeIn.numArcs;arcIdx++) {      final Builder.Arc<T> arc = nodeIn.arcs[arcIdx];      final Builder.CompiledNode target = (Builder.CompiledNode) arc.target;      int flags = 0;      if (arcIdx == lastArc) {        flags += BIT_LAST_ARC;      }      if (builder.lastFrozenNode == target.node && !doFixedArray) {        flags += BIT_TARGET_NEXT;      }      if (arc.isFinal) {        flags += BIT_FINAL_ARC;        if (arc.nextFinalOutput != NO_OUTPUT) {          flags += BIT_ARC_HAS_FINAL_OUTPUT;        }      } else {      }      boolean targetHasArcs = target.node > 0;      if (!targetHasArcs) {        flags += BIT_STOP_NODE;      } else if (inCounts != null) {        inCounts.set((int) target.node, inCounts.get((int) target.node) + 1);      }      if (arc.output != NO_OUTPUT) {        flags += BIT_ARC_HAS_OUTPUT;      }      builder.bytes.writeByte((byte) flags);      writeLabel(builder.bytes, arc.label);      if (arc.output != NO_OUTPUT) {        outputs.write(arc.output, builder.bytes);      }      if (arc.nextFinalOutput != NO_OUTPUT) {        outputs.writeFinalOutput(arc.nextFinalOutput, builder.bytes);      }      if (targetHasArcs && (flags & BIT_TARGET_NEXT) == 0) {        builder.bytes.writeVLong(target.node);      }    }    final long thisNodeAddress = builder.bytes.getPosition()-1;    builder.bytes.reverse(startAddress, thisNodeAddress);    builder.nodeCount++;    final long node;    node = thisNodeAddress;    return node;  }

首先判断如果是最后的节点,直接返回。接下来累加numArcs至arcCount中,统计节点arc个数。addNode函数接下来计算并设置标志位flags,然后将flags和label写入bytes中,label就是Term中的某个字母或者byte。addNode函数最后返回bytes即BytesStore中的位置。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::add->freezeTail->compileNode->NodeHash::addNode

  public long add(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn) throws IOException {    final long h = hash(nodeIn);    long pos = h & mask;    int c = 0;    while(true) {      final long v = table.get(pos);      if (v == 0) {        final long node = fst.addNode(builder, nodeIn);        count++;        table.set(pos, node);        if (count > 2*table.size()/3) {          rehash();        }        return node;      } else if (nodesEqual(nodeIn, v)) {        return v;      }      pos = (pos + (++c)) & mask;    }  }

dedupHash的add函数首先通过hash函数获得该node的hash值,遍历node内的每个arc,计算hash值。
该函数内部也是使用了FST的addNode函数添加节点,并在必要的时候通过rehash扩展hash数组。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::add->UnCompiledNode::addArc

    public void addArc(int label, Node target) {      if (numArcs == arcs.length) {        final Arc<T>[] newArcs = ArrayUtil.grow(arcs, numArcs+1);        for(int arcIdx=numArcs;arcIdx<newArcs.length;arcIdx++) {          newArcs[arcIdx] = new Arc<>();        }        arcs = newArcs;      }      final Arc<T> arc = arcs[numArcs++];      arc.label = label;      arc.target = target;      arc.output = arc.nextFinalOutput = owner.NO_OUTPUT;      arc.isFinal = false;    }

addArc用来将一个Term里的字母或者byte添加到该节点UnCompiledNode的arcs数组中,开头的if语句用来扩充arcs数组,然后按照顺序获取arcs数组中的Arc,并存入label,传入的参数target指向下一个UnCompiledNode节点。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::finish

  public FST<T> finish() throws IOException {    final UnCompiledNode<T> root = frontier[0];    freezeTail(0);    if (root.inputCount < minSuffixCount1 || root.inputCount < minSuffixCount2 || root.numArcs == 0) {      if (fst.emptyOutput == null) {        return null;      } else if (minSuffixCount1 > 0 || minSuffixCount2 > 0) {        return null;      }    } else {      if (minSuffixCount2 != 0) {        compileAllTargets(root, lastInput.length());      }    }    fst.finish(compileNode(root, lastInput.length()).node);    if (doPackFST) {      return fst.pack(this, 3, Math.max(10, (int) (getNodeCount()/4)), acceptableOverheadRatio);    } else {      return fst;    }  }

finish函数开头的freezeTail函数传入的参数0,代表要处理frontier数组维护的所有节点,compileNode函数最后向bytes中写入根节点。最后的finish函数将FST的信息缓存到成员变量blocks中去,blocks是一个byte数组列表。

BlockTreeTermsWriter::write->TermsWriter::write->pushTerm->writeBlocks->PendingBlock::compileIndex->Builder::finish->FST::finish

  void finish(long newStartNode) throws IOException {    startNode = newStartNode;    bytes.finish();    cacheRootArcs();  }  public void finish() {    if (current != null) {      byte[] lastBuffer = new byte[nextWrite];      System.arraycopy(current, 0, lastBuffer, 0, nextWrite);      blocks.set(blocks.size()-1, lastBuffer);      current = null;    }  }

回到BlockTreeTermsWriter的write函数中,接下来通过TermsWriter的finish函数将FST中的信息写入.tip文件中。

BlockTreeTermsWriter::write->TermsWriter::write->finish

    public void finish() throws IOException {      if (numTerms > 0) {        pushTerm(new BytesRef());        pushTerm(new BytesRef());        writeBlocks(0, pending.size());        final PendingBlock root = (PendingBlock) pending.get(0);        indexStartFP = indexOut.getFilePointer();        root.index.save(indexOut);        BytesRef minTerm = new BytesRef(firstPendingTerm.termBytes);        BytesRef maxTerm = new BytesRef(lastPendingTerm.termBytes);        fields.add(new FieldMetaData(fieldInfo,                                     ((PendingBlock) pending.get(0)).index.getEmptyOutput(),                                     numTerms,                                     indexStartFP,                                     sumTotalTermFreq,                                     sumDocFreq,                                     docsSeen.cardinality(),                                     longsSize,                                     minTerm, maxTerm));      } else {      }    }

root.index.save(indexOut)就是将信息写入.tip文件中。

总结

总接一下本章的大体流程。
BlockTreeTermWrite的调用TermsWriter的write函数处理每个域中的每个Term,然后通过finish函数将信息写入.tip文件。
TermsWriter的write函数针对每个Term,调用pushTerm函数将Term的信息写入.tim文件和FST中,然后将每个Term添加到待处理列表pending中。
pushTerm函数通过计算选择适当的时候调用writeBlocks函数将pending中多个Term写成一个Block。
writeBlocks在pending列表中选择相应的Term或者子Block,然后调用writeBlock函数写入相应的信息,然后调用compileIndex函数建立索引,最后删除在pending列表中已被处理的Term或者Block。
writeBlock函数向各个文件.doc、.pos和.pay写入对应Term或者Block的信息。
compileIndex函数通过Builder的add函数添加节点(每个Term的每个字母或者byte)到frontier数组中,frontier数组维护了UnCompiledNode节点,构成一棵树,compileIndex内部通过freezeTail函数将树中不会变动的节点通过compileNode函数写入FST结构中。
BlockTreeTermWrite最后在finish函数中将FST中的信息写入.tip文件中。

1 0