lucene源码分析---6

来源:互联网 发布:清水建设 知乎 编辑:程序博客网 时间:2024/05/02 19:20

lucene源码分析—创建IndexReader

本章开始分析lucene的查询过程,下面先看一段lucene6版本下常用的查询代码,

        String indexPath;        IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));        IndexSearcher searcher = new IndexSearcher(reader);        ScoreDoc[] hits = null;        Query query = null;        Analyzer analyzer = new SimpleAnalyzer();        try {            QueryParser qp = new QueryParser("body", analyzer);            query = qp.parse(words);        } catch (ParseException e) {            return null;        }        if (searcher != null) {            TopDocs results = searcher.search(query, 20);            hits = results.scoreDocs;            Document document = null;            for (int i = 0; i < hits.length; i++) {                document = searcher.doc(hits[i].doc);            }            reader.close();        }

indexPath表示索引文件夹的路径。FSDirectory的open函数前面几章已经分析过了,最后返回MMapDirectory、SimpleFSDirectory以及NIOFSDirectory其中之一,本章后面都假设为NIOFSDirectory。然后调用DirectoryReader的open函数创建一个IndexReader,如下所示,
DirectoryReader::open

  public static DirectoryReader open(final Directory directory) throws IOException {    return StandardDirectoryReader.open(directory, null);  }  static DirectoryReader open(final Directory directory, final IndexCommit commit) throws IOException {    return new SegmentInfos.FindSegmentsFile<DirectoryReader>(directory) {        ...    }.run(commit);  }

DirectoryReader的open函数调用StandardDirectoryReader的open函数,进而调用FindSegmentsFile的run函数,最后其实返回一个StandardDirectoryReader。
DirectoryReader::open->FindSegmentsFile::run

    public T run() throws IOException {      return run(null);    }    public T run(IndexCommit commit) throws IOException {      if (commit != null) {        ...      }      long lastGen = -1;      long gen = -1;      IOException exc = null;      for (;;) {        lastGen = gen;        String files[] = directory.listAll();        String files2[] = directory.listAll();        Arrays.sort(files);        Arrays.sort(files2);        if (!Arrays.equals(files, files2)) {          continue;        }        gen = getLastCommitGeneration(files);        if (gen == -1) {          throw new IndexNotFoundException();        } else if (gen > lastGen) {          String segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS, "", gen);          try {            T t = doBody(segmentFileName);            return t;          } catch (IOException err) {          }        } else {          throw exc;        }      }    }

假设索引文件夹下有文件segments_0,segments_1,segments.gen,上面这段代码中的getLastCommitGeneration返回1,即以”segments”开头的文件名里结尾最大的数字,fileNameFromGeneration返回segments_1。最重要的是doBody函数,用来将文件中的段以及域信息读入内存数据结构中,doBody在DirectoryReader的open中被重载,定义如下,
DirectoryReader::open->FindSegmentsFile::run->doBody

      protected DirectoryReader doBody(String segmentFileName) throws IOException {        SegmentInfos sis = SegmentInfos.readCommit(directory, segmentFileName);        final SegmentReader[] readers = new SegmentReader[sis.size()];        boolean success = false;        try {          for (int i = sis.size()-1; i >= 0; i--) {            readers[i] = new SegmentReader(sis.info(i), IOContext.READ);          }          DirectoryReader reader = new StandardDirectoryReader(directory, readers, null, sis, false, false);          success = true;          return reader;        } finally {        }      }

doBody首先通过SegmentInfos的readCommit函数读取段信息存入SegmentInfos,然后根据该段信息创建SegmentReader,SegmentReader的构造函数会读取每个段中的域信息并存储在SegmentReader的成员变量里。先来看SegmentInfos的readCommit函数,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentInfos::readCommit

  public static final SegmentInfos readCommit(Directory directory, String segmentFileName) throws IOException {    long generation = generationFromSegmentsFileName(segmentFileName);    try (ChecksumIndexInput input = directory.openChecksumInput(segmentFileName, IOContext.READ)) {      return readCommit(directory, input, generation);    }  }  public ChecksumIndexInput openChecksumInput(String name, IOContext context) throws IOException {    return new BufferedChecksumIndexInput(openInput(name, context));  }  public IndexInput openInput(String name, IOContext context) throws IOException {    ensureOpen();    ensureCanRead(name);    Path path = getDirectory().resolve(name);    FileChannel fc = FileChannel.open(path, StandardOpenOption.READ);    return new NIOFSIndexInput("NIOFSIndexInput(path=\"" + path + "\")", fc, context);  }

假设传入的段文件名为segments_1,上面代码中的generationFromSegmentsFileName函数返回1。readCommit函数首先通过openChecksumInput创建BufferedChecksumIndexInput,代表文件的输入流,其中的openInput函数用来创建NIOFSIndexInput,然后根据该输入流通过readCommit函数读取文件内容,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentInfos::readCommit

  public static final SegmentInfos readCommit(Directory directory, ChecksumIndexInput input, long generation) throws IOException {    int magic = input.readInt();    if (magic != CodecUtil.CODEC_MAGIC) {      throw new IndexFormatTooOldException();    }    int format = CodecUtil.checkHeaderNoMagic(input, "segments", VERSION_50, VERSION_CURRENT);    byte id[] = new byte[StringHelper.ID_LENGTH];    input.readBytes(id, 0, id.length);    CodecUtil.checkIndexHeaderSuffix(input, Long.toString(generation, Character.MAX_RADIX));    SegmentInfos infos = new SegmentInfos();    infos.id = id;    infos.generation = generation;    infos.lastGeneration = generation;    if (format >= VERSION_53) {      infos.luceneVersion = Version.fromBits(input.readVInt(), input.readVInt(), input.readVInt());    } else {    }    infos.version = input.readLong();    infos.counter = input.readInt();    int numSegments = input.readInt();    if (format >= VERSION_53) {      if (numSegments > 0) {        infos.minSegmentLuceneVersion = Version.fromBits(input.readVInt(), input.readVInt(), input.readVInt());      } else {      }    } else {    }    long totalDocs = 0;    for (int seg = 0; seg < numSegments; seg++) {      String segName = input.readString();      final byte segmentID[];      byte hasID = input.readByte();      if (hasID == 1) {        segmentID = new byte[StringHelper.ID_LENGTH];        input.readBytes(segmentID, 0, segmentID.length);      } else if (hasID == 0) {      } else {      }      Codec codec = readCodec(input, format < VERSION_53);      SegmentInfo info = codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ);      info.setCodec(codec);      totalDocs += info.maxDoc();      long delGen = input.readLong();      int delCount = input.readInt();      long fieldInfosGen = input.readLong();      long dvGen = input.readLong();      SegmentCommitInfo siPerCommit = new SegmentCommitInfo(info, delCount, delGen, fieldInfosGen, dvGen);      if (format >= VERSION_51) {        siPerCommit.setFieldInfosFiles(input.readSetOfStrings());      } else {        siPerCommit.setFieldInfosFiles(Collections.unmodifiableSet(input.readStringSet()));      }      final Map<Integer,Set<String>> dvUpdateFiles;      final int numDVFields = input.readInt();      if (numDVFields == 0) {        dvUpdateFiles = Collections.emptyMap();      } else {        Map<Integer,Set<String>> map = new HashMap<>(numDVFields);        for (int i = 0; i < numDVFields; i++) {          if (format >= VERSION_51) {            map.put(input.readInt(), input.readSetOfStrings());          } else {            map.put(input.readInt(), Collections.unmodifiableSet(input.readStringSet()));          }        }        dvUpdateFiles = Collections.unmodifiableMap(map);      }      siPerCommit.setDocValuesUpdatesFiles(dvUpdateFiles);      infos.add(siPerCommit);      Version segmentVersion = info.getVersion();      if (format < VERSION_53) {        if (infos.minSegmentLuceneVersion == null || segmentVersion.onOrAfter(infos.minSegmentLuceneVersion) == false) {          infos.minSegmentLuceneVersion = segmentVersion;        }      }    }    if (format >= VERSION_51) {      infos.userData = input.readMapOfStrings();    } else {      infos.userData = Collections.unmodifiableMap(input.readStringStringMap());    }    CodecUtil.checkFooter(input);    return infos;  }

readCommit函数较长,归纳起来,就是针对所有的段信息,读取并设置id、generation、lastGeneration、luceneVersion、version、counter、minSegmentLuceneVersion、userData等信息;
并且针对每个段,读取或设置段名、段ID、该段删除的文档数、删除文档的gen数字,域文件的gen数字,更新的文档的gen数字、该段域信息文件名、该段更新的文件名,最后将这些信息封装成SegmentInfos并返回。
其中,针对每个段,通过segmentInfoFormat函数获得Lucene50SegmentInfoFormat,调用其read函数读取各个信息封装成SegmentInfo,代码如下,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentInfos::readCommit->Lucene50SegmentInfoFormat::read

  public SegmentInfo read(Directory dir, String segment, byte[] segmentID, IOContext context) throws IOException {    final String fileName = IndexFileNames.segmentFileName(segment, "", Lucene50SegmentInfoFormat.SI_EXTENSION);    try (ChecksumIndexInput input = dir.openChecksumInput(fileName, context)) {      Throwable priorE = null;      SegmentInfo si = null;      try {        int format = CodecUtil.checkIndexHeader(input, Lucene50SegmentInfoFormat.CODEC_NAME,                                          Lucene50SegmentInfoFormat.VERSION_START,                                          Lucene50SegmentInfoFormat.VERSION_CURRENT,                                          segmentID, "");        final Version version = Version.fromBits(input.readInt(), input.readInt(), input.readInt());        final int docCount = input.readInt();        final boolean isCompoundFile = input.readByte() == SegmentInfo.YES;        final Map<String,String> diagnostics;        final Set<String> files;        final Map<String,String> attributes;        if (format >= VERSION_SAFE_MAPS) {          diagnostics = input.readMapOfStrings();          files = input.readSetOfStrings();          attributes = input.readMapOfStrings();        } else {          diagnostics = Collections.unmodifiableMap(input.readStringStringMap());          files = Collections.unmodifiableSet(input.readStringSet());          attributes = Collections.unmodifiableMap(input.readStringStringMap());        }        si = new SegmentInfo(dir, version, segment, docCount, isCompoundFile, null, diagnostics, segmentID, attributes);        si.setFiles(files);      } catch (Throwable exception) {        priorE = exception;      } finally {        CodecUtil.checkFooter(input, priorE);      }      return si;    }  }

该read函数打开.si文件,并从中读取version、docCount、isCompoundFile、diagnostics、attributes、files信息,然后创建SegmentInfo封装这些信息并返回。

回到FindSegmentsFile的doBody函数中,从文件中所有的段信息通过readCommit函数封装成SegmentInfos,然后针对每个段,创建SegmentReader,在其构造函数中读取域信息。
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentReader::SegmentReader

  public SegmentReader(SegmentCommitInfo si, IOContext context) throws IOException {    this.si = si;    core = new SegmentCoreReaders(si.info.dir, si, context);    segDocValues = new SegmentDocValues();    boolean success = false;    final Codec codec = si.info.getCodec();    try {      if (si.hasDeletions()) {        liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE);      } else {        liveDocs = null;      }      numDocs = si.info.maxDoc() - si.getDelCount();      fieldInfos = initFieldInfos();      docValuesProducer = initDocValuesProducer();      success = true;    } finally {    }  }

si.info.dir就是索引文件所在的文件夹,先来看SegmentCoreReaders的构造函数,SegmentCoreReaders的构造函数中会读取域信息,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentReader::SegmentReader->SegmentCoreReaders::SegmentCoreReaders

  SegmentCoreReaders(Directory dir, SegmentCommitInfo si, IOContext context) throws IOException {    final Codec codec = si.info.getCodec();    final Directory cfsDir;    boolean success = false;    try {      if (si.info.getUseCompoundFile()) {        cfsDir = cfsReader = codec.compoundFormat().getCompoundReader(dir, si.info, context);      } else {        cfsReader = null;        cfsDir = dir;      }      coreFieldInfos = codec.fieldInfosFormat().read(cfsDir, si.info, "", context);      final SegmentReadState segmentReadState = new SegmentReadState(cfsDir, si.info, coreFieldInfos, context);      final PostingsFormat format = codec.postingsFormat();      fields = format.fieldsProducer(segmentReadState);      if (coreFieldInfos.hasNorms()) {        normsProducer = codec.normsFormat().normsProducer(segmentReadState);        assert normsProducer != null;      } else {        normsProducer = null;      }      fieldsReaderOrig = si.info.getCodec().storedFieldsFormat().fieldsReader(cfsDir, si.info, coreFieldInfos, context);      if (coreFieldInfos.hasVectors()) {        termVectorsReaderOrig = si.info.getCodec().termVectorsFormat().vectorsReader(cfsDir, si.info, coreFieldInfos, context);      } else {        termVectorsReaderOrig = null;      }      if (coreFieldInfos.hasPointValues()) {        pointsReader = codec.pointsFormat().fieldsReader(segmentReadState);      } else {        pointsReader = null;      }      success = true;    } finally {    }  }

getUseCompoundFile表示是否会封装成.cfs、.cfe文件,如果封装,就通过compoundFormat函数获得Lucene50CompoundFormat,然后调用其getCompoundReader函数,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentReader::SegmentReader->SegmentCoreReaders::SegmentCoreReaders->Lucene50CompoundFormat::getCompoundReader

  public Directory getCompoundReader(Directory dir, SegmentInfo si, IOContext context) throws IOException {    return new Lucene50CompoundReader(dir, si, context);  }  public Lucene50CompoundReader(Directory directory, SegmentInfo si, IOContext context) throws IOException {    this.directory = directory;    this.segmentName = si.name;    String dataFileName = IndexFileNames.segmentFileName(segmentName, "", Lucene50CompoundFormat.DATA_EXTENSION);    String entriesFileName = IndexFileNames.segmentFileName(segmentName, "", Lucene50CompoundFormat.ENTRIES_EXTENSION);    this.entries = readEntries(si.getId(), directory, entriesFileName);    boolean success = false;    long expectedLength = CodecUtil.indexHeaderLength(Lucene50CompoundFormat.DATA_CODEC, "");    for(Map.Entry<String,FileEntry> ent : entries.entrySet()) {      expectedLength += ent.getValue().length;    }    expectedLength += CodecUtil.footerLength();     handle = directory.openInput(dataFileName, context);    try {      CodecUtil.checkIndexHeader(handle, Lucene50CompoundFormat.DATA_CODEC, version, version, si.getId(), "");      CodecUtil.retrieveChecksum(handle);      success = true;    } finally {      if (!success) {        IOUtils.closeWhileHandlingException(handle);      }    }  }

getCompoundReader用来创建Lucene50CompoundReader。Lucene50CompoundReader的构造函数打开.cfs以及.cfe文件,然后通过readEntries函数将其中包含的文件读取出来,存入entries中。

回到SegmentCoreReaders的构造函数。fieldInfosFormat返回Lucene60FieldInfosFormat,其read函数用来读取域信息,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentReader::SegmentReader->SegmentCoreReaders::SegmentCoreReaders->Lucene60FieldInfosFormat::read

  public FieldInfos read(Directory directory, SegmentInfo segmentInfo, String segmentSuffix, IOContext context) throws IOException {    final String fileName = IndexFileNames.segmentFileName(segmentInfo.name, segmentSuffix, EXTENSION);    try (ChecksumIndexInput input = directory.openChecksumInput(fileName, context)) {      Throwable priorE = null;      FieldInfo infos[] = null;      try {        CodecUtil.checkIndexHeader(input,                                   Lucene60FieldInfosFormat.CODEC_NAME,                                    Lucene60FieldInfosFormat.FORMAT_START,                                    Lucene60FieldInfosFormat.FORMAT_CURRENT,                                   segmentInfo.getId(), segmentSuffix);        final int size = input.readVInt();        infos = new FieldInfo[size];        Map<String,String> lastAttributes = Collections.emptyMap();        for (int i = 0; i < size; i++) {          String name = input.readString();          final int fieldNumber = input.readVInt();          byte bits = input.readByte();          boolean storeTermVector = (bits & STORE_TERMVECTOR) != 0;          boolean omitNorms = (bits & OMIT_NORMS) != 0;          boolean storePayloads = (bits & STORE_PAYLOADS) != 0;          final IndexOptions indexOptions = getIndexOptions(input, input.readByte());          final DocValuesType docValuesType = getDocValuesType(input, input.readByte());          final long dvGen = input.readLong();          Map<String,String> attributes = input.readMapOfStrings();          if (attributes.equals(lastAttributes)) {            attributes = lastAttributes;          }          lastAttributes = attributes;          int pointDimensionCount = input.readVInt();          int pointNumBytes;          if (pointDimensionCount != 0) {            pointNumBytes = input.readVInt();          } else {            pointNumBytes = 0;          }          try {            infos[i] = new FieldInfo(name, fieldNumber, storeTermVector, omitNorms, storePayloads,                                      indexOptions, docValuesType, dvGen, attributes,                                     pointDimensionCount, pointNumBytes);            infos[i].checkConsistency();          } catch (IllegalStateException e) {          }        }      } catch (Throwable exception) {        priorE = exception;      } finally {        CodecUtil.checkFooter(input, priorE);      }      return new FieldInfos(infos);    }  }

该read函数打开.fnm文件,读取Field域的基本信息。然后遍历所有域,读取name域名、fieldNumber文档数量,storeTermVector是否存储词向量、omitNorms是否存储norm、storePayloads是否存储payload、indexOptions域存储方式、docValuesType文档内容类型、文档的gen、attributes、pointDimensionCount、pointNumBytes,最后封装成FieldInfo,再封装成FieldInfos。

回到SegmentCoreReaders构造函数。接下来的postingsFormat函数返回PerFieldPostingsFormat,其fieldsProducer函数最终设置fields为FieldsReader。
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentReader::SegmentReader->SegmentCoreReaders::SegmentCoreReaders->PerFieldPostingsFormat::fieldsProducer

  public final FieldsProducer fieldsProducer(SegmentReadState state)      throws IOException {    return new FieldsReader(state);  }

normsFormat函数返回Lucene53NormsFormat,Lucene53NormsFormat的normsProducer函数返回Lucene53NormsProducer,赋值给normsProducer。

  public NormsProducer normsProducer(SegmentReadState state) throws IOException {    return new Lucene53NormsProducer(state, DATA_CODEC, DATA_EXTENSION, METADATA_CODEC, METADATA_EXTENSION);  }

再往下,依次分析,fieldsReaderOrig最终被赋值为CompressingStoredFieldsReader。termVectorsReaderOrig最终被赋值为CompressingTermVectorsReader。pointsReader最终被赋值为Lucene60PointsReader。

回到SegmentReader构造函数,现在已经读取了所有的段信息和域信息了,接下来如果段中有删除信息,就通过liveDocsFormat函数获得Lucene50LiveDocsFormat,并调用其readLiveDocs函数,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentReader::SegmentReader->Lucene50LiveDocsFormat::readLiveDocs

  public Bits readLiveDocs(Directory dir, SegmentCommitInfo info, IOContext context) throws IOException {    long gen = info.getDelGen();    String name = IndexFileNames.fileNameFromGeneration(info.info.name, EXTENSION, gen);    final int length = info.info.maxDoc();    try (ChecksumIndexInput input = dir.openChecksumInput(name, context)) {      Throwable priorE = null;      try {        CodecUtil.checkIndexHeader(input, CODEC_NAME, VERSION_START, VERSION_CURRENT,                                      info.info.getId(), Long.toString(gen, Character.MAX_RADIX));        long data[] = new long[FixedBitSet.bits2words(length)];        for (int i = 0; i < data.length; i++) {          data[i] = input.readLong();        }        FixedBitSet fbs = new FixedBitSet(data, length);        return fbs;      } catch (Throwable exception) {        priorE = exception;      } finally {        CodecUtil.checkFooter(input, priorE);      }    }  }

readLiveDocs函数打开.liv文件,创建输入流,然后读取并创建FixedBitSet用来标识哪些文件被删除。

回到SegmentReader构造函数。接下来的initFieldInfos函数将SegmentCoreReaders中的coreFieldInfos赋值给fieldInfos,如果段有更新,就重新读取一次。docValuesProducer函数最后会返回FieldsReader。

再回到FindSegmentsFile的doBody函数中,最后创建StandardDirectoryReader并返回。StandardDirectoryReader本身的构造函数较为简单,值得注意的是StandardDirectoryReader的父类CompositeReader的

回到实例中,接下来创建IndexSearcher以及QueryParser,这两个类的构造函数都没有关键内容,这里就不往下看了。
值得注意的是IndexSearcher的构造函数会调用StandardDirectoryReader的getContext函数,进而调用leaves函数,首先是getContext函数,定义在StandardDirectoryReader的父类CompositeReader中,
StandardDirectoryReader::getContext

  public final CompositeReaderContext getContext() {    ensureOpen();    if (readerContext == null) {      readerContext = CompositeReaderContext.create(this);    }    return readerContext;  }

ensureOpen用来确保IndexWriter未关闭,接下来通过create函数创建CompositeReaderContext,
CompositeReaderContext::create

    static CompositeReaderContext create(CompositeReader reader) {      return new Builder(reader).build();    }    public CompositeReaderContext build() {      return (CompositeReaderContext) build(null, reader, 0, 0);    }    private IndexReaderContext build(CompositeReaderContext parent, IndexReader reader, int ord, int docBase) {      if (reader instanceof LeafReader) {        final LeafReader ar = (LeafReader) reader;        final LeafReaderContext atomic = new LeafReaderContext(parent, ar, ord, docBase, leaves.size(), leafDocBase);        leaves.add(atomic);        leafDocBase += reader.maxDoc();        return atomic;      } else {        final CompositeReader cr = (CompositeReader) reader;        final List<? extends IndexReader> sequentialSubReaders = cr.getSequentialSubReaders();        final List<IndexReaderContext> children = Arrays.asList(new IndexReaderContext[sequentialSubReaders.size()]);        final CompositeReaderContext newParent;        if (parent == null) {          newParent = new CompositeReaderContext(cr, children, leaves);        } else {          newParent = new CompositeReaderContext(parent, cr, ord, docBase, children);        }        int newDocBase = 0;        for (int i = 0, c = sequentialSubReaders.size(); i < c; i++) {          final IndexReader r = sequentialSubReaders.get(i);          children.set(i, build(newParent, r, i, newDocBase));          newDocBase += r.maxDoc();        }        assert newDocBase == cr.maxDoc();        return newParent;      }    }

首先,getSequentialSubReaders函数返回的正是在FindSegmentsFile的doBody函数中为每个段创建的SegmentReader列表,接下来创建CompositeReaderContext,接下来为每个SegmentReader嵌套调用build函数并设置进children中,而SegmentReader继承自LeafReader,因此在嵌套调用的build函数中,会将每个SegmentReader封装为LeafReaderContext并设置进leaves列表中。

因此最后的leaves函数返回封装了SegmentReader的LeafReaderContext列表。

下一章开始分析QueryParser的parse函数。

0 0
原创粉丝点击