lucene源码分析---2

来源:互联网 发布:php curl 输出图片 编辑:程序博客网 时间:2024/05/02 04:13

lucene源码分析—lucene创建索引之准备工作

为了方便分析,这里再贴一次在上一章中lucene关于建立索引的实例的源代码,

            String filePath = ...//文件路径            String indexPath = ...//索引路径            File fileDir = new File(filePath);                Directory dir = FSDirectory.open(Paths.get(indexPath));              Analyzer luceneAnalyzer = new SimpleAnalyzer();            IndexWriterConfig iwc = new IndexWriterConfig(luceneAnalyzer);              iwc.setOpenMode(OpenMode.CREATE);              IndexWriter indexWriter = new IndexWriter(dir,iwc);                File[] textFiles = fileDir.listFiles();                for (int i = 0; i < textFiles.length; i++) {                    if (textFiles[i].isFile()) {                         String temp = FileReaderAll(textFiles[i].getCanonicalPath(),                                "GBK");                        Document document = new Document();                        Field FieldPath = new StringField("path", textFiles[i].getPath(), Field.Store.YES);                    Field FieldBody = new TextField("body", temp, Field.Store.YES);                        document.add(FieldPath);                        document.add(FieldBody);                        indexWriter.addDocument(document);                    }                }                indexWriter.close();

首先,FSDirectory的open函数用来打开索引文件夹,用来存放后面生成的索引文件,代码如下,

  public static FSDirectory open(Path path) throws IOException {    return open(path, FSLockFactory.getDefault());  }  public static FSDirectory open(Path path, LockFactory lockFactory) throws IOException {    if (Constants.JRE_IS_64BIT && MMapDirectory.UNMAP_SUPPORTED) {      return new MMapDirectory(path, lockFactory);    } else if (Constants.WINDOWS) {      return new SimpleFSDirectory(path, lockFactory);    } else {      return new NIOFSDirectory(path, lockFactory);    }  }

FSLockFactory获得的默认LockFactory是NativeFSLockFactory,该工厂可以获得文件锁NativeFSLock,后面如果分析到再来细看这方面代码。这里假设FSDirectory的open函数创建了一个NIOFSDirectory,NIOFSDirectory继承自FSDirectory,并且直接调用了其父类FSDirectory的构造函数,

  protected FSDirectory(Path path, LockFactory lockFactory) throws IOException {    super(lockFactory);    if (!Files.isDirectory(path)) {      Files.createDirectories(path);    }    directory = path.toRealPath();  }

FSDirectory的构造函数根据Path创建了一个目录或者文件,并且保存了对应的路径。FSDirectory继承自BaseDirectory,其构造函数只是简单保存了LockFactory,这里就不要往下看了。

回到最上面的例子中,接下来构造了SimpleAnalyzer,然后根据构造的SimpleAnalyzer创建一个IndexWriterConfig,其构造函数直接调用了其父类LiveIndexWriterConfig的构造函数,

  LiveIndexWriterConfig(Analyzer analyzer) {    this.analyzer = analyzer;    ramBufferSizeMB = IndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB;    maxBufferedDocs = IndexWriterConfig.DEFAULT_MAX_BUFFERED_DOCS;    maxBufferedDeleteTerms = IndexWriterConfig.DEFAULT_MAX_BUFFERED_DELETE_TERMS;    mergedSegmentWarmer = null;    delPolicy = new KeepOnlyLastCommitDeletionPolicy();    commit = null;    useCompoundFile = IndexWriterConfig.DEFAULT_USE_COMPOUND_FILE_SYSTEM;    openMode = OpenMode.CREATE_OR_APPEND;    similarity = IndexSearcher.getDefaultSimilarity();    mergeScheduler = new ConcurrentMergeScheduler();    indexingChain = DocumentsWriterPerThread.defaultIndexingChain;    codec = Codec.getDefault();    infoStream = InfoStream.getDefault();    mergePolicy = new TieredMergePolicy();    flushPolicy = new FlushByRamOrCountsPolicy();    readerPooling = IndexWriterConfig.DEFAULT_READER_POOLING;    indexerThreadPool = new DocumentsWriterPerThreadPool();    perThreadHardLimitMB = IndexWriterConfig.DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB;  }

LiveIndexWriterConfig构造函数又创建并保存了一系列组件,在后面的代码分析中如果碰到会一一分析,这里就不往下看了。

回到lucene实例中,接下来根据刚刚创建的LiveIndexWriterConfig创建一个IndexWriter,IndexWriter时lucene创建索引最为核心的类,其构造函数比较长,下面一一来看,

  public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException {    if (d instanceof FSDirectory && ((FSDirectory) d).checkPendingDeletions()) {      throw new IllegalArgumentException();    }    conf.setIndexWriter(this);    config = conf;    infoStream = config.getInfoStream();    writeLock = d.obtainLock(WRITE_LOCK_NAME);    boolean success = false;    try {      directoryOrig = d;      directory = new LockValidatingDirectoryWrapper(d, writeLock);      mergeDirectory = addMergeRateLimiters(directory);      analyzer = config.getAnalyzer();      mergeScheduler = config.getMergeScheduler();      mergeScheduler.setInfoStream(infoStream);      codec = config.getCodec();      bufferedUpdatesStream = new BufferedUpdatesStream(infoStream);      poolReaders = config.getReaderPooling();      OpenMode mode = config.getOpenMode();      boolean create;      if (mode == OpenMode.CREATE) {        create = true;      } else if (mode == OpenMode.APPEND) {        create = false;      } else {        create = !DirectoryReader.indexExists(directory);      }      boolean initialIndexExists = true;      String[] files = directory.listAll();      IndexCommit commit = config.getIndexCommit();      StandardDirectoryReader reader;      if (commit == null) {        reader = null;      } else {        reader = commit.getReader();      }      if (create) {        if (config.getIndexCommit() != null) {          if (mode == OpenMode.CREATE) {            throw new IllegalArgumentException();          } else {            throw new IllegalArgumentException();          }        }        SegmentInfos sis = null;        try {          sis = SegmentInfos.readLatestCommit(directory);          sis.clear();        } catch (IOException e) {          initialIndexExists = false;          sis = new SegmentInfos();        }        segmentInfos = sis;        rollbackSegments = segmentInfos.createBackupSegmentInfos();        changed();      } else if (reader != null) {        ...      } else {        ...      }      pendingNumDocs.set(segmentInfos.totalMaxDoc());      globalFieldNumberMap = getFieldNumberMap();      config.getFlushPolicy().init(config);      docWriter = new DocumentsWriter(this, config, directoryOrig, directory);      eventQueue = docWriter.eventQueue();      synchronized(this) {        deleter = new IndexFileDeleter(files, directoryOrig, directory,                                       config.getIndexDeletionPolicy(),                                       segmentInfos, infoStream, this,                                       initialIndexExists, reader != null);        assert create || filesExist(segmentInfos);      }      if (deleter.startingCommitDeleted) {        changed();      }      if (reader != null) {        ...      }      success = true;    } finally {      if (!success) {        IOUtils.closeWhileHandlingException(writeLock);        writeLock = null;      }    }  }

IndexWriter构造函数首先通过checkPendingDeletions函数删除被标记的文件,checkPendingDeletions函数定义在FSDirectory中,如下所示

  public boolean checkPendingDeletions() throws IOException {    deletePendingFiles();    return pendingDeletes.isEmpty() == false;  }  public synchronized void deletePendingFiles() throws IOException {    if (pendingDeletes.isEmpty() == false) {      for(String name : new HashSet<>(pendingDeletes)) {        privateDeleteFile(name, true);      }    }  }  private void privateDeleteFile(String name, boolean isPendingDelete) throws IOException {    try {      Files.delete(directory.resolve(name));      pendingDeletes.remove(name);    } catch (NoSuchFileException | FileNotFoundException e) {    } catch (IOException ioe) {    }  }

checkPendingDeletions函数最后调用Files的delete函数删除保存在pendingDeletes的文件。

回到IndexWriter的构造函数中,接下来通过infoStream获得在LiveIndexWriterConfig构造函数中创建的NoOutput,该infoStream用来显示信息,然后调用FSDirectory的obtainLock函数获得文件的写锁,这里就不往下分析了。

回到IndexWriter的构造函数中,接下来会经过一系列的创建和赋值操作,假设create为true,即表示第一次创建或者重新创建索引,然后会通过SegmentInfos的readLatestCommit函数读取段信息,

  public static final SegmentInfos readLatestCommit(Directory directory) throws IOException {    return new FindSegmentsFile<SegmentInfos>(directory) {      @Override      protected SegmentInfos doBody(String segmentFileName) throws IOException {        return readCommit(directory, segmentFileName);      }    }.run();  }

SegmentInfos的readLatestCommit函数创建了一个FindSegmentsFile并调用其run函数,定义如下,

    public T run() throws IOException {      return run(null);    }    public T run(IndexCommit commit) throws IOException {      long lastGen = -1;      long gen = -1;      IOException exc = null;      for (;;) {        lastGen = gen;        String files[] = directory.listAll();        String files2[] = directory.listAll();        Arrays.sort(files);        Arrays.sort(files2);        if (!Arrays.equals(files, files2)) {          continue;        }        gen = getLastCommitGeneration(files);        if (gen == -1) {          throw new IndexNotFoundException();        } else if (gen > lastGen) {          String segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS, "", gen);          try {            T t = doBody(segmentFileName);            return t;          } catch (IOException err) {          }        } else {          throw exc;        }      }    }

这里的泛型T就是SegmentInfos,run函数首先调用getLastCommitGeneration函数获得gen信息,假设索引文件夹下有一个文件名为segments_6的文件,则getLastCommitGeneration最后会返回6赋值到gen中,接下来,如果gen大于lastGen,就表示段信息有更新了,这时候就要通过doBody函数读取该segments_6文件的信息,并返回一个SegmentInfos。
根据前面readLatestCommit的代码,doBody函数最后会调用readCommit函数,定义在SegmentInfos中,代码如下

  public static final SegmentInfos readCommit(Directory directory, String segmentFileName) throws IOException {    long generation = generationFromSegmentsFileName(segmentFileName);    try (ChecksumIndexInput input = directory.openChecksumInput(segmentFileName, IOContext.READ)) {      return readCommit(directory, input, generation);    }  }

readCommit函数首先创建一个ChecksumIndexInput,然后通过readCommit函数读取段信息并返回一个SegmentInfos,这里的readCommit函数和具体的segments_*文件格式和协议相关,这里就不往下看了。最后返回的SegmentInfos保存了段信息。

回到IndexWriter的构造函数中,如果readLatestCommit函数返回的SegmentInfos不为空,就调用其clear清空,如果是第一次创建索引,就会构造一个SegmentInfos,SegmentInfos的构造函数为空函数。接下来调用SegmentInfos的createBackupSegmentInfos函数备份其中的SegmentCommitInfo信息列表,该备份主要是为了回滚rollback操作使用。IndexWriter然后调用changed表示段信息发生了变化。

继续往下看IndexWriter的构造函数,pendingNumDocs函数记录了索引记录的文档总数,globalFieldNumberMap记录了该段中Field的相关信息,getFlushPolicy返回在LiveIndexWriterConfig构造函数中创建的FlushByRamOrCountsPolicy,然后通过FlushByRamOrCountsPolicy的init函数进行简单的赋值。再往下创建了一个DocumentsWriter,并获得其事件队列保存在eventQueue中。IndexWriter的构造函数接下来会创建一个IndexFileDeleter,IndexFileDeleter用来管理索引文件,例如添加引用计数,在多线程环境下操作索引文件时可以保持同步性。

下一章继续分析lucene创建索引的实例的源代码。

0 0
原创粉丝点击