Lucene索引创建过程2

来源：互联网发布：php aes加密解密编辑：程序博客网时间：2024/06/05 05:35

5 DocumentsWriterPerThread.updateDocument详细步骤

该Document的更新交给一个DocumentsWriterPerThread之后，我们再往下看。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
public void updateDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer, Term delTerm) throws IOException, AbortingException {
  testPoint("DocumentsWriterPerThread addDocument start");
  assert deleteQueue != null;
  reserveOneDoc();
  docState.doc = doc;
  docState.analyzer = analyzer;
  docState.docID = numDocsInRAM;
  if (INFO_VERBOSE && infoStream.isEnabled("DWPT")) {
    infoStream.message("DWPT", Thread.currentThread().getName() + " update delTerm=" + delTerm + " docID=" + docState.docID + " seg=" + segmentInfo.name);
  }
  // Even on exception, the document is still added (but marked
  // deleted), so we don't need to un-reserve at that point.
  // Aborting exceptions will actually "lose" more than one
  // document, so the counter will be "wrong" in that case, but
  // it's very hard to fix (we can't easily distinguish aborting
  // vs non-aborting exceptions):
  boolean success = false;
  try {
    try {
      consumer.processDocument();
    } finally {
      docState.clear();
    }
    success = true;
  } finally {
    if (!success) {
      // mark document as deleted
      deleteDocID(docState.docID);
      numDocsInRAM++;
    }
  }
  finishDocument(delTerm);
}

该线程里面我们只关心一行代码

consumer.processDocument();

从这里差不多就豁然开朗了，一切最后该Document的处理是交给了一个DocConsumer来处理。而这个DocConsumer的获取见下：

abstract DocConsumer getChain(DocumentsWriterPerThread documentsWriterPerThread) throws IOException;

Lucene实现了一个默认的DocConsumer即：DefaultIndexingChain。那接下来就看该DocConsumer是如何处理该Document的了就行了。

6 DefaultIndexingChain.processDocument详细步骤

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
@Override
public void processDocument() throws IOException, AbortingException {
  
  // How many indexed field names we've seen (collapses
  // multiple field instances by the same name):
  int fieldCount = 0;
  
  long fieldGen = nextFieldGen++;
  
  // NOTE: we need two passes here, in case there are
  // multi-valued fields, because we must process all
  // instances of a given field at once, since the
  // analyzer is free to reuse TokenStream across fields
  // (i.e., we cannot have more than one TokenStream
  // running "at once"):
  
  termsHash.startDocument();
  
  fillStoredFields(docState.docID);
  startStoredFields();
  
  boolean aborting = false;
  try {
    for (IndexableField field : docState.doc) {//挨个遍历每个Field做处理，哈哈，终于露出可爱的尾巴了
      fieldCount = processField(field, fieldGen, fieldCount);
    }
  } catch (AbortingException ae) {
    aborting = true;
    throw ae;
  } finally {
    if (aborting == false) {
      // Finish each indexed field name seen in the document:
      for (int i=0;i<fieldCount;i++) {
        fields[i].finish();
      }
      finishStoredFields();
    }
  }
  
  try {
    termsHash.finishDocument();
  } catch (Throwable th) {
    // Must abort, on the possibility that on-disk term
    // vectors are now corrupt:
    throw AbortingException.wrap(th);
  }
}

看到上面代码，我笑了。哈哈，越来越清晰，有没有。对该Document的处理，无非就是演化成遍历每个Field，对Field做处理就行了。但是具体Field怎么处理，该wiki不涉及，放到另外一篇wiki中深入记录（参考：Document存储细节）。

五，Commit Document

indexWriter.commit();

提交Commit完成如下工作：

凡是挂起的改变都提交到index中。包括新增加的文档，要删除的文档，segement的合并。
该操作会执行Directory.sync，sync操作会将文件系统的cache都刷新到disk上面。虽然比较耗时（同步耗时），但是刷新到disk上之后，VM挂掉（或者断电）都不影响这些挂起的更新。

sync操作具体的解释可参考如下一段解释：

1
2
3
4
5
6
7
传统的UNIX实现在内核中设有缓冲区高速缓存或页面高速缓存，大多数磁盘I/O都通过缓冲进行。当将数据写入文件时，内核通常先将该数据复制到其中一个缓冲区中，如果该缓冲区尚未写满，则并不将其排入输出队列，而是等待其写满或者当内核需要重用该缓冲区以便存放其他磁盘块数据时，再将该缓冲排入输出队列，然后待其到达队首时，才进行实际的I/O操作。这种输出方式被称为延迟写（delayed write）（Bach [1986]第3章详细讨论了缓冲区高速缓存）。
延迟写减少了磁盘读写次数，但是却降低了文件内容的更新速度，使得欲写到文件中的数据在一段时间内并没有写到磁盘上。当系统发生故障时，这种延迟可能造成文件更新内容的丢失。为了保证磁盘上实际文件系统与缓冲区高速缓存中内容的一致性，UNIX系统提供了sync、fsync和fdatasync三个函数。
sync函数只是将所有修改过的块缓冲区排入写队列，然后就返回，它并不等待实际写磁盘操作结束。
通常称为update的系统守护进程会周期性地（一般每隔30秒）调用sync函数。这就保证了定期冲洗内核的块缓冲区。命令sync(1)也调用sync函数。
fsync函数只对由文件描述符filedes指定的单一文件起作用，并且等待写磁盘操作结束，然后返回。fsync可用于数据库这样的应用程序，这种应用程序需要确保将修改过的块立即写到磁盘上。
fdatasync函数类似于fsync，但它只影响文件的数据部分。而除数据外，fsync还会同步更新文件的属性。
对于提供事务支持的数据库，在事务提交时，都要确保事务日志（包含该事务所有的修改操作以及一个提交记录）完全写到硬盘上，才认定事务提交成功并返回给应用层。 

看完这段解释就能明白，sync操作就是将文件系统（甚至内核）中的缓存数据都刷新到disk上面，保证数据的安全性（OS挂掉，断电，数据不会丢失）。

那具体Lucene做了些什么呢？

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
private final void commitInternal(MergePolicy mergePolicy) throws IOException {
  
  if (infoStream.isEnabled("IW")) {
    infoStream.message("IW", "commit: start");
  }
  
  synchronized(commitLock) {
    ensureOpen(false);
  
    if (infoStream.isEnabled("IW")) {
      infoStream.message("IW", "commit: enter lock");
    }
  
    if (pendingCommit == null) {
      if (infoStream.isEnabled("IW")) {
        infoStream.message("IW", "commit: now prepare");
      }
      prepareCommitInternal(mergePolicy);//最关键的一行
    } else {
      if (infoStream.isEnabled("IW")) {
        infoStream.message("IW", "commit: already prepared");
      }
    }
  
    finishCommit();
  }
}

走到prepareCommitInternal里面就是详细的刷新操作，索引刷新操作放在另外一篇wiki中介绍。

六，关闭IndexWriter

刷新数据，关闭资源。往里走，逻辑还是很丰富的。等flush详细讲完之后，再回头看这部分。

0 0