Whoosh使用手册（Index）（四）

来源：互联网发布：中国联通网络解锁助手编辑：程序博客网时间：2024/04/27 14:19

How to Index documents

Creating an index obiect

可以使用index.create_in()函数创建index对象：

import os, os.pathfrom whoosh import indexif not os.path.exists("indexdir"):    os.mkdir("indexdir")ix = index.create_in("indexdir", schema)

带开一个已经存在某个目录的索引，使用index.open_dir()

import whoosh.index as indexix = index.open_dir("indexdir")

这些是便利方法：

from whoosh.filedb.filestore import FileStoragestorage = FileStorage("indexdir")# Create an indexix = storage.create_index(schema)# Open an existing indexstorage.open_index()

你和index对象一起创建的schema对象是可序列化的并且和index一起存储
你可以在同一个目录下面使用多个索引，用关键字参数分开

# Using the convenience functionsix = index.create_in("indexdir", schema=schema, indexname="usages")ix = index.open_dir("indexdir", indexname="usages")# Using the Storage objectix = storage.create_index(schema, indexname="usages")ix = storage.open_index(indexname="usages")

Clearing the index

在一个目录上调用index.craete_in()函数可以清除已经存在的索引的内容
可以用函数index.exist_in()来检测制定目录上面是否有一个有效的索引

exists = index.exists_in("indexdir")usages_exists = index.exists_in("indexdir", indexname="usages")

(你也可以简单地删除目录上面的索引文件，例如如果索引目录只有一个，使用shutil.rmtree()来移除索引然后再重新创建

Indexing documents

一旦你创建了索引对象，你可以使用IndexWriter对象向其中添加document，可以使用Index.writer()来获取writer对象：

ix = index.open_dir("index")writer = ix.writer()

创建writer的同时会对index加锁，因此同一时间只能有一个线程/进程进行写操作

Note：因为创建writer的时候会创建一个锁，因此在多进程/多线程的时候如果已经有一个writer对象打开那么再打开一个writer的时候可能会抛出一个异常（whoosh.store.LockError）Whoosh有一些例子实现writer的锁操作（whoosh.writing.AsncWriter和whoosh.writing.BufferedWriter）
Note：在writer打开并且提交的过程中，index是可读的，对已经存在reader无影响并且创建的新的reader对象也可以打开index文件，writer提交之后，已经存在的reader对象只能读到没有提交之前的内容，而新创建的reader可以读到最新提交的内容。

IndexWriter对象的add_document(**kwarg)方法接受关键字参数：

writer = ix.writer()writer.add_document(title=u"My document", content=u"This is my document!",                    path=u"/a", tags=u"first short", icon=u"/icons/star.png")writer.add_document(title=u"Second try", content=u"This is the second example.",                    path=u"/b", tags=u"second short", icon=u"/icons/sheep.png")writer.add_document(title=u"Third time's the charm", content=u"Examples are many.",                    path=u"/c", tags=u"short", icon=u"/icons/book.png")writer.commit()

你不需要为每一个field提供一个值，Whoosh不关心你是否漏掉某个field，被索引的field必须是unicode串，而被存储而不被索引的field可以时任意的可序列化的对象

writer.add_document(path=u"/a", title=u"A", content=u"Hello there")writer.add_document(path=u"/a", title=u"A", content=u"Deja vu!")

这将添加两个document到索引，见下文的update_document方法，它使用unique field来替换而不是追加内容

Indexing and storing different values for the same field

如果你有一个既需要索引又需要存储的field，你可以索引一个unicode值但是存储一个不同的对象（通常是这样，但是有时候非常有用）使用一个特别的关键字参数_stored_<fieldname>正常的值将被分析和索引，但是存储的这个值将会在结果里面出现：

writer.add_document(title=u"Title to be indexed", _stored_title=u"Stored title")

Finishing adding documents

一个IndexWriter对象就像一个数据库的事物对象。你可以给你的索引用一系列的改变然后一次性的提交。使用commit将IndexWriter提交的内容保存

writer.commit()

一旦你的document在index里面，你就可以搜索他们，如果你想关闭writer而不保存任何提交的东西，使用cancel()

writer.cancel()

记住只要你有一个writer是打开的，没有其它的线程能够另外得到一个writer然后修改index，并且一个writer通常打开着几个文件，因此你在完成工作前必须调用commit或者cancel方法

Merging segements

一个Whoosh filddb的index事实上是一个或者更多个被称为segement的“sub-index”子索引构成的。因此当你添加一个新的document到index里面的时候，Whoosh将创建一个新的segement到一个已经存在segement里面而不是把他放到documents里面（这样开销很大，因为涉及到磁盘文件上项的排序）因此当你搜索索引的时候，Whoosh会搜索不同的segement然后将结果整合到一起，看起来就像在同一个index一样（这种设计是模仿Lucene）
因此添加documents的时候有一个segement比不停地重新写入index更为高效。但是搜索多个segemet多少会有点拖慢速度，并且你的segement越多，速度越慢，因此当你commit的时候Whoosh有一个自己的算法来保证小的segement能够整合成更少的大一点的segement。
如果不想在commit的时候整合segement，可以将关键字参数merge设为False

writer.commit(merge = False)

若想整合所有的segement，优化索引可以使用optimize关键字参数

writer.commit(optimize = True)

因为优化的过程需要重写index上面的所有信息，因此在一个大的索引上面可能会非常慢，通常建议使用Whoosh自己的优化算法而不是一次性地优化

（Index对象也有optimize()方法来优化index它简单地创建一个writer然后调用commit(optimize = True)）

如果你想对整合的过程有更多的控制，一可以重写整合策略的函数然后把他当做commit的参数，具体见NO_MERGE,MERGE_SMALL,OTIMIZE 函数在whoosh.fieldb.filewriting模块中的实现。

Delete Document

你可以用IndexWriter下列方法删除document然后使用commit方法在磁盘上删除
delete_document(document)
使用自己内部的document number删除一个document较低级的方法
is_deleted(docnum)
返回真如果这个给定数字的文档被删除了
delete_by_term(fieldname,termtext)
删除任何包含这个给定的termtext的field
delete_by_query(query)
删除任何匹配这个query的documents

#Delete document by its path--this field must be indexedix.delete_by_term("path",u'/a/b/c')#save the deletion to diskix.commit()

在filedb的后端，删除一个document实际上他就是简单地把这个document添加到一个删除列表，当你搜索的时候他不会返回删除列表里面的内容，但是索引里面的内容仍然存在，他们的统计信息也不会被更新，直到你整合segement（这因为删除信息肯定会立即涉及到重写磁盘，这样效率会很低）

Updating document

如果你想替换掉一个已经存在的document，一可以先删除他然后再添加，当然你也可以使用IndexWriter对象的update_document方法一步完成
对于update_document这个方法，你必须确保至少有一个field的内容是唯一的，Whoosh会使用这个内容来搜索结果然后删除：

from whoosh.fields import Schema,ID,TEXTschema = Schema (path=ID (unique=True),content=TEXT)ix = index.create_in("index")writer = ix.writer()writer.add_document(path=u"/a",content=u"the first document")writer.add_cocument(path=u"/b",content=u"The second document")writer.commit()writer = ix.writer()#Because "path" is marked as unique,calling update_document with path = u"/a"#will delete any existing document where the path field contains /awriter.update_document(path=u"/a",content="Replacement for the first document")writer.commit()

这个“unique”field必须是被索引的
如果没有document能够搭配这个unique的field，那么这个方法就跟add_document一样

“unique”的field和update_document方法仅仅是为了方便删除和添加使用，Whoosh没有固有的唯一性概念，你也没法在使用add_document的时候强制指定唯一性

Incremental Indexing(增量索引)
当你索引一个documents的时候，可能有两种方式：一个是给所有的内容全部索引，一个时只更新发生改变的内容
给所有内容全部索引是很简单的：

import os.pathfrom whoosh import indexfrom whoosh.fields import Schema, ID, TEXTdef clean_index(dirname):  # Always create the index from scratch  ix = index.create_in(dirname, schema=get_schema())  writer = ix.writer()  # Assume we have a function that gathers the filenames of the  # documents to be indexed  for path in my_docs():  add_doc())writer, path)  writer.commit()def get_schem()    return Schema(path=ID(unique=True, stored=True), content=TEXT)def add_doc(writer, path):    fileobj=open(path, "rb")    content=fileobj.read()    fileobj.close()    writer.add_document(path=path, content=content)

对于一个很小的document，每次都全部索引可能会非常快，但是对于大的documents，你可能需要仅仅更新改变的内容
为了这么做我们需要存储每一个document的最后修改时间，在这个例子里我们使用mtime来简化：

def get_schema()return Schema())path=ID(unique=True, stored=True), time=STORED, content=TEXT)def add_doc(writer, path):fileobj=open())path, "rb")content=fileobj.read()")fileobj.close()")modtime = os.path.getmtime())path)writer.add_document()path=path, content=content, time=modtime)

现在我们能够判断是清除还是增量索引：

def index_my_docs(dirname, clean=False):  if clean:    clean_index(dirname)  else:    incremental_index(dirname)def incremental_index(dirname)    ix = index.open_dir(dirname)    # The set of all paths in the index    indexed_paths = set()    # The set of all paths we need to re-index    to_index = set()    with ix.searcher() as searcher:      writer = ix.writer()      # Loop over the stored fields in the index      for fields in searcher.all_stored_fields():        indexed_path = fields['path']        indexed_paths.add(indexed_path)        if not os.path.exists(indexed_path):          # This file was deleted since it was indexed          writer.delete_by_term('path', indexed_path)        else:          # Check if this file was changed since it          # was indexed          indexed_time = fields['time']          mtime = os.path.getmtime(indexed_path)          if mtime > indexed_time:            # The file has changed, delete it and add it to the list of            # files to reindex            writer.delete_by_term('path', indexed_path)            to_index.add(indexed_path)      # Loop over the files in the filesystem      # Assume we have a function that gathers the filenames of the      # documents to be indexed      for path in my_docs():        if path in to_index or path not in indexed_paths:          # This is either a file that's changed, or a new file          # that wasn't indexed before. So index it!          add_doc(writer, path)      writer.commit()

增量索引功能：

1.在所有已经索引的路径里面循环
a.如果文件不存在了，在现有的document里面删除
b.如果文件仍然存在但是被修改过，添加到需要修改的列表
c.如果文件存在不管是否修改过，添加到一索引的路径里面

2.在磁盘上的所有文件遍历
a.如果文件不是以索引路径集合的一部分，那么这个文件是新的，需要索引他
b.如果路径是修改列表的一部分，也需要更新他
c.否则，跳过