lucene学习第二篇

来源：互联网发布：如何和程序员谈恋爱编辑：程序博客网时间：2024/05/16 15:27

依然是照着前一篇说的视频来做的，这篇不是索引文件夹下的txt文档，而是索引自己建立的一些内容。这篇主要用来研究建立索引、删除索引、合并索引以及更新索引和恢复索引的内容。重点应了解，建立索引和删除索引以及更新索引等都是Indexwriter，而查询索引和恢复则是用indexreader，还需要了解的就是更新索引实质上就是删除索引在重新建立的过程。

源代码之前先明白索引的文档格式：

----------------------

          _2_1.del
          _2.fdt
        _2.fdx
         _2.fnm
         _2.frq
         _2.nrm
         _2.prx
         _2.tii
         _2.tis
     segments_4
     segments.gen

----------------------
1. index
a. *.fnm file
这个文件用来记录各个域的属性
name(STR), isIndex, omitNorms, storePayloads, omitTermFreqAndPositions
storeTermVector, storePositionWithTermVector, storeOffsetWithTermVector(BYTE)
下面的三个属性主要用来高亮显示，类似我们这边的DI高亮，标注索引时每个DOC每个域切出的每个词出现的相关
属性：次数，基于词的POS，基于STR的OFF
isIndex - 是否索引，只要索引的字段才可以搜
omitNorms - 取消算分因子
.nrm file
域的omitNorms=false的话，会保存这个域上各个DOC的算分因子，其文件格式如下
[N, R, M, -1][bytes size=doc num][bytes size=doc num][bytes size=doc num]
头文件标示，接下来是三个omitNorms=false的域下各个document的算分因子
在建索引时可以为DOC设定一个分数，DOC下各个域也可以设定一个分数
norm[doc][filed] = sorce(doc)*sorce(filed) | float -> byte 故精确性因为使用不当而不好
搜索时最终的算分是经过norm[doc][filed]累乘了的
storePayloads - payloads 用来存放TERM在DOC出现时的属性，可以是以后用来算分的因子，也可以是
是其它的属性，在搜索时可以选择用不用这块的数据，以及怎么用。其特点是TERM在DOC里的每次出现
都能有一个payload属性，这个值是在分词组件时切词时设置的。lucene为这块提供的灵活性还是很高
omitTermFreqAndPositions - 有些域在搜索时不需要TERM在DOC里的FREQ，POS信息，只要通过TERM
能定位到DOC就满足了，这样的域omitTermFreqAndPositions最好设置成true. 这样的域一般不会
有TERM间模糊查询，当然还有payload属性(因为payload是跟随POS的).
b. *.tii & *.tis file
*.tis记录所有TERM，类似于我们这边的termsort文件，其文件格式是
[HEAD related][term bytes, docfreq, freq pointer, prox pointer][…]
*.tii是基于*.tis的词表索引文件，类似于我们这边的tindex*文件，其文件格式是
[HEAD related][term bytes, docfreq, freq pointer, prox pointer, index pointer][…]
跳跃间隔默认是128，即在索引时生成*.tis后，*.tis里每间隔128个TERM写一个到*.tii文件中，并且记录当前
*.tis文件的偏移量
这样在搜索时将*.tii文件加载到内存里面，原理同我们的一样，在*.tii里没找到TERM时，根据其
index pointer到*.tis中去找
. docfreq - doclist的长度
. freq pointer - TERM的doclist信息在*.frq文件里的偏移量
. prox pointer - TERM的doclist里各个DOC里TERM的POS信息在*.prx文件里的偏移量
c. *.frq & *.prx
*.frq记录TERM的DOC流的文件，*.prx记录TERM的doclist里各个DOC里TERM的POS信息
从*.tii & *.tis文件读到TERM的docfreq, freq pointer, prox pointer后
1). 一般的不需要POS信息来进行结果过滤的搜索，例如“德国欧洲杯”的搜索，经分词后得到TERM:"德国", "欧洲杯"
只需要TERMDOC流就可以了，定位TERM:"德国"的docfreq, freq pointer，偏移*.frq文件freq pointer，再
读取docfreq个DOCID&FREQ信息, omitTermFreqAndPositions=true -> FREQ=1; 同样处理TERM:"欧洲杯"得到
一个doclist，两个doclist取并集|交集（并集|交集跟有些设置有关，实际操作不是将这两个doclist读取出来取交|并，
而是TERMDOC支持skip，两个流不断的滚动得到结果）
2). 有些搜索为了提高相关性，例如“德国欧洲杯 ~2”的搜索，经分词后得到TERM:"德国", "欧洲杯"
而这时需要TERMPOSITION流，TERMPOSITION流包含TERMDOC流，在TERMDOC流基础上多了TERM的doclist里各个DOC
里TERM的POS信息.
TERM:"德国"的docfreq, freq pointer，prox pointer
TERM:"欧洲杯"的docfreq, freq pointer，prox pointer
freq pointer - *.frq -> doclist for term
prox pointer - *.prx -> positions of per doc
这样比较两个doclist，在得到同样的DOCID后，再到对应的DOCID的POS流中找到所有的两个流中POSITION之差小于等于2
的数目，大于等于0这个DOCID才接纳，同时满足的数目越多分数阅读，同样这个分数是基于各个TERM的分数的
d. *.del file
*.del文件用于记录标记删除的DOCID，文件里除头文件里的两个int，其它的bytes里的每一个bit代表一个DOC，为1表示
DOC已经被删除，在TERMDOC流中是会被删除的
e. *.fdx & *.fdt
*.fdx & *.fdt用于存储STORE数据相关，*.fdx存储索引，*.fdt存储STORE数据
*.fdx文件格式如下
[HEAD][fdt file pointer(LONG)][…]
在搜索的最后一步会读取一些存储信息给搜索调用者，根据DOCID到*.fdx获取其存储信息在*.fdt里的偏移量，然后再去读
*.fdt文件，数据可以是二进制和STR格式两种，二进制都是压缩的，STR可以选择是否压缩
在*.fdt文件偏移点先读一个VINT得到存储的域的个数，然后迭代读取各个域，对于各个域先是存储属性（二进制，压缩）

然后是数据长度，接下来是数据.

以上颜色部分摘自http://www.cnblogs.com/mandela/archive/2012/06/11/2545254.html

实验程序用了包lucene-core-3.5.0.jar和junit-4.7.jar以及commons-io-2.4.jar

程序源代码：

package cn.edu.hit.lx;import java.awt.Window.Type;import java.io.File;import java.io.IOException;import java.text.ParseException;import java.text.SimpleDateFormat;import java.util.Calendar;import java.util.Date;import java.util.HashMap;import java.util.Map;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.document.NumericField;import org.apache.lucene.index.CorruptIndexException;import org.apache.lucene.index.IndexReader;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.index.IndexWriterConfig;import org.apache.lucene.index.Term;import org.apache.lucene.search.IndexSearcher;import org.apache.lucene.search.ScoreDoc;import org.apache.lucene.search.TermQuery;import org.apache.lucene.search.TopDocs;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;import org.apache.lucene.store.LockObtainFailedException;import org.apache.lucene.util.Version;public class indexutil {private String[] sds = { "1", "2", "3", "4", "5", "6" };private String[] emails = { "aa@mtlab.org", "bb@hit.org", "cc@lx.org","dd@hit.org", "ee@hit.org", "ff@hit.org" };private String[] content = { "welcome to visit the space,i like book","hello boy,i like book", "my name is cc,i like game","i like football", "i like football and i like basketball too","i like movie and swimming" };private int[] attachs = { 1, 2, 3, 4, 5, 5 };private Date[] dates = {};private String[] names = { "zhangshan", "lisi", "jahan", "jetts","michael", "jack" };private Directory directory = null;private Map<String, Float> scores = new HashMap<String, Float>();SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");public indexutil() {try {setdates();scores.put("mtlab.org", 5.5f);scores.put("lx.org", 7.5f);directory = FSDirectory.open(new File("d:/lucene/index02"));} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}public void setdates() {dates = new Date[sds.length];try {dates[0] = sdf.parse("2012-12-15");dates[1] = sdf.parse("2012-12-14");dates[2] = sdf.parse("2012-12-13");dates[3] = sdf.parse("2012-12-12");dates[4] = sdf.parse("2012-12-11");dates[5] = sdf.parse("2012-12-10");} catch (ParseException e) {// TODO Auto-generated catch blocke.printStackTrace();}}public void query() {try {IndexReader reader = IndexReader.open(directory);// 通过reader可以有效的获得文档的数量System.out.println("numdocs:" + reader.numDocs());System.out.println("maxdocs:" + reader.maxDoc());System.out.println("deletedocs:" + reader.numDeletedDocs());} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}public void delete() {IndexWriter writer = null;try {writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));// 参数是一个选项，可以是一个Query，也可以是一个Term，Term指的是精确查找// 此时删除索引并未完全删除，只是存在一个回收站中，可以恢复的writer.deleteDocuments(new Term("id", "1"));} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (LockObtainFailedException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {try {if (writer != null)writer.close();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}}public void undelete() {// 对删除索引进行恢复// 恢复时必须将IndexReader的open方法中的一个参数Readonly设置为falsetry {IndexReader reader = IndexReader.open(directory, false);reader.undeleteAll();reader.close();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}public void forcedelete() {IndexWriter writer = null;try {writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));// 参数是一个选项，可以是一个Query，也可以是一个Term，Term指的是精确查找// 此时删除索引并未完全删除，只是存在一个回收站中，可以恢复的writer.forceMergeDeletes();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (LockObtainFailedException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {try {if (writer != null)writer.close();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}}public void merge() {IndexWriter writer = null;try {writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));writer.deleteAll();// 会将索引合并为俩段，这俩段被删除的数据会被清空// 此处在3.5后不建议使用，会消耗大量开销writer.forceMerge(1);} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (LockObtainFailedException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {try {if (writer != null)writer.close();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}}public void update() {IndexWriter writer = null;try {writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));Document doc = new Document();doc.add(new Field("id", sds[0], Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));doc.add(new Field("email", emails[0], Field.Store.YES,Field.Index.NOT_ANALYZED));doc.add(new Field("content", content[0], Field.Store.NO,Field.Index.ANALYZED));doc.add(new Field("name", names[0], Field.Store.YES,Field.Index.NOT_ANALYZED));/* * lucene没有更新操作，这里的更新属于俩个操作的合集 先删除再添加 */writer.updateDocument(new Term("id", "1"), doc);} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (LockObtainFailedException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {try {if (writer != null)writer.close();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}}public void index() {IndexWriter writer = null;try {writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));Document doc = null;for (int i = 0; i < sds.length; i++) {doc = new Document();doc.add(new Field("id", sds[i], Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));doc.add(new Field("email", emails[i], Field.Store.YES,Field.Index.NOT_ANALYZED));doc.add(new Field("content", content[i], Field.Store.NO,Field.Index.ANALYZED));doc.add(new Field("name", names[i], Field.Store.YES,Field.Index.NOT_ANALYZED));// 存储数字doc.add(new NumericField("attach", Field.Store.YES, true).setIntValue(attachs[i]));doc.add(new NumericField("date", Field.Store.YES, true).setLongValue(dates[i].getTime()));String at = emails[i].substring(emails[i].lastIndexOf("@") + 1);System.out.println(at);if (scores.containsKey(at)) {doc.setBoost(scores.get(at));} else {doc.setBoost(0.5f);}writer.addDocument(doc);}} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (LockObtainFailedException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {try {if (writer != null)writer.close();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}}public void search() {try {IndexReader reader = IndexReader.open(directory);IndexSearcher searcher = new IndexSearcher(reader);TermQuery query = new TermQuery(new Term("content", "like"));TopDocs sds = searcher.search(query, 10);for (ScoreDoc sd : sds.scoreDocs) {Document document = searcher.doc(sd.doc);Date dt=new Date(Long.parseLong(document.get("date")));String dString=sdf.format(dt);// 此处getBoost()得到的另一个对象的System.out.println(document.getBoost() + document.get("name")+ "{" + document.get("email") + "|-->"+ document.get("id") + "|-->" + document.get("attach")+ "|-->" + "}" +dString+"!-->"+ sd.score);}} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}}

测试代码：

package cn.edu.hit.lx;import java.util.IdentityHashMap;import org.junit.Test;public class testIndex {@Testpublic void testIndex(){indexutil ix=new indexutil();ix.index();}@Testpublic void testquery(){indexutil ix=new indexutil();ix.query();}@Testpublic void testdelete(){indexutil ix=new indexutil();ix.delete();}@Testpublic void testundel(){indexutil ix=new indexutil();ix.undelete();}@Testpublic void testforcedel(){indexutil ix=new indexutil();ix.forcedelete();}@Testpublic void testmerge(){indexutil ix=new indexutil();ix.merge();}@Testpublic void testupdate(){indexutil ix=new indexutil();ix.update();}@Testpublic void testsearch(){indexutil ix=new indexutil();ix.search();}}