lucene学习第二篇

来源:互联网 发布:如何和程序员谈恋爱 编辑:程序博客网 时间:2024/05/16 15:27

依然是照着前一篇说的视频来做的,这篇不是索引文件夹下的txt文档,而是索引自己建立的一些内容。这篇主要用来研究建立索引、删除索引、合并索引以及更新索引和恢复索引的内容。重点应了解,建立索引和删除索引以及更新索引等都是Indexwriter,而查询索引和恢复则是用indexreader,还需要了解的就是更新索引实质上就是删除索引在重新建立的过程。

源代码之前先明白索引的文档格式:

 

----------------------
                                                                             _2_1.del
                                                                             _2.fdt
                                                                             _2.fdx
                                                                             _2.fnm
                                                                             _2.frq
                                                                             _2.nrm
                                                                             _2.prx
                                                                             _2.tii
                                                                             _2.tis
                                                                             segments_4
                                                                             segments.gen
----------------------
1. index
  a. *.fnm file
  这个文件用来记录各个域的属性
  name(STR), isIndex, omitNorms, storePayloads, omitTermFreqAndPositions
  storeTermVector, storePositionWithTermVector, storeOffsetWithTermVector(BYTE)
  下面的三个属性主要用来高亮显示,类似我们这边的DI高亮,标注索引时每个DOC每个域切出的每个词出现的相关
  属性:次数,基于词的POS,基于STR的OFF
  isIndex - 是否索引,只要索引的字段才可以搜
  omitNorms - 取消算分因子
    .nrm file
    域的omitNorms=false的话,会保存这个域上各个DOC的算分因子,其文件格式如下
    [N, R, M, -1][bytes size=doc num][bytes size=doc num][bytes size=doc num]
    头文件标示,接下来是三个omitNorms=false的域下各个document的算分因子
    在建索引时可以为DOC设定一个分数,DOC下各个域也可以设定一个分数
    norm[doc][filed] = sorce(doc)*sorce(filed) | float -> byte 故精确性因为使用不当而不好
    搜索时最终的算分是经过norm[doc][filed]累乘了的
  storePayloads - payloads 用来存放TERM在DOC出现时的属性,可以是以后用来算分的因子,也可以是
    是其它的属性,在搜索时可以选择用不用这块的数据,以及怎么用。其特点是TERM在DOC里的每次出现
    都能有一个payload属性,这个值是在分词组件时切词时设置的。lucene为这块提供的灵活性还是很高
  omitTermFreqAndPositions - 有些域在搜索时不需要TERM在DOC里的FREQ,POS信息,只要通过TERM
    能定位到DOC就满足了,这样的域omitTermFreqAndPositions最好设置成true. 这样的域一般不会
    有TERM间模糊查询,当然还有payload属性(因为payload是跟随POS的).
  b. *.tii & *.tis file
  *.tis记录所有TERM,类似于我们这边的termsort文件,其文件格式是
  [HEAD related][term bytes, docfreq, freq pointer, prox pointer][…]
  *.tii是基于*.tis的词表索引文件,类似于我们这边的tindex*文件,其文件格式是
  [HEAD related][term bytes, docfreq, freq pointer, prox pointer, index pointer][…]
  跳跃间隔默认是128,即在索引时生成*.tis后,*.tis里每间隔128个TERM写一个到*.tii文件中,并且记录当前
  *.tis文件的偏移量
  这样在搜索时将*.tii文件加载到内存里面,原理同我们的一样,在*.tii里没找到TERM时,根据其
  index pointer到*.tis中去找
  . docfreq - doclist的长度
  . freq pointer - TERM的doclist信息在*.frq文件里的偏移量
  . prox pointer - TERM的doclist里各个DOC里TERM的POS信息在*.prx文件里的偏移量
  c. *.frq & *.prx
  *.frq记录TERM的DOC流的文件,*.prx记录TERM的doclist里各个DOC里TERM的POS信息
  从*.tii & *.tis文件读到TERM的docfreq, freq pointer, prox pointer后
  1). 一般的不需要POS信息来进行结果过滤的搜索,例如“德国欧洲杯”的搜索,经分词后得到TERM:"德国", "欧洲杯"
  只需要TERMDOC流就可以了,定位TERM:"德国"的docfreq, freq pointer,偏移*.frq文件freq pointer,再
  读取docfreq个DOCID&FREQ信息, omitTermFreqAndPositions=true -> FREQ=1; 同样处理TERM:"欧洲杯"得到
  一个doclist,两个doclist取并集|交集(并集|交集跟有些设置有关,实际操作不是将这两个doclist读取出来取交|并,
  而是TERMDOC支持skip,两个流不断的滚动得到结果)
  2). 有些搜索为了提高相关性,例如“德国欧洲杯 ~2”的搜索,经分词后得到TERM:"德国", "欧洲杯"
  而这时需要TERMPOSITION流,TERMPOSITION流包含TERMDOC流,在TERMDOC流基础上多了TERM的doclist里各个DOC
  里TERM的POS信息.
  TERM:"德国"的docfreq, freq pointer,prox pointer
  TERM:"欧洲杯"的docfreq, freq pointer,prox pointer
  freq pointer - *.frq -> doclist for term
  prox pointer - *.prx -> positions of per doc
  这样比较两个doclist,在得到同样的DOCID后,再到对应的DOCID的POS流中找到所有的两个流中POSITION之差小于等于2
  的数目,大于等于0这个DOCID才接纳,同时满足的数目越多分数阅读,同样这个分数是基于各个TERM的分数的
  d. *.del file
  *.del文件用于记录标记删除的DOCID,文件里除头文件里的两个int,其它的bytes里的每一个bit代表一个DOC,为1表示
  DOC已经被删除,在TERMDOC流中是会被删除的
  e. *.fdx & *.fdt
  *.fdx & *.fdt用于存储STORE数据相关,*.fdx存储索引,*.fdt存储STORE数据
  *.fdx文件格式如下
  [HEAD][fdt file pointer(LONG)][…]
  在搜索的最后一步会读取一些存储信息给搜索调用者,根据DOCID到*.fdx获取其存储信息在*.fdt里的偏移量,然后再去读
  *.fdt文件,数据可以是二进制和STR格式两种,二进制都是压缩的,STR可以选择是否压缩
  在*.fdt文件偏移点先读一个VINT得到存储的域的个数,然后迭代读取各个域,对于各个域先是存储属性(二进制,压缩)

  然后是数据长度,接下来是数据.

以上颜色部分摘自http://www.cnblogs.com/mandela/archive/2012/06/11/2545254.html


实验程序用了包lucene-core-3.5.0.jar和junit-4.7.jar以及commons-io-2.4.jar

程序源代码:

package cn.edu.hit.lx;import java.awt.Window.Type;import java.io.File;import java.io.IOException;import java.text.ParseException;import java.text.SimpleDateFormat;import java.util.Calendar;import java.util.Date;import java.util.HashMap;import java.util.Map;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.document.NumericField;import org.apache.lucene.index.CorruptIndexException;import org.apache.lucene.index.IndexReader;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.index.IndexWriterConfig;import org.apache.lucene.index.Term;import org.apache.lucene.search.IndexSearcher;import org.apache.lucene.search.ScoreDoc;import org.apache.lucene.search.TermQuery;import org.apache.lucene.search.TopDocs;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;import org.apache.lucene.store.LockObtainFailedException;import org.apache.lucene.util.Version;public class indexutil {private String[] sds = { "1", "2", "3", "4", "5", "6" };private String[] emails = { "aa@mtlab.org", "bb@hit.org", "cc@lx.org","dd@hit.org", "ee@hit.org", "ff@hit.org" };private String[] content = { "welcome to visit the space,i like book","hello boy,i like book", "my name is cc,i like game","i like football", "i like football and i like basketball too","i like movie and swimming" };private int[] attachs = { 1, 2, 3, 4, 5, 5 };private Date[] dates = {};private String[] names = { "zhangshan", "lisi", "jahan", "jetts","michael", "jack" };private Directory directory = null;private Map<String, Float> scores = new HashMap<String, Float>();SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");public indexutil() {try {setdates();scores.put("mtlab.org", 5.5f);scores.put("lx.org", 7.5f);directory = FSDirectory.open(new File("d:/lucene/index02"));} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}public void setdates() {dates = new Date[sds.length];try {dates[0] = sdf.parse("2012-12-15");dates[1] = sdf.parse("2012-12-14");dates[2] = sdf.parse("2012-12-13");dates[3] = sdf.parse("2012-12-12");dates[4] = sdf.parse("2012-12-11");dates[5] = sdf.parse("2012-12-10");} catch (ParseException e) {// TODO Auto-generated catch blocke.printStackTrace();}}public void query() {try {IndexReader reader = IndexReader.open(directory);// 通过reader可以有效的获得文档的数量System.out.println("numdocs:" + reader.numDocs());System.out.println("maxdocs:" + reader.maxDoc());System.out.println("deletedocs:" + reader.numDeletedDocs());} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}public void delete() {IndexWriter writer = null;try {writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));// 参数是一个选项,可以是一个Query,也可以是一个Term,Term指的是精确查找// 此时删除索引并未完全删除,只是存在一个回收站中,可以恢复的writer.deleteDocuments(new Term("id", "1"));} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (LockObtainFailedException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {try {if (writer != null)writer.close();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}}public void undelete() {// 对删除索引进行恢复// 恢复时必须将IndexReader的open方法中的一个参数Readonly设置为falsetry {IndexReader reader = IndexReader.open(directory, false);reader.undeleteAll();reader.close();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}public void forcedelete() {IndexWriter writer = null;try {writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));// 参数是一个选项,可以是一个Query,也可以是一个Term,Term指的是精确查找// 此时删除索引并未完全删除,只是存在一个回收站中,可以恢复的writer.forceMergeDeletes();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (LockObtainFailedException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {try {if (writer != null)writer.close();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}}public void merge() {IndexWriter writer = null;try {writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));writer.deleteAll();// 会将索引合并为俩段,这俩段被删除的数据会被清空// 此处在3.5后不建议使用,会消耗大量开销writer.forceMerge(1);} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (LockObtainFailedException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {try {if (writer != null)writer.close();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}}public void update() {IndexWriter writer = null;try {writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));Document doc = new Document();doc.add(new Field("id", sds[0], Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));doc.add(new Field("email", emails[0], Field.Store.YES,Field.Index.NOT_ANALYZED));doc.add(new Field("content", content[0], Field.Store.NO,Field.Index.ANALYZED));doc.add(new Field("name", names[0], Field.Store.YES,Field.Index.NOT_ANALYZED));/* * lucene没有更新操作,这里的更新属于俩个操作的合集 先删除再添加 */writer.updateDocument(new Term("id", "1"), doc);} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (LockObtainFailedException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {try {if (writer != null)writer.close();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}}public void index() {IndexWriter writer = null;try {writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));Document doc = null;for (int i = 0; i < sds.length; i++) {doc = new Document();doc.add(new Field("id", sds[i], Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));doc.add(new Field("email", emails[i], Field.Store.YES,Field.Index.NOT_ANALYZED));doc.add(new Field("content", content[i], Field.Store.NO,Field.Index.ANALYZED));doc.add(new Field("name", names[i], Field.Store.YES,Field.Index.NOT_ANALYZED));// 存储数字doc.add(new NumericField("attach", Field.Store.YES, true).setIntValue(attachs[i]));doc.add(new NumericField("date", Field.Store.YES, true).setLongValue(dates[i].getTime()));String at = emails[i].substring(emails[i].lastIndexOf("@") + 1);System.out.println(at);if (scores.containsKey(at)) {doc.setBoost(scores.get(at));} else {doc.setBoost(0.5f);}writer.addDocument(doc);}} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (LockObtainFailedException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {try {if (writer != null)writer.close();} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}}public void search() {try {IndexReader reader = IndexReader.open(directory);IndexSearcher searcher = new IndexSearcher(reader);TermQuery query = new TermQuery(new Term("content", "like"));TopDocs sds = searcher.search(query, 10);for (ScoreDoc sd : sds.scoreDocs) {Document document = searcher.doc(sd.doc);Date dt=new Date(Long.parseLong(document.get("date")));String dString=sdf.format(dt);// 此处getBoost()得到的另一个对象的System.out.println(document.getBoost() + document.get("name")+ "{" + document.get("email") + "|-->"+ document.get("id") + "|-->" + document.get("attach")+ "|-->" + "}" +dString+"!-->"+ sd.score);}} catch (CorruptIndexException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}}


测试代码:

package cn.edu.hit.lx;import java.util.IdentityHashMap;import org.junit.Test;public class testIndex {@Testpublic void testIndex(){indexutil ix=new indexutil();ix.index();}@Testpublic void testquery(){indexutil ix=new indexutil();ix.query();}@Testpublic void testdelete(){indexutil ix=new indexutil();ix.delete();}@Testpublic void testundel(){indexutil ix=new indexutil();ix.undelete();}@Testpublic void testforcedel(){indexutil ix=new indexutil();ix.forcedelete();}@Testpublic void testmerge(){indexutil ix=new indexutil();ix.merge();}@Testpublic void testupdate(){indexutil ix=new indexutil();ix.update();}@Testpublic void testsearch(){indexutil ix=new indexutil();ix.search();}}




原创粉丝点击