Lucene学习--索引的建立

来源:互联网 发布:淘宝如何发布宝贝详情 编辑:程序博客网 时间:2024/05/16 07:28

本篇目录:

1.索引建立的步骤

2.域的存储选项和索引选项

3.索引的增删查改

4.IndexReader的单例模式

5.域为数字和日期时的处理

6.生命周期

 

1.索引建立的步骤

建立索引就是为了在检索时从索引文件中进行快速查找。

各种类型的文件都需要先转换为文本,然后再通过适当的分词器,建立索引文件。
 

先创建Directory和IndexWriter,然后创建文档Document,之后为文档添加域Field,域的存储选项Field.Store和索引选项Field.Index均需要设置;

文档相当于表中的每一条记录,域相当于表中的每一个字段。

最后通过IndexWriter将文档添加到索引中;

 

2.域的存储选项和索引选项

 

被存储———便于对文本进行还原;当Field.Store设为YES时可调用docment.get方法获取该域的完整内容;
被索引———便于通过该域的值检索到所在的文档
 
在被索引的基础上,又有两个子选项:是否分词、是否加权
是否分词——考量一个域值是否为原子单位,是否允许通过拆分后的片段来检索它;
是否加权——考量是否允许该域值来影响检索到的结果集的顺序;
 
索引域选项:Field.Index
Field.Index的值
是否索引
是否分词
是否加权
NO
X
 

NOT_ANALYZED_NO_NORMS
V
X
X
ANALYZED
V
V
V
ANALYZED_NO_NORMS
V
V
X
NOT_ANALYZED
V
X
V

这里存在有最佳实践。
 

3.索引的增删查改

 

参加网络视频,好像授课者是一位大学老师。直接粘在这里,以备查用。

0、查询索引的基本信息:

通过IndexReader加载索引文件就可以直接获取文档的数量了(numDoc和maxDoc)

1、删除
2、恢复删除

3、强制删除

4、优化和合并索引

5、更新索引

package org.itat.index;import java.io.IOException;import java.text.ParseException;import java.text.SimpleDateFormat;import java.util.Date;import java.util.HashMap;import java.util.Map;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.document.NumericField;import org.apache.lucene.index.CorruptIndexException;import org.apache.lucene.index.IndexReader;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.index.IndexWriterConfig;import org.apache.lucene.index.StaleReaderException;import org.apache.lucene.index.Term;import org.apache.lucene.search.IndexSearcher;import org.apache.lucene.search.ScoreDoc;import org.apache.lucene.search.TermQuery;import org.apache.lucene.search.TopDocs;import org.apache.lucene.store.Directory;import org.apache.lucene.store.LockObtainFailedException;import org.apache.lucene.store.RAMDirectory;import org.apache.lucene.util.Version;public class IndexUtil {private String[] ids = {"1","2","3","4","5","6"};private String[] emails = {"aa@itat.org","bb@itat.org","cc@cc.org","dd@sina.org","ee@zttc.edu","ff@itat.org"};private String[] contents = {"welcome to visited the space,I like book","hello boy, I like pingpeng ball","my name is cc I like game","I like football","I like football and I like basketball too","I like movie and swim"};private Date[] dates = null;private int[] attachs = {2,3,1,4,5,5};private String[] names = {"zhangsan","lisi","john","jetty","mike","jake"};private Directory directory = null;private Map<String,Float> scores = new HashMap<String,Float>();private static IndexReader reader = null;public IndexUtil() {try {setDates();scores.put("itat.org",2.0f);scores.put("zttc.edu", 1.5f);//directory = FSDirectory.open(new File("d:/lucene/index02"));directory = new RAMDirectory();index();reader = IndexReader.open(directory,false);} catch (IOException e) {e.printStackTrace();}}public IndexSearcher getSearcher() {try {if(reader==null) {reader = IndexReader.open(directory,false);} else {IndexReader tr = IndexReader.openIfChanged(reader);if(tr!=null) {reader.close();reader = tr;}}return new IndexSearcher(reader);} catch (CorruptIndexException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}return null;}private void setDates() {SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");try {dates = new Date[ids.length];dates[0] = sdf.parse("2010-02-19");dates[1] = sdf.parse("2012-01-11");dates[2] = sdf.parse("2011-09-19");dates[3] = sdf.parse("2010-12-22");dates[4] = sdf.parse("2012-01-01");dates[5] = sdf.parse("2011-05-19");} catch (ParseException e) {e.printStackTrace();}}public void undelete() {//使用IndexReader进行恢复try {IndexReader reader = IndexReader.open(directory,false);//恢复时,必须把IndexReader的只读(readOnly)设置为falsereader.undeleteAll();reader.close();} catch (CorruptIndexException e) {e.printStackTrace();} catch (StaleReaderException e) {e.printStackTrace();} catch (LockObtainFailedException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}}public void merge() {IndexWriter writer = null;try {writer = new IndexWriter(directory,new IndexWriterConfig(Version.LUCENE_35,new StandardAnalyzer(Version.LUCENE_35)));//会将索引合并为2段,这两段中的被删除的数据会被清空//特别注意:此处Lucene在3.5之后不建议使用,因为会消耗大量的开销,//Lucene会根据情况自动处理的writer.forceMerge(2);} catch (CorruptIndexException e) {e.printStackTrace();} catch (LockObtainFailedException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();} finally {try {if(writer!=null) writer.close();} catch (CorruptIndexException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}}}public void forceDelete() {IndexWriter writer = null;try {writer = new IndexWriter(directory,new IndexWriterConfig(Version.LUCENE_35,new StandardAnalyzer(Version.LUCENE_35)));writer.forceMergeDeletes();} catch (CorruptIndexException e) {e.printStackTrace();} catch (LockObtainFailedException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();} finally {try {if(writer!=null) writer.close();} catch (CorruptIndexException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}}}public void delete() {IndexWriter writer = null;try {writer = new IndexWriter(directory,new IndexWriterConfig(Version.LUCENE_35,new StandardAnalyzer(Version.LUCENE_35)));//参数是一个选项,可以是一个Query,也可以是一个term,term是一个精确查找的值//此时删除的文档并不会被完全删除,而是存储在一个回收站中的,可以恢复writer.deleteDocuments(new Term("id","1"));writer.commit();} catch (CorruptIndexException e) {e.printStackTrace();} catch (LockObtainFailedException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();} finally {try {if(writer!=null) writer.close();} catch (CorruptIndexException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}}}public void delete02() {try {reader.deleteDocuments(new Term("id","1"));} catch (CorruptIndexException e) {e.printStackTrace();} catch (LockObtainFailedException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}}public void update() {IndexWriter writer = null;try {writer = new IndexWriter(directory,new IndexWriterConfig(Version.LUCENE_35,new StandardAnalyzer(Version.LUCENE_35)));/* * Lucene并没有提供更新,这里的更新操作其实是如下两个操作的合集 * 先删除之后再添加 */Document doc = new Document();doc.add(new Field("id","11",Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));doc.add(new Field("email",emails[0],Field.Store.YES,Field.Index.NOT_ANALYZED));doc.add(new Field("content",contents[0],Field.Store.NO,Field.Index.ANALYZED));doc.add(new Field("name",names[0],Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));writer.updateDocument(new Term("id","1"), doc);} catch (CorruptIndexException e) {e.printStackTrace();} catch (LockObtainFailedException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();} finally {try {if(writer!=null) writer.close();} catch (CorruptIndexException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}}}public void query() {try {IndexReader reader = IndexReader.open(directory);//通过reader可以有效的获取到文档的数量System.out.println("numDocs:"+reader.numDocs());System.out.println("maxDocs:"+reader.maxDoc());System.out.println("deleteDocs:"+reader.numDeletedDocs());reader.close();} catch (CorruptIndexException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}}public void index() {IndexWriter writer = null;try {writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));writer.deleteAll();Document doc = null;for(int i=0;i<ids.length;i++) {doc = new Document();doc.add(new Field("id",ids[i],Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));doc.add(new Field("email",emails[i],Field.Store.YES,Field.Index.NOT_ANALYZED));doc.add(new Field("email","test"+i+"@test.com",Field.Store.YES,Field.Index.NOT_ANALYZED));doc.add(new Field("content",contents[i],Field.Store.NO,Field.Index.ANALYZED));doc.add(new Field("name",names[i],Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));//存储数字doc.add(new NumericField("attach",Field.Store.YES,true).setIntValue(attachs[i]));//存储日期doc.add(new NumericField("date",Field.Store.YES,true).setLongValue(dates[i].getTime()));String et = emails[i].substring(emails[i].lastIndexOf("@")+1);System.out.println(et);if(scores.containsKey(et)) {doc.setBoost(scores.get(et));} else {doc.setBoost(0.5f);}writer.addDocument(doc);}} catch (CorruptIndexException e) {e.printStackTrace();} catch (LockObtainFailedException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();} finally {try {if(writer!=null)writer.close();} catch (CorruptIndexException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}}}public void search01() {try {IndexReader reader = IndexReader.open(directory);IndexSearcher searcher = new IndexSearcher(reader);TermQuery query = new TermQuery(new Term("email","test0@test.com"));TopDocs tds = searcher.search(query, 10);for(ScoreDoc sd:tds.scoreDocs) {Document doc = searcher.doc(sd.doc);System.out.println("("+sd.doc+"-"+doc.getBoost()+"-"+sd.score+")"+doc.get("name")+"["+doc.get("email")+"]-->"+doc.get("id")+","+doc.get("attach")+","+doc.get("date")+","+doc.getValues("email")[1]);}reader.close();} catch (CorruptIndexException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}}public void search02() {try {IndexSearcher searcher = getSearcher();TermQuery query = new TermQuery(new Term("content","like"));TopDocs tds = searcher.search(query, 10);for(ScoreDoc sd:tds.scoreDocs) {Document doc = searcher.doc(sd.doc);System.out.println(doc.get("id")+"---->"+doc.get("name")+"["+doc.get("email")+"]-->"+doc.get("id")+","+doc.get("attach")+","+doc.get("date")+","+doc.getValues("email")[1]);}searcher.close();} catch (CorruptIndexException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}}}


4.IndexReader的单例模式

public IndexSearcher getSearcher() {try {if(reader==null) {reader = IndexReader.open(directory,false);} else {IndexReader tr = IndexReader.openIfChanged(reader);if(tr!=null) {reader.close();reader = tr;}}return new IndexSearcher(reader);} catch (CorruptIndexException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}return null;}


 

5.域为数字和日期时的处理

 

//存储数字doc.add(new NumericField("attach",Field.Store.YES,true).setIntValue(attachs[i]));//存储日期doc.add(new NumericField("date",Field.Store.YES,true).setLongValue(dates[i].getTime()));


 

6.IndexReader的生命周期

对于IndexReader而言,之所以使用单例模式,是因为反复使用IndexReader.open打开会有很大的开销,所以一般在整个程序的生命周期中只会打开一个IndexReader,通过这个IndexReader来创建不同的IndexSearcher。

 

但如果使用单例模式,可能出现的问题有:
1、如果IndexWriter在创建完成之后,没有关闭,需要进行commit操作之后才能提交,才会更新索引文件;
2、当使用IndexWriter修改了索引之后,IndexReader不会自动更新它加载过的索引信息,所以需要使用IndexReader.openIfChange方法操作。

原创粉丝点击