Lucene In Depth(1)--介绍
来源:互联网 发布:厦门java培训班 编辑:程序博客网 时间:2024/05/21 08:56
摘要
Lucene是apache软件基金会jakarta项目组的一个子项目,是一个开放源代码的全文检索引擎工具包,即它不是一个完整的全文检索引擎,而是一个全文检索引擎的架构,提供了完整的查询引擎和索引引擎,部分文本分析引擎(英文与德文两种西方语言)。Lucene的目的是为软件开发人员提供一个简单易用的工具包,以方便的在目标系统中实现全文检索的功能,或者是以此为基础建立起完整的全文检索引擎。
Lucene的原作者是Doug Cutting,他是一位资深全文索引/检索专家,曾经是V-Twin搜索引擎的主要开发者,后在Excite担任高级系统架构设计师,目前从事于一些Internet底层架构的研究。早先发布在作者自己的http://www.lucene.com/,后来发布在SourceForge,2001年年底成为apache软件基金会jakarta的一个子项目:http://jakarta.apache.org/lucene/。
引子
Lucene作为开源社区里面大名鼎鼎的全文检索系统被广泛传播,网上关于Lucene的资料可以说是俯首皆是,也有很多的其他全文检索系统或多或少的借鉴了Lucene的设计思想,比如国内刚刚出现的FirTex项目。因为一直非常信任Apache,也比较关注开源项目的发展,所以应该很好几年前就知道了Lucene,但总感觉他离我很远。也曾试图去深入的学习一下,最终不了了之。
直到最近,因为要设计一个系统,当中会用到全文检索技术,于是重拾Lucene,决定好好研究一下。搜遍了互联网,看到的大部分资料是讲解如何使用Lucene的,对Lucene深入讲解的不多,包括比较流行的《Lucene In Action》,都让我大失所望。我希望能深入的了解Lucene的设计思想,Lucene文件的数据结构,创建索引和检索的过程,为何可以做到如此高效。
看来只能自己慢慢啃了,我知道这个系统的复杂性,但当看到源代码时还是大吃一惊,竟然有4万多行。虽然以前曾经有过研习ICTCLAS源代码的经历,但它大概只有1万行左右,已经花费了我很长的时间才把各个细节部分搞清楚。看来这次又是一个攻坚战!
初探
虽然Lucene的源代码很庞大,很复杂,但对于使用者来说还是非常简单的,没有几行代码就可以实现完整的创建索引和检索的全过程。我们来看两个例子:
//检索数据
package org.apache.lucene.demo2;
import java.io.File;
import java.util.Date;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class Searcher ...{
/** *//**
* @param args
*/
public static void main(String[] args) ...{
try ...{
File indexDir = new File("test2/index");
String q = "mscomctl.ocx";
if (!indexDir.exists() || !indexDir.isDirectory()) ...{
throw new Exception(indexDir + " does not exist or is not a directory.");
}
search(indexDir, q);
} catch (Exception e) ...{
e.printStackTrace();
}
}
public static void search(File indexDir, String q) throws Exception ...{
Directory fsDir = FSDirectory.getDirectory(indexDir, false);
IndexSearcher is = new IndexSearcher(fsDir);
QueryParser parser = new QueryParser("contents", new StandardAnalyzer());
Query query = parser.parse(q);
long start = new Date().getTime();
Hits hits = is.search(query);
long end = new Date().getTime();
System.err.println("Found " + hits.length() + " document(s) (in " + (end - start) + " milliseconds) that matched query '" + q + "':");
for (int i = 0; i < hits.length(); i++) ...{
Document doc = hits.doc(i);
System.out.println(doc.get("filename"));
}
}
}
package org.apache.lucene.demo2;
import java.io.File;
import java.util.Date;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class Searcher ...{
/** *//**
* @param args
*/
public static void main(String[] args) ...{
try ...{
File indexDir = new File("test2/index");
String q = "mscomctl.ocx";
if (!indexDir.exists() || !indexDir.isDirectory()) ...{
throw new Exception(indexDir + " does not exist or is not a directory.");
}
search(indexDir, q);
} catch (Exception e) ...{
e.printStackTrace();
}
}
public static void search(File indexDir, String q) throws Exception ...{
Directory fsDir = FSDirectory.getDirectory(indexDir, false);
IndexSearcher is = new IndexSearcher(fsDir);
QueryParser parser = new QueryParser("contents", new StandardAnalyzer());
Query query = parser.parse(q);
long start = new Date().getTime();
Hits hits = is.search(query);
long end = new Date().getTime();
System.err.println("Found " + hits.length() + " document(s) (in " + (end - start) + " milliseconds) that matched query '" + q + "':");
for (int i = 0; i < hits.length(); i++) ...{
Document doc = hits.doc(i);
System.out.println(doc.get("filename"));
}
}
}
//创建索引
package org.apache.lucene.demo2;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.Date;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
public class Inderer ...{
public static void main(String[] args) ...{
try ...{
File dataDir = new File("test2/data");
File indexDir = new File("test2/index");
long start = new Date().getTime();
int numIndexed = index(indexDir, dataDir);
long end = new Date().getTime();
System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");
} catch (IOException e) ...{
e.printStackTrace();
}
}
// open an index and start file directory traversal
public static int index(File indexDir, File dataDir) throws IOException ...{
if (!dataDir.exists() || !dataDir.isDirectory()) ...{
throw new IOException(dataDir + " does not exist or is not a directory");
}
IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(), true);
writer.setUseCompoundFile(false);
indexDirectory(writer, dataDir);
int numIndexed = writer.docCount();
writer.optimize();
writer.close();
return numIndexed;
}
// recursive method that calls itself when it finds a directory
private static void indexDirectory(IndexWriter writer, File dir) throws IOException ...{
File[] files = dir.listFiles();
for (int i = 0; i < files.length; i++) ...{
File f = files[i];
if (f.isDirectory()) ...{
indexDirectory(writer, f);
} else if (f.getName().endsWith(".txt")) ...{
indexFile(writer, f);
}
}
}
// method to actually index a file using Lucene
private static void indexFile(IndexWriter writer, File f) throws IOException ...{
if (f.isHidden() || !f.exists() || !f.canRead()) ...{
return;
}
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getCanonicalPath(), Field.Store.YES, Field.Index.NO));
// System.out.println(doc);
writer.addDocument(doc);
}
}
package org.apache.lucene.demo2;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.Date;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
public class Inderer ...{
public static void main(String[] args) ...{
try ...{
File dataDir = new File("test2/data");
File indexDir = new File("test2/index");
long start = new Date().getTime();
int numIndexed = index(indexDir, dataDir);
long end = new Date().getTime();
System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");
} catch (IOException e) ...{
e.printStackTrace();
}
}
// open an index and start file directory traversal
public static int index(File indexDir, File dataDir) throws IOException ...{
if (!dataDir.exists() || !dataDir.isDirectory()) ...{
throw new IOException(dataDir + " does not exist or is not a directory");
}
IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(), true);
writer.setUseCompoundFile(false);
indexDirectory(writer, dataDir);
int numIndexed = writer.docCount();
writer.optimize();
writer.close();
return numIndexed;
}
// recursive method that calls itself when it finds a directory
private static void indexDirectory(IndexWriter writer, File dir) throws IOException ...{
File[] files = dir.listFiles();
for (int i = 0; i < files.length; i++) ...{
File f = files[i];
if (f.isDirectory()) ...{
indexDirectory(writer, f);
} else if (f.getName().endsWith(".txt")) ...{
indexFile(writer, f);
}
}
}
// method to actually index a file using Lucene
private static void indexFile(IndexWriter writer, File f) throws IOException ...{
if (f.isHidden() || !f.exists() || !f.canRead()) ...{
return;
}
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getCanonicalPath(), Field.Store.YES, Field.Index.NO));
// System.out.println(doc);
writer.addDocument(doc);
}
}
参考
1. FirTex:www.firtex.org
2. lucene中国:www.lucene.com.cn
- Lucene In Depth(1)--介绍
- 4.3.1深度定时器(Timer in Depth)
- 【C# in depth 第三版】温故而知新(1)
- Lucene In Action 中文版 [1]
- Lucene in Action (中文)
- Lucene in Action (中文)
- Lucene in Action (中文)
- Lucene in Action(中文版)
- Lucene in Action(中文版)
- Lucene相关介绍(一)
- Lucene相关介绍(二)
- Lucene介绍
- Lucene介绍
- lucene介绍
- Lucene介绍
- Lucene 介绍
- Lucene介绍
- Lucene介绍
- hibernate中给集合排序的方法
- 开始写博客
- Struts中textfield格式化的正解
- CSplitterWnd in a Dialog based Application
- 最简单的MFC程序
- Lucene In Depth(1)--介绍
- pwm的基本概念
- 毫秒转换为(天:时:分:秒)方法
- SQL语句生成随机数
- 创建客户区窗口,列表框之间项的拖拽操作
- Hibernate3 学习(三)
- 关于关卡的分类
- 修改之后,优化和 消除重复 都完成的 最好的 哈哈
- Javascript读取ACCESS数据库