Lucene In Depth（1）--介绍

来源：互联网发布：厦门java培训班编辑：程序博客网时间：2024/05/21 08:56

摘要

Lucene是apache软件基金会jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包，即它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎（英文与德文两种西方语言）。Lucene的目的是为软件开发人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索的功能，或者是以此为基础建立起完整的全文检索引擎。

Lucene的原作者是Doug Cutting，他是一位资深全文索引/检索专家，曾经是V-Twin搜索引擎的主要开发者，后在Excite担任高级系统架构设计师，目前从事于一些Internet底层架构的研究。早先发布在作者自己的http://www.lucene.com/，后来发布在SourceForge，2001年年底成为apache软件基金会jakarta的一个子项目：http://jakarta.apache.org/lucene/。

引子

Lucene作为开源社区里面大名鼎鼎的全文检索系统被广泛传播，网上关于Lucene的资料可以说是俯首皆是，也有很多的其他全文检索系统或多或少的借鉴了Lucene的设计思想，比如国内刚刚出现的FirTex项目。因为一直非常信任Apache，也比较关注开源项目的发展，所以应该很好几年前就知道了Lucene，但总感觉他离我很远。也曾试图去深入的学习一下，最终不了了之。

直到最近，因为要设计一个系统，当中会用到全文检索技术，于是重拾Lucene，决定好好研究一下。搜遍了互联网，看到的大部分资料是讲解如何使用Lucene的，对Lucene深入讲解的不多，包括比较流行的《Lucene In Action》，都让我大失所望。我希望能深入的了解Lucene的设计思想，Lucene文件的数据结构，创建索引和检索的过程，为何可以做到如此高效。

看来只能自己慢慢啃了，我知道这个系统的复杂性，但当看到源代码时还是大吃一惊，竟然有4万多行。虽然以前曾经有过研习ICTCLAS源代码的经历，但它大概只有1万行左右，已经花费了我很长的时间才把各个细节部分搞清楚。看来这次又是一个攻坚战！

初探

虽然Lucene的源代码很庞大，很复杂，但对于使用者来说还是非常简单的，没有几行代码就可以实现完整的创建索引和检索的全过程。我们来看两个例子:

//检索数据

package org.apache.lucene.demo2;

import java.io.File;

import java.util.Date;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.queryParser.QueryParser;

import org.apache.lucene.search.Hits;

import org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.Query;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

public class Searcher {

/**

* @param args

public static void main(String[] args) {

try {

File indexDir = new File("test2/index");

String q = "mscomctl.ocx";

if (!indexDir.exists() || !indexDir.isDirectory()) {

throw new Exception(indexDir + " does not exist or is not a directory.");

}

search(indexDir, q);

} catch (Exception e) {

e.printStackTrace();

}

public static void search(File indexDir, String q) throws Exception {

Directory fsDir = FSDirectory.getDirectory(indexDir, false);

IndexSearcher is = new IndexSearcher(fsDir);

QueryParser parser = new QueryParser("contents", new StandardAnalyzer());

Query query = parser.parse(q);

long start = new Date().getTime();

Hits hits = is.search(query);

long end = new Date().getTime();

System.err.println("Found " + hits.length() + " document(s) (in " + (end - start) + " milliseconds) that matched query '" + q + "':");

for (int i = 0; i < hits.length(); i++) {

Document doc = hits.doc(i);

System.out.println(doc.get("filename"));

}

//创建索引

package org.apache.lucene.demo2;

import java.io.File;

import java.io.FileReader;

import java.io.IOException;

import java.util.Date;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.index.IndexWriter;

public class Inderer {

public static void main(String[] args) {

try {

File dataDir = new File("test2/data");

File indexDir = new File("test2/index");

long start = new Date().getTime();

int numIndexed = index(indexDir, dataDir);

long end = new Date().getTime();

System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");

} catch (IOException e) {

e.printStackTrace();

}

// open an index and start file directory traversal

public static int index(File indexDir, File dataDir) throws IOException {

if (!dataDir.exists() || !dataDir.isDirectory()) {

throw new IOException(dataDir + " does not exist or is not a directory");

}

IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(), true);

writer.setUseCompoundFile(false);

indexDirectory(writer, dataDir);

int numIndexed = writer.docCount();

writer.optimize();

writer.close();

return numIndexed;

}

// recursive method that calls itself when it finds a directory

private static void indexDirectory(IndexWriter writer, File dir) throws IOException {

File[] files = dir.listFiles();

for (int i = 0; i < files.length; i++) {

File f = files[i];

if (f.isDirectory()) {

indexDirectory(writer, f);

} else if (f.getName().endsWith(".txt")) {

indexFile(writer, f);

}

// method to actually index a file using Lucene

private static void indexFile(IndexWriter writer, File f) throws IOException {

if (f.isHidden() || !f.exists() || !f.canRead()) {

return;

}

System.out.println("Indexing " + f.getCanonicalPath());

Document doc = new Document();

doc.add(new Field("contents", new FileReader(f)));

doc.add(new Field("filename", f.getCanonicalPath(), Field.Store.YES, Field.Index.NO));

// System.out.println(doc);

writer.addDocument(doc);

}

参考

1. FirTex：www.firtex.org

2. lucene中国：www.lucene.com.cn