利用lucene对文件内容进行关键字检索

来源：互联网发布：thug life软件下载编辑：程序博客网时间：2024/05/19 17:59

一、概述

关于lucene的具体介绍，请自行百度。

二、实例讲解

在具体实现之前，请根据自己的要求，建立对应的路径及文件。

例如，我这边创建的路径及文件是：

D:/tools/LearningByMyself/lucene/source/demo1.txt

D:/tools/LearningByMyself/lucene/source/demo2.txt

D:/tools/LearningByMyself/lucene/index

第一步，建立索引，代码如下：

/**   * @param sourceFile 需要添加到索引中的路径   * @param indexFile  存放索引的路径   * @throws Exception   */public static void textFileIndexer(String sourceFile,String indexFile) throws Exception{File sourceDir = new File(sourceFile), indexDir = new File(indexFile);  Directory dir =  FSDirectory.open(indexDir);                Analyzer luceneAnalyzer = new   StandardAnalyzer(Version.LUCENE_36); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_36,luceneAnalyzer); iwc.setOpenMode(OpenMode.CREATE); IndexWriter indexWriter = new IndexWriter(dir,iwc);             File[] textFiles = sourceDir.listFiles();               long startTime = new Date().getTime();                 for(int i=0;i<textFiles.length;i++){        if(textFiles[i].isFile() && textFiles[i].getName().endsWith(".txt")){        System.out.println("File--->" + textFiles[i].getCanonicalPath() + " 正在被索引.....");        String str_temp = fileReaderAll(textFiles[i].getCanonicalPath(),"UTF-8");        System.out.println("文件内容：" + str_temp);                Document document = new Document();        Field field_path = new Field("path",textFiles[i].getCanonicalPath(),        Field.Store.YES,Field.Index.NO);        Field field_body = new Field("body",str_temp,Field.Store.YES,        Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS);        document.add(field_path);        document.add(field_body);                indexWriter.addDocument(document);        }        }                indexWriter.close();                long endTime = new Date().getTime();                System.out.println("一共花费了" +                (endTime - startTime) + "毫秒将" + sourceDir.getPath() + "中的文件增加到索引里面去.....");}

private static String fileReaderAll(String filename,String charset) throws IOException{BufferedReader buffer_read = new BufferedReader(new InputStreamReader(new FileInputStream(filename),charset));String line = new String();String temp = new String();while((line = buffer_read.readLine()) != null){temp += line ;}buffer_read.close();return temp ;}

第二步，在索引中检索关键字

/**     * @param indexFile 索引所在的路径     * @param keyWords  需要检索的关键字     * @throws IOException     * @throws ParseException     */     public static void queryKeyWords(String indexFile,String keyWords) throws IOException,ParseException{    IndexReader reader = IndexReader.open(FSDirectory.open(new File(indexFile)));IndexSearcher index_search = new IndexSearcher(reader);Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);    QueryParser query_parser = new QueryParser(Version.LUCENE_36,"body",analyzer);        Query query = query_parser.parse(keyWords);if(index_search != null){    TopDocs result = index_search.search(query, 10); //返回最多为10条记录    ScoreDoc[] hits = result.scoreDocs;        if(hits.length > 0){    System.out.println("关键字：" + keyWords + "，在  " + indexFile + "中，一共检索到" + hits.length + "个...");    }        index_search.close();    }    }

第三部，自己编写一个测试类，测试一下上面的两个方法，例如，我写的测试类如下：

public class LuceneTest {public static void main(String[] args) throws IOException,ParseException,Exception{String sourcePath = "D:/tools/LearningByMyself/lucene/source" ;String indexPath = "D:/tools/LearningByMyself/lucene/index" ;String key_words = "服务器" ;LuceneIndex.textFileIndexer(sourcePath, indexPath);LuceneIndex.queryKeyWords(indexPath, key_words);}}

第四步，在控制台上查看结果。例如，我这边的测试结果如下：

File--->D:\tools\LearningByMyself\lucene\source\demo1.txt 正在被索引.....

文件内容：为了保证机房的网络安全，IDC内所有服务器不被允许从办公网直接ssh登录，必须通过跳板机进行间接登录。用户通过跳板机执行的所有命令（包括通过跳板机登录的其他机器后的命令）都会被保存并审计。

File--->D:\tools\LearningByMyself\lucene\source\demo2.txt 正在被索引.....

文件内容：Relay是我们登录IDC服务器的跳板机，在Relay上用户只能执行ssh、passwd等简单命令，Relay只做ssh跳板机儿不做日常工具机使用。

一共花费了235毫秒将D:\tools\LearningByMyself\lucene\source中的文件增加到索引里面去.....

关键字：服务器，在 D:/tools/LearningByMyself/lucene/index中，一共检索到2个.

0 0