Lucene 课程设计 检索.doc,.pdf,.html,.execl,.txt格式文件
来源:互联网 发布:淘宝不清洗第二次排查 编辑:程序博客网 时间:2024/04/29 14:32
又花了大概一个星期的时间,终于写完了信息检索的课程设计,快考试,还没复习,整天写个门课的实验...
利用Lucene开源软件,实现了检索.doc,.pdf,.html,.execl,.txt等常见格式的文件,检索结果给出文件所在的路径
课设中用到的各软件包为:lucene2.4.0,apache-tomcat-6.0.16,poi-bin-3.2-FINAL-20081019(用于解析office类文件,如doc,.xls等),PDFBox-0.7.3(用于解析.pdf 文件),开发环境为Eclipse Version: 3.4.1
注意的是,用到的包都要放到WebContent/WEB-INF的lib目录下,否则会报NoClassDefError(好像这么写),其次,在eclipse开发页面的Project Explorer的空白处右键单击,然后单击refresh,否则即使把用的包放到lib目录下,仍会报NoClassDefError错误,就这个“小小”的错误,折磨了我一天半的时间,狂上网查原因,用尽了能够搜集到方法,也不行,都快崩溃了,最后坐在电脑前发呆,无意中发现了之前变动过的docs(用于存放待检索的文件) 在eclipse的Project Explorer显示的仍是之前的那几个文件,这才找到原因,轻轻的点击了refresh,终于解决了,既兴奋有无语,就这点差错,之前没怎么用这类集成开发工具(vc6.0除外),怎么评价呢,这种集成开发工具的却带给了我们很高的开发效率,他给你了很多提示,但就是配置环境比较烦人,稍有点错误就不行,而且这种错误通常很难解决,有利有弊吧(我最欣赏这类开发工具的就是他们对格式都是自动控制的,这点很好,不向用.TXT时那烦人的格式控制..)
下面给出代码:
configuration.jsp
- <%@ page language="java" contentType="text/html; charset=GB18030"
- pageEncoding="GB18030"%>
- <html>
- <head>
- <title>
- Welcome to LuceneWeb - Configuration Page
- </title>
- </head>
- <body background="D:/Program Files/MIR Design/eclise workplace/LuceneWebApplication/bg.jpg">
- <center>
- <font face="Monotype Corsiva" size="20" color="#9966ff">
- <b>
- Welcome to LuceneWeb
- </b>
- </font>
- </center>
- <h2 align="right">
- <b>
- <font face="Monotype Corsiva" color="#9966ff">
- Configuration Page
- </font>
- </b>
- </h2>
- <center>
- <form name="Configuration" action="header.jsp" method="get">
- <%for(int i=0;i<7;i++){%>
- <br>
- <%} %>
- <p>
- <font face="楷体_GB2312" size="5">
- 文档路径:
- </font>
- <input type="text" name="DocumentDirectory" size="40"/>
- </p>
- <p>
- <br>
- <font face="楷体_GB2312" size="5">
- 索引路径:
- </font>
- <input type="text" name="IndexDirectory" size="40"/>
- </p>
- <br>
- <center>
- <input type="submit" value="建立索引"/>
- </center>
- </form>
- </center>
- </body>
- </html>
header.jsp
- <%@page language="java" contentType="text/html; charset=GB18030"
- pageEncoding="GB18030"%>
- <%@page import="Index.CreateIndex" %>
- <%@page import ="java.io.*,org.apache.poi.hwpf.extractor.*,org.apache.lucene.analysis.*,org.apache.lucene.analysis.standard.StandardAnalyzer,org.apache.lucene.document.*,org.apache.lucene.index.*, org.apache.lucene.search.*,org.apache.lucene.queryParser.*,org.apache.lucene.demo.*,org.apache.lucene.demo.html.Entities,java.net.URLEncoder" %>
- <head>
- <title>Welcome to LuceneWeb - Search Page</title>
- </head>
- <%
- CreateIndex create=new CreateIndex();
- String index;
- String document_path,index_path;
- document_path=request.getParameter("DocumentDirectory");
- index_path=request.getParameter("IndexDirectory");
- //下面的两个if语句用于设置document和index的默认路径,当没有在配置页面输入这两个路径时,使用默认路径,
- if(document_path.length()<2)//开始使用document_path==null判断,结果不对,姑且就这样判断了
- {
- document_path="D://Program Files//MIR Design//eclise workplace//LuceneWebApplication//docs";
- }
- if(index_path.length()<2)
- {
- index_path="D://Program Files//MIR Design//eclise workplace//LuceneWebApplication//index";
- }
- create.create_index(document_path,index_path);
- String indexPath=create.get_index();
- %>
- <body background="D:/Program Files/MIR Design/eclise workplace/LuceneWebApplication/luceneweb.jpg">
- <center>
- <font face="Monotype Corsiva" color="#00ffff" size="150">
- <b>
- LuceneWeb
- </b>
- </font>
- </center>
- <center>
- <%
- for(int i=0;i<10;i++)
- {
- %>
- <br>
- <%
- }
- %>
- <form name="Search" action="index.jsp" method="get">
- <p>
- <font face="楷体_GB2312" size="5">
- 查询关键zi:
- </font>
- <input name="QueryInput" size="50"/>
- <input type="hidden" name="indexPath" value="<%=indexPath %>"/>
- <br>
- <br>
- <br>
- </p>
- <center>
- <input type="submit" value="Search"/>
- </center>
- </form>
- </center>
index.jsp
- <%@ page import = " javax.servlet.*, javax.servlet.http.*, java.io.*, org.apache.lucene.analysis.*, org.apache.lucene.analysis.standard.StandardAnalyzer, org.apache.lucene.document.*, org.apache.lucene.index.*, org.apache.lucene.search.*, org.apache.lucene.queryParser.*, org.apache.lucene.demo.*, org.apache.lucene.demo.html.Entities, java.net.URLEncoder" %>
- <%@page language="java" contentType="text/html; charset=GB18030"
- pageEncoding="GB18030"%>
- <%@include file="header_frame.jsp" %>
- <%
- boolean error = false;
- String indexName=request.getParameter("indexPath");
- IndexSearcher searcher = null;
- Query query = null;
- Hits hits = null;
- int startindex =0;
- int maxpage =10;
- String queryString = null;
- String startVal ="0";
- String maxresults ="10";
- int thispage = 0;
- try {
- searcher = new IndexSearcher(indexName);
- } catch (Exception e) {
- %>
- <p>Notice:error opening the Index</p>
- <% error = true;
- }
- %>
- <%
- if (error == false) {
- queryString = request.getParameter("QueryInput");
- startVal =request.getParameter("startat");
- maxresults =request.getParameter("maxresults");
- try {
- maxpage = Integer.parseInt(maxresults);
- startindex = Integer.parseInt(startVal);
- } catch (Exception e) { }
- if (queryString == null)
- {
- //throw new ServletException("no query "+"specified");
- %>
- <h3 align="left">
- <font face="Monotype Corsiva" color="#9966ff">
- <b>
- 请输入查询关键字...
- </b>
- </font>
- </h3>
- <%
- }
- %>
- <%
- Analyzer analyzer = new StandardAnalyzer();
- try {
- QueryParser qp = new QueryParser("contents", analyzer);
- query = qp.parse(queryString);
- } catch (ParseException e) {
- %>
- <%
- error = true;
- }
- }
- %>
- <%
- if (error == false && searcher != null) {
- thispage = maxpage;
- hits = searcher.search(query);
- if (hits.length() == 0) {
- %>
- <p>对不起,没有你想查询的结果...</p>
- <%
- error = true;
- }
- }
- if (error == false && searcher != null) {
- %>
- <h3 align="left">
- <font face="Monotype Corsiva" color="#9966ff">
- <b>
- 总共有<%=hits.length()%>条查询结果...
- </b>
- </font>
- </h3>
- <table>
- <%
- if ((startindex + maxpage) > hits.length()) {
- thispage = hits.length() - startindex;
- }
- %>
- <%
- for (int i = startindex; i < (thispage + startindex); i++) {
- %>
- <tr>
- <%
- Document doc = hits.doc(i);
- String docdoctitle = doc.get("title");
- String url = doc.get("path");
- if (url != null && url.startsWith("../webapps/"))
- {
- urlurl = url.substring(10);
- }
- if ((doctitle == null) || doctitle.equals(""))
- doctitle = url;
- %>
- <td><a href="<%=url%>"><%=doctitle%></a></td>
- </tr>
- <%
- }
- %>
- </table>
- <%
- for(int i=0;i<7;i++)
- {
- %>
- <br>
- <%
- }
- %>
- <p align="left">
- <%
- String first_page="index.jsp?QueryInput="+queryString+"maxresults="+maxpage+"startat="+"0"+"indexPath="+indexName;
- %>
- <% if(startindex>maxpage-1)
- {
- %>
- <a href="<%=first_page%>">首页</a>
- <%
- }
- else
- {
- %>
- 首页
- <%
- }
- %>
- <% if (startindex>=maxpage)
- {
- String former_page="index.jsp?QueryInput="+queryString+"maxresults="+maxpage+"startat="+(startindex-maxpage)+"indexPath="+indexName;
- %>
- <a href="<%=former_page%>">上一页</a>
- <%
- }else{
- %>
- 上一页
- <%
- }
- %>
- <% if ( (startindex + maxpage) < hits.length()) {
- String next_page="index.jsp?QueryInput="+queryString+"maxresults="+maxpage+"startat="+(startindex+maxpage)+"indexPath="+indexName;
- %>
- <a href="<%=next_page%>">下一页</a>
- <%
- }else{
- %>
- 下一页
- <%
- }
- %>
- <%
- String end_page="index.jsp?QueryInput="+queryString+"maxresults="+maxpage+"startat="+maxpage*(hits.length()/maxpage)+"indexPath="+indexName;
- %>
- <% if((startindex + maxpage) < hits.length())
- {
- %>
- <a href="<%=end_page%>">尾页</a>
- <%
- }else{
- %>
- 尾页
- <%
- }
- %>
- </p>
- <% }
- if (searcher != null)
- searcher.close();
- %>
header_frame.jsp
- <head>
- <title>Welcome to LuceneWeb - Results Page</title>
- </head>
- <body background="D:/Program Files/MIR Design/eclise workplace/LuceneWebApplication/luceneweb.jpg">
- <center>
- <font face="Monotype Corsiva" color="#00ffff" size="100">
- <b>
- LuceneWeb
- </b>
- </font>
- </center>
- <h1 align="left">
- <font face="Monotype Corsiva" color="#9966ff">
- <b>
- Search Result:
- </b>
- </font>
- </h1>
CreateIndex.java(这个类主要用于生成各种文件格式的索引)
- package Index;
- //生成Index并返回Index的地址,这部分健壮性(robust)严重不行,因为没有考虑异常问题,只是做个演示,假设一切都按常规操作,有待改进,现在时间太紧
- import java.io.BufferedInputStream;
- import java.io.File;
- import java.io.FileInputStream;
- import java.io.FileNotFoundException;
- import java.io.FileReader;
- import java.io.IOException;
- import java.io.Reader;
- import org.apache.lucene.analysis.standard.StandardAnalyzer;
- import org.apache.lucene.demo.html.HTMLParser;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.document.Field;
- import org.apache.lucene.index.IndexWriter;
- import org.apache.poi.hslf.HSLFSlideShow;
- import org.apache.poi.hslf.extractor.PowerPointExtractor;
- import org.apache.poi.hslf.model.TextRun;
- import org.apache.poi.hslf.usermodel.SlideShow;
- import org.apache.poi.hssf.model.Workbook;
- import org.apache.poi.hwpf.extractor.WordExtractor;
- import org.pdfbox.cos.COSDocument;
- import org.pdfbox.pdfparser.PDFParser;
- import org.pdfbox.pdmodel.PDDocument;
- import org.pdfbox.searchengine.lucene.LucenePDFDocument;
- import org.pdfbox.util.PDFTextStripper;
- import org.apache.poi.hssf.extractor.ExcelExtractor;
- import org.apache.poi.hssf.usermodel.HSSFWorkbook;
- public class CreateIndex{
- private String IndexPath="begin1";
- String index_directory=null;
- String document_directory=null;
- String execu=null;
- //用于创建索引,.txt,.pdf,.doc,.html,.ppt,.execl格式
- public String get_index()
- {
- return IndexPath;
- }
- public void create_index(String document_directory,String index_directory) throws FileNotFoundException,IOException
- {
- this.document_directory=document_directory;
- this.index_directory=index_directory;
- File documentDir=new File(this.document_directory);
- File indexDir=new File(this.index_directory);
- StandardAnalyzer luceneAnalyzer=new StandardAnalyzer();
- File datafiles[]=documentDir.listFiles();
- IndexWriter indexWriter;
- indexWriter=new IndexWriter(indexDir,luceneAnalyzer,true);
- for(int i=0;i<datafiles.length;i++)
- {
- //创建.TXT文件的索引
- if(datafiles[i].isFile()&&datafiles[i].getName().endsWith(".txt"))
- {
- Document document=new Document();
- Reader txtReader=new FileReader(datafiles[i]);
- document.add(new Field("path",datafiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.TOKENIZED));
- document.add(new Field("contents",txtReader));
- indexWriter.addDocument(document);
- }
- //创建.HTML文件的索引
- if(datafiles[i].isFile()&&datafiles[i].getName().endsWith(".html"))
- {
- Document document = new Document();
- FileInputStream file_input_stream = new FileInputStream(datafiles[i]);
- HTMLParser parser = new HTMLParser(file_input_stream);
- document.add(new Field("path",datafiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.TOKENIZED));
- document.add(new Field("contents", parser.getReader()));
- indexWriter.addDocument(document);
- }
- //创建.PDF文件的索引
- if(datafiles[i].isFile()&&datafiles[i].getName().endsWith(".pdf"))
- {
- Document document = new Document();
- FileInputStream file_input_stream = new FileInputStream(datafiles[i]);
- PDFParser parser=new PDFParser(file_input_stream);
- parser.parse();
- COSDocument cosdoc=parser.getDocument();
- PDFTextStripper stripper=new PDFTextStripper();
- String docText=stripper.getText(new PDDocument(cosdoc));
- document.add(new Field("path",datafiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.TOKENIZED));
- document.add(new Field("contents",docText,Field.Store.YES,Field.Index.TOKENIZED));
- //document = LucenePDFDocument.getDocument(datafiles[i]);
- indexWriter.addDocument(document);
- }
- //创建.DOC文件的索引
- if(datafiles[i].isFile()&&datafiles[i].getName().endsWith(".doc"))
- {
- Document document = new Document();
- FileInputStream file_input_stream = new FileInputStream(datafiles[i]);
- BufferedInputStream input_stream_buffer = new BufferedInputStream(file_input_stream);
- WordExtractor doc_extractor = new WordExtractor(input_stream_buffer);
- document.add(new Field("path",datafiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.TOKENIZED));
- document.add(new Field("contents",doc_extractor.getText(),Field.Store.YES,Field.Index.TOKENIZED));
- indexWriter.addDocument(document);
- }
- //创建.PPT文件的索引 这地方一直有问题,提示“no such entry: "PowerPoint Document",
- //网上查了下,大概是环境的问题,感觉解析DOC,EXECL,PPT的方法应该都一样,利用POI带的.doc|.ppt|.xlsExtractor
- //就能对这三种文件进行解析,但不知就是PPT不行
- /*if(datafiles[i].isFile()&&datafiles[i].getName().endsWith(".ppt"));
- {
- Document document=new Document();
- InputStream file_input_stream = new FileInputStream(datafiles[i]);
- BufferedInputStream input_stream_buffer = new BufferedInputStream(file_input_stream);
- //HSLFSlideShow slide_show=new HSLFSlideShow(input_stream_buffer);
- PowerPointExtractor ppt_extractor = new PowerPointExtractor(input_stream_buffer);
- document.add(new Field("path",datafiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.TOKENIZED));
- document.add(new Field("contents",ppt_extractor.getText(),Field.Store.NO,Field.Index.TOKENIZED));
- indexWriter.addDocument(document);
- }*/
- if(datafiles[i].isFile()&&datafiles[i].getName().endsWith(".xls"))
- {
- Document document=new Document();
- FileInputStream file_input_stream = new FileInputStream(datafiles[i]);
- BufferedInputStream input_stream_buffer = new BufferedInputStream(file_input_stream);
- HSSFWorkbook hssf=new HSSFWorkbook(input_stream_buffer);
- ExcelExtractor xls_extractor=new ExcelExtractor(hssf);
- document.add(new Field("path",datafiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.TOKENIZED));
- document.add(new Field("contents",xls_extractor.getText(),Field.Store.NO,Field.Index.TOKENIZED));
- indexWriter.addDocument(document);
- }
- }
- IndexPath=index_directory;
- indexWriter.optimize();
- indexWriter.close();
- }
- }
最后说下,这个程序的健壮性不行,因为基本上没有对异常进行处理,没时间考虑那么多了...
- Lucene 课程设计 检索.doc,.pdf,.html,.execl,.txt格式文件
- lucene实现pdf,doc,xls,ppt,htm,html等格式文件的检索
- lucene索引word/pdf/html/txt文件及检索(搜索引擎)
- lucene索引word/pdf/html/txt文件及检索(搜索引擎)
- Lucene索引doc pdf html
- lecene.net实现pdf,doc,xls,ppt,htm,html等格式文件的检索
- 使用Lucene对doc、docx、pdf、txt文档进行全文检索功能的实现
- lucene pdf+doc+ppt+xls+txt+多层文件
- lucene索引word/pdf/html/txt文件及检索(搜索引擎) 转载于http://blog.csdn.net/shiljcn/article/details/6179479
- 将PDF格式文件转为DOC格式文件
- Lucene索引前对doc pdf html文件的预处理
- Lucene索引前对doc pdf html文件的预处理
- lucene 索引非txt文档 (pdf word rtf html xml)
- lucene 索引非txt文档 (pdf word rtf html xml)
- lucene 索引非txt文档 (pdf word rtf html xml)
- lucene 索引非txt文档 (pdf word rtf html xml)
- 读取Doc,Excel,PDF,html,生成Txt文件,读取Txt生成Excel文件
- Pdf格式文件转换为Doc文件
- 根据基本表结构及其数据生成 INSERT INTO ... 的 SQL
- 在Windows下使用Eclipse + CDT+MinGW开发C/C++程序
- Struts ,Hibernate ,Spring 常用整合配置方法.
- 红黑树: 理论与实现(理论篇)[修订版]
- 短信报警
- Lucene 课程设计 检索.doc,.pdf,.html,.execl,.txt格式文件
- vc6编译出错c2557的原因
- __declspec __cdecl __stdcall
- fckeditor配置详解
- Linux网络编程一步一步学-UDP编程介绍
- Linux网络编程一步一步学-UDP方式广播通讯
- 缺陷管理中的状态管理
- Linux网络编程一步一步学-网络广播、组播与单播
- 学习ASP.NET