Lucene 课程设计检索.doc,.pdf,.html,.execl,.txt格式文件

来源：互联网发布：淘宝不清洗第二次排查编辑：程序博客网时间：2024/04/29 14:32

又花了大概一个星期的时间，终于写完了信息检索的课程设计，快考试，还没复习，整天写个门课的实验...

利用Lucene开源软件，实现了检索.doc,.pdf,.html,.execl,.txt等常见格式的文件，检索结果给出文件所在的路径

课设中用到的各软件包为：lucene2.4.0，apache-tomcat-6.0.16,poi-bin-3.2-FINAL-20081019(用于解析office类文件，如doc,.xls等),PDFBox-0.7.3(用于解析.pdf 文件)，开发环境为Eclipse Version: 3.4.1

注意的是，用到的包都要放到WebContent/WEB-INF的lib目录下，否则会报NoClassDefError(好像这么写),其次，在eclipse开发页面的Project Explorer的空白处右键单击，然后单击refresh,否则即使把用的包放到lib目录下，仍会报NoClassDefError错误，就这个“小小”的错误,折磨了我一天半的时间，狂上网查原因，用尽了能够搜集到方法，也不行，都快崩溃了，最后坐在电脑前发呆，无意中发现了之前变动过的docs(用于存放待检索的文件) 在eclipse的Project Explorer显示的仍是之前的那几个文件，这才找到原因，轻轻的点击了refresh,终于解决了，既兴奋有无语，就这点差错，之前没怎么用这类集成开发工具(vc6.0除外),怎么评价呢，这种集成开发工具的却带给了我们很高的开发效率,他给你了很多提示，但就是配置环境比较烦人，稍有点错误就不行，而且这种错误通常很难解决，有利有弊吧(我最欣赏这类开发工具的就是他们对格式都是自动控制的，这点很好，不向用.TXT时那烦人的格式控制..)

下面给出代码：

configuration.jsp

<%@ page language="java" contentType="text/html; charset=GB18030"
    pageEncoding="GB18030"%>
<html>
    <head>
        <title>
            Welcome to LuceneWeb - Configuration Page
        </title>
    </head>
    <body background="D:/Program Files/MIR Design/eclise workplace/LuceneWebApplication/bg.jpg">
        <center>
            <font face="Monotype Corsiva"  size="20" color="#9966ff">
                <b>
                    Welcome to LuceneWeb
                </b>
            </font>
        </center>
    <h2 align="right">
        <b>
            <font face="Monotype Corsiva" color="#9966ff">
                Configuration Page
            </font>
        </b>
    </h2>
    <center>
        <form name="Configuration" action="header.jsp" method="get">
            <%for(int i=0;i<7;i++){%>
                <br>
            <%} %>
        <p>
            <font face="楷体_GB2312" size="5">
                文档路径：
            </font>
            <input type="text" name="DocumentDirectory" size="40"/>
        </p>
        <p>
            <br>
            <font face="楷体_GB2312" size="5">
                索引路径：
            </font>
            <input type="text" name="IndexDirectory" size="40"/>
        </p>
            <br>
            <center>
                <input type="submit" value="建立索引"/>
            </center>
        </form>
        </center>
    </body>
</html>

header.jsp

<%@page language="java" contentType="text/html; charset=GB18030"
    pageEncoding="GB18030"%>
<%@page import="Index.CreateIndex" %>  
<%@page import ="java.io.*,org.apache.poi.hwpf.extractor.*,org.apache.lucene.analysis.*,org.apache.lucene.analysis.standard.StandardAnalyzer,org.apache.lucene.document.*,org.apache.lucene.index.*, org.apache.lucene.search.*,org.apache.lucene.queryParser.*,org.apache.lucene.demo.*,org.apache.lucene.demo.html.Entities,java.net.URLEncoder" %>
<head>
    <title>Welcome to LuceneWeb - Search Page</title>
</head>
<%
        CreateIndex create=new CreateIndex();
        String index;
        String document_path,index_path;
        document_path=request.getParameter("DocumentDirectory");
        index_path=request.getParameter("IndexDirectory");
        //下面的两个if语句用于设置document和index的默认路径，当没有在配置页面输入这两个路径时，使用默认路径，
        if(document_path.length()<2)//开始使用document_path==null判断，结果不对，姑且就这样判断了
        {
            document_path="D://Program Files//MIR Design//eclise workplace//LuceneWebApplication//docs";
        }
        if(index_path.length()<2)
        {
            index_path="D://Program Files//MIR Design//eclise workplace//LuceneWebApplication//index";
        }
        create.create_index(document_path,index_path);
        String indexPath=create.get_index();
%>
<body background="D:/Program Files/MIR Design/eclise workplace/LuceneWebApplication/luceneweb.jpg">
    <center>
        <font face="Monotype Corsiva" color="#00ffff" size="150">
            <b>
                LuceneWeb
            </b>
        </font>
    </center>
    <center> 
    <% 
        for(int i=0;i<10;i++)
        {
    %>
        <br>
    <% 
        }
    %>
        <form name="Search" action="index.jsp" method="get">
            <p>                     
                <font face="楷体_GB2312" size="5">
                    查询关键zi：
                </font>
                <input name="QueryInput" size="50"/>
                <input type="hidden" name="indexPath" value="<%=indexPath %>"/>
                <br>
                <br>
                <br>
            </p>
                <center>
                    <input type="submit" value="Search"/>
                </center>
     </form>
    </center>

index.jsp

<%@ page import = "  javax.servlet.*, javax.servlet.http.*, java.io.*, org.apache.lucene.analysis.*, org.apache.lucene.analysis.standard.StandardAnalyzer, org.apache.lucene.document.*, org.apache.lucene.index.*, org.apache.lucene.search.*, org.apache.lucene.queryParser.*, org.apache.lucene.demo.*, org.apache.lucene.demo.html.Entities, java.net.URLEncoder" %>
<%@page language="java" contentType="text/html; charset=GB18030"
    pageEncoding="GB18030"%>
<%@include file="header_frame.jsp" %>
    <% 
        boolean error = false;  
        String indexName=request.getParameter("indexPath");
        IndexSearcher searcher = null;   
        Query query = null;                    
        Hits hits = null;                      
        int startindex =0;                     
        int maxpage =10;                    
        String queryString = null;              
        String startVal ="0";             
        String maxresults ="10";             
        int thispage = 0;    
  
        try {
          searcher = new IndexSearcher(indexName);                                                    
        } catch (Exception e) {                                                                             
    %>
                <p>Notice:error opening the Index</p>
    <%                error = true;                                  
        }
    %>
    <%
       if (error == false) {                                          
                queryString = request.getParameter("QueryInput");           
                startVal    =request.getParameter("startat");       
                maxresults  =request.getParameter("maxresults"); 
                try {
                        maxpage    = Integer.parseInt(maxresults);    
                        startindex = Integer.parseInt(startVal);      
                } catch (Exception e) { } 
              if (queryString == null)
                {
                        //throw new ServletException("no query "+"specified");                                                                          
        %>
                <h3 align="left">
                    <font face="Monotype Corsiva"  color="#9966ff">
                        <b>
                            请输入查询关键字...
                        </b>
                    </font>
                </h3>
        <%
                } 
        %>
        <%
                Analyzer analyzer = new StandardAnalyzer();       
                try {
                        QueryParser qp = new QueryParser("contents", analyzer);
                        query = qp.parse(queryString); 
                } catch (ParseException e) {                         
    %>
    <%
                        error = true;                                
                }
        }
    %>
    <%
        if (error == false && searcher != null) {                   
                                                                     
                                                                    
                thispage = maxpage;                                  
                hits = searcher.search(query);                      
                if (hits.length() == 0) {                             
    %>
                <p>对不起，没有你想查询的结果...</p>
    <%
                error = true;                                      
                                                                   
                }
        }
        if (error == false && searcher != null) {  
    %>
    <h3 align="left">
        <font face="Monotype Corsiva"  color="#9966ff">
            <b>
                总共有<%=hits.length()%>条查询结果...
            </b>
        </font>
    </h3>       
                <table>
    <%
                    if ((startindex + maxpage) > hits.length()) {
                            thispage = hits.length() - startindex;     
                    }       
    %>
    <%
                    for (int i = startindex; i < (thispage + startindex); i++) { 
    %>
                    <tr>
    <%                      
                            Document doc = hits.doc(i);                   
                            String docdoctitle = doc.get("title");            
                            String url = doc.get("path");                  
                            if (url != null && url.startsWith("../webapps/")) 
                                { 
                                    urlurl = url.substring(10);
                                }
                            if ((doctitle == null) || doctitle.equals("")) 
                                    doctitle = url;
                                                                         
    %>
                            <td><a href="<%=url%>"><%=doctitle%></a></td>
                    </tr>
    <%
                    }
    %>                  
                </table>
    <%
                    for(int i=0;i<7;i++)
                    {
    %>
                    <br>
    <%
                    }
    %>
                    <p align="left">
    <%
                     String first_page="index.jsp?QueryInput="+queryString+"maxresults="+maxpage+"startat="+"0"+"indexPath="+indexName;
    %>
    <%               if(startindex>maxpage-1)
                     {
    %>
                      <a href="<%=first_page%>">首页</a>
    <% 
                     }
                    else
                    {
    %>
                    首页
    <%
                    }
    %>
     <%                if (startindex>=maxpage) 
                                {                                                                   
                                    String former_page="index.jsp?QueryInput="+queryString+"maxresults="+maxpage+"startat="+(startindex-maxpage)+"indexPath="+indexName;
    %>
                   
                            <a href="<%=former_page%>">上一页</a>
                        
                  
    <%
                    }else{
    %>
                        上一页
    <%
                    }
    %>
    <%                if ( (startindex + maxpage) < hits.length()) {                                                                   
                            String next_page="index.jsp?QueryInput="+queryString+"maxresults="+maxpage+"startat="+(startindex+maxpage)+"indexPath="+indexName;
    %>
                   
                            <a href="<%=next_page%>">下一页</a>
                        
                  
    <%
                    }else{
    %>
                        下一页
    <%
                    }
    %>                  
     <%
                     String end_page="index.jsp?QueryInput="+queryString+"maxresults="+maxpage+"startat="+maxpage*(hits.length()/maxpage)+"indexPath="+indexName;
     %>
     <%              if((startindex + maxpage) < hits.length())
                        {
    %>
                            <a href="<%=end_page%>">尾页</a>
    <% 
                             }else{
    %>
                            尾页
    <%
                         }
    %>     
     </p>
    <%       }  
             if (searcher != null)
                    searcher.close();
    %>
     

header_frame.jsp

<head>
    <title>Welcome to LuceneWeb - Results Page</title>
</head>
<body background="D:/Program Files/MIR Design/eclise workplace/LuceneWebApplication/luceneweb.jpg">
    <center>
        <font face="Monotype Corsiva" color="#00ffff" size="100">
            <b>
                LuceneWeb
            </b>
        </font>
    </center>
    <h1 align="left">
        <font face="Monotype Corsiva"  color="#9966ff">
            <b>
                Search Result:
            </b>
        </font>
    </h1>

CreateIndex.java(这个类主要用于生成各种文件格式的索引)

package Index;
//生成Index并返回Index的地址,这部分健壮性(robust)严重不行，因为没有考虑异常问题，只是做个演示，假设一切都按常规操作，有待改进，现在时间太紧 
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.demo.html.HTMLParser;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.poi.hslf.HSLFSlideShow;
import org.apache.poi.hslf.extractor.PowerPointExtractor;
import org.apache.poi.hslf.model.TextRun;
import org.apache.poi.hslf.usermodel.SlideShow;
import org.apache.poi.hssf.model.Workbook;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.searchengine.lucene.LucenePDFDocument;
import org.pdfbox.util.PDFTextStripper;
import org.apache.poi.hssf.extractor.ExcelExtractor;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
public class CreateIndex{
    private String IndexPath="begin1";
    String index_directory=null;
    String document_directory=null;
    String execu=null;
    //用于创建索引,.txt,.pdf,.doc,.html,.ppt,.execl格式 
    public String get_index()
    {
        return IndexPath;
    }
    public void create_index(String document_directory,String index_directory) throws FileNotFoundException,IOException
    {
        this.document_directory=document_directory;
        this.index_directory=index_directory;
        File documentDir=new File(this.document_directory);
        File indexDir=new File(this.index_directory);
        StandardAnalyzer luceneAnalyzer=new StandardAnalyzer();
        File datafiles[]=documentDir.listFiles();
        IndexWriter indexWriter;
        indexWriter=new IndexWriter(indexDir,luceneAnalyzer,true);
        for(int i=0;i<datafiles.length;i++)
        {
            //创建.TXT文件的索引 
            if(datafiles[i].isFile()&&datafiles[i].getName().endsWith(".txt"))
            {
                Document document=new Document();
                Reader txtReader=new FileReader(datafiles[i]);
                document.add(new Field("path",datafiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.TOKENIZED));
                document.add(new Field("contents",txtReader));
                indexWriter.addDocument(document);
            }
            //创建.HTML文件的索引 
            if(datafiles[i].isFile()&&datafiles[i].getName().endsWith(".html"))
            {
                Document document = new Document();
                FileInputStream file_input_stream = new FileInputStream(datafiles[i]);
                HTMLParser parser = new HTMLParser(file_input_stream);
                document.add(new Field("path",datafiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.TOKENIZED));
                document.add(new Field("contents", parser.getReader()));
                indexWriter.addDocument(document);
            }
            //创建.PDF文件的索引 
            if(datafiles[i].isFile()&&datafiles[i].getName().endsWith(".pdf"))
            {
                Document document = new Document();
                FileInputStream file_input_stream = new FileInputStream(datafiles[i]);
                PDFParser parser=new PDFParser(file_input_stream);
                parser.parse();
                COSDocument cosdoc=parser.getDocument();
                PDFTextStripper stripper=new PDFTextStripper();
                String docText=stripper.getText(new PDDocument(cosdoc));
                document.add(new Field("path",datafiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.TOKENIZED));
                document.add(new Field("contents",docText,Field.Store.YES,Field.Index.TOKENIZED));
                //document = LucenePDFDocument.getDocument(datafiles[i]); 
                indexWriter.addDocument(document);  
            }
            //创建.DOC文件的索引 
            if(datafiles[i].isFile()&&datafiles[i].getName().endsWith(".doc"))
            {
                Document document = new Document();
                FileInputStream file_input_stream = new FileInputStream(datafiles[i]);
                BufferedInputStream input_stream_buffer = new BufferedInputStream(file_input_stream);
                WordExtractor doc_extractor = new WordExtractor(input_stream_buffer);  
                document.add(new Field("path",datafiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.TOKENIZED));
                document.add(new Field("contents",doc_extractor.getText(),Field.Store.YES,Field.Index.TOKENIZED));
                indexWriter.addDocument(document);
            }
            //创建.PPT文件的索引    这地方一直有问题，提示“no such entry: "PowerPoint Document"， 
            //网上查了下，大概是环境的问题，感觉解析DOC,EXECL,PPT的方法应该都一样，利用POI带的.doc|.ppt|.xlsExtractor 
            //就能对这三种文件进行解析，但不知就是PPT不行 
            /*if(datafiles[i].isFile()&&datafiles[i].getName().endsWith(".ppt"));
            {
               Document document=new Document();
               InputStream file_input_stream = new FileInputStream(datafiles[i]);
               BufferedInputStream input_stream_buffer = new BufferedInputStream(file_input_stream);
               //HSLFSlideShow slide_show=new HSLFSlideShow(input_stream_buffer);
               PowerPointExtractor ppt_extractor = new PowerPointExtractor(input_stream_buffer);  
               document.add(new Field("path",datafiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.TOKENIZED));
               document.add(new Field("contents",ppt_extractor.getText(),Field.Store.NO,Field.Index.TOKENIZED));
               indexWriter.addDocument(document);
            }*/
            if(datafiles[i].isFile()&&datafiles[i].getName().endsWith(".xls"))
            {
                Document document=new Document();
                FileInputStream file_input_stream = new FileInputStream(datafiles[i]);
                BufferedInputStream input_stream_buffer = new BufferedInputStream(file_input_stream);
                HSSFWorkbook hssf=new HSSFWorkbook(input_stream_buffer);
                ExcelExtractor xls_extractor=new ExcelExtractor(hssf);
                document.add(new Field("path",datafiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.TOKENIZED));
                document.add(new Field("contents",xls_extractor.getText(),Field.Store.NO,Field.Index.TOKENIZED));
                indexWriter.addDocument(document);
            }   
        }   
        IndexPath=index_directory;
        indexWriter.optimize();
        indexWriter.close();
    }   
}   

最后说下，这个程序的健壮性不行，因为基本上没有对异常进行处理,没时间考虑那么多了...

Lucene 课程设计 检索.doc,.pdf,.html,.execl,.txt格式文件

Lucene 课程设计检索.doc,.pdf,.html,.execl,.txt格式文件