全文检索Lucence(五)——查询

来源:互联网 发布:淘宝聚划算怎么用 编辑:程序博客网 时间:2024/04/30 02:55

    一、简单概要:

无论是分词器、索引(索引库、索引表、数据)等都是为最终的索引做服务,个人觉得这张图特别重要:

    这张图展示了Lucence工作原理的最核心部分:创建索引库(由索引表和数据组成)。

    索引表:是对文档集合的一个索引,建立索引使用的是分词器,而且将来搜索的时候,也使用同一个分词器对象。     文档集合内部,Lucence会自动维护一个文档内容编号,它相当于hibernate映射实体的自增主键。

    通过分词器可以建立并维护索引表,维护索引的时候,一般采用先删除后增加的方法来回避更新带来的性能问题。


二、关于查询:

    类比Hibernate,Lucence查询分为:查询语句和对象查询两种查询方式。如果使用查询语句,一般也会在程序中将查询条件进一步封装成查询对象。

1、查询语句:

    QueryParser对象,以查询内容字串和查询字段为参数,将两者解析成Query对象,IndexSearcher对象以Query对象为参数进行查询。

    前面的文章中都使用了这种查询:

/** * 搜索 *  * IndexSearcher 是用来在索引库中进行查询的 */@Testpublic void search() throws Exception {//String queryString = "document";String queryString = "adddocument";// 1,把要搜索的文本解析为 QueryString[] fields = { "name", "content" };QueryParser queryParser = new MultiFieldQueryParser(fields, analyzer);Query query = queryParser.parse(queryString);// 2,进行查询IndexSearcher indexSearcher = new IndexSearcher(indexPath);Filter filter = null;TopDocs topDocs = indexSearcher.search(query, filter, 10000);System.out.println("总共有【" + topDocs.totalHits + "】条匹配结果");// 3,打印结果for (ScoreDoc scoreDoc : topDocs.scoreDocs) {int docSn = scoreDoc.doc; // 文档内部编号Document doc = indexSearcher.doc(docSn); // 根据编号取出相应的文档File2DocumentUtils.printDocumentInfo(doc); // 打印出文档信息}}

下面是增加了不同字段权重和分页的字串查询:

public QueryResult search(String queryString, int firstResult, int maxResults) {try {// 1,把要搜索的文本解析为 QueryString[] fields = { "name", "content" };//表示权重:标题和内容中出现关键字的得分不一样,在标题中出现时的得分理应高些  Map<String, Float> boosts = new HashMap<String, Float>();boosts.put("name", 3f);// boosts.put("content", 1.0f); 默认为1.0fQueryParser queryParser = new MultiFieldQueryParser(fields, analyzer, boosts);Query query = queryParser.parse(queryString);return search(query, firstResult, maxResults);} catch (Exception e) {throw new RuntimeException(e);}}


如:对上面的封装的调用:

public void testQueryString() {// String queryString = "+content:\"绅士 饭店\"~2 -size:[000000000000dw TO 000000000000rs]";// String queryString = "content:\"绅士 饭店\"~2 AND size:[000000000000dw TO 000000000000rs]";// String queryString = "content:\"绅士 饭店\"~2 OR size:[000000000000dw TO 000000000000rs]";// String queryString = "(content:\"绅士 饭店\"~2 NOT size:[000000000000dw TO 000000000000rs])";//String queryString = "-content:\"绅士 饭店\"~2 AND -size:[000000000000dw TO 000000000000rs]";//String queryString = "-content:\"绅士 饭店\"~2 OR -size:[000000000000dw TO 000000000000rs]";String queryString = "-content:\"绅士 饭店\"~2 NOT -size:[000000000000dw TO 000000000000rs]";QueryResult qr = indexDao.search(queryString, 0, 10);System.out.println("总共有【" + qr.getRecordCount() + "】条匹配结果");for (Document doc : qr.getRecordList()) {File2DocumentUtils.printDocumentInfo(doc);}}}


2、查询对象:

    Lucence的对象查询分为:TermQuery、RangeQuery、WildcardQuery、PhraseQuery、BooleanQuery(关键词查询、范围查询、通配符查询、短语查询、多条件组合查询)。

    对象查询的封装:

public QueryResult search(Query query, int firstResult, int maxResults) {IndexSearcher indexSearcher = null;try {// 2,进行查询indexSearcher = new IndexSearcher(indexPath);Filter filter = new RangeFilter("size", NumberTools.longToString(200), NumberTools.longToString(1000), true, true);// ========== 排序Sort sort = new Sort();// 默认为升序sort.setSort(new SortField("size")); // sort.setSort(new SortField("size", true));// ==========TopDocs topDocs = indexSearcher.search(query, filter, 10000, sort);int recordCount = topDocs.totalHits;List<Document> recordList = new ArrayList<Document>();// ============== 准备高亮器Formatter formatter = new SimpleHTMLFormatter("<font color='red'>", "</font>");Scorer scorer = new QueryScorer(query);Highlighter highlighter = new Highlighter(formatter, scorer);Fragmenter fragmenter = new SimpleFragmenter(50);highlighter.setTextFragmenter(fragmenter);// ==============// 3,取出当前页的数据int end = Math.min(firstResult + maxResults, topDocs.totalHits);for (int i = firstResult; i < end; i++) {ScoreDoc scoreDoc = topDocs.scoreDocs[i];int docSn = scoreDoc.doc; // 文档内部编号Document doc = indexSearcher.doc(docSn); // 根据编号取出相应的文档// =========== 高亮// 返回高亮后的结果,如果当前属性值中没有出现关键字,会返回 nullString hc = highlighter.getBestFragment(analyzer, "content", doc.get("content"));if (hc == null) {String content = doc.get("content");int endIndex = Math.min(50, content.length());// 最多前50个字符hc = content.substring(0, endIndex);}doc.getField("content").setValue(hc);// ===========recordList.add(doc);}// 返回结果return new QueryResult(recordCount, recordList);} catch (Exception e) {throw new RuntimeException(e);} finally {try {indexSearcher.close();} catch (IOException e) {e.printStackTrace();}}}



关于高亮、分页、排序后续介绍。

1.TermQuery关键词查询:

<span style="font-size:18px;">/** * 关键词查询 *  * name:room */@Testpublic void testTermQuery() {// Term term = new Term("name", "房间");// Term term = new Term("name", "Room"); // 英文关键词全是小写字符Term term = new Term("name", "room");Query query = new TermQuery(term);queryAndPrintResult(query);}</span>

2.RangeQuery范围查询:是否包括边界

<span style="font-size:18px;">/** * 范围查询 *  * 包含边界:size:[0000000000001e TO 000000000000rs] *  * 不包含边界:size:{0000000000001e TO 000000000000rs} */@Testpublic void testRangeQuery() {Term lowerTerm = new Term("size", NumberTools.longToString(50));Term upperTerm = new Term("size", NumberTools.longToString(1000));Query query = new RangeQuery(lowerTerm, upperTerm, false);queryAndPrintResult(query);}</span>


3.WildcardQuery 通配符查询:'?' 代表一个字符, '*' 代表0个或多个字符

<span style="font-size:18px;">/** * 通配符查询 *  * '?' 代表一个字符, '*' 代表0个或多个字符 * name:房* * name:*o* * name:roo? */@Testpublic void testWildcardQuery() {Term term = new Term("name", "roo?");// Term term = new Term("name", "ro*"); // 前缀查询 PrefixQuery// Term term = new Term("name", "*o*");// Term term = new Term("name", "房*");Query query = new WildcardQuery(term);queryAndPrintResult(query);}</span>

4.PhraseQuery 短语查询:

<span style="font-size:18px;">/** * 短语查询 *  * content:"? 绅士 ? ? 饭店" *  * content:"绅士 饭店"~2 */@Testpublic void testPhraseQuery() {PhraseQuery phraseQuery = new PhraseQuery();// phraseQuery.add(new Term("content", "绅士"), 1);// phraseQuery.add(new Term("content", "饭店"), 4);phraseQuery.add(new Term("content", "绅士"));phraseQuery.add(new Term("content", "饭店"));phraseQuery.setSlop(2);queryAndPrintResult(phraseQuery);}</span>

5.BooleanQuery 组合查询:

<span style="font-size:18px;">/** * +content:"绅士 饭店"~2 -size:[000000000000dw TO 000000000000rs] * +content:"绅士 饭店"~2 +size:[000000000000dw TO 000000000000rs] * content:"绅士 饭店"~2 size:[000000000000dw TO 000000000000rs] * +content:"绅士 饭店"~2 size:[000000000000dw TO 000000000000rs] */@Testpublic void testBooleanQuery() {// 条件1PhraseQuery query1 = new PhraseQuery();query1.add(new Term("content", "绅士"));query1.add(new Term("content", "饭店"));query1.setSlop(2);// 条件2Term lowerTerm = new Term("size", NumberTools.longToString(500));Term upperTerm = new Term("size", NumberTools.longToString(1000));//true表示是否包含边界  Query query2 = new RangeQuery(lowerTerm, upperTerm, true);// 组合BooleanQuery boolQuery = new BooleanQuery();boolQuery.add(query1, Occur.MUST);boolQuery.add(query2, Occur.SHOULD);queryAndPrintResult(boolQuery);}</span>

PS:  queryAndPrintResult对查询和结果的封装:

<span style="font-size:18px;">public void queryAndPrintResult(Query query) {System.out.println("对应的查询字符串:" + query);QueryResult qr = indexDao.search(query, 0, 100);System.out.println("总共有【" + qr.getRecordCount() + "】条匹配结果");for (Document doc : qr.getRecordList()) {File2DocumentUtils.printDocumentInfo(doc);}}</span>


总结:

    查询是Lucence的核心,上面的文章对索引的增删改做了封装,这里重点解释了Lucence的基于索引的查询,以及不同的查询策略的代码实现。

    这里重点是解释查询对高亮、排序、权重值、分页、String、Long类型的转换等没有做过多的解释,后续的博客中会做补充性解释,内容中如有不对的地方,欢迎拍砖。

0 0