10、索引库的查询四之：Lucene的高级搜索技术

来源：互联网发布：webpack node env 编辑：程序博客网时间：2024/05/29 23:45

Lucene的高级搜索技术

首先要说的就是 SpanTermQuery ，他和TermQuery用法很相似，唯一区别就是SapnTermQuery可以得到Term的span跨度信息，用法如下：

@Test    public void testSpanTermQuery() throws  Exception{        Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex"));        //创建一个IndexReader        IndexReader indexReader = DirectoryReader.open(directory);        //创建一个IndexSearcher对象        IndexSearcher indexSearcher = new IndexSearcher(indexReader);        SpanQuery query=new SpanTermQuery(new Term("text","new"));        //执行查询        TopDocs topDocs = indexSearcher.search(query, 10);        System.out.println("查询结果总数量：" + topDocs.totalHits);        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {            //取document对象            Document document = indexSearcher.doc(scoreDoc.doc);            System.out.println(document.get("text"));        }        indexSearcher.getIndexReader().close();    }

SpanNearQuery：用来匹配两个Term之间的跨度的，用法如下：

@Test    public void testSpanNearQuery() throws  Exception{        Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex"));        //创建一个IndexReader        IndexReader indexReader = DirectoryReader.open(directory);        //创建一个IndexSearcher对象        IndexSearcher indexSearcher = new IndexSearcher(indexReader);        SpanQuery queryStart = new SpanTermQuery(new Term("text","there"));        SpanQuery queryEnd = new SpanTermQuery(new Term("text","contrib"));        /**         *原文：  there is a new QueryParser in contrib, which matches the same syntax as this class, but is more modular, enabling substantial customization to how a query is created.         *SpanNearQuery：用来匹配两个Term之间的跨度的，         * 即一个Term经过几个跨度可以到达另一个Term,slop为跨度因子，用来限制两个Term之间的最大跨度，         * 不可能一个Term和另一个Term之间要经过十万八千个跨度才到达也算两者相近，这不符合常理。所以有个slop因子进行限制。         * 还有一个inOrder参数要引起注意，它用来设置是否允许进行倒序跨度，什么意思？即TermA到TermB不一定是从左到右去匹配也可以从右到左，         * 而从右到左就是倒序，inOrder为true即表示order(顺序)很重要不能倒序去匹配必须正向去匹配，false则反之。注意停用词不在slop统计范围内。         *         * slop:其实之前就有过一次说明，这里再提一次，slop的值表示 跨度的大小，如果slop的值是4 则无法匹配到正确的，只是大于或等于5才能正确匹配。         */        SpanNearQuery query=new SpanNearQuery(new SpanQuery[]{queryStart,queryEnd}, 5, true);        //执行查询        TopDocs topDocs = indexSearcher.search(query, 10);        System.out.println("查询结果总数量：" + topDocs.totalHits);        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {            //取document对象            Document document = indexSearcher.doc(scoreDoc.doc);            System.out.println(document.get("text"));        }        indexSearcher.getIndexReader().close();    }

SpanNotQuery:使用场景是当使用SpanNearQuery时，如果两个Term从TermA到TermB有多种情况，

即可能出现TermA或者TermB在索引中重复出现，则可能有多种情况，

SpanNotQuery就是用来限制TermA和TermB之间不存在TermC,从而排除一些情况，实现更精确的控制，用法如下：

 @Test    public void testSpanNotQuery() throws  Exception{        Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2"));        //创建一个IndexReader        IndexReader indexReader = DirectoryReader.open(directory);        SpanQuery queryStart = new SpanTermQuery(new Term("text","there"));        SpanQuery queryEnd = new SpanTermQuery(new Term("text","contrib"));        SpanQuery excludeQuery = new SpanTermQuery(new Term("text","new"));        /**         *原文：  there is a new QueryParser in contrib, which matches the same syntax as this class, but is more modular, enabling substantial customization to how a query is created.         */        SpanNearQuery spanquery=new SpanNearQuery(new SpanQuery[]{queryStart,queryEnd}, 5, true);        //第一个参数表示要包含的跨度对象，第二个参数则表示要排除的跨度对象        SpanQuery query=new SpanNotQuery(spanquery,excludeQuery);        //创建一个IndexSearcher对象        IndexSearcher indexSearcher = new IndexSearcher(indexReader);        //执行查询        TopDocs topDocs = indexSearcher.search(query, 10);        System.out.println("查询结果总数量：" + topDocs.totalHits);        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {            //取document对象            Document document = indexSearcher.doc(scoreDoc.doc);            System.out.println(document.get("text"));        }        indexSearcher.getIndexReader().close();    }

SpanOrQuery顾名思义就是把多个Span'Query用or连接起来，其实你也可以用BooleanQuery来代替SpanOrQuery,但SpanOrQuery会返回额外的Span跨度信息，用法如下：

@Test    public void testSpanOrQuery() throws  Exception{        Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2"));        //创建一个IndexReader        IndexReader indexReader = DirectoryReader.open(directory);        SpanQuery queryStart = new SpanTermQuery(new Term("text","there"));        SpanQuery queryEnd = new SpanTermQuery(new Term("text","contrib"));        SpanQuery excludeQuery = new SpanTermQuery(new Term("text","new"));        /**         *原文：  there is a new QueryParser in contrib, which matches the same syntax as this class, but is more modular, enabling substantial customization to how a query is created.         * SpanOrQuery顾名思义就是把多个Span'Query用or连接起来，其实你也可以用BooleanQuery来代替SpanOrQuery,但SpanOrQuery会返回额外的Span跨度信息         */         SpanNearQuery spanquery=new SpanNearQuery(new SpanQuery[]{queryStart,queryEnd}, 5, true);        //第一个参数表示要包含的跨度对象，第二个参数则表示要排除的跨度对象        SpanOrQuery query=new SpanOrQuery(spanquery,excludeQuery);        //创建一个IndexSearcher对象        IndexSearcher indexSearcher = new IndexSearcher(indexReader);        //执行查询        TopDocs topDocs = indexSearcher.search(query, 10);        System.out.println("查询结果总数量：" + topDocs.totalHits);        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {            //取document对象            Document document = indexSearcher.doc(scoreDoc.doc);            System.out.println(document.get("text"));        }        indexSearcher.getIndexReader().close();    }

SpanPositionRangeQuery这个query是用来限制匹配的情况是否分布在(start,end)这个区间内，区间索引从零开始计算，用法如下：

 @Test    public void testSpanPositionRangeQuery() throws  Exception{        Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2"));        //创建一个IndexReader        IndexReader indexReader = DirectoryReader.open(directory);        FuzzyQuery fQuery = new FuzzyQuery(new Term("text", "conerib"));        SpanQuery startEnd = new SpanMultiTermQueryWrapper<FuzzyQuery>(fQuery);        /**         *原文：  there is a new QueryParser in contrib, which matches the same syntax as this class, but is more modular, enabling substantial customization to how a query is created.         * 首先呢，FuzzyQuery fQuery = new FuzzyQuery(new Term("text", "conerib"));用来查询包含跟单词contrib相似字符的索引文档         * 然后呢，new一个SpanQuery，把FuzzyQuery转换成了SpanQuery,然后使用SpanPositionRangeQuery对匹配到的2种情况的落放的位置进行限制即跟conerib相似的单词必须分布在(3,10)这个区间内         */        Query query = new SpanPositionRangeQuery(startEnd,3,10);        //第一个参数表示要包含的跨度对象，第二个参数则表示要排除的跨度对象        IndexSearcher indexSearcher = new IndexSearcher(indexReader);        //执行查询        TopDocs topDocs = indexSearcher.search(query, 10);        System.out.println("查询结果总数量：" + topDocs.totalHits);        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {            //取document对象            Document document = indexSearcher.doc(scoreDoc.doc);            System.out.println(document.get("text"));        }        indexSearcher.getIndexReader().close();    }

SpanFirstQuery 用法如下：

 @Test    public void testSpanFirstQuery() throws  Exception{        Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2"));        //创建一个IndexReader        IndexReader indexReader = DirectoryReader.open(directory);        FuzzyQuery fQuery = new FuzzyQuery(new Term("text", "conerib"));        SpanQuery startEnd = new SpanMultiTermQueryWrapper<FuzzyQuery>(fQuery);        /**         *原文：  there is a new QueryParser in contrib, which matches the same syntax as this class, but is more modular, enabling substantial customization to how a query is created.         * 原理与SpanPositionRangeQuery是相同的，只是看起来少了一个参数，如果进行他的构建方法里就能看的出来 是将start 赋值成0了         */        Query query = new SpanFirstQuery(startEnd,10);        //第一个参数表示要包含的跨度对象，第二个参数则表示要排除的跨度对象        IndexSearcher indexSearcher = new IndexSearcher(indexReader);        //执行查询        TopDocs topDocs = indexSearcher.search(query, 10);        System.out.println("查询结果总数量：" + topDocs.totalHits);        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {            //取document对象            Document document = indexSearcher.doc(scoreDoc.doc);            System.out.println(document.get("text"));        }        indexSearcher.getIndexReader().close();    }

FieldMaskingSpanQuery，它用于在多个域之间查询，即把另一个域看作某个域，从而看起来就像在同一个域里查询，因为Lucene默认某个条件只能作用在单个域上，不支持跨域查询只能在同一个域里查询，所以有了FieldMaskingSpanQuery

 @Test    public void testFieldMaskingSpanQuery() throws  Exception{        Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2"));        //创建一个IndexReader        IndexReader indexReader = DirectoryReader.open(directory);        SpanQuery queryStart = new SpanTermQuery(new Term("text","there"));        SpanQuery queryEnd = new SpanTermQuery(new Term("text","new"));        SpanQuery startEnd = new FieldMaskingSpanQuery(queryEnd, "text");        /**         *原文：  there is a new QueryParser in contrib, which matches the same syntax as this class, but is more modular, enabling substantial customization to how a query is created.         *它用于在多个域之间查询，即把另一个域看作某个域，从而看起来就像在同一个域里查询，因为Lucene默认某个条件只能作用在单个域上，不支持跨域查询只能在同一个域里查询，所以有了FieldMaskingSpanQuery         */        Query query = new SpanNearQuery(new SpanQuery[]{queryStart, startEnd}, 5, false);        //第一个参数表示要包含的跨度对象，第二个参数则表示要排除的跨度对象        IndexSearcher indexSearcher = new IndexSearcher(indexReader);        //执行查询        TopDocs topDocs = indexSearcher.search(query, 10);        System.out.println("查询结果总数量：" + topDocs.totalHits);        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {            //取document对象            Document document = indexSearcher.doc(scoreDoc.doc);            System.out.println(document.get("text"));        }        indexSearcher.getIndexReader().close();    }

1.6 禁用模糊查询和通配符查询

如果禁用模糊查询就要自定义QueryParser 类，禁用模糊查询和通配符查询，同样的如果希望禁用其它类型查询，只需要覆写对应的getXXXQuery方法即可

import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.queryparser.classic.ParseException;import org.apache.lucene.queryparser.classic.QueryParser;import org.apache.lucene.search.Query;/** * Created by kangz on 2016/12/15. */public class CustomQueryParser extends QueryParser{    public CustomQueryParser(String f, Analyzer a) {        super(f, a);    }    protected Query getFuzzyQuery(String field, String termStr, float minSimilarity) throws ParseException {        throw new ParseException("Fuzzy queries not allowed!");    }    protected Query getWildcardQuery(String field, String termStr) throws ParseException {        throw new ParseException("由于性能原因，已禁用通配符搜索，请输入更精确的信息进行搜索 ^_^ ^_^");    }}

1.7 多索引的搜索合并方法

//多索引的组合查询

@Testpublic void testMultiReader() throws IOException {    Directory directory1 = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex"));    Directory directory2 = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2"));    IndexReader aIndexReader = DirectoryReader.open(directory1);    IndexReader bIndexReader = DirectoryReader.open(directory2);    MultiReader multiReader = new MultiReader(aIndexReader, bIndexReader);    IndexSearcher indexSearcher = new IndexSearcher(multiReader);    TopDocs animal = indexSearcher.search(new TermRangeQuery("text", new BytesRef("a"), new BytesRef("z"), true, true), 10);    ScoreDoc[] scoreDocs = animal.scoreDocs;    for (ScoreDoc sd : scoreDocs) {        System.out.println(indexSearcher.doc(sd.doc));    }}

0 0