10、索引库的查询四之:Lucene的高级搜索技术
来源:互联网 发布:webpack node env 编辑:程序博客网 时间:2024/05/29 23:45
Lucene的高级搜索技术
首先要说的就是 SpanTermQuery ,他和TermQuery用法很相似,唯一区别就是SapnTermQuery可以得到Term的span跨度信息,用法如下:
@Test public void testSpanTermQuery() throws Exception{ Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex")); //创建一个IndexReader IndexReader indexReader = DirectoryReader.open(directory); //创建一个IndexSearcher对象 IndexSearcher indexSearcher = new IndexSearcher(indexReader); SpanQuery query=new SpanTermQuery(new Term("text","new")); //执行查询 TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("查询结果总数量:" + topDocs.totalHits); for (ScoreDoc scoreDoc : topDocs.scoreDocs) { //取document对象 Document document = indexSearcher.doc(scoreDoc.doc); System.out.println(document.get("text")); } indexSearcher.getIndexReader().close(); }
SpanNearQuery:用来匹配两个Term之间的跨度的,用法如下:
@Test public void testSpanNearQuery() throws Exception{ Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex")); //创建一个IndexReader IndexReader indexReader = DirectoryReader.open(directory); //创建一个IndexSearcher对象 IndexSearcher indexSearcher = new IndexSearcher(indexReader); SpanQuery queryStart = new SpanTermQuery(new Term("text","there")); SpanQuery queryEnd = new SpanTermQuery(new Term("text","contrib")); /** *原文: there is a new QueryParser in contrib, which matches the same syntax as this class, but is more modular, enabling substantial customization to how a query is created. *SpanNearQuery:用来匹配两个Term之间的跨度的, * 即一个Term经过几个跨度可以到达另一个Term,slop为跨度因子,用来限制两个Term之间的最大跨度, * 不可能一个Term和另一个Term之间要经过十万八千个跨度才到达也算两者相近,这不符合常理。所以有个slop因子进行限制。 * 还有一个inOrder参数要引起注意,它用来设置是否允许进行倒序跨度,什么意思?即TermA到TermB不一定是从左到右去匹配也可以从右到左, * 而从右到左就是倒序,inOrder为true即表示order(顺序)很重要不能倒序去匹配必须正向去匹配,false则反之。注意停用词不在slop统计范围内。 * * slop:其实之前就有过一次说明,这里再提一次,slop的值表示 跨度的大小,如果slop的值是4 则无法匹配到正确的,只是大于或等于5才能正确匹配。 */ SpanNearQuery query=new SpanNearQuery(new SpanQuery[]{queryStart,queryEnd}, 5, true); //执行查询 TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("查询结果总数量:" + topDocs.totalHits); for (ScoreDoc scoreDoc : topDocs.scoreDocs) { //取document对象 Document document = indexSearcher.doc(scoreDoc.doc); System.out.println(document.get("text")); } indexSearcher.getIndexReader().close(); }
SpanNotQuery:使用场景是当使用SpanNearQuery时,如果两个Term从TermA到TermB有多种情况,
即可能出现TermA或者TermB在索引中重复出现,则可能有多种情况,
SpanNotQuery就是用来限制TermA和TermB之间不存在TermC,从而排除一些情况,实现更精确的控制,用法如下:
@Test public void testSpanNotQuery() throws Exception{ Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2")); //创建一个IndexReader IndexReader indexReader = DirectoryReader.open(directory); SpanQuery queryStart = new SpanTermQuery(new Term("text","there")); SpanQuery queryEnd = new SpanTermQuery(new Term("text","contrib")); SpanQuery excludeQuery = new SpanTermQuery(new Term("text","new")); /** *原文: there is a new QueryParser in contrib, which matches the same syntax as this class, but is more modular, enabling substantial customization to how a query is created. */ SpanNearQuery spanquery=new SpanNearQuery(new SpanQuery[]{queryStart,queryEnd}, 5, true); //第一个参数表示要包含的跨度对象,第二个参数则表示要排除的跨度对象 SpanQuery query=new SpanNotQuery(spanquery,excludeQuery); //创建一个IndexSearcher对象 IndexSearcher indexSearcher = new IndexSearcher(indexReader); //执行查询 TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("查询结果总数量:" + topDocs.totalHits); for (ScoreDoc scoreDoc : topDocs.scoreDocs) { //取document对象 Document document = indexSearcher.doc(scoreDoc.doc); System.out.println(document.get("text")); } indexSearcher.getIndexReader().close(); }
SpanOrQuery顾名思义就是把多个Span'Query用or连接起来,其实你也可以用BooleanQuery来代替SpanOrQuery,但SpanOrQuery会返回额外的Span跨度信息,用法如下:
@Test public void testSpanOrQuery() throws Exception{ Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2")); //创建一个IndexReader IndexReader indexReader = DirectoryReader.open(directory); SpanQuery queryStart = new SpanTermQuery(new Term("text","there")); SpanQuery queryEnd = new SpanTermQuery(new Term("text","contrib")); SpanQuery excludeQuery = new SpanTermQuery(new Term("text","new")); /** *原文: there is a new QueryParser in contrib, which matches the same syntax as this class, but is more modular, enabling substantial customization to how a query is created. * SpanOrQuery顾名思义就是把多个Span'Query用or连接起来,其实你也可以用BooleanQuery来代替SpanOrQuery,但SpanOrQuery会返回额外的Span跨度信息 */ SpanNearQuery spanquery=new SpanNearQuery(new SpanQuery[]{queryStart,queryEnd}, 5, true); //第一个参数表示要包含的跨度对象,第二个参数则表示要排除的跨度对象 SpanOrQuery query=new SpanOrQuery(spanquery,excludeQuery); //创建一个IndexSearcher对象 IndexSearcher indexSearcher = new IndexSearcher(indexReader); //执行查询 TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("查询结果总数量:" + topDocs.totalHits); for (ScoreDoc scoreDoc : topDocs.scoreDocs) { //取document对象 Document document = indexSearcher.doc(scoreDoc.doc); System.out.println(document.get("text")); } indexSearcher.getIndexReader().close(); }
SpanPositionRangeQuery这个query是用来限制匹配的情况是否分布在(start,end)这个区间内,区间索引从零开始计算,用法如下:
@Test public void testSpanPositionRangeQuery() throws Exception{ Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2")); //创建一个IndexReader IndexReader indexReader = DirectoryReader.open(directory); FuzzyQuery fQuery = new FuzzyQuery(new Term("text", "conerib")); SpanQuery startEnd = new SpanMultiTermQueryWrapper<FuzzyQuery>(fQuery); /** *原文: there is a new QueryParser in contrib, which matches the same syntax as this class, but is more modular, enabling substantial customization to how a query is created. * 首先呢,FuzzyQuery fQuery = new FuzzyQuery(new Term("text", "conerib"));用来查询包含跟单词contrib相似字符的索引文档 * 然后呢,new一个SpanQuery,把FuzzyQuery转换成了SpanQuery,然后使用SpanPositionRangeQuery对匹配到的2种情况的落放的位置进行限制即跟conerib相似的单词必须分布在(3,10)这个区间内 */ Query query = new SpanPositionRangeQuery(startEnd,3,10); //第一个参数表示要包含的跨度对象,第二个参数则表示要排除的跨度对象 IndexSearcher indexSearcher = new IndexSearcher(indexReader); //执行查询 TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("查询结果总数量:" + topDocs.totalHits); for (ScoreDoc scoreDoc : topDocs.scoreDocs) { //取document对象 Document document = indexSearcher.doc(scoreDoc.doc); System.out.println(document.get("text")); } indexSearcher.getIndexReader().close(); }
SpanFirstQuery 用法如下:
@Test public void testSpanFirstQuery() throws Exception{ Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2")); //创建一个IndexReader IndexReader indexReader = DirectoryReader.open(directory); FuzzyQuery fQuery = new FuzzyQuery(new Term("text", "conerib")); SpanQuery startEnd = new SpanMultiTermQueryWrapper<FuzzyQuery>(fQuery); /** *原文: there is a new QueryParser in contrib, which matches the same syntax as this class, but is more modular, enabling substantial customization to how a query is created. * 原理与SpanPositionRangeQuery是相同的,只是看起来少了一个参数,如果进行他的构建方法里就能看的出来 是将start 赋值成0了 */ Query query = new SpanFirstQuery(startEnd,10); //第一个参数表示要包含的跨度对象,第二个参数则表示要排除的跨度对象 IndexSearcher indexSearcher = new IndexSearcher(indexReader); //执行查询 TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("查询结果总数量:" + topDocs.totalHits); for (ScoreDoc scoreDoc : topDocs.scoreDocs) { //取document对象 Document document = indexSearcher.doc(scoreDoc.doc); System.out.println(document.get("text")); } indexSearcher.getIndexReader().close(); }
FieldMaskingSpanQuery,它用于在多个域之间查询,即把另一个域看作某个域,从而看起来就像在同一个域里查询,因为Lucene默认某个条件只能作用在单个域上,不支持跨域查询只能在同一个域里查询,所以有了FieldMaskingSpanQuery
@Test public void testFieldMaskingSpanQuery() throws Exception{ Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2")); //创建一个IndexReader IndexReader indexReader = DirectoryReader.open(directory); SpanQuery queryStart = new SpanTermQuery(new Term("text","there")); SpanQuery queryEnd = new SpanTermQuery(new Term("text","new")); SpanQuery startEnd = new FieldMaskingSpanQuery(queryEnd, "text"); /** *原文: there is a new QueryParser in contrib, which matches the same syntax as this class, but is more modular, enabling substantial customization to how a query is created. *它用于在多个域之间查询,即把另一个域看作某个域,从而看起来就像在同一个域里查询,因为Lucene默认某个条件只能作用在单个域上,不支持跨域查询只能在同一个域里查询,所以有了FieldMaskingSpanQuery */ Query query = new SpanNearQuery(new SpanQuery[]{queryStart, startEnd}, 5, false); //第一个参数表示要包含的跨度对象,第二个参数则表示要排除的跨度对象 IndexSearcher indexSearcher = new IndexSearcher(indexReader); //执行查询 TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("查询结果总数量:" + topDocs.totalHits); for (ScoreDoc scoreDoc : topDocs.scoreDocs) { //取document对象 Document document = indexSearcher.doc(scoreDoc.doc); System.out.println(document.get("text")); } indexSearcher.getIndexReader().close(); }
1.6 禁用模糊查询和通配符查询
如果禁用模糊查询就要自定义QueryParser 类,禁用模糊查询和通配符查询,同样的如果希望禁用其它类型查询,只需要覆写对应的getXXXQuery方法即可
import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.queryparser.classic.ParseException;import org.apache.lucene.queryparser.classic.QueryParser;import org.apache.lucene.search.Query;/** * Created by kangz on 2016/12/15. */public class CustomQueryParser extends QueryParser{ public CustomQueryParser(String f, Analyzer a) { super(f, a); } protected Query getFuzzyQuery(String field, String termStr, float minSimilarity) throws ParseException { throw new ParseException("Fuzzy queries not allowed!"); } protected Query getWildcardQuery(String field, String termStr) throws ParseException { throw new ParseException("由于性能原因,已禁用通配符搜索,请输入更精确的信息进行搜索 ^_^ ^_^"); }}
1.7 多索引的搜索合并方法
//多索引的组合查询
@Testpublic void testMultiReader() throws IOException { Directory directory1 = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex")); Directory directory2 = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2")); IndexReader aIndexReader = DirectoryReader.open(directory1); IndexReader bIndexReader = DirectoryReader.open(directory2); MultiReader multiReader = new MultiReader(aIndexReader, bIndexReader); IndexSearcher indexSearcher = new IndexSearcher(multiReader); TopDocs animal = indexSearcher.search(new TermRangeQuery("text", new BytesRef("a"), new BytesRef("z"), true, true), 10); ScoreDoc[] scoreDocs = animal.scoreDocs; for (ScoreDoc sd : scoreDocs) { System.out.println(indexSearcher.doc(sd.doc)); }}
0 0
- 10、索引库的查询四之:Lucene的高级搜索技术
- lucene(索引的查询)
- 【Lucene】使用反射技术优化Lucene索引库的查询与创建
- Lucene整理--索引的搜索
- lucene学习四:索引库的优化
- Lucene对索引的查询
- lucene 基于索引的查询
- lucene学习之针对多索引的搜索
- 一步一步跟我学习lucene(8)---lucene搜索之索引的查询原理和查询工具类(支持分页)示例
- 9、索引库的查询三之:Lucene的多样化查询
- 【Lucene实战】高级搜索技术
- Lucene的多种高级搜索形式
- Lucene以及索引和搜索的流程
- lucene多索引上的搜索
- 基于lucene的案例开发:搜索索引
- lucene的建立索引,搜索,中文分词
- Lucene搜索已经创建好的索引
- solr入门之lucene创建索引和查询索引及查询的源码读取类确定
- Oracle数据库锁表
- Leetcode 234 Palindrome Linked List
- linux下解压命令大全
- 每天一个linux命令(50):crontab命令
- 作业的一个网页
- 10、索引库的查询四之:Lucene的高级搜索技术
- 【数据结构】堆与堆排序
- Linux C 线程与竞争
- android activity与多个fragment之间的瓜葛
- php安装指南
- MySQL常量查询
- 一个2年安卓开发者的一些忠告
- GreenDao3.0升级数据库
- Linux - chown 中的 -R 参数