Solr实现Low Level查询解析(QParser)
来源:互联网 发布:股票什么时候买入知乎 编辑:程序博客网 时间:2024/06/03 05:15
Solr基于Lucene提供了方便的查询解析和搜索服务器的功能,可以以插件的方式集成,非常容易的扩展我们自己需要的查询解析方式。其中,Solr内置了一些QParser,对一些没有特殊要求的应用来说,可以直接使用这些查询解析组件,而无需做任何修改,只需要了解这些查询解析组件提供的基本参数(Local Params),就可以实现强大的搜索功能。
对于Solr来说,它的设计目标就是尽可能屏蔽底层Lucene的复杂度和难点,而是通过提供可配置的方式来实现全文检索。我们标题所说的Low Level是指,在Solr里面直接使用Lucene的查询语法,来构造满足需要的查询,例如:+(title:solr) +(+(title:lucene content:hadoop) (title:search)),这样的话,你应该了解Lucene的查询语法。因为在实际应用中,完全使用Solr自带一些QParser可能不能够达到我们的目的,比如,你在对数据进行索引,索引时使用了词典的方式进行分词,词典中出现的一些关键词很可能是与用户交互设计中内容相关的(如搜索某个关键词,向用户推荐一些向关键词),那么,在前端需要将某些关键词进行某种组合,提交到后端进行解析搜索。在后端,就会存在一个专门的查询解析组件(在Solr中成为QParser,可以扩展),最终将解析成Lucene识别的“语言”,从而进行索引搜索,返回搜索结果。
下面是一个简单的例子:
用户搜索“北京”,我需要提供相关的一组同义关键词:“北平”、”首都“、”京城“、”京都“;而此时,与”北京“相关的一组关键词:”首都博物馆“、”故宫“、”天坛“、”八达岭长城“,其中”首博“是”首都博物馆“的同义词;我们需要实现的是,当用户搜索”北京“时,对其进行同义词扩展搜索(这个在Solr里面可以直接使用同义词Analyzer),但是当用户点击这组相关关键词时,需要进行扩展,比如点击”首都博物馆“进行搜索,这时扩展搜索Lucene能够解析的形式为:
+((title:北京 content:北京) (title:北平 content:北平) (title:首都 content:首都) (title:京城 content:京城) (title:京都 content:京都)) +((title:首都博物馆 content:首都博物馆) (title:首博 content:首博))实际上,如果直接使用Lucene,可能会比较容易的多,只需要根据分词词典中具有的Term(存在于索引中),构造满足实际需要的Query即可实现搜索。但是,在Solr里面,将构造查询解析的逻辑移到了QParser中,基于QParserPlugin可以很好地使用Solr提供的一些基础组件和附加组件,并且,这些自定义组件都是基于solrconfig.xml来进行配置的,比较灵活。
当然,Solr提供了一个QParserPlugin插件,核心查询解析在LuceneQParser中实现,是一个相对Low Level的组件,只需要在solrconfig.xml中配置好相应的requestHandler即可,实例如下:
<queryParser name="lucene" class="org.apache.solr.search.LuceneQParserPlugin"/> <requestHandler name="/lucene" class="solr.SearchHandler"> <lst name="defaults"> <str name="defType">lucene</str> <str name="bf">recip(ms(NOW,publishDate),3.16e-13,1,1)</str> <str name="qf">title^1.50 content</str> <bool name="hl">true</bool> <str name="hl.fl">title content</str> <int name="hl.fragsize">100</int> <int name="hl.snippets">3</int> <str name="fl">*,score</str> <str name="qt">standard</str> <str name="wt">standard</str> <str name="version">2.2</str> <str name="echoParams">explicit</str> <str name="indent">true</str> <str name="debugQuery">on</str> <str name="explainOther">on</str> </lst> </requestHandler>启动Solr搜索服务器(如,部署在tomcat容器中),如果你直接输入上述Lucene能够识别的Query字符串:
http://192.168.0.181:8080/solr/core3/lucene/?q=+((title:北京 content:北京) (title:北平 content:北平) (title:首都 content:首都) (title:京城 content:京城) (title:京都 content:京都)) +((title:首都博物馆 content:首都博物馆) (title:首博 content:首博))&start=0&rows=10查询的各个关键词会解析为OR运算,并非我们的设计意图,如果需要的话,可以修改LuceneQParser,将其中的”+“解析成MUST,才能按实际需要搜索。
下面介绍另外一种方法,直接扩展Solr的QParserPlugin。
首先,和前端设计定义统一的接口:
北京OR北平OR首都OR京城OR京都AND首都博物馆OR首博 <=> +((title:北京 content:北京) (title:北平 content:北平) (title:首都 content:首都) (title:京城 content:京城) (title:京都 content:京都)) +((title:首都博物馆 content:首都博物馆) (title:首博 content:首博))
我们通过在扩展的QParser中进行解析,代码如下所示:
package org.shirdrn.solr.search;import java.util.HashMap;import java.util.Iterator;import java.util.Map;import java.util.Map.Entry;import org.apache.lucene.index.Term;import org.apache.lucene.queryParser.ParseException;import org.apache.lucene.search.BooleanClause;import org.apache.lucene.search.BooleanQuery;import org.apache.lucene.search.DisjunctionMaxQuery;import org.apache.lucene.search.PhraseQuery;import org.apache.lucene.search.Query;import org.apache.lucene.search.TermQuery;import org.apache.solr.common.params.CommonParams;import org.apache.solr.common.params.DefaultSolrParams;import org.apache.solr.common.params.DisMaxParams;import org.apache.solr.common.params.SolrParams;import org.apache.solr.request.SolrQueryRequest;import org.apache.solr.search.DisMaxQParser;import org.apache.solr.util.SolrPluginUtils;import org.slf4j.Logger;import org.slf4j.LoggerFactory;/** * Customized solr QParser of the plugin * * @author shirdrn 2011/11/03 */public class SimpleQParser extends DisMaxQParser {private final Logger LOG = LoggerFactory.getLogger(SimpleQParser.class);// using low level Term query? For internal search usage.private boolean useLowLevelTermQuery = false;private float tiebreaker = 0f;private static Float mainBoost = 1.0f;private static Float frontBoost = 1.0f;private static Float rearBoost = 1.0f;private String userQuery = "";public SimpleQParser(String qstr, SolrParams localParams,SolrParams params, SolrQueryRequest req) {super(qstr, localParams, params, req);}@Override public Query parse() throws ParseException { SolrParams solrParams = localParams == null ? params : new DefaultSolrParams(localParams, params); queryFields = SolrPluginUtils.parseFieldBoosts(solrParams.getParams(DisMaxParams.QF)); if (0 == queryFields.size()) { queryFields.put(req.getSchema().getDefaultSearchFieldName(), 1.0f); } /* the main query we will execute. we disable the coord because * this query is an artificial construct */ BooleanQuery query = new BooleanQuery(true); addMainQuery(query, solrParams); // rewrite q parameter for highlighting if(useLowLevelTermQuery) { query = new BooleanQuery(true); rewriteAndOrQuery(userQuery, query, solrParams); } addBoostQuery(query, solrParams); addBoostFunctions(query, solrParams); return query; }protected void addMainQuery(BooleanQuery query, SolrParams solrParams)throws ParseException {tiebreaker = solrParams.getFloat(DisMaxParams.TIE, 0.0f);// get the comma separated list of fields used for payload/* * a parser for dealing with user input, which will convert things to * DisjunctionMaxQueries */SolrPluginUtils.DisjunctionMaxQueryParser up = getParser(queryFields,DisMaxParams.QS, solrParams, tiebreaker);/* * * Main User Query * * */ parsedUserQuery = null; userQuery = getString(); altUserQuery = null; if (userQuery == null || userQuery.trim().length() < 1) { // If no query is specified, we may have an alternate altUserQuery = getAlternateUserQuery(solrParams); query.add(altUserQuery, BooleanClause.Occur.MUST); } else { // There is a valid query string userQuery = SolrPluginUtils.partialEscape(SolrPluginUtils.stripUnbalancedQuotes(userQuery)).toString(); userQuery = SolrPluginUtils.stripIllegalOperators(userQuery).toString(); // use low level Term for constructing TermQuery or BooleanQuery. // warning: for internal AND, OR query, in order to integrate with Solr for obtaining highlight String luceneQueryText = userQuery; String q = solrParams.get(CommonParams.Q);if(q!=null && (q.indexOf("AND")!=-1 || q.indexOf("OR")!=-1)) { addBasicAndOrQuery(luceneQueryText, query, solrParams); luceneQueryText = query.toString(); useLowLevelTermQuery = true; } LOG.debug("userQuery=" + luceneQueryText);parsedUserQuery = getUserQuery(luceneQueryText, up, solrParams);BooleanQuery rewritedQuery = rewriteQueries(parsedUserQuery);query.add(rewritedQuery, BooleanClause.Occur.MUST);}}protected void rewriteAndOrQuery(String userQuery, BooleanQuery query, SolrParams solrParams)throws ParseException {addBasicAndOrQuery(userQuery, query, solrParams);}/** * Parse mixing MUST and SHOULD query defined by us, * e.g. 首都OR北京OR北平AND首博OR首都博物馆 * @param userQuery * @param query * @param solrParams * @throws ParseException */protected void addBasicAndOrQuery(String userQuery, BooleanQuery query, SolrParams solrParams)throws ParseException { userQuery = SolrPluginUtils.partialEscape(SolrPluginUtils.stripUnbalancedQuotes(userQuery)).toString(); userQuery = SolrPluginUtils.stripIllegalOperators(userQuery).toString(); LOG.debug("userQuery=" + userQuery); BooleanQuery parsedUserQuery = new BooleanQuery(true); String[] a = userQuery.split("\\s*AND\\s*");String q = "";if(a.length==0) {createTermQuery(parsedUserQuery, userQuery);} if(a.length>=3) {if(userQuery.indexOf("OR")==-1) { // e.g. 首都AND北京AND北平BooleanQuery andBooleanQuery = parseAndQuery(a);parsedUserQuery.add(andBooleanQuery, BooleanClause.Occur.MUST);}} else{if(a.length>0) {q = a[0].trim();if(q.indexOf("OR")!=-1 || q.length()>0) {parsedUserQuery.add(parseOrQuery(q, frontBoost), BooleanClause.Occur.MUST);}}if(a.length==2) {q = a[1].trim();if(q.indexOf("OR")!=-1 || q.length()>0) {parsedUserQuery.add(parseOrQuery(q, rearBoost), BooleanClause.Occur.MUST);}}}parsedUserQuery.setBoost(mainBoost);BooleanQuery rewritedQuery = rewriteQueries(parsedUserQuery);query.add(rewritedQuery, BooleanClause.Occur.MUST);}/** * Parse SHOULD query, e.g. 北京OR北平OR首都 * @param ors * @param boost * @return */private BooleanQuery parseOrQuery(String ors, Float boost) {BooleanQuery bq = new BooleanQuery(true);for(String or : ors.split("\\s*OR\\s*")) {if(!or.isEmpty()) {createTermQuery(bq, or.trim());}}bq.setBoost(boost);return bq;}/** * Create TermQuery for some term text, query fields. * @param bq * @param qsr */private void createTermQuery(BooleanQuery bq, String qsr) {for(String field : queryFields.keySet()) {TermQuery tq = new TermQuery(new Term(field, qsr));if(queryFields.get(field)!=null) {tq.setBoost(queryFields.get(field));}bq.add(tq, BooleanClause.Occur.SHOULD);}}/** * Parse MUST query, e.g. 首都AND北京AND北平 * @param ands * @return */private BooleanQuery parseAndQuery(String[] ands) {BooleanQuery andBooleanQuery = new BooleanQuery(true);for(String and : ands) {if(!and.isEmpty()) {BooleanQuery bq = new BooleanQuery(true);createTermQuery(bq, and);andBooleanQuery.add(bq, BooleanClause.Occur.MUST);}}return andBooleanQuery;}/** * Rewrite a query, especially a {@link BooleanQuery}, whose * subclauses maybe include {@link BooleanQuery}s, {@link DisjunctionMaxQuery}s, * {@link TermQuery}s, {@link PhraseQuery}s, {@link PayloadQuery}s, etc. * @param input * @return */private BooleanQuery rewriteQueries(Query input) { BooleanQuery output = new BooleanQuery(true);if(input instanceof BooleanQuery) {BooleanQuery bq = (BooleanQuery) input;for(BooleanClause clause : bq.clauses()) {if(clause.getQuery() instanceof DisjunctionMaxQuery) {BooleanClause.Occur occur = clause.getOccur();output.add(rewriteDisjunctionMaxQueries((DisjunctionMaxQuery) clause.getQuery()), occur); // BooleanClause.Occur.SHOULD} else {output.add(clause.getQuery(), clause.getOccur());}}} else if(input instanceof DisjunctionMaxQuery) {output.add(rewriteDisjunctionMaxQueries((DisjunctionMaxQuery) input), BooleanClause.Occur.SHOULD); // BooleanClause.Occur.SHOULD}output.setBoost(input.getBoost()); // boost main clausereturn output;}/** * Rewrite the {@link DisjunctionMaxQuery}, because of default parsing * query string to {@link PhraseQuery}s which are not what we want. * @param input * @return */private BooleanQuery rewriteDisjunctionMaxQueries(DisjunctionMaxQuery input) { // input e.g. (content:"吉林 长白山 内蒙古 九寨沟" | title:"吉林 长白山 内蒙古 九寨沟"^1.5)~1.0Map<String, BooleanQuery> m = new HashMap<String, BooleanQuery>();Iterator<Query> iter = input.iterator();while (iter.hasNext()) {Query query = iter.next();if(query instanceof PhraseQuery) {PhraseQuery pq = (PhraseQuery) query; // e.g. content:"吉林 长白山 内蒙古 九寨沟"for(Term term : pq.getTerms()) {BooleanQuery fieldsQuery = m.get(term.text());if(fieldsQuery==null) {fieldsQuery = new BooleanQuery(true);m.put(term.text(), fieldsQuery);}fieldsQuery.setBoost(pq.getBoost());fieldsQuery.add(new TermQuery(term), BooleanClause.Occur.SHOULD);}} else if(query instanceof TermQuery) {TermQuery termQuery = (TermQuery) query;BooleanQuery fieldsQuery = m.get(termQuery.getTerm().text());if(fieldsQuery==null) {fieldsQuery = new BooleanQuery(true);m.put(termQuery.getTerm().text(), fieldsQuery);}fieldsQuery.setBoost(termQuery.getBoost());fieldsQuery.add(termQuery, BooleanClause.Occur.SHOULD);}}Iterator<Entry<String, BooleanQuery>> it = m.entrySet().iterator();BooleanQuery mustBooleanQuery = new BooleanQuery(true);while(it.hasNext()) {Entry<String, BooleanQuery> entry = it.next();BooleanQuery shouldBooleanQuery = new BooleanQuery(true);createTermQuery(shouldBooleanQuery, entry.getKey());mustBooleanQuery.add(shouldBooleanQuery, BooleanClause.Occur.MUST);}return mustBooleanQuery;}}
接下来,QParser的plugin只需要使用上面实现SimpleQParser,非常容易,如下所示:
package org.shirdrn.solr.search;import org.apache.solr.common.params.SolrParams;import org.apache.solr.common.util.NamedList;import org.apache.solr.request.SolrQueryRequest;import org.apache.solr.search.QParser;import org.apache.solr.search.QParserPlugin;/** * * Simple query parser plugin. * e.g. search "Tokyo AND food" * * @author shirdrn * @date 2011-11-03 */public class SimpleQParserPlugin extends QParserPlugin {@SuppressWarnings("rawtypes")@Overridepublic void init(NamedList args) {}@Overridepublic QParser createParser(String qstr, SolrParams localParams,SolrParams params, SolrQueryRequest req) {return new SimpleQParser(qstr, localParams,params, req);}}最后,在Solr的solrconfig.xml中配置好对应的requestHandler即可,配置片段示例如下所示:
<queryParser name="simple" class="org.shirdrn.solr.search.SimpleQParserPlugin" /> <requestHandler name="/simple" class="solr.SearchHandler"> <lst name="defaults"> <str name="defType">simple</str> <str name="qf">title^1.5 content</str> <str name="bf">recip(ms(NOW,publishDate),3.16e-13,1,1)^1.68</str> <str name="mainBoost">1.555</str> <str name="frontBoost">1.333</str> <str name="rearBoost">1.222</str> <str name="fl">*,score</str> <str name="qt">standard</str> <str name="wt">standard</str> <str name="version">2.2</str> <str name="echoParams">explicit</str> <bool name="hl">true</bool> <str name="hl.fl">title content</str> <int name="hl.snippets">3</int> <str name="indent">true</str> <str name="debugQuery">on</str> <str name="explainOther">on</str> </lst> </requestHandler>下面,启动Solr搜索服务器,通过搜索:
http://192.168.0.181:8080/solr/core/simple/?q=北京OR北平OR首都OR京城OR京都AND首都博物馆OR首博&start=0&rows=10
就能达到我们的目的,搜索结果的xml格式响应,如下所示:
<result name="response" numFound="710" start="0" maxScore="2.5198267">... ...<lst name="debug"><str name="rawquerystring">北京OR北平OR首都OR京城OR京都AND首都博物馆OR首博</str><str name="querystring">北京OR北平OR首都OR京城OR京都AND首都博物馆OR首博</str><str name="parsedquery">+((+((content:北京 title:北京^1.5 content:北平 title:北平^1.5 content:首都 title:首都^1.5 content:京城 title:京城^1.5 content:京都 title:京都^1.5)^1.333) +((content:首都博物馆 title:首都博物馆^1.5 content:首博 title:首博^1.5)^1.222))^1.555) FunctionQuery(1.0/(3.16E-13*float(ms(const(1320330543420),date(publishDate)))+1.0))</str><str name="parsedquery_toString">+((+((content:北京 title:北京^1.5 content:北平 title:北平^1.5 content:首都 title:首都^1.5 content:京城 title:京城^1.5 content:京都 title:京都^1.5)^1.333) +((content:首都博物馆 title:首都博物馆^1.5 content:首博 title:首博^1.5)^1.222))^1.555) 1.0/(3.16E-13*float(ms(const(1320330543420),date(publishDate)))+1.0)</str>另外,如必要的时候,还可以扩展SolrDispatcherFilter,对HTTP请求参数进行精细地控制,实现更灵活的请求搜索方式。
- Solr实现Low Level查询解析(QParser)
- 通过对QParser类的继承 实现SOLR 半匹配检索(模糊搜索/模糊检索) (一)
- solr/lucene查询语法解析
- Low Level MIDI API
- Low-level text rendering
- RPi Low-level peripherals
- python low level thread
- Low-level GPU programming
- solr全文查询基本实现
- Direct Rendering Infrastructure, Low-Level Design Document (翻译)
- 史陶比尔机器人的 LLI (Low Level Interface)
- Akka(29): Http:Server-Side-Api,Low-Level-Api
- Akka(29): Http:Server-Side-Api,Low-Level-Api
- Solr Dismax查询解析器-深入分析
- Solr解析器通用的查询参数
- solr 自定义QueryParser 用户查询解析方案
- Solr解析器通用的查询参数
- solr 标准查询解析器的加权
- 在一个APK中调用另一个APK
- linux目录结构详细分析
- VS2005的MFC Class Wizard哪去了 2011.10.31
- 为Word2003宏添加VBA项目的数字证书签名
- 读写文件流操作
- Solr实现Low Level查询解析(QParser)
- xcode 4.2 "XCode could not find a valid private-key/certificate pair for this profile" 解决方案
- 字符串,指针访问
- 【ASP.NET】站点设计的基本原则
- 自学PHP的笔记(一)
- Semaphore线程同步
- 小企业预防数据外泄损失的四大步骤
- php 日期时间问题
- 利用Word VBA制作选择题