Solr_stopword相关注意事项
来源:互联网 发布:Linux telnet拒绝 编辑:程序博客网 时间:2024/06/08 04:06
So in Solr, normally we’re used to stopwords just kind of magically working. If you enter a stop word in a query, it’ll just be silently ignored and stripped out (unlike my legacy OPAC, which will give you zero results whenever you include a stopword!) — if you include a stopword in a phrase search, it’ll do even better: “kill a mockingbird” basically changes into “kill * mockingbird”, kill and mockingbird seperated by one word, and succesfully matches indexes with “kill a mockingbird” (along with any other “kill * mockingbird”).
Great! So normally we don’t have to think about it too much.
An exception is when you throw dismax into it. Dismax lets you search multiple solr fields at once (the qf parameter). It also lets you search with a multi-clause query, where, depending on your “mm” settings, only SOME of those clauses have to match for results to be included in the hitlist.
So you have multiple Solr fields involved. As long as each of those solr fields is configured for stopwords (and the same) stopwords, everything Just Works the way you’d expect. But if one of those fields does not have stopwords configured, then (depending on your mm settings), you can easily end up getting zero hits for any (non-phrase) query clause that is a stopword. This kind of makes sense when you think about it — since at least one field didn’t have stopwords, there was a clause included for that stopword you entered. And that clause won’t possibly match on any of your stopword fields, so it’s a clause that can’t match, which depending on your mm (and the contents of all your fields, phew) will result in no hits.
A bit more information in this solr listserv thread.
If you have fields included in a dismax qf that all have stopwords configured, but with different stopwords lists, the results could be even more confusing.
The solution?
If you are using dismax, make sure all fields included in a qf have exactly the same stopwords settings. Either they all need to have stopwords configured with the same stopwords file, or they all need to have stopwords not configured.
Just not using stopwords seems like the simplest solution to me. What’s the reason for stopwords in the first place? Generally performance, a very common word will end up with a huge result set when there’s a search clause on that word, which will slow down lucene/solr. My Solr is not as performant as I’d like, it’s true, but there are a whole bunch of different things I really need to look at for performance (So many that it’s kind of overwhelming to consider, honestly) — Since using stopwords would make my solr configuration more confusing and error prone, I think assuming that lack of stopwords is my most important bottleneck without profiling of some kind is a kind of “premature optimization”. So no stopwords for now.
Erik Hatcher suggested in an IRC chat that if very common words are a performance bottleneck, rather than stopwords it might make more sense to investigate Solr’s (or lucene’s?) “commongrams capability”. Need to put that on my list to look into, I know little about that; I get the basic concept, but dont’ know how it’s implemented in solr/lucene or how to set it up.
- Solr_stopword相关注意事项
- 后台程序的相关注意事项
- 数据恢复相关注意事项
- DllMain相关注意事项
- 关于PHP 相关注意事项
- H264相关注意事项
- unity相关注意事项
- innodb 相关注意事项整理
- service的相关注意事项
- google play 相关注意事项
- android开发相关注意事项
- 继承的相关注意事项
- jQuery相关注意事项
- UICollectionViewController相关注意事项
- NGUI相关注意事项
- Android 图片相关注意事项
- 越狱相关一:注意事项
- push推送相关注意事项
- IT工程师必看的十条建议
- 排序算法总结
- POI复制源码
- QTP关闭除了ALM/QC以外的所有IE窗口
- MP3编码解码详解/MP3编码原理专题/MP3编码源码/MP3解码器源码/免费/下载
- Solr_stopword相关注意事项
- XMPP协议的原理介绍
- C/C++——strcmp函数实现
- 为别人做嫁衣——代理模式
- goinstall
- SQLSERVER聚集索引与非聚集索引
- SQLSERVER聚集索引与非聚集索引
- SQLSERVER聚集索引与非聚集索引
- 写给想当程序员的朋友!