Similarity in Elasticsearch: A Brief Introduction to the Similarity Models Available
来源:互联网 发布:linux写c程序 编辑:程序博客网 时间:2024/05/15 23:16
本文转载于: https://www.found.no/foundation/similarity/
Introduction
A similarity model is a set of abstractions and metrics to define to what extent things are similar. That’s quite a general definition. In this article I will only consider textual similarity. In this context, the uses of similarity models can be divided into two categories: classification of documents, with a finite set of categories where the categories are known; and information retrieval where the problem can be defined as ‘find the the most relevant documents to a given query’. In this article I will look into the latter category.
Elasticsearch provides the following similarity models: default, bm25, drf and ib. I have limited the scope of this article to default and bm25. The divergence from randomness and information based similarities may feature in a future article.
Default Similarity
The default similarity model in Elasticsearch is an implementation of tf/idf. Tf/idf is the most common vector space model. A vector space model is a model where each term of the query is considered a vector dimension. This allows for defining one vector for the query and another for the document considered. The scalar product of the two vectors is then considered the relevance of the document to the query. This implies that the positions of the words within a document are not used - a document is just a bag of words.
A simple vector space model would be to set each coordinate in the document vector to one if the document has the word for the dimension, otherwise set it to zero. The query vector would be all one’s if each word has the same weight. In such a model the scalar product would then be the sum of the words which are common for the document and the query. However, such a model would rank a document referring once to a query term in a side note the same as a document using the term repeatedly throughout the text. Which of the the two documents’ topics is identical to the term? The tf part in tf/idf helps us resolve this: tf stands for term frequency, and the solution is to use the term frequency instead of a simple zero or one in the document vector.
In its simplest form you could calculate the term frequency by counting the number of occurences of the term in the document. This simplistic approach has one inherent problem. Assuming a query with the terms: fire foxgenerates the document vectors D1{15, 0} and D2{5, 5}. With a query vector of Q{1, 1}, document D1 will have a score of 15 and document D2 will have a score of 10. Let’s say D1 is a lengthy document about fire hazards and D2 is a tutorial on how to install the browser Firefox. Which document do you think corresponds best to the query? To get the best of both worlds, by rewarding documents for specifically being about a term and also being about all the terms, tf-idf uses the logarithm of the term frequency for the vector value. Those who remember their calculus know that the logarithm of zero is infinitely negative while zero is the logarithm to one. To get the desired behaviour tf-idf simply adds one to all term frequencies before taking the logarithm. Using natural logarithms the document vectors become D1{2.772588722239781, 0} and D2{1.791759469228055, 1.791759469228055} with ranks 2.772588722239781 and 3.58351893845611 respectively. We now have a model that values having all terms without disregarding that a document can be more about a term than another.
The other part of tf-idf yet to be mentioned is IDF, the Inverse Document Frequency. The definition of idf is the total number of documents in the collection (or index) divided by the number of documents that contain the word. It is used to address the fact that some words are more common in the language than others. In fact some words are that common they are considered noise. If a word is included in every document of a collection, how can it be helpful in choosing documents? Why don’t we use stop words you say? Well, Elasticsearch does use stop words, however using stop words can prove to be inaccurate and of course, stop words are not only language specific, but also domain specific. IDF is a measure of the significance of a word’s occurrence in a document. Similarly to TF, it is common to use the logarithm of IDF. The base number of the logarithm is not important, the crux of the matter is that it requires an exponential increase in the frequency to increase the IDF.
As division by zero is somewhat problematic, it’s recommended to add 1 to the denominator of the fraction. To avoid a negative IDF it is then common to add a +1 to the idf outside of the logarithm as in Lucene’sTFIDFSimilarity.
BM25
BM25 belongs to the probabilistic models while TF-IDF is a vector space model, but their formulas are not as different as you might expect. Both models define a weight for each term as a product of some idf-function and some tf-function and then summarize that term weight as the score for the whole document towards the given query.
For those who would like a thorough summary of the theoretical basis of BM25 I recommend The Probabilistic Relevance Framework: BM25 and Beyond. In this article I will rather try to explain the practical difference.
Saturation
Both tf-idf and bm25 acknowledge that for highly frequent terms a further increase in term frequency has little significance for the relevance. The difference between them is that BM25 takes this a bit further. The saturation function of BM25 asymptotically approaches a limit for high term frequencies, while the logarithms of tf-idf has no boundary. For tf-idf, multiplying the the term frequency by the base of the logarithm, usually
This difference in growth implies that for any fixed set of tuning parameters, the base of the logarithm included, you can create a document that will get a higher relative increase in term frequency score in tf-idf than in bm25.
The following functions describe how term frequency contributes to score in BM25 and tf-idf:
SaturationBM25(tfi)=tfik1+tfi Saturationtf−idf(tfi)=ln(tfi+1)
If we plot these functions for typical values of k1 and term frequency from 0 to 100 we get:
SaturationWhen reading this chart it’s important to note that the absolute height of each graph is irrelevant as it could easily be adjusted with a simple boost factor. What is interesting is the relative growth of each curve. There are two things to note from this graph. Firstly, it supports the fact that the logarithm has no upper boundary and secondly, the differences between
Average Document Length
The second major difference between tf-idf and BM25 is the use of document length in BM25. BM25 uses document length to compensate for the fact that a longer document in general has more words and thus is more likely to have a higher term frequency without necessarily being more pertinent to the term and thus no more relevant to the query. But it’s not all black and white, some documents actually have a wider scope which makes the longer text justified. BM25 adjusts the term factor by a factor
By defining
This is how term frequency is used in BM25.
The Complete Term Weight
To get the full weight for a term in BM25 we need to multiply
Where
How Lucene Does BM25
Lucene uses a variation of BM25 where the numerator of the fraction is multiplied with
As Lucene aims for performance, the formula is not calculated entirely at query time. Lucene uses the similarity model chosen both while indexing and querying. This reflects on Elasticsearch as well.
Tuning BM25
Regarding the tuning parameters,
Conclusion
There is a reason why TF-IDF is as widespread as it is. It is conceptually easy to understand and implement while also performing pretty well. That said, there are other, strong candidates. Typically, they offer more tuning flexiblity. In this article we have delved into one of them, BM25. In general, it is known to perform just as good or even better than TF-IDF, especially on collections with short documents. Bearing that in mind, don’t forget that similarity rank is not the only ranking contributor in Elasticsearch. To create a good search experience it is key to combine textual similarity rank with metadata suiting the given case, such as the last time document was updated or if there is some proximity between the authors of the query and the document. Read our follow-up article where we compare the precision and recall of the two models using Wikipedia articles: BM25 vs Lucene Default Similarity.
- Similarity in Elasticsearch: A Brief Introduction to the Similarity Models Available
- A Brief Introduction to Graphical Models and Bayesian Networks
- brief introduction to the project
- A Brief Introduction to the JTAG Boundary Scan Interface
- A Brief Introduction to the JCo Server - Austin Sincock
- A Brief Introduction to the log4net logging library, using C#
- A brief introduction to the Linux graphics stack
- a brief introduction to Loaders and the LoaderManager
- A Brief Introduction to IoC
- A Brief Introduction to IoC
- A Brief Introduction to REST
- A Brief Introduction to REST
- A Brief Introduction to OVF
- A Brief Introduction to Myself
- A brief introduction to VXLAN
- A Brief Introduction to XACML
- Week2-3Text similarity:introduction
- elasticsearch 自定义similarity 插件开发
- IScroll5中文API整理,用法与参考
- jQuery源码学习记录(1)基本结构和选择器
- 1042. Shuffling Machine (20) - sstream实现数字转字符串
- Windbg 分析 Dump File 简单演示
- MFC 单文档中添加OnPaint
- Similarity in Elasticsearch: A Brief Introduction to the Similarity Models Available
- Regularization线性回归练习
- __get和__set的用法
- 结构struct动态数组创建、操作、删除
- 通过IIS设置阻止某个IP或IP段访问你的网站
- Robotium DialogUtils
- Quartz任务调度快速入门
- android在一个activity中finish掉另外一个activity
- This file probably contains a syntax error."错误