Term weight algorithm in IR

来源：互联网发布：windows vista 64位编辑：程序博客网时间：2024/05/22 17:41

1 TF-IDF

2 BM25

f是TD-IDF中的TF，|D|是文档D的长度，avgdl是语料库全部文档的平均长度。k1和b是参数。usually chosen, in absence of an advanced optimization, as k1∈[1.2,2.0] and b = 0.75 。
这里写图片描述

b的相关性

令： y=1-b+b*x， x表示|D|/avgdl, x与y的关系如上图。
b越大，文档长度对相关性得分的影响越大，反之越小。b越大时，当文档长度大于平均长度，那么相关性得分越小；反之越大。
这可以理解为，当文档较长时，包含qi的机会越大，因此，同等fi的情况下，长文档与qi的相关性应该比短文档与qi的相关性弱。
这里写图片描述

K的相关性

令： y=(tf*(k+1))./(tf+k)， k与y的关系如下图。
这里写图片描述
从图表明, k对相似度的影响不大。

3 DFR(divergence form randomness)

Basic Randomness Models

The DFR models are based on this simple idea: “The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d”. In other words the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness:

weight (t | d) \propto - log Prob M (t \in d | Collection)

(8)
where the subscript M stands for the type of model of randomness employed to compute the probability. The basic models are derived in the following table.

Basic DFR Models D Divergence approximation of the binomial P Approximation of the binomial B_E Bose-Einstein distribution G Geometric approximation of the Bose-Einstein I(n) Inverse Document Frequency model I(F) Inverse Term Frequency model I(n_e) Inverse Expected Document Frequency model

If the model M is the binomial distribution, then the basic model is P and computes the value:

- log Prob P (t \in d | Collection) = - log (T F t f) p t f q T F - t f

where:

TF is the term-frequency of the term t in the Collection
tf is the term-frequency of the term t in the document d
N is the number of documents in the Collection
p is 1/N and q=1-p

Similarly, if the model M is the geometric distribution, then the basic model is G and computes the value:

- l o g P r o b G (t \in d | C o l l e c t i o n) = - l o g ((1 1 + λ) (λ 1 + λ))

where λ = F/N.

0 0