Term weight algorithm in IR

来源:互联网 发布:windows vista 64位 编辑:程序博客网 时间:2024/05/22 17:41

1 TF-IDF

2 BM25

f是TD-IDF中的TF,|D|是文档D的长度,avgdl是语料库全部文档的平均长度。k1和b是参数。usually chosen, in absence of an advanced optimization, as k1[1.2,2.0] and b = 0.75 。
这里写图片描述
这里写图片描述

b的相关性

令: y=1-b+b*x, x表示|D|/avgdl, x与y的关系如上图。
b越大,文档长度对相关性得分的影响越大,反之越小。b越大时,当文档长度大于平均长度,那么相关性得分越小;反之越大。
这可以理解为,当文档较长时,包含qi的机会越大,因此,同等fi的情况下,长文档与qi的相关性应该比短文档与qi的相关性弱。
这里写图片描述

K的相关性

令: y=(tf*(k+1))./(tf+k), k与y的关系如下图。
这里写图片描述
从图表明, k对相似度的影响不大。

3 DFR(divergence form randomness)

Basic Randomness Models

The DFR models are based on this simple idea: “The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d”. In other words the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness:

weight(t|d)logProbM(td|Collection)

(8)
where the subscript M stands for the type of model of randomness employed to compute the probability. The basic models are derived in the following table.

Basic DFR Models D Divergence approximation of the binomial P Approximation of the binomial BE Bose-Einstein distribution G Geometric approximation of the Bose-Einstein I(n) Inverse Document Frequency model I(F) Inverse Term Frequency model I(ne) Inverse Expected Document Frequency model

If the model M is the binomial distribution, then the basic model is P and computes the value:

logProbP(td|Collection)=log(TF tf)ptfqTFtf

where:

  • TF is the term-frequency of the term t in the Collection
  • tf is the term-frequency of the term t in the document d
  • N is the number of documents in the Collection
  • p is 1/N and q=1-p

Similarly, if the model M is the geometric distribution, then the basic model is G and computes the value:

logProbG(td|Collection)=log((11+λ)(λ1+λ))

where λ = F/N.

0 0
原创粉丝点击