Term weight algorithm in IR
来源:互联网 发布:windows vista 64位 编辑:程序博客网 时间:2024/05/22 17:41
1 TF-IDF
2 BM25
f是TD-IDF中的TF,|D|是文档D的长度,avgdl是语料库全部文档的平均长度。k1和b是参数。usually chosen, in absence of an advanced optimization, as
b的相关性
令: y=1-b+b*x, x表示|D|/avgdl, x与y的关系如上图。
b越大,文档长度对相关性得分的影响越大,反之越小。b越大时,当文档长度大于平均长度,那么相关性得分越小;反之越大。
这可以理解为,当文档较长时,包含qi的机会越大,因此,同等fi的情况下,长文档与qi的相关性应该比短文档与qi的相关性弱。
K的相关性
令: y=(tf*(k+1))./(tf+k), k与y的关系如下图。
从图表明, k对相似度的影响不大。
3 DFR(divergence form randomness)
Basic Randomness Models
The DFR models are based on this simple idea: “The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d”. In other words the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness:
(8)
where the subscript M stands for the type of model of randomness employed to compute the probability. The basic models are derived in the following table. Basic DFR Models D Divergence approximation of the binomial P Approximation of the binomial BE Bose-Einstein distribution G Geometric approximation of the Bose-Einstein I(n) Inverse Document Frequency model I(F) Inverse Term Frequency model I(ne) Inverse Expected Document Frequency model
If the model M is the binomial distribution, then the basic model is P and computes the value:
where:
- TF is the term-frequency of the term t in the Collection
- tf is the term-frequency of the term t in the document d
- N is the number of documents in the Collection
- p is 1/N and q=1-p
Similarly, if the model M is the geometric distribution, then the basic model is G and computes the value:
where λ = F/N.
- Term weight algorithm in IR
- query term weight计算
- Cosine Similarity and Term Weight Tutorial
- IR in deep learning
- run in term
- IR
- ir
- IR
- ir
- Weight Decay in neural network
- Term
- Lucene in action 笔记 term vector
- 【转】 Lucene in action 笔记 term vector
- [Erlang]Term sharing in Erlang/OTP 上篇
- Term sharing in Erlang/OTP 下篇
- Lucene in action 笔记 term vector
- Algorithm in Python(1)
- SampleRate Algorithm in MadWifi
- 使用VS2012内建的C++测试架构进行单元测试
- ruby笔记(2)
- Java 日期字符串date与数字long之间的转换
- Unity 优化心得 (2D游戏)
- 10个Android开源个性化View控件
- Term weight algorithm in IR
- 【jzoj4685】【礼物】
- 使用表单标签,与用户交互
- 规范相交
- poj 3056 The Bavarian Beer Party (区间DP)
- VS2010/MFC编程入门之十三(对话框:属性页对话框及相关类的介绍)
- 不只是给面试加分 -- Java WeakReference的理解与使用
- WebStorm快捷键操作
- Android Remote Views