编辑距离
来源:互联网 发布:怎么就可以做淘宝直播 编辑:程序博客网 时间:2024/05/06 12:40
1.FuzzyQuery
Lucene的FuzzyQuery是将词典中所有与查询串相似的构建BooleanQuery
如何评估查询串和词典词项的相似性呢?
使用的是编辑距离(Edit distance 又名Levenshtein distance)
2.Levenshtein distance
搞自然语言处理的应该不会对这个概念感到陌生,编辑距离就是用来计算从原串(s)转换到目标串(t)所需要的最少的插入,删除和替换的数目,在NLP中应用比较广泛,如一些评测方法中就用到了(wer,mWer等),同时也常用来计算你对原文本所作的改动数。
编辑距离的算法是首先由俄国科学家Levenshtein提出的,故又叫Levenshtein Distance。
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,
- If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
- If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.
The greater the Levenshtein distance, the more different the strings are.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns. 2Initialize the first row to 0..n.
Initialize the first column to 0..m.
3Examine each character of s (i from 1 to n).4Examine each character of t (j from 1 to m).5If s[i] equals t[j], the cost is 0.
If s[i] doesn't equal t[j], the cost is 1.6Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].
3.lucene中的编辑距离实现
#define min3(a, b, c) __t = (a < b) ? a : b; __t = (__t < c) ? __t : c;
int32_t FuzzyTermEnum::editDistance(const TCHAR* s, const TCHAR* t, const int32_t n, const int32_t m) {
//Func - Calculates the Levenshtein distance also known as edit distance is a measure of similiarity
// between two strings where the distance is measured as the number of character
// deletions, insertions or substitutions required to transform one string to
// the other string.
//Pre - s != NULL and contains the source string
// t != NULL and contains the target string
// n >= 0 and contains the length of the source string
// m >= 0 and containts the length of th target string
//Post - The distance has been returned
CND_PRECONDITION(s != NULL, "s is NULL");
CND_PRECONDITION(t != NULL, "t is NULL");
CND_PRECONDITION(n >= 0," n is a negative number");
CND_PRECONDITION(n >= 0," n is a negative number");
int32_t i; // iterates through s
int32_t j; // iterates through t
TCHAR s_i; // ith character of s
if (n == 0)
return m;
if (m == 0)
return n;
//Check if the array must be reallocated because it is too small or does not exist
if (e == NULL || eWidth <= n || eHeight <= m) {
//Delete e if possible
_CLDELETE_ARRAY(e);
//resize e
eWidth = max(eWidth, n+1);
eHeight = max(eHeight, m+1);
e = _CL_NEWARRAY(int32_t,eWidth*eHeight);
}
CND_CONDITION(e != NULL,"e is NULL");
// init matrix e
for (i = 0; i <= n; i++){
e[i + (0*eWidth)] = i;
}
for (j = 0; j <= m; j++){
e[0 + (j*eWidth)] = j;
}
int32_t __t; //temporary variable for min3
// start computing edit distance
for (i = 1; i <= n; i++) {
s_i = s[i - 1];
for (j = 1; j <= m; j++) {
if (s_i != t[j-1]){
min3(e[i + (j*eWidth) - 1], e[i + ((j-1)*eWidth)], e[i + ((j-1)*eWidth)-1]);
e[i + (j*eWidth)] = __t+1;
}else{
min3(e[i + (j*eWidth) -1]+1, e[i + ((j-1)*eWidth)]+1, e[i + ((j-1)*eWidth)-1]);
e[i + (j*eWidth)] = __t;
}
}
}
// we got the result!
return e[n + ((m)*eWidth)];
}
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- 编辑距离
- Flex中如何监测stateChange事件,将VideoDisplay中视频当前ProgressBar中状态显示在List中
- 关于客户关系管理系统几点思考
- 小小树设计草图一
- 读书笔记:学习C语言必须读的第二本书
- 草草记个流水账吧
- 编辑距离
- Flex中如何利用videoPlayer属性和mx_internal命名空间,清除VideoDisplay控件内容
- Flex中如何利用Camera.getCamera()和VideoDisplay#attachCamera()函数在VideoDisplay中显示用户摄像头内容
- Linux IP设置
- 通过委托异步调用方法
- 学java得这样学,学习东西确实也得这样
- 人之道
- 做了Nebula3的应用程序向导
- 错误: