Edit Distance
来源:互联网 发布:MySQL完美封装函数 编辑:程序博客网 时间:2024/04/26 18:02
题目:Edit Distance
Definition of Minimum Edit Distance
Edit Distance 用于衡量两个strings之间的相似性。
两个strings之间的 Minimum edit distance 是指把其中一个string通过编辑(包括插入,删除,替换操作)转换为另一个string的最小操作数。
如上图所示,d(deletion)代表删除操作,s(substitution)代表替换操作,i(insertion)代表插入操作。(为了简单起见,后面的Edit Distance 简写为ED)
如果每种操作的cost(成本)为1,那么ED = 5.
如果s操作的cost为2(即所谓的Levenshtein Distance),ED = 8.
那么如何找到两个strings的minimun edit distance呢?
要知道把一个string转换为另一个string可以有很多种方法(或者说“路径“)。我们所知道起始状态(第一个string)、终止状态(另一个string)、基本操作(插入、删除、替换),要求的是最短路径。
对于如下两个strings:
X的长度为n , Y的长度为m
我们定义D(i,j)为 X 的前i个字符 X[1...i] 与 Y 的前j个字符 Y[1...j] 之间的距离,其中0<i<n, 0<j<m,因此X与Y的距离可以用D(n,m)来表示。
假如我们想要计算最终的D(n,m),那么可以从头开始,先计算D(i, j) (i和j从1开始)的值,然后基于前面的结果计算更大的D(i, j),直到最终求得D(n,m)。
算法过程如下图所示:
思路:How to calculate the minimum edit distance
(链接地址)
How to calculate the minimum edit distance
Now, we take another example
target : s t o p
source: s o t
u can insert "t", which will cost 1, to get "stot", then u can substitute "p" for "t", which will cost 2, to get "stop".
Besides, u can substitue "t" for "o" to get "stt", then u can substitute "o" for "t" to get "sto", last, u can insert "p" to get "stop"
Alright, u can have plenty of method to edit to get the target, so how to get the minimum edit distance. We will introduce the algorithm as following, which is dynamic programming.
i : the length of the target
j : the length of the source
D(0, 0) = 0
D(i, 0) = insertCost * i, means the length of the target is i and the length of the source is 0, so u should insert i characters
D(0, j) = deleteCost * j, means the length of the target is 0 and the length of the source is j, so u should delete j characters
D(i, j) = min {D(i - 1, j) + insertCost(target[i]);
D(i - 1, j - 1) + substituteCost(source[j], target[i]);
D(i, j - 1) + deleteCost(source[j])} means u should choose the method which will cost least.substituteCost = {0, 2}, if source[j] is equal to target[i], it's 0, or it's 2.
insertCost = 1
deleteCost = 1
Well, maybe u will feel confuse, but no problem, We will use the algorithm to solve the above question.
firstly, u can gain the belew form.
D(1, 1) = min {D[0, 1] + insert(t[1]) = 2;
D[0, 0] + substitute(s[1], t[1]) = 0;
D[1, 0] + delete(s[1]) = 2} = 0
D[1, 2] = min{ D[0, 2] + insert(t[1]) = 3;
D[0, 1] + substitute(s[2], t[1]) = 3;
D[1, 1] + delete(s[2] = 1)} = 1
................
then u will get the last number
D[4, 3] = min {D[3, 2] + insert(t[4]) = 3;
D[3, 2] + substitute(s[3], t[4]) = 3;
D[4, 2] + delete(s[3]) = 3} = 3
Actually, u can get the path how to get edit through every step. The below picture show the results.
动态规划实现 DP
class Solution {public: int minDistance(string word1, string word2) { // Start typing your C/C++ solution below // DO NOT write int main() function int dist[word1.size()+1][word2.size()+1]; for(int i=0;i<=word1.size();i++) dist[i][0]=i; for(int j=0;j<=word2.size();j++) dist[0][j]=j; for(int i=1;i<=word1.size();i++) { for(int j=1;j<=word2.size();j++) { if(word1[i-1]==word2[j-1]) dist[i][j]=dist[i-1][j-1]; else { int tmp=min(dist[i-1][j],dist[i][j-1]); dist[i][j]=1+min(dist[i-1][j-1],tmp); } } } return dist[word1.size()][word2.size()]; }};
参考链接:
(1)自然语言处理处理学习篇02--Edit Distance
(2)Edit Distance
- edit distance
- Edit Distance
- edit distance
- Edit Distance
- Edit Distance
- Edit Distance
- Edit Distance
- Edit Distance
- Edit Distance
- Edit Distance
- Edit Distance
- Edit Distance
- Edit Distance
- Edit Distance
- Edit Distance
- Edit Distance
- Edit Distance
- Edit Distance
- Elevator解题报告
- .Net之路(六)概述为vs2010自定义添加版权信息
- Wince 下好用的浏览器:OPERA如何设置语言
- .NET Framework中CTS、CLS、CLR是什么?
- .net邮箱找回密码
- Edit Distance
- hb_service
- UDP怎么会返回Connection refused错误
- 疯狂的小猪游戏策划
- ThreadLocal是什么
- abap table control里面各种属性和事件的写法
- 以后看电影就按这个名单了
- USB启动过程
- 开源库集合