Edit Distance

来源:互联网 发布:MySQL完美封装函数 编辑:程序博客网 时间:2024/04/26 18:02

题目:Edit Distance 

(链接地址)

Definition of  Minimum Edit Distance 

Edit Distance 用于衡量两个strings之间的相似性。

两个strings之间的 Minimum edit distance 是指把其中一个string通过编辑(包括插入,删除,替换操作)转换为另一个string的最小操作数。

如上图所示,d(deletion)代表删除操作,s(substitution)代表替换操作,i(insertion)代表插入操作。(为了简单起见,后面的Edit Distance 简写为ED)

如果每种操作的cost(成本)为1,那么ED = 5.

如果s操作的cost为2(即所谓的Levenshtein Distance),ED = 8.

那么如何找到两个strings的minimun edit distance呢?

要知道把一个string转换为另一个string可以有很多种方法(或者说“路径“)。我们所知道起始状态(第一个string)、终止状态(另一个string)、基本操作(插入、删除、替换),要求的是最短路径。

对于如下两个strings:

X的长度为n , Y的长度为m

我们定义D(i,j)为 X 的前i个字符 X[1...i] 与 Y 的前j个字符 Y[1...j] 之间的距离,其中0<i<n, 0<j<m,因此X与Y的距离可以用D(n,m)来表示。

假如我们想要计算最终的D(n,m),那么可以从头开始,先计算D(i, j) (i和j从1开始)的值,然后基于前面的结果计算更大的D(i, j),直到最终求得D(n,m)。

算法过程如下图所示:


上图中使用的是”Levenshtein Distance“即替换的成本为2.
请读者深入理解一下上图中的循环体部分: D(i,j)可能的取值为:
1. D(i-1, j) +1 ;
2. D(i, j-1) +1 ;
3. D(i-1, j-1) + 2 (当X新增加的字符和Y新增加的字符不同时,需要替换)或者 + 0(即两个字符串新增加的字符相同)


思路:How to calculate the minimum edit distance 

(链接地址)

How to calculate the minimum edit distance

Now, we take another example

target :  s t o p

source: s o t

u can insert "t", which will cost 1, to get "stot", then u can substitute "p" for "t", which will cost 2, to get "stop".

Besides, u can substitue "t" for "o" to get "stt", then u can substitute "o" for "t" to get "sto", last, u can insert "p" to get "stop"

Alright, u can have plenty of method to edit to get the target, so how to get the minimum edit distance. We will introduce the algorithm as following, which is dynamic programming.

i : the length of the target

j : the length of the source

D(0, 0) = 0

D(i, 0) = insertCost * i, means the length of the target is i and the length of the source is 0, so u should insert i characters

D(0, j) = deleteCost * j, means the length of the target is 0 and the length of the source is j, so u should delete j characters

D(i, j) = min {D(i - 1, j) + insertCost(target[i]);

                    D(i - 1, j - 1) + substituteCost(source[j], target[i]);

                    D(i, j - 1) + deleteCost(source[j])} means u should choose the method which will cost least.substituteCost = {0, 2}, if source[j] is equal to target[i], it's 0, or it's 2.

insertCost = 1

deleteCost = 1

Well, maybe u will feel confuse, but no problem, We will use the algorithm to solve the above question.

firstly, u can gain the belew form.

 

D(1, 1) = min {D[0, 1] + insert(t[1]) = 2;

                      D[0, 0] + substitute(s[1], t[1]) = 0;

        D[1, 0] + delete(s[1]) = 2}          =    0

D[1, 2] = min{ D[0, 2] + insert(t[1]) = 3;

        D[0, 1] + substitute(s[2], t[1]) = 3;

        D[1, 1] + delete(s[2] = 1)}            =   1

................

then u will get the last number

D[4, 3] = min {D[3, 2] + insert(t[4]) = 3;

        D[3, 2] + substitute(s[3], t[4]) = 3;

        D[4, 2] + delete(s[3]) = 3}    =    3

Actually, u can get the path how to get edit through every step. The below picture show the results.

动态规划实现 DP

class Solution {public:    int minDistance(string word1, string word2) {        // Start typing your C/C++ solution below        // DO NOT write int main() function        int dist[word1.size()+1][word2.size()+1];        for(int i=0;i<=word1.size();i++)            dist[i][0]=i;        for(int j=0;j<=word2.size();j++)            dist[0][j]=j;        for(int i=1;i<=word1.size();i++)        {            for(int j=1;j<=word2.size();j++)            {                if(word1[i-1]==word2[j-1])                    dist[i][j]=dist[i-1][j-1];                else                {                    int tmp=min(dist[i-1][j],dist[i][j-1]);                    dist[i][j]=1+min(dist[i-1][j-1],tmp);                }            }        }        return dist[word1.size()][word2.size()];              }};

参考链接:

(1)自然语言处理处理学习篇02--Edit Distance

(2)Edit Distance


原创粉丝点击