输出路径的最小编辑距离

来源:互联网 发布:网络专科学位证有用吗 编辑:程序博客网 时间:2024/05/22 21:05

编辑距离,又称Levenshtein距离,是指两个字串之间,由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符。

例如将kitten一字转成sitting:

  1. sitten (k→s)
  2. sittin (e→i)
  3. sitting (→g)

俄罗斯科学家Vladimir Levenshtein在1965年提出这个概念。

(以上概念介绍来自维基百科,“编辑距离”,http://zh.wikipedia.org/wiki/%E7%B7%A8%E8%BC%AF%E8%B7%9D%E9%9B%A2)


求最小编辑距离,即是从一个字符串转换成另一个所需要的最少的插入、删除、替换的操作次数。


常用的一个解法是动态规划。



具体的计算方法,请查阅相关文章,此不赘述。

空间复杂度为O(mn)的方法,是计算上面的矩阵时,保留所有的结果。

Java工具包Apache的StringUtils类(在包commons-lang中,最新为commons-lang3-3.3.2)中采用的则是仅保留上一行的结果。减少的空间,并且避免长字符串时的内容溢出。详细请查看该包的源代码org.apache.commons.lang3.StringUtils 中的StringUtils.getLevenshteinDistance(CharSequence s, CharSequence t);中。
// Misc    //-----------------------------------------------------------------------    /**     * <p>Find the Levenshtein distance between two Strings.</p>     *     * <p>This is the number of changes needed to change one String into     * another, where each change is a single character modification (deletion,     * insertion or substitution).</p>     *     * <p>The previous implementation of the Levenshtein distance algorithm     * was from <a href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.htm</a></p>     *     * <p>Chas Emerick has written an implementation in Java, which avoids an OutOfMemoryError     * which can occur when my Java implementation is used with very large strings.<br>     * This implementation of the Levenshtein distance algorithm     * is from <a href="http://www.merriampark.com/ldjava.htm">http://www.merriampark.com/ldjava.htm</a></p>     *     * <pre>     * StringUtils.getLevenshteinDistance(null, *)             = IllegalArgumentException     * StringUtils.getLevenshteinDistance(*, null)             = IllegalArgumentException     * StringUtils.getLevenshteinDistance("","")               = 0     * StringUtils.getLevenshteinDistance("","a")              = 1     * StringUtils.getLevenshteinDistance("aaapppp", "")       = 7     * StringUtils.getLevenshteinDistance("frog", "fog")       = 1     * StringUtils.getLevenshteinDistance("fly", "ant")        = 3     * StringUtils.getLevenshteinDistance("elephant", "hippo") = 7     * StringUtils.getLevenshteinDistance("hippo", "elephant") = 7     * StringUtils.getLevenshteinDistance("hippo", "zzzzzzzz") = 8     * StringUtils.getLevenshteinDistance("hello", "hallo")    = 1     * </pre>     *     * @param s  the first String, must not be null     * @param t  the second String, must not be null     * @return result distance     * @throws IllegalArgumentException if either String input {@code null}     * @since 3.0 Changed signature from getLevenshteinDistance(String, String) to     * getLevenshteinDistance(CharSequence, CharSequence)     */    public static int getLevenshteinDistance(CharSequence s, CharSequence t) {        if (s == null || t == null) {            throw new IllegalArgumentException("Strings must not be null");        }        /*           The difference between this impl. and the previous is that, rather           than creating and retaining a matrix of size s.length() + 1 by t.length() + 1,           we maintain two single-dimensional arrays of length s.length() + 1.  The first, d,           is the 'current working' distance array that maintains the newest distance cost           counts as we iterate through the characters of String s.  Each time we increment           the index of String t we are comparing, d is copied to p, the second int[].  Doing so           allows us to retain the previous cost counts as required by the algorithm (taking           the minimum of the cost count to the left, up one, and diagonally up and to the left           of the current cost count being calculated).  (Note that the arrays aren't really           copied anymore, just switched...this is clearly much better than cloning an array           or doing a System.arraycopy() each time  through the outer loop.)           Effectively, the difference between the two implementations is this one does not           cause an out of memory condition when calculating the LD over two very large strings.         */        int n = s.length(); // length of s        int m = t.length(); // length of t        if (n == 0) {            return m;        } else if (m == 0) {            return n;        }        if (n > m) {            // swap the input strings to consume less memory            final CharSequence tmp = s;            s = t;            t = tmp;            n = m;            m = t.length();        }        int p[] = new int[n + 1]; //'previous' cost array, horizontally        int d[] = new int[n + 1]; // cost array, horizontally        int _d[]; //placeholder to assist in swapping p and d        // indexes into strings s and t        int i; // iterates through s        int j; // iterates through t        char t_j; // jth character of t        int cost; // cost        for (i = 0; i <= n; i++) {            p[i] = i;        }        for (j = 1; j <= m; j++) {            t_j = t.charAt(j - 1);            d[0] = j;            for (i = 1; i <= n; i++) {                cost = s.charAt(i - 1) == t_j ? 0 : 1;                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost                d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);            }            // copy current distance counts to 'previous row' distance counts            _d = p;            p = d;            d = _d;        }        // our last action in the above loop was to switch d and p, so p now        // actually has the most recent cost counts        return p[n];    }


如果不仅是求出Levenshtein Distance, 还要输出编辑的路径,那么只能保留矩阵,然后倒退求取编辑路径。

定义操作类OperateObj保存修改的位置,还有替换目标。
package cn.com.sp.align.model;/*操作类,保存每个操作的具体内容,* 三个成员是,*     操作位置 index*     替换的目标 targetStr,(删除,替换为空“”;替换,替换为目标字符;添加,替换为目标字符串)*     操作的类型 operateType,定义为枚举类型OperateEnum。其实从替换目标就能判断出操作类型,为了简便,省去了每步的判断。**/public class OperateObj {        //操作类型定义为枚举,有添加add、删除delete、替换replace三种public enum OperateEnum {add, delete, replace;}//操作,在原始字符串中的位置,操作前private int index = 0;        //操作时,替换目标串private String targetStr = "";        //操作类型符号OperateEnumoperateType; public OperateObj(int index, String targetStr, OperateEnum operateType) {this.index = index;this.targetStr = targetStr;this.operateType = operateType;}public int getIndex() {return index;}public void setIndex(int index) {this.index = index;}public String getTargetStr() {return targetStr;}public void setTargetStr(String targetStr) {this.targetStr = targetStr;}public OperateEnum getOperateType() {return operateType;}public void setOperateType(OperateEnum operateType) {this.operateType = operateType;}}
具体实现求Levenshtein Distance,过程中,保存矩阵的所有结果,实现类为StringUtils_SP:

package cn.com.sp.align.levenshtein;import java.util.ArrayList;import java.util.Arrays;import java.util.HashMap;import java.util.List;import cn.com.sp.align.model.OperateObj;import cn.com.sp.align.model.OperateObj.OperateEnum;public class StringUtils_SP {    public static int getLevenshteinDistance(CharSequence s, CharSequence t, List<OperateObj> operateList) {        if (s == null || t == null) {            throw new IllegalArgumentException("Strings must not be null");        }        int n = s.length(); // length of s        int m = t.length(); // length of t        if (n == 0) {            return m;        } else if (m == 0) {            return n;        }        int distance[][] = new int[s.length()+1][t.length()+1];                for(int i=0; i<s.length()+1; ++i){        distance[i][0] = i;        }                for(int j=1; j<t.length()+1; ++j){        distance[0][j] = j;        }        int cost = 0;        for(int i=1; i<s.length()+1; ++i){        for(int j=1; j<t.length()+1; ++j){        int tempCost = Math.min(distance[i-1][j]+1, distance[i][j-1]+1);        if(s.charAt(i-1)==t.charAt(j-1)){        cost = 0;        }else{        cost = 1;        }        distance[i][j] = Math.min(distance[i-1][j-1]+cost, tempCost);        }                }                        int i = s.length(), j = t.length();        int minDistance = distance[i][j];        while(i>0 && j>0){        if(distance[i][j-1]+1 == minDistance){        OperateObj operateObj = new OperateObj(i-1, s.charAt(i-1)+""+t.charAt(j-1), OperateEnum.add);        operateList.add(operateObj);                minDistance = distance[i][j-1];        j -= 1;        }else if(distance[i-1][j]+1 == minDistance){        OperateObj operateObj = new OperateObj(i-1, "", OperateEnum.delete);        operateList.add(operateObj);                minDistance = distance[i-1][j];        i -= 1;        }else if(distance[i-1][j-1]+1 == minDistance){        OperateObj operateObj = new OperateObj(i-1, t.charAt(j-1)+"", OperateEnum.replace);        operateList.add(operateObj);                minDistance = distance[i-1][j-1];        i -= 1;        j -= 1;        }else{                i -= 1;        j -= 1;        }                }                while(i>0){        OperateObj operateObj = new OperateObj(i-1, "", OperateEnum.delete);        operateList.add(operateObj);                minDistance = distance[i-1][j];        i -= 1;        }                while(j>0){        OperateObj operateObj = new OperateObj(i, t.charAt(j-1)+""+s.charAt(i), OperateEnum.add);    operateList.add(operateObj);        minDistance = distance[i][j-1];    j -= 1;        }                        return distance[s.length()-1][t.length()-1];   }   <pre name="code" class="java">   public static void main(String[] args){        String s = "中华人民共和国";        String t = "中化人名和国";        ArrayList<OperateObj> operateList = new ArrayList<OperateObj>();        System.out.println("编辑距离为 : "+StringUtils_SP.getLevenshteinDistance(s, t, operateList));String operateStr = s;for (int i = 0; i < operateList.size(); ++i) {OperateObj operateObj = operateList.get(i);System.out.println(operateStr);System.out.println(s.charAt(operateObj.getIndex())+"("+operateObj.getIndex()+","+operateObj.getOperateType()+") -> "+operateObj.getTargetStr());operateStr = operateStr.substring(0, operateObj.getIndex()) + operateObj.getTargetStr() + operateStr.substring(operateObj.getIndex() + 1);}System.out.println("");System.out.println(t);    }
}
运行的结果如下:
编辑距离为 : 3中华人民共和国共(4,delete) -> 中华人民和国民(3,replace) -> 名中华人名和国华(1,replace) -> 化中化人名和国



0 0
原创粉丝点击