Python “编辑距离”(Levenshtein distance)函数的比较
来源:互联网 发布:三毛淘宝小号交易平台 编辑:程序博客网 时间:2024/05/17 22:22
本文搜集了网上比较常用的几种计算Levenshtein distance的函数,
其中函数(1)为调用数学工具包Numpy, 函数(2)和(1)算法类似,都是采用DP, (3)来自wiki(4)是直接调用python的第三方库Levenshtein
源码和结果如下:
import timefrom functools import wrapsimport cProfileimport numpyimport Levenshteindef fn_timer(function): @wraps(function) def function_timer(*args, **kwargs): t0 = time.time() result = function(*args, **kwargs) t1 = time.time() print ("Total time running %s: %s seconds" % (function.func_name, str(t1-t0)) ) return result return function_timerdef levenshtein1(source, target): if len(source) < len(target): return levenshtein1(target, source) # So now we have len(source) >= len(target). if len(target) == 0: return len(source) # We call tuple() to force strings to be used as sequences # ('c', 'a', 't', 's') - numpy uses them as values by default. source = numpy.array(tuple(source)) target = numpy.array(tuple(target)) # We use a dynamic programming algorithm, but with the # added optimization that we only need the last two rows # of the matrix. previous_row = numpy.arange(target.size + 1) for s in source: # Insertion (target grows longer than source): current_row = previous_row + 1 # Substitution or matching: # Target and source items are aligned, and either # are different (cost of 1), or are the same (cost of 0). current_row[1:] = numpy.minimum( current_row[1:], numpy.add(previous_row[:-1], target != s)) # Deletion (target grows shorter than source): current_row[1:] = numpy.minimum( current_row[1:], current_row[0:-1] + 1) previous_row = current_row return previous_row[-1]def levenshtein2(s1, s2): if len(s1) < len(s2): return levenshtein2(s2, s1) # len(s1) >= len(s2) if len(s2) == 0: return len(s1) previous_row = range(len(s2) + 1) for i, c1 in enumerate(s1): current_row = [i + 1] for j, c2 in enumerate(s2): insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer deletions = current_row[j] + 1 # than s2 substitutions = previous_row[j] + (c1 != c2) current_row.append(min(insertions, deletions, substitutions)) previous_row = current_row return previous_row[-1]def levenshtein3(s, t): ''' From Wikipedia article; Iterative with two matrix rows. ''' if s == t: return 0 elif len(s) == 0: return len(t) elif len(t) == 0: return len(s) v0 = [None] * (len(t) + 1) v1 = [None] * (len(t) + 1) for i in range(len(v0)): v0[i] = i for i in range(len(s)): v1[0] = i + 1 for j in range(len(t)): cost = 0 if s[i] == t[j] else 1 v1[j + 1] = min(v1[j] + 1, v0[j + 1] + 1, v0[j] + cost) for j in range(len(v0)): v0[j] = v1[j] return v1[len(t)]@fn_timerdef calllevenshtein1(s,t,n): for i in range(n): levenshtein3(s,t)@fn_timerdef calllevenshtein2(s,t,n): for i in range(n): levenshtein3(s,t)@fn_timerdef calllevenshtein3(s,t,n): for i in range(n): levenshtein3(s,t)@fn_timerdef calllevenshtein4(s,t,n): for i in range(n): Levenshtein.distance(s,t) if __name__ == "__main__": n = 50000 a = 'abddcdefdgbd22svb' b = 'bcdefg34rdyvdfsd' calllevenshtein1(a, b, n) calllevenshtein2(a, b, n) calllevenshtein3(a, b, n) calllevenshtein4(a, b, n)
结果:
Total time running calllevenshtein1: 16.0750000477 seconds
Total time running calllevenshtein2: 16.4990000725 seconds
Total time running calllevenshtein3: 16.2939999104 seconds
Total time running calllevenshtein4: 0.0629999637604 seconds
从结果来看,调用python第三方包效率最高,原因是其内部调用c库,优化了算法结构
0 0
- Python “编辑距离”(Levenshtein distance)函数的比较
- 编辑距离(Levenshtein Distance)
- 编辑距离算法 Levenshtein Distance
- 编辑距离(Levenshtein Distance)
- Levenshtein Distance(编辑距离)
- 编辑距离算法(Levenshtein distance)
- python 中 衡量字符串编辑距离的Levenshtein函数
- 编辑距离(Edit Distance | Levenshtein距离)
- C++实现两个字符串之间的Levenshtein Distance(编辑距离)
- 编辑距离(Levenshtein Distance) (转)
- Levenshtein distance最小编辑距离算法实现
- Levenshtein distance最小编辑距离算法实现
- 最短编辑距离问题 : Levenshtein Distance
- Levenshtein distance最小编辑距离算法实现
- Levenshtein Distance Levenshtein 编辑距离——一种相似度的计算方法
- mysql 编辑距离levenshtein函数
- Minimum edit distance(levenshtein distance)(最小编辑距离)初探
- Levenshtein距离,Edit Distance
- 第十一章(二) 线程锁的使用
- gsoap的几个常用设置选项
- 左右滑动评分
- FP-Tree频繁模式树算法
- Android异常ADB server didn't ACK * failed to start daemon *
- Python “编辑距离”(Levenshtein distance)函数的比较
- SYN 洪水攻击
- BZOJ 3198 Sdoi2013 spring Hash+容斥原理
- 嵌入式系统开发人员C语言测试题(数据类型和变量)
- MQ的配置和理解
- Perl Tk介绍
- Java_语法基础_移位运算符
- NoSql数据库使用半年后在设计上面的一些心得
- 传感器坐标系统