中文字符串模糊匹配算法|C# Levenshtein Distance
来源:互联网 发布:52弹弓淘宝店 编辑:程序博客网 时间:2024/06/05 22:41
中文字符串模糊匹配算法|C# Levenshtein Distance
C# Levenshtein Distance
by Sam Allen - Updated November 27, 2009
You want to match approximate strings with fuzzy logic, using the Levenshtein distance algorithm. Many projects need this logic, including programs that manage prescription drugs, spell-checkers, suggestion searches and plagiarism detectors. Here we see a simple but complete implementation of this algorithm using the C# programming language.
Words: ant, aunt
Levenshtein distance: 1
Note: Only 1 edit is needed.
The 'u' must be added at index 2.
Words: Samantha, Sam
Levenshtein distance: 5
Note: The final 5 letters must be removed.
Words: Flomax, Volmax
Levenshtein distance: 3
Note: The first 3 letters must be changed
Drug names are commonly confused.Levenshtein algorithm
First, credit goes to Vladimir Levenshtein, a Russian scientist. Here we see the C# code I adapted and optimized. It uses a two-dimensional array instead of a jagged array because the space required will only have one width and one height.
=== Program that implements the algorithm (C#) ===
using System;
/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
class Program
{
static void Main()
{
Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
}
}
=== Output from the program ===
1
5
3Description. The Levenshtein method is static. This Compute method doesn't need to store state or instance data, which means you can declare it as static. This can also improve performance, avoiding callvirt instructions. You can easily verify that the above implementation is the standard version of Levenshtein by looking at one of the textbooks you were supposed to read.
Performance notes. The code I show above was adapted by me from another source, and optimized so that it is three times faster. However, there are faster variants of Levenshtein algorithms for some scenarios. [Levenshtein distance - wikipedia.org]
Static classes. This algorithm is stateless, which means it doesn't store instance data and therefore can be put in a static class. Static classes are easier to add to new projects than separate methods.
Usage
Here we see how you can call the method in your C# programs. You will often want to compare multiple strings with the Levenshtein algorithm. The example here shows how you can compare strings in a loop. We use a List of string[] arrays.
=== Program that calls Levenshtein in loop (C#) ===
static void Main()
{
List<string[]> l = new List<string[]>
{
new string[]{"ant", "aunt"},
new string[]{"Sam", "Samantha"},
new string[]{"clozapine", "olanzapine"},
new string[]{"flomax", "volmax"},
new string[]{"toradol", "tramadol"},
new string[]{"kitten", "sitting"}
};
foreach (string[] a in l)
{
int cost = Compute(a[0], a[1]);
Console.WriteLine("{0} -> {1} = {2}",
a[0],
a[1],
cost);
}
}
=== Output of the program ===
ant -> aunt = 1
Sam -> Samantha = 5
clozapine -> olanzapine = 3
flomax -> volmax = 3
toradol -> tramadol = 3
kitten -> sitting = 3More resources
Michael Gilleland has an excellent page about the Levenshtein distance and many implementations of it, and that resource is important if you need more detailed reference. [Levenshtein Distance - merriampark.com]
Performance mistake
I found the C# version linked from merriampark.com, but I adapted that code for some big performance improvements. I changed the first statement into the second statement. The before version makes a new string copy for each single character. The after version examines characters directly, with no copy strings made, taking 75% less time to run.
=== Slow version that uses Substring ===
// It makes new strings.
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
=== Fast version that uses chars ===
// Doesn't make new strings with Substring.
cost = (t[j - 1] == s[i - 1]) ? 0 : 1;Summary
Here we saw the famous Levenshtein Distance algorithm, adapted and optimized for the C# programming language. The author places the code here in the public domain, and encourages you to test it and improve it. This means you are free to use it anywhere you want. Use this code to implement approximate string matching. The brilliance of the algorithm is from Dr. Levenshtein, not the author of this article. [Page protected by Copyscape; do not copy.]
- 中文字符串模糊匹配算法|C# Levenshtein Distance
- C#实现字符串相似度比较[Levenshtein Distance算法].
- C#实现字符串相似度比较[Levenshtein Distance算法].
- C#实现字符串相似度比较[Levenshtein Distance算法]
- String Distance compare (最佳字符串匹配算法(Damerau-Levenshtein距离算法))
- 字符串相似度算法(Levenshtein Distance)
- 用C#实现字符串相似度算法(编辑距离算法 Levenshtein Distance)
- C# SEO整合系列之字符串相似度算法——Levenshtein Distance method
- Levenshtein Distance 算法
- Levenshtein Distance 算法
- 字符串相似度算法( Levenshtein Distance算法)(zz)
- 字符串相似度算法( Levenshtein Distance算法)
- 字符串相似度算法 levenshtein distance 编辑距离算法
- 字符串相似度算法( Levenshtein Distance算法)
- 字符串相似度算法( Levenshtein Distance算法)
- C#:字符串相似度算法( Levenshtein Distance算法)
- 字符串相似度算法( Levenshtein Distance算法)
- 字符串相似度算法(编辑距离算法 Levenshtein Distance)
- java设置字体
- Online Office, More then Office
- OpenXenCenter:思杰XenCenter的开源实现
- 解决IE6.0下png背景透明及连接不能点击
- Xen 虚拟机live migration的一种解决方案
- 中文字符串模糊匹配算法|C# Levenshtein Distance
- GOOGLE按钮样式
- 云计算七问七答
- C#学习之Enum
- he
- Windows API一日一练(47)CreateSemaphore和ReleaseSemaphore函数
- 成功
- 传智播客—Android(二)数据存储和访问 之文件
- 架构设计心得体会