中文字符串模糊匹配算法|C# Levenshtein Distance

来源:互联网 发布:52弹弓淘宝店 编辑:程序博客网 时间:2024/06/05 22:41

中文字符串模糊匹配算法|C# Levenshtein Distance

2010-01-06 09:08:09  

C# Levenshtein Distance
by Sam Allen - Updated November 27, 2009
You want to match approximate strings with fuzzy logic, using the Levenshtein distance algorithm. Many projects need this logic, including programs that manage prescription drugs, spell-checkers, suggestion searches and plagiarism detectors. Here we see a simple but complete implementation of this algorithm using the C# programming language.

Words:                ant, aunt
Levenshtein distance: 1
Note:                 Only 1 edit is needed.
                      The 'u' must be added at index 2.

Words:                Samantha, Sam
Levenshtein distance: 5
Note:                 The final 5 letters must be removed.

Words:                Flomax, Volmax
Levenshtein distance: 3
Note:                 The first 3 letters must be changed
                      Drug names are commonly confused.Levenshtein algorithm
First, credit goes to Vladimir Levenshtein, a Russian scientist. Here we see the C# code I adapted and optimized. It uses a two-dimensional array instead of a jagged array because the space required will only have one width and one height.

=== Program that implements the algorithm (C#) ===

using System;

/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
    /// <summary>
    /// Compute the distance between two strings.
    /// </summary>
    public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

class Program
{
    static void Main()
    {
        Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
        Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
        Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
    }
}

=== Output from the program ===

1
5
3Description. The Levenshtein method is static. This Compute method doesn't need to store state or instance data, which means you can declare it as static. This can also improve performance, avoiding callvirt instructions. You can easily verify that the above implementation is the standard version of Levenshtein by looking at one of the textbooks you were supposed to read.

Performance notes. The code I show above was adapted by me from another source, and optimized so that it is three times faster. However, there are faster variants of Levenshtein algorithms for some scenarios. [Levenshtein distance - wikipedia.org]

Static classes. This algorithm is stateless, which means it doesn't store instance data and therefore can be put in a static class. Static classes are easier to add to new projects than separate methods.

Usage
Here we see how you can call the method in your C# programs. You will often want to compare multiple strings with the Levenshtein algorithm. The example here shows how you can compare strings in a loop. We use a List of string[] arrays.

=== Program that calls Levenshtein in loop (C#) ===

static void Main()
{
    List<string[]> l = new List<string[]>
    {
        new string[]{"ant", "aunt"},
        new string[]{"Sam", "Samantha"},
        new string[]{"clozapine", "olanzapine"},
        new string[]{"flomax", "volmax"},
        new string[]{"toradol", "tramadol"},
        new string[]{"kitten", "sitting"}
    };

    foreach (string[] a in l)
    {
        int cost = Compute(a[0], a[1]);
        Console.WriteLine("{0} -> {1} = {2}",
            a[0],
            a[1],
            cost);
    }
}

=== Output of the program ===

ant -> aunt = 1
Sam -> Samantha = 5
clozapine -> olanzapine = 3
flomax -> volmax = 3
toradol -> tramadol = 3
kitten -> sitting = 3More resources
Michael Gilleland has an excellent page about the Levenshtein distance and many implementations of it, and that resource is important if you need more detailed reference. [Levenshtein Distance - merriampark.com]

Performance mistake
I found the C# version linked from merriampark.com, but I adapted that code for some big performance improvements. I changed the first statement into the second statement. The before version makes a new string copy for each single character. The after version examines characters directly, with no copy strings made, taking 75% less time to run.

=== Slow version that uses Substring ===

// It makes new strings.
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);

=== Fast version that uses chars ===

// Doesn't make new strings with Substring.
cost = (t[j - 1] == s[i - 1]) ? 0 : 1;Summary
Here we saw the famous Levenshtein Distance algorithm, adapted and optimized for the C# programming language. The author places the code here in the public domain, and encourages you to test it and improve it. This means you are free to use it anywhere you want. Use this code to implement approximate string matching. The brilliance of the algorithm is from Dr. Levenshtein, not the author of this article. [Page protected by Copyscape; do not copy.]