KMP算法

来源：互联网发布：数据库毕业论文题目编辑：程序博客网时间：2024/06/01 07:40

原文地址：Searching for Patterns | Set 2 (KMP Algorithm)

已知一段文本txt[0..n-1]与一个模式pat[0..m-1]，写一个函数search(char pat[], char txt[])打印pat[]在txt[]所有出现的位置。

例子：

Input:  txt[] = "THIS IS A TEST TEXT"        pat[] = "TEST"Output: Pattern found at index 10Input:  txt[] =  "AABAACAADAABAAABAA"        pat[] =  "AABA"Output: Pattern found at index 0        Pattern found at index 9        Pattern found at index 13

模式搜索是计算机科学中一个十分重要的问题。当我们在记事本/word，或者浏览器或者数据库中查找字符串的时候，模式搜索算法用于显示这些查询结果。

我们已经在前面的章节中讨论过了简单的模式搜索算法（Naive pattern searching algorithm）。简单模式搜索算法在最差情况下的时间复杂度是O(m(n-m+1))。KMP算法在最差情况的时间复杂度是O(n)。

KMP (Knuth Morris Pratt)模式搜索

当许多匹配的字符后面有一个不匹配的字符的时候，简单模式搜索算法效果就不是那么好了。下面就是一些例子。

   txt[] = "AAAAAAAAAAAAAAAAAB"   pat[] = "AAAAB"   txt[] = "ABABABCABABABCABABABC"   pat[] = "ABABAC" (not a worst case, but a bad case for Naive)

KMP搜索算法利用模式退化属性（模式有相同的子字符串并且在模式中出现不止一次），并且把最坏情况的复杂度改进到O(n)。KMP算法的基本思想是：无论啥时候检测到不匹配的字符串（在一些匹配的字符串之后），我们已经知道在下一个窗口的文本中的一些字符。我们利用这个信息的优势，避免匹配那些将要匹配的字符。我们考虑下面的例子来理解这个问题。

匹配概述txt = "AAAAABAAABA" pat = "AAAA"我们首先用pat比较第一个窗口中的文本txt = "AAAAABAAABA" pat = "AAAA"  [初始化位置]我们找到了一个匹配的位置。这与简单的字符串匹配是一样的。在下一步中，我们用pat比较下一个窗口中的文本txt = "AAAAABAAABA" pat =  "AAAA" [模式切换到位置1]这就是为啥KMP优化了简单的搜索算法。在第二个窗口中，我们用当前窗口中的第四个字符模式比较第四个A来决定是否当前窗口匹配。因为我们知道无论如何前三个字符是匹配的，我们可以忽略匹配前三个字符。还需要预处理吗？上述的解释提出这样一个重要的问题，我们咋能知道有多少个字符可以略过呢。为了得到这个答案，我们要预处理模式，并准备一个整形数组lps[]，这个数组可以告诉我们有几个字符可以略过。

预处理概述：

KMP算法做预处理pat[]并建立一个大小为m（与模式的大小相同）的附加数组lps[]，它是用于在匹配过程中略过字符的。
lps表示的是longest proper prefix，也就是后缀。一个合适的前缀就是不允许带有整个字符串的前缀。例如，“ABC”的前缀有“”, “A”, “AB”和“ABC”。合适的前缀是“”, “A”和“AB”。这个字符串的后缀是“”, “C”, “BC” and “ABC”。
对于每个子模式pat[0..i]，在这里i从0到m-1，lps保存的是匹配的合适前缀的最大长度，这也是子模式pat[0..i]的一个后缀。

lps[i] = the longest proper prefix of pat[0..i]               which is also a suffix of pat[0..i].

注意：lps[i]可以被定义为最长前缀，也是合适的后缀。我们需要用合适的在一个地方来确保整个字符串没被考虑。

Examples of lps[] construction:For the pattern “AAAA”, lps[] is [0, 1, 2, 3]For the pattern “ABCDE”, lps[] is [0, 0, 0, 0, 0]For the pattern “AABAACAABAA”, lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]For the pattern “AAACAAAAAC”, lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4] For the pattern “AAABAAA”, lps[] is [0, 1, 2, 0, 1, 2, 3]

搜索算法：

与简单算法不一样，我们逐一滑动模式，并在每一次改变都比较所有的字符串，我们用lps[]中的一个值确定下一个将要匹配的字符。这个思想不是我们无论怎样都匹配的匹配字符。

怎样利用lps[]确定下一个位置呢（或者知道略过字符的个数）？

我们从字符串中当前窗口的字符与pat[j]，j=1开始比较
我们保持txt[i]与pat[j]字符串的匹配，并随着txt[i]与pat[j]的匹配增加i和j。
当发现匹配失败的时候
– 我们知道字符pat[0..j-1]与txt[i-j+1…i-1]匹配（注意：j是从0开始的，只有出现了匹配它才增加）。
– 我们也知道（从上面的定义）lps[j-1]计算的是合适前缀和后缀pat[0…j-1]中字符的个数。
– 从以上两点我们可以推出，我们不需要用lps[j-1]个字符去匹配txt[i-j…i-1]，因为我们直到这些字符无论怎样都能匹配得上。我们考虑下上面的例子来理解它。

txt[] = "AAAAABAAABA" pat[] = "AAAA"lps[] = {0, 1, 2, 3} i = 0, j = 0txt[] = "AAAAABAAABA" pat[] = "AAAA"txt[i] and pat[j[ match, do i++, j++i = 1, j = 1txt[] = "AAAAABAAABA" pat[] = "AAAA"txt[i] and pat[j[ match, do i++, j++i = 2, j = 2txt[] = "AAAAABAAABA" pat[] = "AAAA"pat[i] and pat[j[ match, do i++, j++i = 3, j = 3txt[] = "AAAAABAAABA" pat[] = "AAAA"txt[i] and pat[j[ match, do i++, j++i = 4, j = 4Since j == M, print pattern found and resset j,j = lps[j-1] = lps[3] = 3Here unlike Naive algorithm, we do not match first three characters of this window. Value of lps[j-1] (in above step) gave us index of next character to match.i = 4, j = 3txt[] = "AAAAABAAABA" pat[] =  "AAAA"txt[i] and pat[j[ match, do i++, j++i = 5, j = 4Since j == M, print pattern found and reset j,j = lps[j-1] = lps[3] = 3Again unlike Naive algorithm, we do not match first three characters of this window. Value of lps[j-1] (in above step) gave us index of next character to match.i = 5, j = 3txt[] = "AAAAABAAABA" pat[] =   "AAAA"txt[i] and pat[j] do NOT match and j > 0, change only jj = lps[j-1] = lps[2] = 2i = 5, j = 2txt[] = "AAAAABAAABA" pat[] =    "AAAA"txt[i] and pat[j] do NOT match and j > 0, change only jj = lps[j-1] = lps[1] = 1 i = 5, j = 1txt[] = "AAAAABAAABA" pat[] =     "AAAA"txt[i] and pat[j] do NOT match and j > 0, change only jj = lps[j-1] = lps[0] = 0i = 5, j = 0txt[] = "AAAAABAAABA" pat[] =      "AAAA"txt[i] and pat[j] do NOT match and j is 0, we do i++.i = 6, j = 0txt[] = "AAAAABAAABA" pat[] =       "AAAA"txt[i] and pat[j] match, do i++ and j++i = 7, j = 1txt[] = "AAAAABAAABA" pat[] =       "AAAA"txt[i] and pat[j] match, do i++ and j++We continue this way...

// JAVA program for implementation of KMP pattern// searching algorithmclass KMP_String_Matching{    void KMPSearch(String pat, String txt)    {        int M = pat.length();        int N = txt.length();        // create lps[] that will hold the longest        // prefix suffix values for pattern        int lps[] = new int[M];        int j = 0;  // index for pat[]        // Preprocess the pattern (calculate lps[]        // array)        computeLPSArray(pat,M,lps);        int i = 0;  // index for txt[]        while (i < N)        {            if (pat.charAt(j) == txt.charAt(i))            {                j++;                i++;            }            if (j == M)            {                System.out.println("Found pattern "+                              "at index " + (i-j));                j = lps[j-1];            }            // mismatch after j matches            else if (i < N && pat.charAt(j) != txt.charAt(i))            {                // Do not match lps[0..lps[j-1]] characters,                // they will match anyway                if (j != 0)                    j = lps[j-1];                else                    i = i+1;            }        }    }    void computeLPSArray(String pat, int M, int lps[])    {        // length of the previous longest prefix suffix        int len = 0;        int i = 1;        lps[0] = 0;  // lps[0] is always 0        // the loop calculates lps[i] for i = 1 to M-1        while (i < M)        {            if (pat.charAt(i) == pat.charAt(len))            {                len++;                lps[i] = len;                i++;            }            else  // (pat[i] != pat[len])            {                // This is tricky. Consider the example.                // AAACAAAA and i = 7. The idea is similar                 // to search step.                if (len != 0)                {                    len = lps[len-1];                    // Also, note that we do not increment                    // i here                }                else  // if (len == 0)                {                    lps[i] = len;                    i++;                }            }        }    }    // Driver program to test above function    public static void main(String args[])    {        String txt = "ABABDABACDABABCABAB";        String pat = "ABABCABAB";        new KMP_String_Matching().KMPSearch(pat,txt);    }}// This code has been contributed by Amit Khandelwal.

输出：

Found pattern at index 10

预处理算法：

在预处理部分，我们计算了lps[]的值。为了达到目的，我们跟踪前后缀值的最长长度（这里我们用变量len），我们初始化lps[0],len为0。如果pat[len]与pat[i]匹配，那么我们就加1，并将这个值赋给lps[i]。如果pat[i]与pat[len]不匹配，并且len不为0，那么我们更新len到lps[len-1]。详情请看下面代码中的computeLPSArray ()。

预处理描述（lps[]的构造）

pat[] = "AAACAAAA"len = 0, i  = 0.lps[0] is always 0, we move to i = 1len = 0, i  = 1.Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.len = 1, lps[1] = 1, i = 2len = 1, i  = 2.Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.len = 2, lps[2] = 2, i = 3len = 2, i  = 3.Since pat[len] and pat[i] do not match, and len > 0, set len = lps[len-1] = lps[1] = 1len = 1, i  = 3.Since pat[len] and pat[i] do not match and len > 0, len = lps[len-1] = lps[0] = 0len = 0, i  = 3.Since pat[len] and pat[i] do not match and len = 0, Set lps[3] = 0 and i = 4.len = 0, i  = 4.Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.len = 1, lps[4] = 1, i = 5len = 1, i  = 5.Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.len = 2, lps[5] = 2, i = 6len = 2, i  = 6.Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.len = 3, lps[6] = 3, i = 7len = 3, i  = 7.Since pat[len] and pat[i] do not match and len > 0,set len = lps[len-1] = lps[2] = 2len = 2, i  = 7.Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.len = 3, lps[7] = 3, i = 8We stop here as we have constructed the whole lps[].

0 0