KMP算法
来源:互联网 发布:数据库毕业论文题目 编辑:程序博客网 时间:2024/06/01 07:40
原文地址:Searching for Patterns | Set 2 (KMP Algorithm)
已知一段文本txt[0..n-1]与一个模式pat[0..m-1],写一个函数search(char pat[], char txt[])打印pat[]在txt[]所有出现的位置。
例子:
Input: txt[] = "THIS IS A TEST TEXT" pat[] = "TEST"Output: Pattern found at index 10Input: txt[] = "AABAACAADAABAAABAA" pat[] = "AABA"Output: Pattern found at index 0 Pattern found at index 9 Pattern found at index 13
模式搜索是计算机科学中一个十分重要的问题。当我们在记事本/word,或者浏览器或者数据库中查找字符串的时候,模式搜索算法用于显示这些查询结果。
我们已经在前面的章节中讨论过了简单的模式搜索算法(Naive pattern searching algorithm)。简单模式搜索算法在最差情况下的时间复杂度是O(m(n-m+1))。KMP算法在最差情况的时间复杂度是O(n)。
KMP (Knuth Morris Pratt)模式搜索
当许多匹配的字符后面有一个不匹配的字符的时候,简单模式搜索算法效果就不是那么好了。下面就是一些例子。
txt[] = "AAAAAAAAAAAAAAAAAB" pat[] = "AAAAB" txt[] = "ABABABCABABABCABABABC" pat[] = "ABABAC" (not a worst case, but a bad case for Naive)
KMP搜索算法利用模式退化属性(模式有相同的子字符串并且在模式中出现不止一次),并且把最坏情况的复杂度改进到O(n)。KMP算法的基本思想是:无论啥时候检测到不匹配的字符串(在一些匹配的字符串之后),我们已经知道在下一个窗口的文本中的一些字符。我们利用这个信息的优势,避免匹配那些将要匹配的字符。我们考虑下面的例子来理解这个问题。
匹配概述txt = "AAAAABAAABA" pat = "AAAA"我们首先用pat比较第一个窗口中的文本txt = "AAAAABAAABA" pat = "AAAA" [初始化位置]我们找到了一个匹配的位置。这与简单的字符串匹配是一样的。在下一步中,我们用pat比较下一个窗口中的文本txt = "AAAAABAAABA" pat = "AAAA" [模式切换到位置1]这就是为啥KMP优化了简单的搜索算法。在第二个窗口中,我们用当前窗口中的第四个字符模式比较第四个A来决定是否当前窗口匹配。因为我们知道无论如何前三个字符是匹配的,我们可以忽略匹配前三个字符。还需要预处理吗?上述的解释提出这样一个重要的问题,我们咋能知道有多少个字符可以略过呢。为了得到这个答案,我们要预处理模式,并准备一个整形数组lps[],这个数组可以告诉我们有几个字符可以略过。
预处理概述:
- KMP算法做预处理pat[]并建立一个大小为m(与模式的大小相同)的附加数组lps[],它是用于在匹配过程中略过字符的。
- lps表示的是longest proper prefix,也就是后缀。一个合适的前缀就是不允许带有整个字符串的前缀。例如,“ABC”的前缀有“”, “A”, “AB”和“ABC”。合适的前缀是“”, “A”和“AB”。这个字符串的后缀是“”, “C”, “BC” and “ABC”。
- 对于每个子模式pat[0..i],在这里i从0到m-1,lps保存的是匹配的合适前缀的最大长度,这也是子模式pat[0..i]的一个后缀。
lps[i] = the longest proper prefix of pat[0..i] which is also a suffix of pat[0..i].
注意:lps[i]可以被定义为最长前缀,也是合适的后缀。我们需要用合适的在一个地方来确保整个字符串没被考虑。
Examples of lps[] construction:For the pattern “AAAA”, lps[] is [0, 1, 2, 3]For the pattern “ABCDE”, lps[] is [0, 0, 0, 0, 0]For the pattern “AABAACAABAA”, lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]For the pattern “AAACAAAAAC”, lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4] For the pattern “AAABAAA”, lps[] is [0, 1, 2, 0, 1, 2, 3]
搜索算法:
与简单算法不一样,我们逐一滑动模式,并在每一次改变都比较所有的字符串,我们用lps[]中的一个值确定下一个将要匹配的字符。这个思想不是我们无论怎样都匹配的匹配字符。
怎样利用lps[]确定下一个位置呢(或者知道略过字符的个数)?
- 我们从字符串中当前窗口的字符与pat[j],j=1开始比较
- 我们保持txt[i]与pat[j]字符串的匹配,并随着txt[i]与pat[j]的匹配增加i和j。
- 当发现匹配失败的时候
– 我们知道字符pat[0..j-1]与txt[i-j+1…i-1]匹配(注意:j是从0开始的,只有出现了匹配它才增加)。
– 我们也知道(从上面的定义)lps[j-1]计算的是合适前缀和后缀pat[0…j-1]中字符的个数。
– 从以上两点我们可以推出,我们不需要用lps[j-1]个字符去匹配txt[i-j…i-1],因为我们直到这些字符无论怎样都能匹配得上。我们考虑下上面的例子来理解它。
txt[] = "AAAAABAAABA" pat[] = "AAAA"lps[] = {0, 1, 2, 3} i = 0, j = 0txt[] = "AAAAABAAABA" pat[] = "AAAA"txt[i] and pat[j[ match, do i++, j++i = 1, j = 1txt[] = "AAAAABAAABA" pat[] = "AAAA"txt[i] and pat[j[ match, do i++, j++i = 2, j = 2txt[] = "AAAAABAAABA" pat[] = "AAAA"pat[i] and pat[j[ match, do i++, j++i = 3, j = 3txt[] = "AAAAABAAABA" pat[] = "AAAA"txt[i] and pat[j[ match, do i++, j++i = 4, j = 4Since j == M, print pattern found and resset j,j = lps[j-1] = lps[3] = 3Here unlike Naive algorithm, we do not match first three characters of this window. Value of lps[j-1] (in above step) gave us index of next character to match.i = 4, j = 3txt[] = "AAAAABAAABA" pat[] = "AAAA"txt[i] and pat[j[ match, do i++, j++i = 5, j = 4Since j == M, print pattern found and reset j,j = lps[j-1] = lps[3] = 3Again unlike Naive algorithm, we do not match first three characters of this window. Value of lps[j-1] (in above step) gave us index of next character to match.i = 5, j = 3txt[] = "AAAAABAAABA" pat[] = "AAAA"txt[i] and pat[j] do NOT match and j > 0, change only jj = lps[j-1] = lps[2] = 2i = 5, j = 2txt[] = "AAAAABAAABA" pat[] = "AAAA"txt[i] and pat[j] do NOT match and j > 0, change only jj = lps[j-1] = lps[1] = 1 i = 5, j = 1txt[] = "AAAAABAAABA" pat[] = "AAAA"txt[i] and pat[j] do NOT match and j > 0, change only jj = lps[j-1] = lps[0] = 0i = 5, j = 0txt[] = "AAAAABAAABA" pat[] = "AAAA"txt[i] and pat[j] do NOT match and j is 0, we do i++.i = 6, j = 0txt[] = "AAAAABAAABA" pat[] = "AAAA"txt[i] and pat[j] match, do i++ and j++i = 7, j = 1txt[] = "AAAAABAAABA" pat[] = "AAAA"txt[i] and pat[j] match, do i++ and j++We continue this way...
// JAVA program for implementation of KMP pattern// searching algorithmclass KMP_String_Matching{ void KMPSearch(String pat, String txt) { int M = pat.length(); int N = txt.length(); // create lps[] that will hold the longest // prefix suffix values for pattern int lps[] = new int[M]; int j = 0; // index for pat[] // Preprocess the pattern (calculate lps[] // array) computeLPSArray(pat,M,lps); int i = 0; // index for txt[] while (i < N) { if (pat.charAt(j) == txt.charAt(i)) { j++; i++; } if (j == M) { System.out.println("Found pattern "+ "at index " + (i-j)); j = lps[j-1]; } // mismatch after j matches else if (i < N && pat.charAt(j) != txt.charAt(i)) { // Do not match lps[0..lps[j-1]] characters, // they will match anyway if (j != 0) j = lps[j-1]; else i = i+1; } } } void computeLPSArray(String pat, int M, int lps[]) { // length of the previous longest prefix suffix int len = 0; int i = 1; lps[0] = 0; // lps[0] is always 0 // the loop calculates lps[i] for i = 1 to M-1 while (i < M) { if (pat.charAt(i) == pat.charAt(len)) { len++; lps[i] = len; i++; } else // (pat[i] != pat[len]) { // This is tricky. Consider the example. // AAACAAAA and i = 7. The idea is similar // to search step. if (len != 0) { len = lps[len-1]; // Also, note that we do not increment // i here } else // if (len == 0) { lps[i] = len; i++; } } } } // Driver program to test above function public static void main(String args[]) { String txt = "ABABDABACDABABCABAB"; String pat = "ABABCABAB"; new KMP_String_Matching().KMPSearch(pat,txt); }}// This code has been contributed by Amit Khandelwal.
输出:
Found pattern at index 10
预处理算法:
在预处理部分,我们计算了lps[]的值。为了达到目的,我们跟踪前后缀值的最长长度(这里我们用变量len),我们初始化lps[0],len为0。如果pat[len]与pat[i]匹配,那么我们就加1,并将这个值赋给lps[i]。如果pat[i]与pat[len]不匹配,并且len不为0,那么我们更新len到lps[len-1]。详情请看下面代码中的computeLPSArray ()。
预处理描述(lps[]的构造)
pat[] = "AAACAAAA"len = 0, i = 0.lps[0] is always 0, we move to i = 1len = 0, i = 1.Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.len = 1, lps[1] = 1, i = 2len = 1, i = 2.Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.len = 2, lps[2] = 2, i = 3len = 2, i = 3.Since pat[len] and pat[i] do not match, and len > 0, set len = lps[len-1] = lps[1] = 1len = 1, i = 3.Since pat[len] and pat[i] do not match and len > 0, len = lps[len-1] = lps[0] = 0len = 0, i = 3.Since pat[len] and pat[i] do not match and len = 0, Set lps[3] = 0 and i = 4.len = 0, i = 4.Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.len = 1, lps[4] = 1, i = 5len = 1, i = 5.Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.len = 2, lps[5] = 2, i = 6len = 2, i = 6.Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.len = 3, lps[6] = 3, i = 7len = 3, i = 7.Since pat[len] and pat[i] do not match and len > 0,set len = lps[len-1] = lps[2] = 2len = 2, i = 7.Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.len = 3, lps[7] = 3, i = 8We stop here as we have constructed the whole lps[].
- KMP算法详解 【KMP】
- 【KMP】KMP算法模板
- KMP hihoCoder1015 KMP算法
- kmp算法
- KMP算法
- KMP算法
- KMP算法
- KMP算法
- KMP 算法
- kmp算法
- KMP算法
- kmp算法
- KMP算法
- KMP算法
- kmp算法
- kmp算法
- KMP算法
- KMP算法
- 在Myeclipse中,编写strut.xml文件时自动提示设置
- 【物联网(IoT)开发】物联网及NodeRed 技术讲解及动手实践活动分享
- 左旋转字符串
- 数据结构实验之查找六:顺序查找
- 进程通信总结
- KMP算法
- 使用 HorizontalScrollerView 完成 水平可滑动的分类栏效果
- linux中oracle执行sql文件
- 哈尔滨理工大学软件学院ACM程序设计全国邀请赛(网络同步赛) D. Pairs FFT
- 顺序查找(34)
- mysql 5.7.16免安装版配置
- LR-微信订车压力测试
- CSDN爬虫(六)——动态网页爬取的两种策略
- I/O流java