KMP字符串匹配算法

来源：互联网发布：菜瓜软件编辑：程序博客网时间：2024/06/05 04:29

1. KMP算法基本思想

问题：在字符串ABABABACA中寻找字符串ABABACA，并返回第一次出现的位置。
下面分析匹配过程

ABABABACAABABACA     |此处出现不匹配

若此时按照朴素字符串匹配算法进行匹配，模式字符串在不匹配的时候右移一位，重新从第一个字符进行匹配，情况如下

ABABABACA ABABACA |右移一位，重新从第一个字符进行匹配，很明显不匹配，无效偏移

ABABABACA  ABABACA  |再次右移一位，重新从第一个字符进行匹配，一直到模式串末尾，匹配成功

能否避免无效偏移和每次都从头开始匹配？这就是KMP算法所实现的。

ABABABACAABABACA     |此处出现不匹配，将该位置记为posABABABACA  ABABACA     |直接偏移2位     |发现在上次出现不匹配的位置pos之前的3个字符ABA是匹配的，那么就不需要从模式串头开始匹配，直接从pos处进行匹配

问题：ABABA和ABA是什么关系？怎么知道可以直接偏移2位？
ABA为字符串ABABA的前缀和后缀的最长的共有字符串。
ABABA的前缀字符串（不包括尾字符）有A AB ABA ABAB
ABABA的后缀字符串（不包括头字符）有A BA ABA BABA
所以ABABA的前缀和后缀的最长的共有字符串为ABA，长度为3

移动位数 = 已匹配的字符数 - 对应的部分匹配值

上述例子中，已匹配=5，部分匹配=3，所以移动位数=2

倘若算出每个位置的部分匹配值，就可以直接得到应该移动的位数，从而避免无效移位，这个要求的部分匹配值被称为部分匹配表（Partial Match Table）。

2. 如何求部分匹配表（next数组）？

next数组的前两个元素为-1，0

  A B A B A C A -1 0

求next[pos]要根据next[pos - 1]的值。

1. 当pos - 1处的字符与next[pos - 1]即cnd处字符相同时
如下图所示，浅蓝色是子串P[0..pos - 2]的最长前缀后缀公共字符串，并且两个深蓝色处字符相同，那么子串P[0..pos - 1]的最长前缀后缀公共字符串长度为next[pos - 1] + 1，即cnd + 1。
这里写图片描述

2. 当pos - 1处的字符与next[pos - 1]即cnd处字符不相同时
如下图所示，绿色方块表示子串P[0..cnd - 1]的最长前缀后缀公共字符串，该绿色方块字符串一定也会是子串P[0..pos - 2]的前缀后缀公共字符串(非最长)，若next[cnd]处字符与pos - 1处字符相同，则next[pos] = next[cnd] + 1，若不相同，重复上述步骤。
这里写图片描述

实现代码如下：

private int[] getNext(String p) {    if (p.length() == 1)        return new int[] {-1};    int[] next = new int[p.length()];    next[0] = -1;    next[1] = 0;    int pos = 2; // 当前计算位置为2    int cnd = 0; // 当前已经计算出的最长前缀后缀公共字符串的下一个字符位置    while (pos < p.length()) {        if (p.charAt(pos - 1) == p.charAt(cnd)) {            next[pos++] = ++cnd;        } else if (cnd > 0) {            cnd = next[cnd];        } else {            next[pos++] = 0;        }    }    return next;}

3. 优化next数组

目前该算法实现并不完美。依然以模式串ABABACA为例，然而此时的待检测字符串为ABABCABABACA。让我们分析下匹配过程。

ABABCABABACAABABACA    | 此处出现不匹配，根据部分匹配表，next[4] = 2，最长前缀后缀公共字符串为AB，右移2位ABABCABABACA  ABABACA    | 不匹配。注意，上一次是字符C与A进行比较，这一次依然是字符C与A比较，这一次也是一次无效偏移，这就是待优化的地方

优化方法为判断当前字符是否与前缀下一个字符相同，若相同，则next[pos] = next[cnd]。
优化结果

原next数组 A B A B A C A-1 0 0 1 2 3 0改进后next数组 A B A B A C A-1 0 0 0 0 3 0

优化代码如下

private int[] getNext(String p) {    if (p.length() == 1)        return new int[] {-1};    int[] next = new int[p.length()];    next[0] = -1;    next[1] = p.charAt(0) == p.charAt(1) ? -1 : 0;    next[1] = 0;    int cnd = 0;    int pos = 2;    while (pos < p.length()) {        if (p.charAt(pos - 1) == p.charAt(cnd)) {            // 此处判断当前字符是否与前缀下一个字符相同            // 若相同，则next[pos] = next[cnd]            if (p.charAt(pos) != p.charAt(++cnd))                next[pos++] = cnd;            else                next[pos++] = next[cnd];        } else if (next[cnd] > -1) {            cnd = next[cnd];        } else {            next[pos++] = 0;        }    }    return next;}

4. 根据next数组实现线性字符串匹配

实现代码如下

public int strStr(String str, String pattern) {    if (str == null || pattern == null)        return -1;    if (pattern.length() == 0)        return 0;    int[] next = getNext(pattern);    int m = 0; // 已匹配字符串头在待检测字符串str中的位置    int i = 0; // 当前进行匹配的字符在模式串pattern中所处的位置    while (m + i < str.length()) {        if (str.charAt(m + i) == pattern.charAt(i)) {            i++;            if (i == pattern.length())                return m;        } else {            if (next[i] == -1) {                // 无前缀后缀公共字符串                // 右移一位，从模式串头开始匹配                m++;                i = 0;            } else {                // i为已匹配长度 next[i]为部分匹配长度 i - next[i]为移动位数                m = m + i - next[i]; // 右移                i = next[i]; // 用部分匹配长度更新已匹配长度            }        }    }    return -1;}

参考资料

Knuth–Morris–Pratt algorithm

2 0