字符串匹配算法（二）

来源：互联网发布：行知职高新疆部编辑：程序博客网时间：2024/05/13 14:54

注：本文大致翻译自EXACT STRING MATCHING ALGORITHMS，去掉一些废话，增加一些解释。

本文的算法一律输出全部的匹配位置。模式串在代码中用x[m]来表示，文本用y[n]来，而所有字符串都构造自一个有限集的字母表Σ，其大小为σ。

三、位运算的魔法——KR与SO

位运算经常能做出一些不可思议的事情来，例如不用临时变量要交换两个数该怎么做呢？一个没接触过这类问题的人打死他也想不出来。如果拿围棋来做比喻，那么位运算可以喻为编程中的“手筋”。

按位的存储方式能提供最大的存储空间利用率，而随着空间被压缩的同时，由于CPU硬件的直接支持，速度竟然神奇般的提升了。举个例子，普通的数组要实现移位操作，那是O(n)的时间复杂度，而如果用位运算中的移位，就是一个指令搞定了。

KR算法

Karp-Rabin algorithm

特点：

uses an hashing function;
preprocessing phase in O(m) time complexity and constant space;
searching phase in O(mn) time complexity;
O(n+m) expected running time.

KR算法之前第一章介绍中说是利用哈希，原文这么介绍的。而我的看法是，哈希只是一个幌子。这个算法的基本步骤同穷举法一样，不同在于每趟比较前先比较一下哈希值，hash值不同就不必比较了。而如果hash值无法高效计算，这样的改进甚至还不如不改进。你想想，比较之前还要先计算一遍hash值，有计算的功夫，直接比都比完了。

KR算法为了把挨个字符的比较转化为两个整数的比较，它把一个m长度的字符串直接当成一个整数来对待（以2为基数的整数）。这样呢，在第一次算出这个整数后，以后每次移动窗口，只需要移去最高位，再加上最低位，就得出一个新的hash值。但是m太大，导致超出计算机所能处理的最大整数怎么办？不用担心，对整数最大值取模，借助模运算的特性，一切可以完美的进行。而且由于是对整数最大值取模，所以取模这一步都可以忽略掉。

Hashing provides a simple method to avoid a quadratic number of character comparisons in most practical situations. Instead of checking at each position of the text if the pattern occurs, it seems to bemore efficient to check only if the contents of the window “looks like” the pattern. In order to check the resemblance between these two words an hashing function is used.

To be helpful for the string matching problem an hashing function hash should have the following properties:: efficiently computable;; highly discriminating for strings;; hash(y[j+1 .. j+m]) must be easily computable from hash(y[j .. j+m-1]) and y[j+m]:
hash(y[j+1 .. j+m])= rehash(y[j], y[j+m], hash(y[j .. j+m-1]).

For a word w of length m let hash(w) be defined as follows:
hash(w[0 .. m-1])=(w[0]*2^m-1+ w[1]*2^m-2+···+ w[m-1]*2⁰) mod q
where q is a large number.

Then, rehash(a,b,h)= ((h-a*2^m-1)*2+b) mod q

The preprocessing phase of the Karp-Rabin algorithm consists in computing hash(x). It can be done in constant space and O(m) time.

During searching phase, it is enough to compare hash(x) with hash(y[j .. j+m-1]) for 0 leq j < n-m. If an equality is found, it is still necessary to check the equality x=y[j .. j+m-1] character by character.

The time complexity of the searching phase of the Karp-Rabin algorithm is O(mn) (when searching for a^m in aⁿ for instance). Its expected number of text character comparisons is O(n+m).

这是KR算法的代码：

#define REHASH(a, b, h) ((((h) - (a)*d) << 1) + (b))
void KR(char *x, int m, char *y, int n) {
   int d, hx, hy, i, j;
   /* Preprocessing */
   /* computes d = 2^(m-1) with
      the left-shift operator */
   for (d = i = 1; i < m; ++i)
      d = (d<<1);
   for (hy = hx = i = 0; i < m; ++i) {
      hx = ((hx<<1) + x[i]);
      hy = ((hy<<1) + y[i]);
   }
   /* Searching */
   j = 0;
   while (j <= n-m) {
      if (hx == hy && memcmp(x, y + j, m) == 0)
         OUTPUT(j);
      hy = REHASH(y[j], y[j + m], hy);
      ++j;
   }
}

示例：

（字符以ascii字符值来算，那么A为65，C为67，G为71）

hx = hash(x[0 .. 7]) = 17597

First attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[0 .. 7]) = 17819

Second attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[1 .. 8]) = 17533

Third attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[2 .. 9]) = 17979

Fourth attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[3 .. 10]) = 19389

Fifth attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[4 .. 11]) = 17339

Sixth attemptGCATCGCAGAGAGTATACAGTACG 12345678 GCAGAGAG

hash(y[5 .. 12]) = 17597

Seventh attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[6 .. 13]) = 17102

Eighth attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[7 .. 14]) = 17117

Ninth attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[8 .. 15]) = 17678

Tenth attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[9 .. 16]) = 17245

Eleventh attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[10 .. 17]) = 17917

Twelfth attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[11 .. 18]) = 17723

Thirteenth attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[12 .. 19]) = 18877

Fourteenth attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[13 .. 20]) = 19662

Fifteenth attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[14 .. 21]) = 17885

Sixteenth attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[15 .. 22]) = 19197

Seventeenth attemptGCATCGCAGAGAGTATACAGTACG GCAGAGAG

hash(y[16 .. 23]) = 16961

The Karp-Rabin algorithm performs 8 character comparisons on the example.

我们可以看到，KR算法有O(m)复杂度的预处理的过程，总感觉它的预处理没有反映出模式本身的特点来，导致它的搜索过程依然是O(mn)复杂度的，只不过一般情况下体现不出来，在"aaaaaaaaaaaaaaaaaaaaaaaaa"中搜"aaaaa"就知道KR多慢了。

总的来说，KR算法比穷举强一点，比较次数的期望值是O(m+n)。

Shift Or 算法

特点：

uses bitwise techniques;
efficient if the pattern length is no longer than the memory-word size of the machine;
preprocessing phase in O(m + ) time and space complexity;
searching phase in O(n) time complexity (independent from the alphabet size and the pattern length);
adapts easily to approximate string matching.

为了最大限度的发挥出位运算的能力，Shift Or算法就有了一个最大缺陷：模式不能超过机器字长。按现在普遍的32位机，机器字长就是32，也就是只能用来匹配不大于32个字符的模式。而带来的好处就是匹配过程是O(n)时间复杂度的，达到自动机的速度了。而预处理所花费的时间与空间都为O(m+σ)，比自动机少多了。

我们来看看它怎么巧妙的实现“只看一遍”的：

假设我们有一个升级系统，总共有m个级别。每一关都会放一个新人到第0级上，然后对于系统中所有的人，如果通过考验，升一级，否则，咔嚓掉。而对于升到最高级的人，那说明他连续通过了m次考验，这就是我们要选拔的人。

KR算法的思路就是上面的升级规则，给出的考验就是你的位置上的字符与给出的文本字符是否一致。升满级了，说明在连续m个位置上与不断给出的文本字符一致，这也就是匹配成功了。

明白了这个思路后，疑问就开始出来了：检查哪些位置与文本字符一致，需要m次吧？那么整个算法就是O(mn)了？

现在就该位运算出场了，对，这个算法的思路是很笨，但是我位运算的效率高呀：

事先算出字母表中每个字符在模式中出现的位置，用位的方式存在整数里，出现的地方标为0，不出现的地方标为1，这样总共使用σ个整数（代码中S）；同样，我用一个整数(代码中lim)来表示升级状态，某个级别有人就标为0，没人就标为1，整个系统升级就恰好可以用“移位”来进行，当检查位置的时候只需要与表示位置状态的整数S[y[j]]“或”1次，所以整个算法就成O(n)了。Shift-Or算法名字就是这样来的。

有一个地方很奇怪，0和1的设定和通常的习惯相反呀，习惯上，喜欢把存在设为1，不存在设为0的。这是因为移位新移出来的是0。

这时我们来看代码就容易理解多了：

#define WORDSIZE sizeof(int)*8
#define ASIZE 256
int preSo(const char *x, int m, unsigned int S[]) {
        unsigned int j, lim;
        int i;
        for (i = 0; i < ASIZE; ++i)
                S[i] = ~0;
        for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) {
                S[x[i]] &= ~j;
                lim |= j;
        }
        lim = ~(lim>>1);
        return(lim);
}
void SO(const char *x, int m, const char *y, int n) {
        unsigned int lim, state;
        unsigned int S[ASIZE];
        int j;
        if (m > WORDSIZE)
                error("SO: Use pattern size <= word size");
        /* Preprocessing */
        lim = preSo(x, m, S);
        /* Searching */
        for (state = ~0, j = 0; j < n; ++j) {
                state = (state<<1) | S[y[j]];
                if (state < lim)
                        OUTPUT(j - m + 1);
        }
}

代码中lim变量其实就是一个标尺，例如，以下示例中，出现最高级的状态是01111111，那么lim就成了10000000，因此只要小于lim，就表示最高级上的0出现了。

原文中对Shift-Or算法的描述还是很难懂的，如果对着那段说明去看代码，有点不知所云的感觉。我还是直接对着代码才想出这个升级的比喻来。

示例：

As R₁₂[7]=0 it means that an occurrence of x has been found at position 12-8+1=5.

（R₁₂[7]=0，说明与模式字符匹配的字符串在12 - 8 + 1 = 5 位置上出现。）

preSo函数中第二个for循环后，lim = 2^m - 1。最后，lim为二进制数：11111111。然后lim = ~(lim>>1) = 10000000。

preSo函数第一个for循环，把所有字符在模式中出现的位置S[x[i]]全部初始化为全1数。

preSo函数第二个for循环：

i = 0，j = 1 = 00000001, S(x[i]) = S[G] = S[G] & ~j = 11111110，lim = 00000001；

i = 1，j = 2 = 00000010, S(x[i]) = S[C] = S[C] & ~j = 11111101，lim = 00000011；

i = 2，j = 4 = 00000100, S(x[i]) = S[A] = S[A] & ~j = 11111011，lim = 00000111；

i = 3，j = 8 = 00001000, S(x[i]) = S[G] = S[G] & ~j = 11110110，lim = 00001111；

i = 4，j = 16 = 00010000, S(x[i]) = S[A] = S[A] & ~j = 11101011，lim = 00011111；

i = 5，j = 32 = 00100000, S(x[i]) = S[G] = S[G] & ~j = 11010110，lim = 00111111；

i = 6，j = 64 = 01000000, S(x[i]) = S[A] = S[A] & ~j = 10101011，lim = 01111111；

i = 7，j = 128 = 10000000, S(x[i]) = S[G] = S[G] & ~j = 01010110，lim = 11111111；

最后：

S[A] = 10101011

S[C] = 11111101

S[G] = 01010110

其它为全1。

SO函数中for循环中的state对应上图中的每一个竖列，分解：

j = 0，state = 11111111 | S[G] = 11111110；

j = 1，state = 11111100 | S[C] = 11111101；

j = 2，state = 11111010 | S[A] = 11111011；

j = 3，state = 11110110 | S[T] = 11111111；

j = 4，state = 11111110 | S[C] = 11111111；

j = 5，state = 11111110 | S[G] = 11111110；

j = 6，state = 11111100 | S[C] = 11111101；

j = 7，state = 11111010 | S[A] = 11111011；

j = 8，state = 11110110 | S[G] = 11110110；

j = 9，state = 11101100 | S[A] = 11101111；

j = 10，state = 11011110 | S[G] = 11011110；

j = 11，state = 10111100 | S[A] = 10111111；

j = 12，state = 01111110 | S[G] = 01111110；

......

上图中0为最低位，7为最高位。所以只有在j = 12时，才出现最高位为0，小于lim。

参考：

http://blog.csdn.net/oyd/article/details/3175805

http://www.cnblogs.com/Su-30MKK/archive/2012/09/17/2688122.html

http://blog.csdn.net/airfer/article/details/8951802

http://blog.163.com/d_chaser/blog/static/18248972220113179334905/

0 0