字符串匹配算法(二)
来源:互联网 发布:行知职高新疆部 编辑:程序博客网 时间:2024/05/13 14:54
注:本文大致翻译自EXACT STRING MATCHING ALGORITHMS,去掉一些废话,增加一些解释。
本文的算法一律输出全部的匹配位置。模式串在代码中用x[m]来表示,文本用y[n]来,而所有字符串都构造自一个有限集的字母表Σ,其大小为σ。
三、位运算的魔法——KR与SO
位运算经常能做出一些不可思议的事情来,例如不用临时变量要交换两个数该怎么做呢?一个没接触过这类问题的人打死他也想不出来。如果拿围棋来做比喻,那么位运算可以喻为编程中的“手筋”。
按位的存储方式能提供最大的存储空间利用率,而随着空间被压缩的同时,由于CPU硬件的直接支持,速度竟然神奇般的提升了。举个例子,普通的数组要实现移位操作,那是O(n)的时间复杂度,而如果用位运算中的移位,就是一个指令搞定了。
KR算法
Karp-Rabin algorithm
特点:
- uses an hashing function;
- preprocessing phase in O(m) time complexity and constant space;
- searching phase in O(mn) time complexity;
- O(n+m) expected running time.
Hashing provides a simple method to avoid a quadratic number of character comparisons in most practical situations. Instead of checking at each position of the text if the pattern occurs, it seems to bemore efficient to check only if the contents of the window “looks like” the pattern. In order to check the resemblance between these two words an hashing function is used.
- To be helpful for the string matching problem an hashing function hash should have the following properties:
- efficiently computable;
- highly discriminating for strings;
- hash(y[j+1 .. j+m]) must be easily computable from hash(y[j .. j+m-1]) and y[j+m]:
hash(y[j+1 .. j+m])= rehash(y[j], y[j+m], hash(y[j .. j+m-1]).
For a word w of length m let hash(w) be defined as follows:
hash(w[0 .. m-1])=(w[0]*2m-1+ w[1]*2m-2+···+ w[m-1]*20) mod q
where q is a large number.
Then, rehash(a,b,h)= ((h-a*2m-1)*2+b) mod q
The preprocessing phase of the Karp-Rabin algorithm consists in computing hash(x). It can be done in constant space and O(m) time.
During searching phase, it is enough to compare hash(x) with hash(y[j .. j+m-1]) for 0 j < n-m. If an equality is found, it is still necessary to check the equality x=y[j .. j+m-1] character by character.
The time complexity of the searching phase of the Karp-Rabin algorithm is O(mn) (when searching for am in an for instance). Its expected number of text character comparisons is O(n+m).
- #define REHASH(a, b, h) ((((h) - (a)*d) << 1) + (b))
- void KR(char *x, int m, char *y, int n) {
- int d, hx, hy, i, j;
- /* Preprocessing */
- /* computes d = 2^(m-1) with
- the left-shift operator */
- for (d = i = 1; i < m; ++i)
- d = (d<<1);
- for (hy = hx = i = 0; i < m; ++i) {
- hx = ((hx<<1) + x[i]);
- hy = ((hy<<1) + y[i]);
- }
- /* Searching */
- j = 0;
- while (j <= n-m) {
- if (hx == hy && memcmp(x, y + j, m) == 0)
- OUTPUT(j);
- hy = REHASH(y[j], y[j + m], hy);
- ++j;
- }
- }
hash(y[0 .. 7]) = 17819
hash(y[1 .. 8]) = 17533
hash(y[2 .. 9]) = 17979
hash(y[3 .. 10]) = 19389
hash(y[4 .. 11]) = 17339
hash(y[5 .. 12]) = 17597
hash(y[6 .. 13]) = 17102
hash(y[7 .. 14]) = 17117
hash(y[8 .. 15]) = 17678
hash(y[9 .. 16]) = 17245
hash(y[10 .. 17]) = 17917
hash(y[11 .. 18]) = 17723
hash(y[12 .. 19]) = 18877
hash(y[13 .. 20]) = 19662
hash(y[14 .. 21]) = 17885
hash(y[15 .. 22]) = 19197
hash(y[16 .. 23]) = 16961
The Karp-Rabin algorithm performs 8 character comparisons on the example.
Shift Or 算法
特点:
- uses bitwise techniques;
- efficient if the pattern length is no longer than the memory-word size of the machine;
- preprocessing phase in O(m + ) time and space complexity;
- searching phase in O(n) time complexity (independent from the alphabet size and the pattern length);
- adapts easily to approximate string matching.
- #define WORDSIZE sizeof(int)*8
- #define ASIZE 256
- int preSo(const char *x, int m, unsigned int S[]) {
- unsigned int j, lim;
- int i;
- for (i = 0; i < ASIZE; ++i)
- S[i] = ~0;
- for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) {
- S[x[i]] &= ~j;
- lim |= j;
- }
- lim = ~(lim>>1);
- return(lim);
- }
- void SO(const char *x, int m, const char *y, int n) {
- unsigned int lim, state;
- unsigned int S[ASIZE];
- int j;
- if (m > WORDSIZE)
- error("SO: Use pattern size <= word size");
- /* Preprocessing */
- lim = preSo(x, m, S);
- /* Searching */
- for (state = ~0, j = 0; j < n; ++j) {
- state = (state<<1) | S[y[j]];
- if (state < lim)
- OUTPUT(j - m + 1);
- }
- }
示例:
As R12[7]=0 it means that an occurrence of x has been found at position 12-8+1=5.
(R12[7]=0,说明与模式字符匹配的字符串在12 - 8 + 1 = 5 位置上出现。)
preSo函数中第二个for循环后,lim = 2m - 1。最后,lim为二进制数:11111111。然后lim = ~(lim>>1) = 10000000。
preSo函数第一个for循环,把所有字符在模式中出现的位置S[x[i]]全部初始化为全1数。
preSo函数第二个for循环:
i = 0,j = 1 = 00000001, S(x[i]) = S[G] = S[G] & ~j = 11111110,lim = 00000001;
i = 1,j = 2 = 00000010, S(x[i]) = S[C] = S[C] & ~j = 11111101,lim = 00000011;
i = 2,j = 4 = 00000100, S(x[i]) = S[A] = S[A] & ~j = 11111011,lim = 00000111;
i = 3,j = 8 = 00001000, S(x[i]) = S[G] = S[G] & ~j = 11110110,lim = 00001111;
i = 4,j = 16 = 00010000, S(x[i]) = S[A] = S[A] & ~j = 11101011,lim = 00011111;
i = 5,j = 32 = 00100000, S(x[i]) = S[G] = S[G] & ~j = 11010110,lim = 00111111;
i = 6,j = 64 = 01000000, S(x[i]) = S[A] = S[A] & ~j = 10101011,lim = 01111111;
i = 7,j = 128 = 10000000, S(x[i]) = S[G] = S[G] & ~j = 01010110,lim = 11111111;
最后:
S[A] = 10101011
S[C] = 11111101
S[G] = 01010110
其它为全1。
SO函数中for循环中的state对应上图中的每一个竖列,分解:
j = 0,state = 11111111 | S[G] = 11111110;
j = 1,state = 11111100 | S[C] = 11111101;
j = 2,state = 11111010 | S[A] = 11111011;
j = 3,state = 11110110 | S[T] = 11111111;
j = 4,state = 11111110 | S[C] = 11111111;
j = 5,state = 11111110 | S[G] = 11111110;
j = 6,state = 11111100 | S[C] = 11111101;
j = 7,state = 11111010 | S[A] = 11111011;
j = 8,state = 11110110 | S[G] = 11110110;
j = 9,state = 11101100 | S[A] = 11101111;
j = 10,state = 11011110 | S[G] = 11011110;
j = 11,state = 10111100 | S[A] = 10111111;
j = 12,state = 01111110 | S[G] = 01111110;
......
上图中0为最低位,7为最高位。所以只有在j = 12时,才出现最高位为0,小于lim。
参考:
http://blog.csdn.net/oyd/article/details/3175805
http://www.cnblogs.com/Su-30MKK/archive/2012/09/17/2688122.html
http://blog.csdn.net/airfer/article/details/8951802
- 字符串匹配算法(二)
- 字符串匹配算法(二)-KMP算法
- 字符串匹配算法研究(二)
- 字符串匹配的KMP算法(二)
- 字符串匹配算法之二------KMP算法
- 字符串匹配算法(二)穷举与自动机
- 二、Knuth-Morris-Pratt字符串匹配算法
- 字符串匹配算法Sunday实现(二)
- 字符串匹配(KMP算法)
- 字符串匹配算法(一)
- 字符串匹配(MP算法)
- 字符串匹配算法(KMP)
- 字符串匹配算法(一)
- 字符串匹配算法(三)
- kmp算法(字符串匹配)
- 字符串匹配(KMP算法)
- 字符串匹配(算法导论)
- KMP算法(字符串匹配)
- eclipse新建JAVA项目导入web项目
- EBS系统请求表定时清除
- 3D语音天气球(源码分享)——创建可旋转的3D球
- 最常见的5个导致 RAC 实例崩溃的问题 (Doc ID 1549191.1)
- 【bzoj 1076】: [SCOI2008]奖励关
- 字符串匹配算法(二)
- PHPCMS V9上传附件图片出现“服务器安全认证错误”解决方法
- Alamofire网络库基础教程:使用 Alamofire 轻松实现 Swift 网络请求
- OpenCV2.4.9新版本使用问题---sift,surf无法使用
- ubuntu14.04创建桌面快捷方式
- 算法_已知五个>=0的自然数,随机输入5个数,编写算法判断是否能排列成有序的数字。0可以替换成任何数
- 一些认识
- 第14章 事务
- CPU GPR与CP0寄存器汇编宏函数-读写