[算法] 字符串模式匹配 KML BM Sundy AC

来源：互联网发布：全民超神刷钻石软件。编辑：程序博客网时间：2024/06/05 20:46

暴力匹配
KMP算法
BM算法
- 简介与例子
- C语言实现
Sundy算法
- 简介与例子
- C实现
AC算法
- 简介
- C实现

文本串S[] “A1B324”
模式串P[] “324”
查找P在S中的位置(下标3)

暴力匹配

int bf(char* s, char*p) {    int sLen = strlen(s);    int pLen = strlen(p);    int i=0; //遍历S    int j=0; //遍历P    while (i < sLen && j < pLen) { //未超出两个字符串范围        if (s[i]==p[j]) { //当前位置字符匹配成功            i++;            j++;        } else { //失配            i=i-j+1; //i回溯刚匹配的下一个位置            j=0; //j回溯到0        }    }    if (j==pLen) { //完成匹配        return i-j;    } else {        return -1;    }}

KMP算法

int kmpSearch(char *s,char *p) {    getNext(p,next); //获得next数组    int i=0,j=0;    int sLen = strlen(s);    int pLen = strlen(p);    while (i<sLen && j<pLen) {        if (j==-1 && s[i]==p[j]) {            i++;            j++;        } else {            j = next[j];        }    }    if (j==pLen)        return i-j;    else        return -1;}void getNext(char *p,int next[]) {    int pLen = strlen(p);    next[0] = -1;    int k = -1;    int j = 0;    while (j < pLen - 1) {        //p[k]前缀，p[j]后缀        if ( k==-1 || p[j]==p[k]) {            j++;            k++;            if (p[j]!=p[k]) {                next[j] = k;            } else {                next[j] = next[k];            }        } else {            k = next[k];        }    }}

BM算法

简介与例子

模式串后面开始匹配，效率比KML高，思维巧妙也较容易理解

例子：下面过程中S即表示文本串，P表示模式串，’P’才表示字符
文本串S[] : HERE IS A SIMPLE EXAMPLE
模式串P[] : EXAMPLE

S与P头部对齐，从尾部开始比较
1. ’S’与’E’失配，’S’称为‘坏字符’，对应P中下标为6
2. 在P中找最右边的’S’，没找到，相当于’S’出现在下标为-1的地方
把P后移 6-(-1)=7 位，进入下一步
1. ’P’与’E’失配，’P’为‘坏字符’，对应P中下标为6
2. 在P中找最右边的’P’，下标为4的时候出现
将P后移 6-4=2 位，两个’P’对齐

第二次从P的最后开始比较:①’E’匹配 ②’L’匹配 ③’P’匹配 ④’M’匹配
这里得出后面的’M’、’P’、’L’、’E’都匹配，称为’好后缀’(所有尾部匹配的字符串)，注意’MPLE’、’PLE’、’LE’、’E’都是好后缀
匹配进行到’I’与’A’：
1. ’I’与’A’失配，’I’为坏字符，对应P中下标为2
2. 在P中找’I’，未找到，即视为-1
3. P移动2-(-1)=3 位？
  
  [分析] 有没有更高效的规则
4. 讨论“好后缀”：前面得出好后缀有四种可能：”MPLE”,”PLE”,”LE”,”E” 。四个之中只有”E”在模式串P(“EXAMPLE”)的头部出现，即好后缀为”E”。算出：好后缀在P中的下标为6，好后缀在P上一次出现的下标为0
5. P移动6-0=6位？还是前面的3位？
6. 此时出现两种移动方式，在效率的角度，当然是移动最大位
P移动 6-0=6 位，进入下一步
1. ’P’与’E’失配，该位置对应P中下标为6
2. 在P中寻找最右边的“坏字符”’P’，出现在下标为4的地方
P移动 6-4=2 位，进入下一步

从P最后一位开始匹配，发现完全匹配，算法结束

综上：BM算法有两个规则

坏字符规则
当S中的每个字符跟P中某个字符不匹配时，称S中这个失配字符为坏字符，此时模式串需要向右移动，移动的位数 = 坏字符在P中对应的下标 - 坏字符在P最右出现的下标。如果P中没有坏字符，则当最右出现的在下标为-1的地方。
好后缀规则
当字符失配时，后移位数=好后缀在P中的下标-好后缀在P上一次出现的下标，且如果好后缀在P中没有再次出现，则为-1

C语言实现

void preBmBc(char *x, int m, int bmBc[]) {   int i;   for (i = 0; i < ASIZE; ++i)      bmBc[i] = m;   for (i = 0; i < m - 1; ++i)      bmBc[x[i]] = m - i - 1;}void suffixes(char *x, int m, int *suff) {   int f, g, i;   suff[m - 1] = m;   g = m - 1;   for (i = m - 2; i >= 0; --i) {      if (i > g && suff[i + m - 1 - f] < i - g)         suff[i] = suff[i + m - 1 - f];      else {         if (i < g)            g = i;         f = i;         while (g >= 0 && x[g] == x[g + m - 1 - f])            --g;         suff[i] = f - g;      }   }}void preBmGs(char *x, int m, int bmGs[]) {   int i, j, suff[XSIZE];   suffixes(x, m, suff);   for (i = 0; i < m; ++i)      bmGs[i] = m;   j = 0;   for (i = m - 1; i >= 0; --i)      if (suff[i] == i + 1)         for (; j < m - 1 - i; ++j)            if (bmGs[j] == m)               bmGs[j] = m - 1 - i;   for (i = 0; i <= m - 2; ++i)      bmGs[m - 1 - suff[i]] = m - 1 - i;}void BM(char *x, int m, char *y, int n) {   int i, j, bmGs[XSIZE], bmBc[ASIZE];   /* Preprocessing */   preBmGs(x, m, bmGs);   preBmBc(x, m, bmBc);   /* Searching */   j = 0;   while (j <= n - m) {      for (i = m - 1; i >= 0 && x[i] == y[i + j]; --i);      if (i < 0) {         OUTPUT(j);         j += bmGs[0];      }      else         j += MAX(bmGs[i], bmBc[y[i + j]] - m + 1 + i);   }}

Sundy算法

简介与例子

KMP与BM算法在最坏情况下均具有线性的查找时间
KMP算法并不比最简单的C库函数strstr()快多少
BM算法往往比KMP算法快3-5倍

文本串S[]: “substring searching algorithm”
模式串P[]: “search”

s与p左侧对齐
```
substring searching algorithmsearch ^    ~
```
1. 第二个字符(^对应上去)处失配
2. 文本串中参加匹配的最末位字符的下一位字符(~对应上去的字母i)
3. P串不存在i，所以P串直接跳过一大片，向右移动位数=匹配串长度+1=6+1=7，从i下一个字符(n)开始匹配
P向右移动6+1=7位，从n开始匹配
```
substring searching algorithm       search       ^     ~
```
1. 第一个字符(^对应上去)处失配
2. 文本串中参加匹配的最末位字符的下一位字符(~对上去的字符r)
3. ‘r’出现P串的倒数第3位，’r’到模式串末尾的距离+1 = 2+1 = 3
4. 使两个’r’对齐
P串移动2+1=3位，使’r’对齐
```
substring searching algorithm          search             ^
```
1. 在P串中从第一位字符向右匹配，匹配成功

C++实现

#include<iostream>#include<string.h>using namespace std;//一个字符8位 最大256种#define MAX_CHAR_SIZE 256#define MAXSIZE 100/*设定每个字符最右移动步长，保存每个字符的移动步长 如果大串中匹配字符的右侧一个字符没在子串中，大串移动步长=整个串的距离+1   如果大串中匹配范围内的右侧一个字符在子串中，大串移动距离=子串长度-这个字符在子串中的位置*/int *setCharStep(char *subStr){ int i;     int *charStep=new int[MAX_CHAR_SIZE];     int subStrLen=strlen(subStr);     for(i=0;i<MAX_CHAR_SIZE;i++)             charStep[i]=subStrLen+1;     //从左向右扫描一遍 保存子串中每个字符所需移动步长      for(i=0;i<subStrLen;i++)     {            charStep[(unsigned char)subStr[i]]=subStrLen-i;             }     return charStep;}/*   算法核心思想，从左向右匹配，遇到不匹配的看大串中匹配范围之外的右侧第一个字符在小串中的最右位置    根据事先计算好的移动步长移动大串指针，直到匹配 */int sundaySearch(char *mainStr,char *subStr,int *charStep){     int mainStrLen=strlen(mainStr);     int subStrLen=strlen(subStr);     int main_i=0;     int sub_j=0;     while(main_i<mainStrLen)     {                              //保存大串每次开始匹配的起始位置，便于移动指针              int tem=main_i;             while(sub_j<subStrLen)             {                    if(mainStr[main_i] == subStr[sub_j])                    {                            main_i++;                            sub_j++;                            continue;                                       }                                    else{                        //如果匹配范围外已经找不到右侧第一个字符，则匹配失败                          if(tem+subStrLen > mainStrLen)                                     return -1;                         //否则 移动步长 重新匹配                          char firstRightChar=mainStr[tem+subStrLen];                         main_i+=charStep[(unsigned char)firstRightChar];                         sub_j=0;                            break;//退出本次失败匹配 重新一轮匹配                     }               }             if(sub_j == subStrLen)                       return main_i-subStrLen;     }      return -1; }int main(){ char mainStr[MAXSIZE]; char subStr[MAXSIZE]; int answer, i; printf("\nBoyer-Moore String Searching Program"); printf("\n===================================="); printf("\n\nmainStr String --> "); gets(mainStr); printf( "\nsubStr String --> "); gets(subStr); int *charStep=setCharStep(subStr); if ((answer = sundaySearch(mainStr,subStr,charStep)) >= 0)  {  printf("\n");  printf("%s\n", mainStr);  for (i = 0; i < answer; i++)   printf(" ");  printf("%s", subStr);  printf("\n\nPattern Found at location %d\n", answer); } else  printf("\nPattern NOT FOUND.\n"); return 0；}

AC算法

原文：http://blog.chinaunix.net/uid-22237530-id-1781824.html

简介

AC：Aho_Corasick自动机匹配算法是多模式匹配算法
步骤：构造一颗Trie树，构造失败指针和模式匹配过程

建立一颗Trie
构造失败指针
设这个节点上的字母为C，沿着他父亲的失败指针走，直到走到一个节点，他的儿子中也有字母为C的节点。然后把当前节点的失败指针指向那个字母也为C的儿子。如果一直走到了root都没找到，那就把失败指针指向root。
模式匹配过程
1. 当前字符匹配，表示从当前节点沿着树边有一条路径可以到达目标字符，此时只需沿该路径走向下一个节点继续匹配即可，目标字符串指针移向下个字符继续匹配；
2. 当前字符不匹配，则去当前节点失败指针所指向的字符继续匹配，匹配过程随着指针指向root结束。

C++实现

#include <iostream>#include <cstring>#include <cstdio>using namespace std;const int MAXN = 1000001; //模式串的最大长度MAXN - 1const int MAXM = 51; //单词最大长度为MAXM - 1const int KEYSIZE = 26; //26个小写字母struct Node {      Node *fail;  //失败指针      Node *next[KEYSIZE]; //儿子结点个数      int count; //单词个数      Node() {            fail = NULL;            count = 0;            memset(next, 0, sizeof(next));      }}*q[MAXN / 2];void insert(char *str, Node *root){      Node *p = root;      int i = 0;      while(str[i]) {           int index = str[i] - 'a';           if(p -> next[index] == NULL)                  p -> next[index] = new Node();           p = p -> next[index];           i ++;      }      p -> count ++; //在单词的最后一个结点count + 1,代表一个单词}void build_ac_automation(Node *root){      root -> fail = NULL;      int head, tail;      head = tail = 0;      q[tail ++] = root;      while(head < tail) {            Node *temp = q[head ++];            for(int i = 0; i < KEYSIZE; i ++) {                if(temp -> next[i] != NULL) {                     if(temp == root) {                          temp -> next[i] -> fail = root;                     }else {                          Node *p = temp -> fail;                          while(p != NULL) {                               if(p -> next[i] != NULL) {                                         temp -> next[i] -> fail = p -> next[i];                                    break;                               }                               p = p -> fail;                          }                          if(p == NULL)                               temp -> next[i] -> fail = root;                     }                     q[tail ++] = temp -> next[i];                }           }      }}int AC_search(char *str, Node *root){      int i = 0, cnt = 0;      Node *p = root;      while(str[i]) {           int index = str[i] - 'a';           while(p -> next[index] == NULL && p != root) p = p -> fail;           p = p -> next[index];           p = (p == NULL) ? root : p;           Node *temp = p;           while(temp != root && temp -> count != -1) {                 cnt += temp -> count;                 temp -> count = -1;                 temp = temp -> fail;           }           i ++;      }      return cnt;}int main(){      int n;      Node *root;      char keyword[MAXM]; //单词      char str[MAXN]; //模式串   printf("scanf the number of words-->\n");      scanf("%d", &n);      root = new Node();   printf("scanf the words-->\n");      while(n --) {  scanf("%s", keyword);  insert(keyword, root);      }      build_ac_automation(root);   printf("scanf the text-->\n");      scanf("%s", str);      printf("there are %d words match\n", AC_search(str, root));      return(0);}

阅读全文

0 0