KMP

来源:互联网 发布:linux语句解释说明 编辑:程序博客网 时间:2024/05/17 04:00

Oulipo
Time Limit: 1000MS Memory Limit: 65536KTotal Submissions: 20878 Accepted: 8334

Description

The French author Georges Perec (1936–1982) once wrote a book, La disparition, without the letter'e'. He was a member of the Oulipo group. A quote from the book:

Tout avait Pair normal, mais tout s’affirmait faux. Tout avait Fair normal, d’abord, puis surgissait l’inhumain, l’affolant. Il aurait voulu savoir où s’articulait l’association qui l’unissait au roman : stir son tapis, assaillant à tout instant son imagination, l’intuition d’un tabou, la vision d’un mal obscur, d’un quoi vacant, d’un non-dit : la vision, l’avision d’un oubli commandant tout, où s’abolissait la raison : tout avait l’air normal mais…

Perec would probably have scored high (or rather, low) in the following contest. People are asked to write a perhaps even meaningful text on some subject with as few occurrences of a given “word” as possible. Our task is to provide the jury with a program that counts these occurrences, in order to obtain a ranking of the competitors. These competitors often write very long texts with nonsense meaning; a sequence of 500,000 consecutive'T's is not unusual. And they never use spaces.

So we want to quickly find out how often a word, i.e., a given string, occurs in a text. More formally: given the alphabet {'A','B','C', …, 'Z'} and two finite strings over that alphabet, a wordW and a textT, count the number of occurrences of W inT. All the consecutive characters of W must exactly match consecutive characters ofT. Occurrences may overlap.

Input

The first line of the input file contains a single number: the number of test cases to follow. Each test case has the following format:

  • One line with the word W, a string over {'A', 'B','C', …,'Z'}, with 1 ≤ |W| ≤ 10,000 (here |W| denotes the length of the stringW).
  • One line with the text T, a string over {'A', 'B','C', …,'Z'}, with |W| ≤ |T| ≤ 1,000,000.

Output

For every test case in the input file, the output should contain a single number, on a single line: the number of occurrences of the wordW in the textT.

Sample Input

3BAPCBAPCAZAAZAZAZAVERDIAVERDXIVYERDIAN

Sample Output

130

代码:

#include<iostream>#include<cstring>#include<cstdio>#define Maxn 1000010using namespace std;char pat[Maxn],ob[Maxn];int next[Maxn];void get_next(int lenpat){    next[0]=-1;    int j=0,k=-1;    while(j<lenpat-1){        if(k==-1||pat[k]==pat[j]){            k++;j++;            next[j]=k;        }        else k=next[k];    }}int KMP(int lenpat,int lenob){    int i,j,cou;    i=j=cou=0;    while(i<lenob&&j<=lenpat){        if(j==lenpat){            i--;j--;            cou++;            j=next[j];        }        if(j==-1||ob[i]==pat[j]){            i++;j++;        }        else j=next[j];    }    if(j>=lenpat) cou++;    return cou;    //if(j>=lenpat) return i-j;    //else return -1;}int main(){    int t;    cin>>t;    while(t--){        scanf("%s%s",pat,ob);        //memset(next,0,sizeof(next));        int lenpat=strlen(pat),lenob=strlen(ob);        get_next(lenpat);        printf("%d\n",KMP(lenpat,lenob));    }    return 0;}

注:这是一道很裸的KMP算法,说起这个算法,那只能膜拜啊,三个大牛发现的,其中Kunth算最屌的一个了,所以一般人都无法再短时间内理解这个算法的核心。这个如果要我在短时间内把它讲清楚也是非常困难,像这种东西只有自己亲身试验,才能更好地理解。

但我还是把算法核心稍微讲一下吧,帮助一些初学者更好地入门。这个算法不同于朴素的字符串匹配法(BF),在于这个算法利用了一些信息。

要知道BF算法是不管三七二十一,当发生失陪情况时,直接回溯到开始匹配的下一个位置,这样的效率极其低下,因为当发生失配情况时,其实恰好说明了前面一些字符是匹配的,所谓回溯,其实可以认为模式串向右平移,在右移时匹配成功的条件是模式串的前几个字符目标串的失配处的前几个字符相匹配,而根据前面的说明,在失配处前模式串和目标串是匹配的,也就是说目标串的失配处的前几个字符与模式串失配处的前几个字符相同,这样一来匹配成功的条件就转化成模式串的前几个字符和模式串的失配处的前几个字符相匹配。这个已经说明右移的步长决定于模式串,和目标串无关,这样就可以预处理模式串了。

写了这么多,不知大家对原理是否稍微理解了些,更深入地学习,还得靠大家自己!



























































0 0