[字符串hash][堆排序][AC自动机][usaco3.1.5]Contact

来源：互联网发布：淘宝创意收货人昵称编辑：程序博客网时间：2024/04/29 17:02

描述

奶牛们开始对用射电望远镜扫描牧场外的宇宙感兴趣。最近，他们注意到了一种非常奇怪的脉冲调制微波从星系的中央发射出来。他们希望知道电波是否是被某些地外生命发射出来的，还是仅仅是普通的的星星发出的。

帮助奶牛们用一个能够分析他们在文件中记下的记录的工具来找到真相。他们在寻找长度在A到B之间（包含A和B本身）在每天的数据文件中重复得最多的比特序列 (1 <= A <= B <= 12)。他们在找那些重复得最多的比特序列。一个输入限制告诉你应输出多少频率最多的序列。

符合的序列可能会重叠，并且至少出现一次的序列会被计数。

[编辑]格式

PROGRAM NAME: contact

INPUT FORMAT:

(file contact.in)

第一行：三个用空格分隔的整数: A, B, N; (1 <= N < 50)

第二行及以后: 一个最多200，000字符的序列，全是0或1; 每行字符数不大于80。

OUTPUT FORMAT:

(file contact.out)

输出N个频率最高的序列（按照频率由高到低的次序）。由短到长排列频率相同的这些序列，如果长短相同，按二进制大小排列。如果出现的序列个数小于N，输出存在的序列。

对于每个存在的频率，先输出单独包含该频率的一行，再输出以空格分隔的这些序列。每行六个（除非少于六个剩下）。

[编辑]SAMPLE INPUT

2 4 1001010010010001000111101100001010011001111000010010011110010000000

在样例里，序列100出现了12次，而序列1000出现了5次。次数最多的序列是00，出现了23次。

[编辑]SAMPLE OUTPUT

23001501 10121001111 000 001100108010070010 10016111 00005011 110 100040001 0011 1100

应该也是一次性通过。但是usaco的这道题测试数据的格式要求不同，所以只好离线数据包测试。

开始容易想到KMP：

外层循环是枚举模式串，由于a、b很小，可以忽略，近似上界为n。

内层是KMP，复杂度为O(n)，总的时间复杂度为O(n^2)，给定的数据范围内无法过全部数据。

另一种方法是字符串hash。用拉链法解决空间问题。

BKDRhash，代码比较简洁，效率也比较高。

同样枚举长度在a到b的字符串，取得它的hash值，即可统计字符串出现的次数。

枚举所有hash值，如果存在对应的字符串，则把它放入堆中

优先比较出现次数，然后是数字大小。（数字大小的比较可先比较字符串长度，再比较字符串）

再按要求依次取出即可

输出格式要求比较蛋疼

复习时发现没有用BKDRhash的必要。因为二进制数转十进制即可，而且分布会均匀得多。

/*ID: wuyihao1LANG: C++TASK: contact*/#include <cstdio>#include <string>#include <queue>#include <iostream>using std::string;using std::cout;using std::priority_queue;const int hmod = 100009;char str[200010];struct node{node* nxt;string str;int amt;};node* head[hmod];struct node2{int amt;string str;node2(){}node2(int aa,string& ss){amt = aa;str = ss;}bool operator<(const node2& n2)const{if (amt != n2.amt)return amt < n2.amt;if (str.size() != n2.str.size())return str.size() > n2.str.size();return str > n2.str;}};priority_queue<node2> heap;void inc(int h,string ss){h = h % hmod;for (node* nn=head[h];nn;nn=nn->nxt)if (ss == nn->str){nn->amt ++;return;}node* nn = new node;nn -> str = ss;nn -> nxt = head[h];nn -> amt = 1;head[h] = nn;}unsigned int gethash(char *str){unsigned int seed = 131;unsigned int hash = 0;while (*str){hash = hash * seed + (*str++);}return (hash & 0x7FFFFFFF);}int main(){freopen("contact.in","r",stdin);freopen("contact.out","w",stdout);int a,b,n;scanf("%d%d%d",&a,&b,&n);int t = 0;while (1){if (scanf("%c",&str[++t]) != 1) break;if (str[t] == '\n') t--;}t --;for (int l=a;l<b+1;l++){for (int i=1;i+l-1<t+1;i++){int j = i + l - 1;char ttt = str[j+1];str[j+1] = 0;inc(gethash(str+i),string(str+i,str+j+1));str[j+1] = ttt;}}for (int i=0;i<hmod;i++){if (head[i]){for (node* nn=head[i];nn;nn=nn->nxt){heap.push(node2(nn->amt,nn->str));}}}int i = 0;int j = 0;int last = 0;node2 tmp;int cnt = 0;while (!heap.empty()){tmp = heap.top();heap.pop();if (tmp.amt == last){j ++;if (j == 7) cout << '\n';cout << ' ' << tmp.str;}else{j = 1;if (cnt) cout << '\n';cnt ++;if (cnt == n+1) break;cout << tmp.amt << '\n' << tmp.str;last = tmp.amt;}}cout << '\n';return 0;}

本题如果采用AC自动机，

属于比较特殊的一种情况：

1、Trie树是二叉树

2、不存在找不到的情况，因为我们的模式串包括了所有长度在a~b的01串，fail后至少能在Root处找到。

这种ACautomation的实现方法从网上借鉴来，充分利用了该题的特殊之处。

因为是二叉树，所以采取了堆的存储方式，节省了大量空间，极大地提升了速度，并且简化了代码。从1开始从小到大枚举，就完成了宽搜，呵呵，太巧妙了。

统计部分主要是：一个串出现了若干次，相应的它的子串也会多出现这么多次，而子串的次数应大于等于母串，多出来的部分已经计算过（当fail指向root，然后匹配成功时），因此需要自顶向下累加上母串的次数（倒着枚举一定满足该拓扑序）。

转换一下关系可以这样理解：串A是串B的子串；A表示串A出现的不同情况的集合，S(A)代表A的出现的不同情况总数。所以有：B﹝A，S(A)>=S(B)，S(A)=S(B)+S(A∩ CuB)。

~~比较函数写得不好，本来有一个字符串比较的函数符合qsort的格式，输出-1、0、1，但是我忘记了是什么。~~

比较函数比较完长度后，可以用string::compare，符合qsort的格式。

/*ID: wuyihao1LANG: C++TASK: contact*/#include <cstdio>#include <iostream>#include <cstdlib>#include <string>using std::cout;using std::string;int A;int B;int n;char str[200010];int count[9000];int depth[9000];string postfix[9000];int next[9000][2];int fail[9000];int ans[9000];int que[9000];int qc = 0;///////////////////////////////////////int cmp(const void* a,const void* b){int aa = *(int*)a;int bb = *(int*)b;if (ans[aa] < ans[bb])return 1;if (ans[aa] > ans[bb])return -1;if (postfix[aa].length()<postfix[bb].length())return -1;if (postfix[aa].length()>postfix[bb].length())return 1;if (postfix[aa] < postfix[bb])return -1;if (postfix[aa] > postfix[bb])return 1;return 0;}////////////待修改////////////////////////int main(){freopen("contact.in","r",stdin);freopen("contact.out","w",stdout);scanf("%d%d%d",&A,&B,&n);int len = 0;while (1){len ++;if (scanf("%c",str+len) != 1)break;if (str[len] == '\n')len --;elsestr[len] -= '0';}len --;int q = 0;for (int p=0;;p++){if (depth[p]==0 && p>0)break;if (depth[p]+1 < B+1){next[p][0] = ++q;depth[q] = depth[p] + 1;postfix[q] = postfix[p] + '0';if (depth[q] >= A)count[q] = 1;next[p][1] = ++q;depth[q] = depth[p] + 1;postfix[q] = postfix[p] + '1';if (depth[q] >= A)count[q] = 1;}}for (int p=1;p<q+1;p++){if (next[p][0]){int q = fail[p];while(!next[q][0]){q = fail[q];}fail[next[p][0]] = next[q][0];}if (next[p][1]){int q = fail[p];while(!next[q][1]){q = fail[q];}fail[next[p][1]] = next[q][1];}}int u = 0;for (int i=1;i<len+1;i++){while (!next[u][str[i]])u=fail[u];u = next[u][str[i]];ans[u] += count[u];}for (int p=q;p>0;p--){ans[fail[p]] += ans[p];}for (int i=1;i<q+1;i++){if (ans[i] > 0 && count[i] > 0)que[++qc] = i;//cout << ans[i] << ' ' << postfix[i] << '\n';}qsort(que+1,qc,sizeof(int),cmp);int times = 1;int times2 = 1;cout << ans[que[1]] << '\n' << postfix[que[1]];for (int i=2;i<qc+1;i++){if (ans[que[i]] != ans[que[i-1]]){times = 1;times2 ++;if (times2 == n+1)break;cout << '\n' << ans[que[i]] << '\n' << postfix[que[i]];}else{if (times == 7){times = 1;cout << '\n';}elsecout << ' ';cout << postfix[que[i]];}}return 0;}