sgu 142解题记录

来源：互联网发布：如何识别网络虚假信息编辑：程序博客网时间：2024/05/09 22:12

142. Keyword

time limit per test: 0.5 sec.
memory limit per test: 16384 KB

Kevin has invented a new algorithm to crypt and decrypt messages, which he thinks is unbeatable. The algorithm uses a very large key-string, out of which a keyword is found out after applying the algorithm. Then, based on this keyword, the message is easily crypted or decrypted. So, if one would try to decrypt some messages crypted with this algorithm, then knowing the keyword would be enough. Someone has found out how the keyword is computed from the large key-string, but because he is not a very experienced computer programmer, he needs your help. The key-string consists of N characters from the set {'a','b'}. The keyword is the shortest non-empty string made up of the letters 'a' and 'b', which is not contained as a contiguous substring (also called subsequence) inside the key-string. It is possible that more than one such string exists, but the algorithm is designed in such a way that any of these strings can be used as a keyword. Given the key-string, your task is to find one keyword.

Input

The first line contains the integer number N, the number of characters inside the key-string (1 <= N <= 500 000). The next line contains N characters from the set {'a','b'} representing the string.

Output

The first line of output should contain the number of characters of the keyword. The second line should contain the keyword.

Sample Input

11aabaaabbbab

Sample Output

4aaaa

题目链接：http://acm.sgu.ru/problem.php?contest=0&problem=142

题目大意：

给出一个仅由'a'，‘b’组成的字符串S，长度小于500 000，求一个长度最小的由‘a’，‘b’组成的不是S子串的字符串T。

解题记录：

第一想法：字符串问题。
看到这题字符串的规模有5*10^5的规模，然而由于只有2个字母，立马联想到了用2进制来进行储存。算一下，5*10^5的字符串，总共大约可以表示出2^19不到的状态数（之前算成了2^1+2^2+2^3+...+2^n=2^(n+1)-2，所有状态求和，算成了n最大为18,后来想想不对，因为一个长度为T的字符串的所有ab组合，必然包含了长度在T以下的字符串的所有ab组合，例如aab跟ab，所以应该是最长的字符串长度为19）。
考虑采用hash被压缩之后的字符串状态，但是有一个地方总是想不通，假设a=1,b=0，a跟ba跟bba表示方法在计算机中都是1。于是自己就把2进制的方法给叉掉了。那就只有模拟了，时间复杂度又太大，所以想不出来了=_=#。
只能求助于题解，看来我又是临门一脚没有想出来。我可以规定每个状态的长度来表示所有的状态，f[i][j]，其中i是表示字符串的长度，j是表示状态。这样就可以进行hash了（我的hash应用还停留在一维=_=#）。
模拟之所以时间复杂度高，是因为有大量的信息没有用到，假设我aab已经在整个字符串中找到，那么我要找aaba，就只需要在aab找到的基础之上再进行筛选。而模拟需要再重新一个个匹配一次，浪费了之前已经获取的信息。然而在aab基础上再筛选操作难度较大，就采用hash的方式，将每个字符前的19个字符的状态都hash。最后再从小到大的方式寻找没有标记的状态。这也是一种变相的利用已得信息。

解题报告：

5*10^5，估算最大情况，1位即可表示一个状态，2^18 < 5*10^ < 2^19，那么所求的T最大为19，将字符串每长度为19的子串的所有状态全部hash，然后再从小到大寻找没有被hash的子串。

代码：

#include <cstdio>int n, cnt;bool f[19][530000];int main() {scanf("%d\n", &n);for (int i = 1; i <= n; i++) {int tmp=(getchar()=='a');cnt = (cnt<<1)+tmp;for (int j = 1, mop=0; (j <= 19 && i-j>=0); j++) {mop = (mop<<1)+1;f[j-1][cnt&mop]=true;}}for (int i = 1, mop=1; i <= 19; i++) {mop = mop<<1;for (int j = 0; j < mop; j++)if (!f[i-1][j]) {printf("%d\n", i);for (int k = 1, sum=1<<(i-1); k <= i; k++, sum >>= 1)if (j&sum)printf("a");elseprintf("b");printf("\n");return 0;}}}

0 0