DNA Prefix （字典树）

来源：互联网发布：jquery ui.min.js下载编辑：程序博客网时间：2024/05/21 09:12

Given a set of n DNA samples, where each sample is a string containing characters from{A, C, G, T}, we are trying to find a subset of samples in the set, where the length of the longest common prefix multiplied by the number of samples in that subset is maximum.

To be specific, let the samples be:

ACGT

ACGTGCGT

ACCGTGC

ACGCCGT

If we take the subset {ACGT} then the result is 4 (4 * 1), if we take {ACGT, ACGTGCGT, ACGCCGT} then the result is 3 * 3 = 9 (since ACG is the common prefix), if we take {ACGT, ACGTGCGT, ACCGTGC, ACGCCGT} then the result is 2 * 4 = 8.

Now your task is to report the maximum result we can get from the samples.

Input

Input starts with an integer T (≤ 10), denoting the number of test cases.

Each case starts with a line containing an integer n (1 ≤ n ≤ 50000) denoting the number of DNA samples. Each of the nextn lines contains a non empty string whose length is not greater than50. And the strings contain characters from {A, C, G, T}.

Output

For each case, print the case number and the maximum result that can be obtained.

Sample Input

ACGT

ACGTGCGT

ACCGTGC

ACGCCGT

CGCGCGCGCGCGCCCCGCCCGCGC

CGCGCGCGCGCGCCCCGCCCGCAC

CGCGCGCGCGCGCCCCGCCCGCTC

CGCGCCGCGCGCGCGCGCGC

GGCGCCGCGCGCGCGCGCTC

Sample Output

Case 1: 9

Case 2: 66

Case 3: 20

题意：给出几串DNA序列（只包含“ACGT”），求最长的前缀和（相同前缀所包含的字符个数乘字符串个数）；

通过这个题引入了trie树（字典树）的概念；（以下来自百度百科）

1.定义：又称单词查找树，Tire树，是一种树形结构，是一种哈希树的变种。典型应用是用于统计，排序和保存大量的字符串（但不仅限于字符串），所以经常被搜索引擎系统用于文本词频统计。它的优点是：利用字符串的公共前缀来减少查询时间，最大限度地减少无谓的字符串比较，查询效率比哈希树高。

2.性质：

（1）根节点不包含字符，除根节点外每一个节点都只包含一个字符；

（2）从根节点到某一节点，路径上经过的字符连接起来，为该节点对应的字符串；

（3）每个节点的所有子节点包含的字符都不相同；

3.操作：建立，查询，删除；

(1) 从根结点开始一次搜索；

(2) 取得要查找关键词的第一个字母，并根据该字母选择对应的子树并转到该子树继续进行检索；

(3) 在相应的子树上，取得要查找关键词的第二个字母,并进一步选择对应的子树进行检索。

(4) 迭代过程……

(5) 在某个结点处，关键词的所有字母已被取出，则读取附在该结点上的信息，即完成查找。

其他操作类似处理

4.应用：

（1）串的快速检索

给出N个单词组成的熟词表，以及一篇全用小写英文书写的文章，请你按最早出现的顺序写出所有不在熟词表中的生词。

在这道题中，我们可以用数组枚举，用哈希，用字典树，先把熟词建一棵树，然后读入文章进行比较，这种方法效率是比较高的；

（2）“串”排序

给定N个互不相同的仅由一个单词构成的英文名，让你将他们按字典序从小到大输出

用字典树进行排序，采用数组的方式创建字典树，这棵树的每个结点的所有儿子很显然地按照其字母大小排序。对这棵树进行先序遍历即可。

（3）最长公共前缀

对所有串建立字典树，对于两个串的最长公共前缀的长度即他们所在的结点的公共祖先个数，于是，问题就转化为当时公共祖先问题。

......

现在看看本题代码：

#include <iostream>#include <cstdlib>#include <cstdio>#include <cstring>#include <iostream>#define MAX 4using namespace std;struct Trie{    Trie *next[MAX];    int cnt;}*root;int ans;/****new一个新指针，开辟空间****/Trie *newTrie(){    Trie *temp = new Trie;    for(int i=0; i<MAX; i++)        temp->next[i]=NULL;    temp->cnt=0;    return temp;}/****释放空间，避免超内存****/void freedom(Trie *p){    for(int i=1; i<MAX; i++){        if(p->next[i]!=NULL)            freedom(p->next[i]);    }    delete(p);}/****建立trie树****/void SetTrie(string s){    Trie *p=root;    int t;    int len=s.size();    int id;    for(int i=0; i<len; i++){        switch(s[i]){        case 'A': id=0; break;        case 'C': id=1; break;        case 'G': id=2; break;        case 'T': id=3; break;        }        if(p->next[id]==NULL){            p->next[id]=newTrie();        }        p=p->next[id];        p->cnt++;        t=(i+1)*p->cnt;        ans=max(ans, t);    }}int main(){    int T;    cin >> T;    int Case=0;    while(T--){        Case++;        root=newTrie();        int n;        cin >> n;        string s;        ans=0;        while(n--){            cin >> s;            SetTrie(s);        }        printf("Case %d: %d\n", Case, ans);        freedom(root);    }    return 0;}

阅读全文

0 0