Repeated DNA Sequences

来源：互联网发布：淘宝卖家怎么联系快递编辑：程序博客网时间：2024/05/21 09:34

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",Return:["AAAAACCCCC", "CCCCCAAAAA"].

Subscribe to see which companies asked this question.

解题技巧：

该题主要采用了两种技巧：位运算、hash。

考虑将ACGT进行二进制编码，即：A -> 00, C -> 01, G -> 10, T -> 11；在编码的情况下，每10位字符串的组合即为一个数字，且10位的字符串有20位，一般来说int有4个字节，32位，即可以用于对应一个10位的字符串。例如：ACGTACGTAC -> 00011011000110110001

20位的二进制数，至多有2^20种组合，因此hash table的大小为2^20，即1024 * 1024，将hash table设计为bool hashTable[1024 * 1024];

在处理字符串时，每次向右移动1位字符，相当于字符串对应的int值左移2位，再将其最低2位置为新的字符的编码值，最后将高2位置0；得到当前的子字符串对应的值val后，判断该值是否出现过，如果未出现，则将hasTable[val]设置为true，否则，将当前的子字符串存入到set容器中

代码：

#include <iostream>#include <string>#include <vector>#include <set>#include <mem.h>#include <map>using namespace std;vector<string> findRepeatedDnaSequences(string s){    vector<string> res;    if(s.length() < 10) return res;    map<char,int> mp;    mp['A'] = 0;    mp['C'] = 1;    mp['G'] = 2;    mp['T'] = 3;    bool exist[1024*1024];    memset(exist, false, sizeof(exist));    int val = 0;    for(int i = 0; i < 10; i ++)    {        val <<= 2;        val |= mp[s[i]];    }    exist[val] = true;    set<string> tmp;    for(int i = 10; i < s.length(); i ++)    {        val <<= 2;        val |= mp[s[i]];        val &= ~(0x300000);        if(exist[val]) tmp.insert(s.substr(i-9,10));        else exist[val] = true;    }    set<string>::iterator it = tmp.begin();    while(it != tmp.end())    {        res.push_back(*it);        it++;    }    return res;}int main(){    vector<string> res;    string s;    cin >> s;    res = findRepeatedDnaSequences(s);    for(int i = 0; i < res.size(); i ++)    {        cout<<res[i]<<' ';    }}

阅读全文

0 0