Repeated DNA Sequences--LeetCode

来源：互联网发布：双声道音乐软件编辑：程序博客网时间：2024/06/10 01:17

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",Return:["AAAAACCCCC", "CCCCCAAAAA"].

straight-forward method（TLE）

算法分析

直接字符串匹配；设计next数组，存字符串中每个字母在其中后续出现的位置；遍历时以next数组为起始。

简化考虑长度为4的字符串

case1:

src A C G T A C G T

next [4] [5] [6] [7] [-1] [-1] [-1] [-1]

那么匹配ACGT字符串的过程，匹配next[0]之后的3位字符即可

case2：

src A C G T A A C G T

next [4] [5] [6] [7] [5] [-1] [-1] [-1] [-1]

多个A字符后继，那么需要匹配所有后继，匹配next[0]不符合之后，还要匹配next[next[0]]

case3：

src A A A A A A

next [1] [2] [3] [4] [5] [-1]

重复的情况，在next[0]匹配成功时，可以把next[next[0]]置为-1，即以next[0]开始的长度为4的字符串已经成功匹配过了，无需再次匹配了；当然这么做只能减少重复的情况，并不能消除重复，因此仍需要使用一个set存储匹配成功的结果，方便去重

时间复杂度

构造next数组的复杂度O(n^2)，遍历的复杂度O(n^2)；总时间复杂度O(n^2)

代码实现

#include <string> 2 #include <vector> 3 #include <set> 4  5 class Solution { 6 public: 7     std::vector<std::string> findRepeatedDnaSequences(std::string s); 8  9     ~Solution();10 11 private:12     std::size_t* next;13 };14 15 std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) {16     std::vector<std::string> rel;17 18     if (s.length() <= 10) {19         return rel;20     }21 22     next = new std::size_t[s.length()];23 24     // cal next array25     for (int pos = 0; pos < s.length(); ++pos) {26         next[pos] = s.find_first_of(s[pos], pos + 1);27     }28 29     std::set<std::string> tmpRel;30 31     for (int pos = 0; pos < s.length(); ++pos) {32         std::size_t nextPos = next[pos];33         while (nextPos != std::string::npos) {34             int ic = pos;35             int in = nextPos;36             int count = 0;37             while (in != s.length() && count < 9 && s[++ic] == s[++in]) {38                 ++count;39             }40             if (count == 9) {41                 tmpRel.insert(s.substr(pos, 10));42                 next[nextPos] = std::string::npos;43             }44             nextPos = next[nextPos];45         }46     }47 48     for (auto itr = tmpRel.begin(); itr != tmpRel.end(); ++itr) {49         rel.push_back(*itr);50     }51 52     return rel;53 }54 55 Solution::~Solution() {56     delete [] next;57 }

View Code

hash table plus bit manipulation method

（view the Show Tags and Runtime 10ms !）

算法分析

首先考虑将ACGT进行二进制编码

A -> 00

C -> 01

G -> 10

T -> 11

在编码的情况下，每10位字符串的组合即为一个数字，且10位的字符串有20位；一般来说int有4个字节，32位，即可以用于对应一个10位的字符串。例如

ACGTACGTAC -> 00011011000110110001

AAAAAAAAAA -> 00000000000000000000

20位的二进制数，至多有2^20种组合，因此hash table的大小为2^20，即1024 * 1024，将hash table设计为bool hashTable[1024 * 1024];

遍历字符串的设计

每次向右移动1位字符，相当于字符串对应的int值左移2位，再将其最低2位置为新的字符的编码值，最后将高2位置0。例如

src CAAAAAAAAAC

subStr CAAAAAAAAA

int 0100000000

subStr AAAAAAAAAC

int 0000000001

时间复杂度

字符串遍历O(n)，hash tableO(1)；总时间复杂度O(n)

代码实现

 1 #include <string> 2 #include <vector> 3 #include <unordered_set> 4 #include <cstring> 5  6 bool hashMap[1024*1024]; 7  8 class Solution { 9 public:10     std::vector<std::string> findRepeatedDnaSequences(std::string s);11 };12 13 std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) {14     std::vector<std::string> rel;15     if (s.length() <= 10) {16         return rel;17     }18 19     // map char to code20     unsigned char convert[26];21     convert[0] = 0; // 'A' - 'A'  0022     convert[2] = 1; // 'C' - 'A'  0123     convert[6] = 2; // 'G' - 'A'  1024     convert[19] = 3; // 'T' - 'A' 1125 26     // initial process27     // as ten length string28     memset(hashMap, false, sizeof(hashMap));29 30     int hashValue = 0;31 32     for (int pos = 0; pos < 10; ++pos) {33         hashValue <<= 2;34         hashValue |= convert[s[pos] - 'A'];35     }36 37     hashMap[hashValue] = true;38 39     std::unordered_set<int> strHashValue;40 41     // 42     for (int pos = 10; pos < s.length(); ++pos) {43         hashValue <<= 2;44         hashValue |= convert[s[pos] - 'A'];45         hashValue &= ~(0x300000);46         47         if (hashMap[hashValue]) {48             if (strHashValue.find(hashValue) == strHashValue.end()) {49                 rel.push_back(s.substr(pos - 9, 10));50                 strHashValue.insert(hashValue);51             }52         } else {53             hashMap[hashValue] = true;54         }55     }56 57     return rel; 58 }

暴力枚举肯定会超时，所以首先想到用哈希，以长为10子串作为key，出现次数作为value，如果value==1则加入到结果中。但内存消耗太大，还是不行。

稍微想了下便有了思路，压缩状态。将长为10的字符串压缩为一个整数。

    class Solution {      public:      int dna['T'+1];      char rdna[4] = {'A','C','G','T'};            vector<string> findRepeatedDnaSequences(string s){          dna['A'] = 0; dna['C'] = 1; dna['G'] = 2; dna['T'] = 3;                    unordered_map<unsigned int,int> tab;          vector<string> res;                    int len = s.length();                    for(int i=0;i<len-9;i++){              unsigned int x = 0;              for(int j=i+9;j>=i;j--){                  x += dna[s[j]]*pow(10,i+9-j);              }              if(tab[x]==1){                  //把x转换为字符串，加入res中                  string tps(10,' ');                  for(int j=9;j>=0;j--){                      tps[j] = rdna[x%10];                      x/=10;                  }                  res.push_back(tps);              }              tab[x]++;          }                    return res;      }      };

PS:其实想一下，这个题目没必要用什么别的转换思路，不就是想找一下有没有子串重复的么，那么不管怎么转化，都是需要找到整个字符串中长度为10的子串，那么我们就用hash_map,或者map，来记录各个长度为10的子串对应出现的次数就可以了，然后对于出现次数查过一次的进行保存就可以了，第一次整理就是0(n)几倍，第二次查询也是0(n)级别的。

0 0