Repeated DNA Sequences--LeetCode
来源:互联网 发布:双声道音乐软件 编辑:程序博客网 时间:2024/06/10 01:17
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",Return:["AAAAACCCCC", "CCCCCAAAAA"].
straight-forward method(TLE)
算法分析
直接字符串匹配;设计next数组,存字符串中每个字母在其中后续出现的位置;遍历时以next数组为起始。
简化考虑长度为4的字符串
case1:
src A C G T A C G T
next [4] [5] [6] [7] [-1] [-1] [-1] [-1]
那么匹配ACGT字符串的过程,匹配next[0]之后的3位字符即可
case2:
src A C G T A A C G T
next [4] [5] [6] [7] [5] [-1] [-1] [-1] [-1]
多个A字符后继,那么需要匹配所有后继,匹配next[0]不符合之后,还要匹配next[next[0]]
case3:
src A A A A A A
next [1] [2] [3] [4] [5] [-1]
重复的情况,在next[0]匹配成功时,可以把next[next[0]]置为-1,即以next[0]开始的长度为4的字符串已经成功匹配过了,无需再次匹配了;当然这么做只能减少重复的情况,并不能消除重复,因此仍需要使用一个set存储匹配成功的结果,方便去重
时间复杂度
构造next数组的复杂度O(n^2),遍历的复杂度O(n^2);总时间复杂度O(n^2)
代码实现
#include <string> 2 #include <vector> 3 #include <set> 4 5 class Solution { 6 public: 7 std::vector<std::string> findRepeatedDnaSequences(std::string s); 8 9 ~Solution();10 11 private:12 std::size_t* next;13 };14 15 std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) {16 std::vector<std::string> rel;17 18 if (s.length() <= 10) {19 return rel;20 }21 22 next = new std::size_t[s.length()];23 24 // cal next array25 for (int pos = 0; pos < s.length(); ++pos) {26 next[pos] = s.find_first_of(s[pos], pos + 1);27 }28 29 std::set<std::string> tmpRel;30 31 for (int pos = 0; pos < s.length(); ++pos) {32 std::size_t nextPos = next[pos];33 while (nextPos != std::string::npos) {34 int ic = pos;35 int in = nextPos;36 int count = 0;37 while (in != s.length() && count < 9 && s[++ic] == s[++in]) {38 ++count;39 }40 if (count == 9) {41 tmpRel.insert(s.substr(pos, 10));42 next[nextPos] = std::string::npos;43 }44 nextPos = next[nextPos];45 }46 }47 48 for (auto itr = tmpRel.begin(); itr != tmpRel.end(); ++itr) {49 rel.push_back(*itr);50 }51 52 return rel;53 }54 55 Solution::~Solution() {56 delete [] next;57 }View Code
hash table plus bit manipulation method
(view the Show Tags and Runtime 10ms !)
算法分析
首先考虑将ACGT进行二进制编码
A -> 00
C -> 01
G -> 10
T -> 11
在编码的情况下,每10位字符串的组合即为一个数字,且10位的字符串有20位;一般来说int有4个字节,32位,即可以用于对应一个10位的字符串。例如
ACGTACGTAC -> 00011011000110110001
AAAAAAAAAA -> 00000000000000000000
20位的二进制数,至多有2^20种组合,因此hash table的大小为2^20,即1024 * 1024,将hash table设计为bool hashTable[1024 * 1024];
遍历字符串的设计
每次向右移动1位字符,相当于字符串对应的int值左移2位,再将其最低2位置为新的字符的编码值,最后将高2位置0。例如
src CAAAAAAAAAC
subStr CAAAAAAAAA
int 0100000000
subStr AAAAAAAAAC
int 0000000001
时间复杂度
字符串遍历O(n),hash tableO(1);总时间复杂度O(n)
代码实现
1 #include <string> 2 #include <vector> 3 #include <unordered_set> 4 #include <cstring> 5 6 bool hashMap[1024*1024]; 7 8 class Solution { 9 public:10 std::vector<std::string> findRepeatedDnaSequences(std::string s);11 };12 13 std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) {14 std::vector<std::string> rel;15 if (s.length() <= 10) {16 return rel;17 }18 19 // map char to code20 unsigned char convert[26];21 convert[0] = 0; // 'A' - 'A' 0022 convert[2] = 1; // 'C' - 'A' 0123 convert[6] = 2; // 'G' - 'A' 1024 convert[19] = 3; // 'T' - 'A' 1125 26 // initial process27 // as ten length string28 memset(hashMap, false, sizeof(hashMap));29 30 int hashValue = 0;31 32 for (int pos = 0; pos < 10; ++pos) {33 hashValue <<= 2;34 hashValue |= convert[s[pos] - 'A'];35 }36 37 hashMap[hashValue] = true;38 39 std::unordered_set<int> strHashValue;40 41 // 42 for (int pos = 10; pos < s.length(); ++pos) {43 hashValue <<= 2;44 hashValue |= convert[s[pos] - 'A'];45 hashValue &= ~(0x300000);46 47 if (hashMap[hashValue]) {48 if (strHashValue.find(hashValue) == strHashValue.end()) {49 rel.push_back(s.substr(pos - 9, 10));50 strHashValue.insert(hashValue);51 }52 } else {53 hashMap[hashValue] = true;54 }55 }56 57 return rel; 58 }
暴力枚举肯定会超时,所以首先想到用哈希,以长为10子串作为key,出现次数作为value,如果value==1则加入到结果中。但内存消耗太大,还是不行。
稍微想了下便有了思路,压缩状态。 将长为10的字符串压缩为一个整数。
class Solution { public: int dna['T'+1]; char rdna[4] = {'A','C','G','T'}; vector<string> findRepeatedDnaSequences(string s){ dna['A'] = 0; dna['C'] = 1; dna['G'] = 2; dna['T'] = 3; unordered_map<unsigned int,int> tab; vector<string> res; int len = s.length(); for(int i=0;i<len-9;i++){ unsigned int x = 0; for(int j=i+9;j>=i;j--){ x += dna[s[j]]*pow(10,i+9-j); } if(tab[x]==1){ //把x转换为字符串,加入res中 string tps(10,' '); for(int j=9;j>=0;j--){ tps[j] = rdna[x%10]; x/=10; } res.push_back(tps); } tab[x]++; } return res; } };
PS:其实想一下,这个题目没必要用什么别的转换思路,不就是想找一下有没有子串重复的么,那么不管怎么转化,都是需要找到整个字符串中长度为10的子串,那么我们就用hash_map,或者map,来记录各个长度为10的子串对应出现的次数就可以了,然后对于出现次数查过一次的进行保存就可以了,第一次整理就是0(n)几倍,第二次查询也是0(n)级别的。
- Leetcode Repeated DNA Sequences
- Repeated DNA Sequences [leetcode]
- [LeetCode] Repeated DNA Sequences
- Leetcode Repeated DNA Sequences
- Leetcode:Repeated DNA Sequences
- Leetcode: Repeated DNA Sequences
- LeetCode: Repeated DNA Sequences
- LeetCode: Repeated DNA Sequences
- LeetCode Repeated DNA Sequences
- LeetCode--Repeated DNA Sequences
- [LeetCode]Repeated DNA Sequences
- [Leetcode]Repeated DNA Sequences
- [leetcode]Repeated DNA Sequences
- Repeated DNA Sequences - LeetCode
- Leetcode: Repeated DNA Sequences
- Leetcode:Repeated DNA Sequences
- leetcode:Repeated DNA Sequences
- LeetCode - Repeated DNA Sequences
- Pyhton学习笔记——socket异常处理
- C/C++计时函数的比较
- 饱和度,对比度,锐度
- Android差异化编译
- Android SDK 环境变量配置
- Repeated DNA Sequences--LeetCode
- inet_pton 与 sockaddr
- notification+service+broadcastreceiver实现简单的音乐播放器
- UIView和CALayer的关系
- nice和renice命令
- Aerospike C客户端手册——简介
- 求一个大数最左边上的数
- “哈工大讯飞”语言云发布 科大讯飞云服务再下一城
- 试题一