二进制逻辑运算求解187. Repeated DNA Sequences

来源：互联网发布：软件测试表情包编辑：程序博客网时间：2024/06/06 20:33

题目

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Given s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”,Return:[“AAAAACCCCC”, “CCCCCAAAAA”].

题目解析：

给出一个DNA字符串，从中里面找出出现超过1次的子串，并且这个子串的长度为10个字符

思路解析：

最简单的解法就是采取暴力的方法，两层循环暴力匹配，复杂度是o(m*n)，但是这样会超时，因此可以尝试使用kmp算法降低复杂度。但仔细阅读题目条件，发现字符串只包含A,C,G,T四个字母，观察这四个字母的ascii码，A的二进制ascii码为 0100 0001，C为0100 0011，G为0100 0111，T为0100 0111，发现每个字母二进制的低三位都不一样，因此可以使用这三位去表示一个字母，那么一个子字符串有10个字母，那么就可以使用30位二进制位去表示这个子字符串，为了可以提取出后30位，可以使用0x7FFFFFFF(或者0x3FFFFFFF)掩码去提取。当从S中取出第九个字符时，那么就会得到从字符串S中第一个子字符串的哈希值，那么将其存到哈希表中（将值加1），之后每向左移动3位替换一个字符，得到新的字符串哈希值，那么在哈希表中寻找哈希表中该值是否为1，如果为1，那么就说明这个子字符串已经在前面出现，同时加1,可以避免将相同的子字符串放到结果中

AC代码

#include <iostream>#include <vector>#include <unordered_map>#include <string>#include <algorithm>using namespace std;class Solution {public:    vector<string> findRepeatedDnaSequences(string s) {        unordered_map<int, int> m;        vector<string> r;        int t = 0, i = 0, ss = s.size();        while (i < 9)            t = t << 3 | s[i++] & 7;        while (i < ss)            if (m[t = t << 3 & 0x3FFFFFFF | s[i++] & 7]++ == 1)                r.push_back(s.substr(i - 10, 10));        return r;    }};int main() {    string s = "AAAAAAAAAAA";    Solution ss;    vector<string> result = ss.findRepeatedDnaSequences(s);    for (int i = 0; i < result.size(); ++i) {        cout << result[i] << endl;    }    return 0;}

阅读全文

0 0