Repeated DNA Sequences

来源:互联网 发布:网站搜索引擎优化 编辑:程序博客网 时间:2024/05/16 05:30

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",Return:["AAAAACCCCC", "CCCCCAAAAA"].
思路参考这个https://leetcode.com/discuss/24478/i-did-it-in-10-lines-of-c

A is 0x41, C is 0x43, G is 0x47, T is 0x54. Still don't see it? Let me write it in octal.

A is 0101, C is 0103, G is 0107, T is 0124. The last digit in octal are different for all four letters. That's all we need!

也就是说写成8进制的时候,字母的最后一位都是不同的。而八进制的一位需要二进制的3位。所以每个字母仅仅凭借它的2进制表示最后3位就可以区分出来。

我们使用一个整数来代替存字符串在map里面,因为整数共有32位,我们需要看10个字母,所以每次读取一个字母,就把前面的左移3位(所以共占30位,前面多出来的两位用&3FFFFFFF去掉。),然后把当前的字母代表数字concatenate到数字上。

    public List<String> findRepeatedDnaSequences(String s) {        Map<Integer, Integer> map = new HashMap<Integer, Integer>();        List<String> result = new ArrayList<String>();        int num = 0;        for (int i = 0; i < s.length(); i++) {            num = ((num << 3 & 0x3FFFFFFF) | (s.charAt(i) & 7));            if (map.get(num) != null && map.get(num).equals(1)) {                result.add(s.substring(i-9, i+1));            }            map.put(num, map.get(num) == null ? 1 : map.get(num)+1);        }        return result;    }


0 0