LeetCode 187 Repeated DNA Sequences

来源：互联网发布：动态加载数据js 编辑：程序博客网时间：2024/06/05 03:55

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",Return:["AAAAACCCCC", "CCCCCAAAAA"].

以下3种方式：从易到难，但是速度越来越看。具体逻辑见代码以及其注释。

方法一：Runtime: 38 ms beats 68.58% of javasubmissions.

public List<String> findRepeatedDnaSequences2(String s) {Set<String> dna = new HashSet<>(), res = new HashSet<>();for (int i = 10; i <= s.length(); i++) {String chars = s.substring(i - 10, i);if (!dna.add(chars)) res.add(chars);}return new ArrayList<>(res);}

方法二：Runtime: 14 ms beats 97.35% of javasubmissions.

private static final byte[] t = new byte[128];static {t['A'] = 0;t['C'] = 1;t['G'] = 2;t['T'] = 3;}public List<String> findRepeatedDnaSequences(String s) { //12ms 99.77%boolean[] has = new boolean[1048576];// 1048576  =  1>>20;boolean[] written = new boolean[1048576];List<String> list = new ArrayList();char[] c = s.toCharArray();int n = c.length, cur = 0;if (n < 10) return list;for (int i = 0; i < 9; i++)cur = (cur << 2) | t[c[i]];//前9位字符对应的数值,每个字符占用二进制的两位for (int i = 9; i < n; i++) {cur = ((cur << 2) | t[c[i]]) & 0xFFFFF;//只保留10位字符对应的值if (has[cur]) {if (!written[cur]) {list.add(s.substring(i - 9, i + 1));written[cur] = true;}} elsehas[cur] = true;}return list;}

方法三：Runtime: 7 ms beats 99.93% of javasubmissions.

private static final byte[] t = new byte[128];static {t['A'] = 0;t['C'] = 1;t['G'] = 2;t['T'] = 3;}public List<String> findRepeatedDnaSequences3(String s) {final long[] has = new long[16384];//16384 = 1<<14final long[] written = new long[16384];ArrayList<String> dupSeqs = new ArrayList<>();if (s.length() <= 10) return dupSeqs;char[] c = s.toCharArray();  //String.charAt will be slower than char array access int cur = 0;for (int i = 0; i < 9; i++) {cur = (cur << 2) | t[c[i]];}for (int i = 9; i < c.length; i++) {cur = ((cur << 2) | t[c[i]]) & 0xFFFFF;//只保留10位字符对应的值,1个字符占2位二进制int idx = (cur >> 6);//前14位的二进制cur作为index,后6位作为bitmap的值//long型只有64位长度,64正好是1<<6.如果这里是dnaSeqRep >> 7,会出现1左移超过64位发生溢出,高位无效的情况.long bitmap = 1L << (cur & 0x3f);//if the sequence has a duplicate and haven't been added beforeif ((has[idx] & bitmap) != 0) {if ((written[idx] & bitmap) == 0) {written[idx] |= bitmap;dupSeqs.add(s.substring(i - 9, i + 1));}} else {has[idx] |= bitmap;}}return dupSeqs;}

方法一使用了set，因此效率不高。

方法二比方法三简洁，但是比方法三慢，原因在我看来，是has数组和written数组定义长度为100多万，太长，造成数组寻址时间花费过多，可是定义长度为1>>20又是必需的，因为10-letter-long sequences ，不同的二进制表达有1>>20种（4的10次）。

参考https://discuss.leetcode.com/topic/31963/8ms-of-java-solution/4

0 0