uva1392 - DNA Regions 维护递减数列 二分

来源:互联网 发布:阿尔弗雷德大学 知乎 编辑:程序博客网 时间:2024/05/21 07:01

A DNA sequence or genetic sequence is a succession of letters representing the primary structure of a real or hypothetical DNA molecule or strand, with the capacity to carry information. The possible letters are A, C, G, and T, representing the four nucleotide subunits of a DNA strand: adenine, cytosine, guanine and thymine bases covalently linked to phospho-backbone.

DNA sequences undergo mutations during the evolution of species, which means that some letters are randomly replaced with others. Therefore, the DNA sequences of two closely related species are very similar, and the difference increases as the distance between the species increases. The mutations do not occur with uniform frequency throughout the sequence; typically there are fewer mutations at the biologically important parts, since even a single mutation can be lethal at such a place. On the other hand, if a part of the sequence does not carry any biologically relevant information, then mutations on this part have no effect. It follows that if we compare the DNA sequences of two species and a particular region of the sequence contains fewer than the average number of mutations, then most probably this part of the sequence plays an important biological role. Therefore, it is of crucial importance to identify such regions. More precisely, aconserved region is a consecutive interval of the DNA sequence such that in this region at mostp percent of the letters are different in the two sequences. Your task is to write a program that, given two DNA sequences, finds the longest conserved region.

Input 

The input contains several blocks of test cases. Each case begins with a line containing two integers:1$ \le$n$ \le$150000 , the length of the genetic sequences and1$ \le$p$ \le$99 , the maximum percentage of mutated letters allowed in a conserved region. This is followed by two lines, each containing a DNA sequence of lengthn . The sequence contains only the letters `A', `C', `G', and `T'.

The input is terminated by a test case with n = 0 .

Output 

For each test case, you have to output a line containing a single integer: the length of the longest conserved region between the two sequences. If there are no conserved regions in the input, then output `No solution.' (without quotes).

Sample Input 

14 25ACCGGTAACGTGAAACTGGATACGTAAA14 24ACCGGTAACGTGAAACTGGATACGTAAA8 1AAAAAAAACCCCCCCC8 33AAACAAAACCCCCCCC 0 0

Sample Output 

87No solution.1

  找突变率不超过p%的最长的序列长度。

  这个和也是DNA单调队列那个题目挺像的,都可以转化为斜率问题,但是做法不一样。设sum[i]为前i个的突变数,这个是要找(sum[b]-sum[a])/(b-a)<=p/100的最小的一个a,也就是b*p-100*sum[b]>=a*p-100*sum[a]的最小a。设sum[i].key=i*p-100*sum[i],从左往右扫描,用c数组存key值的一个递减数列,如果当前的key小于c数组里最后一个值,说明前面找不到比它更小的了,就把这个key加入c数组。反之就用二分法在c数组中找到下界,也就是位置尽量前的满足条件的,并且这个key不用加入c数组,因为c数组里已经有小于等于它的了,找前面那个更优。

  注意设c[0]=0,因为若i*p-100*sum[i]>=0,也就是sum[i]/i<=p/100,说明从0到i这个序列满足条件。

#include<cstdio>#include<algorithm>#include<iostream>#include<sstream>#include<cstring>#include<cmath>#include<queue>#include<map>#include<set>#define INF 0x3f3f3f3f#define MAXN 150010#define MAXM 1010#define eps 1e-9#define pi 4*atan(1.0)#define pii pair<int,int>using namespace std;int N,P;char a[MAXN],b[MAXN];int c[MAXN];struct DNA{    int id,key,sum;}d[MAXN];int bsearch(int L,int R,int v){    int m;    while(L<R){        m=(L+R)/2;        if(d[c[m]].key<=v) R=m;        else L=m+1;    }    return L;}int main(){    freopen("in.txt","r",stdin);    while(scanf("%d%d",&N,&P),N||P){        scanf("%s%s",a,b);        d[0].id=d[0].sum=c[0]=0;        int cnt=0,ans=0;        for(int i=1;i<=N;i++){            d[i].sum=d[i-1].sum+(a[i-1]!=b[i-1]);            d[i].id=i;            d[i].key=i*P-100*d[i].sum;            if(d[i].key<d[c[cnt]].key) c[++cnt]=i;            else{                int s=bsearch(0,cnt,d[i].key);                s=d[c[s]].id;                ans=max(ans,i-s);            }        }        if(!ans) printf("No solution.\n");        else printf("%d\n",ans);    }    return 0;}


0 0
原创粉丝点击