字符串匹配之二：DNA序列匹配度的计算

来源：互联网发布：淘宝店铺怎么从新激活编辑：程序博客网时间：2024/05/29 03:02

老师给出了两个DNA序列，分别是人类的致盲蛋白质序列和果蝇的致盲蛋白质序列，也给出了匹配评分矩阵，比如A与T匹配，得分－5，A与A匹配得分10，诸如此类。

有了上篇文章的四个模块，简直就是手到擒来：

def protein_alignment(scoring_matrix):    human_protein = read_protein('HumanEyelessProtein.txt')    fruitfly_protein = read_protein('FruitflyEyelessprotein.txt')        alignment_matrix = compute_alignment_matrix(seq_x=human_protein, seq_y=fruitfly_protein, scoring_matrix=scoring_matrix, global_flag=False)    score, align_human, align_fruitfly = compute_local_alignment(seq_x=human_protein, seq_y=fruitfly_protein, scoring_matrix=scoring_matrix, alignment_matrix=alignment_matrix)    return score, align_human, align_fruitfly

稍微写的有点啰嗦，如果函数直接传入参数的话，代码其实是很简单的。这里是为了区分开，一目了然。

蛋白质序列和得分矩阵在github中有。

局部匹配结果得分700多，应该是个很大的数值了，具体的验证稍后给出。

局部匹配出的基本可以看作是二者的共同点了，这是否是一个蛋白质呢？于是找来一个蛋白质的公共序列，把匹配出的序列在跟这个公共序列匹配。

def consensus_alignment(scoring_matrix, align_sequence):    pax_domain = read_protein('ConsensusPAXDomain.txt')    align_sequence = align_sequence.replace("-", "")    alignment_matrix = compute_alignment_matrix(seq_x=align_sequence, seq_y=pax_domain, scoring_matrix=scoring_matrix, global_flag=True)    score, global_align_sequence, global_align_pax = compute_global_alignment(seq_x=align_sequence, seq_y=pax_domain, scoring_matrix=scoring_matrix, alignment_matrix=alignment_matrix)    seq = difflib.SequenceMatcher(None, global_align_sequence, global_align_pax)    ratio = seq.ratio()       return global_align_sequence, ratio

返回的是相似度，二者分别是71.3%以及57.1%。这是个什么概念呢？

于是把蛋白质的公共序列打乱，组成跟上述两个匹配后序列相同长度的序列，然后跟公共序列匹配，结果分别是14.4％和1.14％。基本可以认定公共部分是蛋白质。

局部匹配得分700多又是一个什么概念呢？

这回引入Statistical Hypothesis 概念。考验一个事件的发生是否可能是巧合，随机产生多次事件，会形成一个正态分布。如果该事件出了3倍标准差之外，基本可以肯定不是巧合。

于是我们把果蝇的蛋白质序列打乱，跟人类的蛋白质匹配，重复1000次，计算每个score出现的次数，可以得到一个平均值为60，标准差为14的正态分布，700多已经位于48倍标准差之外了。Python的标准精度已经计算不出这个数值了。

def generate_null_distribution(seq_x, seq_y, scoring_matrix, num_trials):    scoring_distribution = {}    for _ in range(num_trials):                #shuffle the seq_y        list_y = list(seq_y)        random.shuffle(list_y)        rand_y = "".join(list_y)        alignment_matrix = compute_alignment_matrix(seq_x=seq_x, seq_y=rand_y, scoring_matrix=scoring_matrix, global_flag=False)        score = max([max(item) for item in alignment_matrix])        try:            scoring_distribution[score] += 1        except(KeyError):            scoring_distribution[score] = 1            return scoring_distribution

可以肯定，人类的致盲蛋白质与果蝇的致盲蛋白质来自于同一个祖先。从某种程度上来说，人类和果蝇拥有相同的祖先。

源代码：Github

0 0