perl应用:SNP的提取(4):信息的补全all.pl和重复区域的删除repeat_move_all_information.pl!
来源:互联网 发布:js上下文是什么意思 编辑:程序博客网 时间:2024/06/05 16:20
我们在上一篇文章中看了最后的输出结果如下:
pos TAIR bur can ct edi hi kn ler mt no oy po rsch sf tsu wil ws wu zu 1000 G - A - - - - - - - - - - - - - - - -10000003 C - - - - - G - - - - - - - - - - - -10000013 A - - R - - - - - - - - - - - - - - -10000021 G - - - - - - - - - - - C - - - - - -10000034 C - - - - - - - - - - - - - - - - Y -10000047 G R - - - - - - - - - - - - - - - R -10000049 A R - - - - - - - - - - - - - - - R -
我们发现里面有许多的--当然详细看过上面文章的会知道,有些地方在一个sample是有SNP的。但是在另一个sample这个位置就可能不是一个SNP,这样,在这个位置就是一个空白。我们设置的输出就是-;
但是这一部分信息,也是我们需要的,所以我们要到原来的.maf文件中去回对,也就是把每个位置的每个样品的碱基信息进行统计。其实关键就是一个回对的过程。
所以我们的解决思路就是:
1.用合并后的信息,建立一个hash,方便后面的查找
2.打开.maf以后,我们对每一个块,(这里说的块,包括score部分,ref序列,sample序列),我们先看在ref中是否有这个点的信息,因为,我设置的score的起始位点是90000.所以会有很多无法匹配的位点。如果有,那么取sample中对应位置的碱基,保存到hash中。
3.重复回对每一个文件,最终得到最后的结果。
程序如下;
#!/usr/bin/perluse strict;use warnings;my %position;my @information1;my @mout;my @score;my @info1;my @info2;my @sequen1;my @sequen2;my $cout1;my $cout;my @samples;my $samples;my $key1;my $value;my $z;open (DNA,"Chr5_join.txt")||die("can not open");#这里有需要替换成不同的染色体信息while(<DNA>){@information1=split;$position{$information1[0]}{sample0}=$information1[1];}open (DNA1,"TAIR_vs_bur.maf")||die("can not open");$z=<DNA1>;$z=<DNA1>;$z=<DNA1>;$z=<DNA1>;$z=<DNA1>;$z=<DNA1>;$z=<DNA1>;$z=<DNA1>;$z=<DNA1>;$z=<DNA1>;$z=<DNA1>;$z=<DNA1>;$z=<DNA1>;$z=<DNA1>;$/="\n\n";while(<DNA1>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless ($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample1}=$sequen2[$cout];}else {next;}}}}open (DNA2,"TAIR_vs_can.maf")||die("can not open");$z=<DNA2>;$z=<DNA2>;$z=<DNA2>;$z=<DNA2>;$z=<DNA2>;$z=<DNA2>;$z=<DNA2>;$z=<DNA2>;$z=<DNA2>;$z=<DNA2>;$z=<DNA2>;$z=<DNA2>;$z=<DNA2>;$z=<DNA2>;$/="\n\n";while(<DNA2>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless ($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample2}=$sequen2[$cout];}else {next;}}}}open (DNA3,"TAIR_vs_ct.maf")||die("can not open");$z=<DNA3>;$z=<DNA3>;$z=<DNA3>;$z=<DNA3>;$z=<DNA3>;$z=<DNA3>;$z=<DNA3>;$z=<DNA3>;$z=<DNA3>;$z=<DNA3>;$z=<DNA3>;$z=<DNA3>;$z=<DNA3>;$z=<DNA3>;$/="\n\n";while(<DNA3>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample3}=$sequen2[$cout];}else {next;}}}}open (DNA4,"TAIR_vs_edi.maf")||die("can not open");$z=<DNA4>;$z=<DNA4>;$z=<DNA4>;$z=<DNA4>;$z=<DNA4>;$z=<DNA4>;$z=<DNA4>;$z=<DNA4>;$z=<DNA4>;$z=<DNA4>;$z=<DNA4>;$z=<DNA4>;$z=<DNA4>;$z=<DNA4>;$/="\n\n";while(<DNA4>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample4}=$sequen2[$cout];}else {next;}}}}open (DNA5,"TAIR_vs_hi.maf")||die("can not open");$z=<DNA5>;$z=<DNA5>;$z=<DNA5>;$z=<DNA5>;$z=<DNA5>;$z=<DNA5>;$z=<DNA5>;$z=<DNA5>;$z=<DNA5>;$z=<DNA5>;$z=<DNA5>;$z=<DNA5>;$z=<DNA5>;$z=<DNA5>;$/="\n\n";while(<DNA5>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample5}=$sequen2[$cout];}else {next;}}}}open (DNA6,"TAIR_vs_kn.maf")||die("can not open");$z=<DNA6>;$z=<DNA6>;$z=<DNA6>;$z=<DNA6>;$z=<DNA6>;$z=<DNA6>;$z=<DNA6>;$z=<DNA6>;$z=<DNA6>;$z=<DNA6>;$z=<DNA6>;$z=<DNA6>;$z=<DNA6>;$z=<DNA6>;$/="\n\n";while(<DNA6>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample6}=$sequen2[$cout];}else {next;}}}}open (DNA7,"TAIR_vs_ler.maf")||die("can not open");$z=<DNA7>;$z=<DNA7>;$z=<DNA7>;$z=<DNA7>;$z=<DNA7>;$z=<DNA7>;$z=<DNA7>;$z=<DNA7>;$z=<DNA7>;$z=<DNA7>;$z=<DNA7>;$z=<DNA7>;$z=<DNA7>;$z=<DNA7>;$/="\n\n";while(<DNA7>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample7}=$sequen2[$cout];}else {next;}}}}open (DNA8,"TAIR_vs_mt.maf")||die("can not open");$z=<DNA8>;$z=<DNA8>;$z=<DNA8>;$z=<DNA8>;$z=<DNA8>;$z=<DNA8>;$z=<DNA8>;$z=<DNA8>;$z=<DNA8>;$z=<DNA8>;$z=<DNA8>;$z=<DNA8>;$z=<DNA8>;$z=<DNA8>;$/="\n\n";while(<DNA8>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample8}=$sequen2[$cout];}else {next;}}}}open (DNA9,"TAIR_vs_no.maf")||die("can not open");$z=<DNA9>;$z=<DNA9>;$z=<DNA9>;$z=<DNA9>;$z=<DNA9>;$z=<DNA9>;$z=<DNA9>;$z=<DNA9>;$z=<DNA9>;$z=<DNA9>;$z=<DNA9>;$z=<DNA9>;$z=<DNA9>;$z=<DNA9>;$/="\n\n";while(<DNA9>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample9}=$sequen2[$cout];}else {next;}}}}open (DNA10,"TAIR_vs_oy.maf")||die("can not open");$z=<DNA10>;$z=<DNA10>;$z=<DNA10>;$z=<DNA10>;$z=<DNA10>;$z=<DNA10>;$z=<DNA10>;$z=<DNA10>;$z=<DNA10>;$z=<DNA10>;$z=<DNA10>;$z=<DNA10>;$z=<DNA10>;$z=<DNA10>;$/="\n\n";while(<DNA10>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample10}=$sequen2[$cout];}else {next;}}}}open (DNA11,"TAIR_vs_po.maf")||die("can not open");$z=<DNA11>;$z=<DNA11>;$z=<DNA11>;$z=<DNA11>;$z=<DNA11>;$z=<DNA11>;$z=<DNA11>;$z=<DNA11>;$z=<DNA11>;$z=<DNA11>;$z=<DNA11>;$z=<DNA11>;$z=<DNA11>;$z=<DNA11>;$/="\n\n";while(<DNA11>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample11}=$sequen2[$cout];}else {next;}}}}open (DNA12,"TAIR_vs_rsch.maf")||die("can not open");$z=<DNA12>;$z=<DNA12>;$z=<DNA12>;$z=<DNA12>;$z=<DNA12>;$z=<DNA12>;$z=<DNA12>;$z=<DNA12>;$z=<DNA12>;$z=<DNA12>;$z=<DNA12>;$z=<DNA12>;$z=<DNA12>;$z=<DNA12>;$/="\n\n";while(<DNA12>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample12}=$sequen2[$cout];}else {next;}}}}open (DNA13,"TAIR_vs_sf.maf")||die("can not open");$z=<DNA13>;$z=<DNA13>;$z=<DNA13>;$z=<DNA13>;$z=<DNA13>;$z=<DNA13>;$z=<DNA13>;$z=<DNA13>;$z=<DNA13>;$z=<DNA13>;$z=<DNA13>;$z=<DNA13>;$z=<DNA13>;$z=<DNA13>;$/="\n\n";while(<DNA13>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample13}=$sequen2[$cout];}else {next;}}}}open (DNA14,"TAIR_vs_tsu.maf")||die("can not open");$z=<DNA14>;$z=<DNA14>;$z=<DNA14>;$z=<DNA14>;$z=<DNA14>;$z=<DNA14>;$z=<DNA14>;$z=<DNA14>;$z=<DNA14>;$z=<DNA14>;$z=<DNA14>;$z=<DNA14>;$z=<DNA14>;$z=<DNA14>;$/="\n\n";while(<DNA14>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample14}=$sequen2[$cout];}else {next;}}}}open (DNA15,"TAIR_vs_wil.maf")||die("can not open");$z=<DNA15>;$z=<DNA15>;$z=<DNA15>;$z=<DNA15>;$z=<DNA15>;$z=<DNA15>;$z=<DNA15>;$z=<DNA15>;$z=<DNA15>;$z=<DNA15>;$z=<DNA15>;$z=<DNA15>;$z=<DNA15>;$z=<DNA15>;$/="\n\n";while(<DNA15>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample15}=$sequen2[$cout];}else {next;}}}}open (DNA16,"TAIR_vs_ws.maf")||die("can not open");$z=<DNA16>;$z=<DNA16>;$z=<DNA16>;$z=<DNA16>;$z=<DNA16>;$z=<DNA16>;$z=<DNA16>;$z=<DNA16>;$z=<DNA16>;$z=<DNA16>;$z=<DNA16>;$z=<DNA16>;$z=<DNA16>;$z=<DNA16>;$/="\n\n";while(<DNA16>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample16}=$sequen2[$cout];}else {next;}}}}open (DNA17,"TAIR_vs_wu.maf")||die("can not open");$z=<DNA17>;$z=<DNA17>;$z=<DNA17>;$z=<DNA17>;$z=<DNA17>;$z=<DNA17>;$z=<DNA17>;$z=<DNA17>;$z=<DNA17>;$z=<DNA17>;$z=<DNA17>;$z=<DNA17>;$z=<DNA17>;$z=<DNA17>;$/="\n\n";while(<DNA17>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample17}=$sequen2[$cout];}else {next;}}}}open (DNA18,"TAIR_vs_zu.maf")||die("can not open");$z=<DNA18>;$z=<DNA18>;$z=<DNA18>;$z=<DNA18>;$z=<DNA18>;$z=<DNA18>;$z=<DNA18>;$z=<DNA18>;$z=<DNA18>;$z=<DNA18>;$z=<DNA18>;$z=<DNA18>;$z=<DNA18>;$z=<DNA18>;$/="\n\n";while(<DNA18>){@mout = split/\n/,$_;@score= split/=/, $mout[0];unless($score[1]<90000){@info1 = split/\s+/,$mout[1];@info2 = split/\s+/,$mout[2];@sequen1 = split// ,$info1[6];@sequen2 = split// ,$info2[6];for($cout1=$info1[2]+1;$cout1<$info1[3]+$info1[2]+1;$cout1++){if(exists $position{$cout1}){$cout=$cout1-$info1[2]-1;$position{$cout1}{sample18}=$sequen2[$cout];}else {next;}}}}@samples=qw/sample0 sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample16 sample17 sample18/;open(ALL,">all_information.txt")||die("can not open!");foreach $key1(sort keys %position){print ALL "$key1 ";foreach $samples(@samples){if(exists $position{$key1}{$samples}){$value = $position{$key1}{$samples};print ALL "$value ";}else{print ALL "- "}}print ALL "\n";}
结果如下:
10001656 G G - - G - G G G G - G A G G A G G G 10001667 G G - - G - G G G G - G G G G S G G G 10001670 G G - - G T G G G G - G G G G G G G G 10001672 T T - - T G T T T T - T T T T T T T T 10001673 G G - - G A G G G G - G G G G G G G G 10001696 G G - - G G G G G G - G A G G A G G G 10001713 C C - - C C C C C C - C C C C Y C C C 10001768 C C - - C C C C C C - C C C C Y C C C 10001791 A A - - G G G G A A - A G A G G A A G 10001812 G G - - G G G G A A - G G G G G G G G 10001816 A A - - A A A A A A - A G A A G A A A 10001820 C C - - C C C C C C - C C C C S C C C我们可以看到好多位置的信息已经补充完整了。
然后一个新的问题就是,我们知道基因在测序的过程中会有很多的重复区域,我们也要把相应的重复区域删除。
删除重复区域其实也是一个hash应用的过程。没有多少亮点。
我们首先来看一下这个repeat区域的格式:
504758552485289556677568875636986379656401564047564771711995784717853151386551387205138941139927514075914094851628061628445301999302104
我们要做的就是确定第一个,和最后一个位置,然后利用if循环,使数据递增。然后检查hash中时候有这样一个key值的存在,如果有的话,那么就delete这个hash,如果没有的就保存。就是这样。
我们来看一下代码:
#!/usr/bin/perl# move repeat arear outuse strict;use warnings;my %position;my @pos;my $cout;my $key;my $value;open (ALL,"all_information.txt")||die("can not open!");while(<ALL>){$position{$1}=$2 if $_=~/^(\d+)\s(.+)$/;}open (POS,"sepChr5.txt")||die("can not open!");while(<POS>){@pos=split;for($cout=$pos[1]+1;$cout<$pos[2]+1;$cout++){if(exists $position{$cout}){delete $position{$cout};}else {next;}}}open (RESULT,">without_repeat_information.txt")||die ("can not open!");foreach $key(sort keys %position){print RESULT "$key ";$value=$position{$key};print RESULT "$value \n";}
- perl应用:SNP的提取(4):信息的补全all.pl和重复区域的删除repeat_move_all_information.pl!
- perl应用:提取snp后续处理:删除带有“—”的行remove-.pl
- perl应用:SNP的提取(3):18个样品SNP的合并join.pl,+忽略-多的行
- perl应用:SNP的提取(2):从对比序列中找到SNP位点并输出 a.pl
- perl应用:SNP的提取(1):lastz
- perl应用:snp提取后续处理:非ATGC行的删除
- 登陆百度和GOOGLE的perl/Tk脚本.pl
- PL-SQL 包的创建和应用
- SNP的概念和特点
- PL/SQL删除锁表的进程
- PL/SQL删除锁表的进程
- PL/SQL 找回删除的表
- PL/SQL程序设计 第七章 包的创建和应用
- PL/SQL程序设计 第七章 包的创建和应用
- python分别用while和for于vcf格式提取复等位基因的snp(并计算分别条数)
- 对adRegisterWLSListeners.pl 和 adSyncContext.pl 脚本的研究
- PL/SQL高级应用的学习1
- PL/SQL高级应用的学习2
- draw和onDraw
- hello ,CSDN
- getRequestDispatcher()与sendRedirect()区别
- 客户端包分析
- 2013校园招聘——微软PM面试经历
- perl应用:SNP的提取(4):信息的补全all.pl和重复区域的删除repeat_move_all_information.pl!
- ubuntu关闭防火墙
- DDL语言详细梳理
- 二维数组和指针
- WEB-JSP标准标签库
- HDU 3049 求逆元模板
- jboss-web.xml 配置说明
- Python读写文件
- And解roid中Activity启动模式详