perl应用:DNA序列酶切图谱的创建
来源:互联网 发布:ubuntu下载 阿里云 编辑:程序博客网 时间:2024/05/02 00:40
程序里面有很多小的知识点,大家要认真的看,才能发现:
a.fasta的DNA序列如下:
> sample dna | (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc acacctgagccactctcagatgaggaccta
REBASE.txt的内容如下:
REBASE version 104 bionet.104 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= REBASE, The Restriction Enzyme Database http://rebase.neb.com Copyright (c) Dr. Richard J. Roberts, 2001. All rights reserved. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Rich Roberts Mar 30 2001 AaaI (XmaIII) C^GGCCG AacI (BamHI) GGATCC AaeI (BamHI) GGATCC AagI (ClaI) AT^CGAT AaqI (ApaLI) GTGCAC AarI CACCTGCNNNN^ AarI ^NNNNNNNNGCAGGTG AatI (StuI) AGG^CCT AatII GACGT^C AauI (Bsp1407I) T^GTACA AbaI (BclI) T^GATCA AbeI (BbvCI) CC^TCAGC AbeI (BbvCI) GC^TGAGG AbrI (XhoI) C^TCGAG AcaI (AsuII) TTCGAA AcaII (BamHI) GGATCC AcaIII (MstI) TGCGCA AcaIV (HaeIII) GGCC AccI GT^MKAC AccII (FnuDII) CG^CG AccIII (BspMII) T^CCGGA Acc16I (MstI) TGC^GCA Acc36I (BspMI) ACCTGCNNNN^ Acc36I (BspMI) ^NNNNNNNNGCAGGT Acc38I (EcoRII) CCWGG Acc65I (KpnI) G^GTACC Acc113I (ScaI) AGT^ACT AccB1I (HgiCI) G^GYRCC AccB2I (HaeII) RGCGC^Y AccB7I (PflMI) CCANNNN^NTGG AccBSI (BsrBI) CCG^CTC AccBSI (BsrBI) GAG^CGG AccEBI (BamHI) G^GATCC AceI (TseI) G^CWGC AceII (NheI) GCTAG^C AceIII CAGCTCNNNNNNN^ AceIII ^NNNNNNNNNNNGAGCTG AciI C^CGC AciI G^CGG AclI AA^CGTT AclNI (SpeI) A^CTAGT AclWI (BinI) GGATCNNNN^
这里面要求两次输入要读取的文件,第一次读取的是a.fasta。第二次读取的是REBASE.txt
sub IUB_to_regexp { # A subroutine that, given a sequence with IUB ambiguity codes, # outputs a translation with IUB codes changed to regular expressions # These are the IUB ambiguity codes my($iub) = @_; my $regular_expression = ''; my %iub2character_class = ( # 这里除了四种常用的碱基外,用来表明核酸序列中不常见或不明确的碱基 A => 'A', C => 'C', G => 'G', T => 'T', R => '[GA]',#R可以代表G或者A中一种 Y => '[CT]', M => '[AC]', K => '[GT]', S => '[GC]', W => '[AT]', B => '[CGT]', D => '[AGT]', H => '[ACT]', V => '[ACG]', N => '[ACGT]', ); # Remove the ^ signs from the recognition sites $iub =~ s/\^//g; # Translate each character in the iub sequence for ( my $i = 0 ; $i < length($iub) ; ++$i ) { $regular_expression .= $iub2character_class{substr($iub, $i, 1)}; } return $regular_expression; } sub get_file_data { # A subroutine to get data from a file given its filename #读取文件的子序列 my $dna_filename; my @filedata; print "please input the Path just like this f:\\\\perl\\\\data.txt\n"; chomp($dna_filename=<STDIN>); open(DNAFILENAME,$dna_filename)||die("can not open the file!"); @filedata = <DNAFILENAME>; close DNAFILENAME; return @filedata;#子函数的返回值一定要记住写 } sub parseREBASE { my($rebasefile) = @_; use strict; use warnings; my @rebasefile = ( ); my %rebase_hash = ( ); my $name; my $site; my $regexp; # Read in the REBASE file @rebasefile = get_file_data($rebasefile); foreach ( @rebasefile ) { # Discard header lines ( 1 .. /Rich Roberts/ ) and next; # Discard blank lines /^\s*$/ and next; # Split the two (or three if includes parenthesized name) fields my @fields = split( " ", $_); # Get and store the name and the recognition site # by not saving the middle field, if any, # just the first and last $name = $fields[0]; $site = $fields[-1]; # Translate the recognition sites to regular expressions $regexp = IUB_to_regexp($site); # Store the data into the hash # $site 表示位点序列,$regexp 表示位点的可以和DNA序列匹配的位点序列 $rebase_hash{$name} = "$site $regexp"; } # Return the hash containing the reformatted REBASE data return %rebase_hash; } sub extract_sequence_from_fasta_data { #******************************************************************* # A subroutine to extract FASTA sequence data from an array # 得到其中的序列 # fasta格式介绍: # 包括三个部分 # 1.第一行中以>开头的注释行,后面是名称和序列的来源 # 2.标准单字母符号的序列 # 3.*表示结尾 #******************************************************************* my (@fasta_file_data) =@_; my $sequence =' '; foreach my $line (@fasta_file_data) { #这里忽略空白行 if ($line=~/^\s*$/) { next; } #忽略注释行 elsif($line=~/^\s*#/) { next; } #忽略fasta的第一行 elsif($line=~/^>/) { next; } else { $sequence .=$line; } } $sequence=~s/\s//g; return $sequence; } sub match_positions{my ($regexp,$sequence) = @_;use strict;my @positions =( );while($sequence=~/$regexp/ig){push (@positions,pos($sequence)-length($&)+1)# pos返回最后一次匹配的位置# $&代表匹配的位置,$`代表匹配位置之前的位置,$'代表匹配位置之后的位置}return @positions;}use strict;use warnings;my %rebase_hash = ( );my @file_data = ( );my $query = '';my $dna = '';my $recognition_site= '';my $regexp = '';my @locations = ( );@file_data = get_file_data( );$dna = extract_sequence_from_fasta_data(@file_data);%rebase_hash = parseREBASE();do {print "Please input restriction enzyme name\n";chomp($query=<STDIN>);if ($query=~/^\s*$/){exit;}if ($rebase_hash{$query}){if ($rebase_hash{$query}){($recognition_site,$regexp) = split (" ",$rebase_hash{$query});@locations = match_positions($regexp,$dna);if (@locations){print "searching for $query $recognition_site $regexp\n";print "A restriction site for $query at locations:\n";print join(" ",@locations),"\n";}else{print "A restriction site for $query is not in the DNA:\n";}}print "\n";}}until ($query=~/quit/);
最后的 结果如下:
F:\>perl\a.plplease input the Path just like this f:\\perl\\data.txtf:\\perl\\a.fastaplease input the Path just like this f:\\perl\\data.txtf:\\perl\\REBASE.txtPlease input restriction enzyme nameAbeIsearching for AbeI GC^TGAGG GCTGAGGA restriction site for AbeI at locations:11Please input restriction enzyme name
- perl应用:DNA序列酶切图谱的创建
- perl应用:DNA互补序列的获取
- perl应用:DNA序列翻译为蛋白质的完整程序(中)
- perl应用:六框阅读翻译DNA序列
- perl应用:DNA序列翻译(下):从fasta格式中读取序列,然后输出蛋白质序列,以及fasta格式的介绍
- 创建图谱的初衷
- 知识图谱的应用
- 知识图谱的应用
- 知识图谱的应用
- 知识图谱的应用
- 知识图谱的应用
- 数学建模(6)-DNA限制性图谱的绘制
- perl:DNA序列翻译成氨基酸序列的若干方法,直接法,简并法,哈希法,以及perl中的uc和lc函数(上)
- DNA序列
- DNA序列
- DNA序列
- DNA序列
- DNA序列
- ARP攻击
- 面试记
- Linux下c语言多线程编程
- LInux 环境变量
- Ubuntu开机自动挂载Windows分区(NTFS FAT32)教程
- perl应用:DNA序列酶切图谱的创建
- 整理笔记。记的有点乱。
- TextView属性大全
- linux下配置jdk环境变量
- Xfire soapHeader的WebService权限控制forjava
- poj 1469
- Android开发EditText属性
- 对互联网海量数据实时计算的理解
- c/c++混合编程简明总结