submit assembly to NCBI

来源:互联网 发布:api接口怎么用java 编辑:程序博客网 时间:2024/04/29 10:23

二、submit assembly to NCBI

1、prepare data

首先要具有fasta格式数据(NO .gz),这是处理的基础,具体格式如下:

>Scaffold633TCATTTCTCCACTCTCGATGAACAAATCTGGAGGGATTTTTTTTCATTCCACTCAATAGGTTGTCTATAAAGGTGTGATTCGTGGAACTTCTTCACACAGCAGCTAGTCTATATAATACAGAAGATCG>Scaffold553AAAAAATTTTTTTTTTAAACTATCATCTCATGGATCAGCAGCAATTCTGAGTGTAACGTCTTCATTAAATGCGTATATAAATTTGCATAAAGATATGCGACCAATATTGAGCCTGGAAATATATGCGCAGAGTGCAAAATTGTGTTTTTTGATCGGTTAATTAAAGG>Scaffold641GTTTCCCAGTAGGTCTCTCCCGCTACGGCGTCCGCACGAACGCGATCTGCCCTCGTGCCCGCACCGCCATGACGGCAGAAGCCTTCGGCGAGAACAACACCGGCGTCGTCGGCCTCGATCCGCTTGCACCCGAGCGCGTCGCGACCCTGGTCAGCTACCTCGCATCCCCCGATTCCGACGAGATCAACGGACAGGTCTTCGTCGTCTACGGCAAGATGGTGGCGTTGATGGAAGCACCCAAGGTCGAGAACCGTTTCGACGCAGCCGGATCCGCGTTCACCGTCGAAGAACTCGGTGGCCAGCTCTCGTCTTACTTCTCCGGCCGTGGGCCGTACGAGACCTACTGGGAAAC

2、处理数据

分为几步:

(1)生成.greater, short.list和ZERO_BASE_COUNT文件

perl  ../ scaf_filter_2k.pl  Ascaris_suum.scaf.fa

scaf_filter_2k.pl代码

#!/usr/bin/perluse strict;use warnings; my $file=shift;#my $cutoff=shift;my $outfile="short.list";my $outfile2="$file.greater";my $outfile3="ZERO_BASE_COUNT"; open IN,"< $file" or die $!;open OUT,"> $outfile" or die $!;open OUT1,"> $outfile2" or die $!;open OUT2, "> $outfile3" or die $!;$/='>';<IN>;$/="\n";while(<IN>){        chomp;        my $id=$1 if(/^(\S+)/);        $/='>';        my $seq=<IN>;        chomp($seq);        $/="\n";        $seq=~s/\s//g;        my $len=length($seq);        if ($len < 200){                print  OUT "$id\t$len\n";                next;        }        else{                my $a=$seq=~tr/aA/aA/;                my $t=$seq=~tr/tT/tT/;                my $c=$seq=~tr/cC/cC/;                my $g=$seq=~tr/gG/gG/;                if($a==0 || $c==0 || $g==0 || $t==0){                        print OUT2 "$id\t$len\n";                        next;                }                print OUT1 ">$id\n$seq\n";        }}close IN;close OUT;close OUT1;close OUT2;

(2)生成.Nchange文件。

perl ../ info_N_plus.pl  Ascaris_suum.scaf.fa.greater  > Ascaris_suum.scaf.fa.greater.Nchange

info_N_plus.pl代码:

#!/usr/bin/perl -wuse strict;#use Getopt::Long;sub usage{  print STDERR <<USAGE;      ############################################            Version 1.1 by Wing-L 2011.07.15       usage: $0 <sequence.fa> <len> >STDOUT       ############################################USAGEexit;}&usage if(@ARGV <1);my ($fa,$len)=@ARGV;$len||=10;open IN,"<$fa" or die("$!\n");$/='>';<IN>;$/="\n";while(my $line=<IN>){    my @block;  $line=~/^\S+/;  my $tag={1};  $/='>';  my $seq=<IN>;  chomp $seq;  $/="\n";  $seq=~s/\s//g;  my $chr_length=length $seq;  while($seq=~/[^N]N{1,9}[^N]/g){    substr ($seq,$-[0]+1,$len)=~s/\S/N/g;  }  while($seq=~/N([^N]{1,49})N/g){      my $tlen=length $1;      substr ($seq,$-[0]+1,$tlen)=~s/\S/N/g;  }  if($seq=~/^N?[^N]{0,49}N+/){      print STDERR "$tag\t1\t{1}[0]\n";    substr($seq,0,{1}[0]-$-[0])='';  }  if($seq=~/N+[^N]{0,49}N{0,}$/){    print STDERR "$tag\t$-[0]\t$chr_length\n";    substr($seq,$-[0],$chr_length-$-[0])='';  }  print ">$tag\n$seq\n";}close IN;#open IN,"<" or die("$!\n");#while(my $line=<IN>){}#foreach my $e (){}#(split /\s+/,$line)[0]#open OUT,">" or die("$!\n");###############  sub  ###############

(3)生成分割文件

perl  ../get_scaftig.pl  Ascaris_suum.scaf.fa.greater.Nchange  > Ascaris_suum.scaf.fa.greater.Nchange.split

get_scaftig.pl代码:

#!/usr/bin/perl -w##Author: Ruan Jue <ruanjue@genomics.org.cn>#use warnings;use strict; my $min_length = 0;my $name = '';my $seq = ''; while(<>){   if(/^>(\S+)/){      &print_scafftig($name, $seq) if($seq);      $name = $1;      $seq  = '';   } else {      chomp;      $seq .= 
{1}

; }}&print_scafftig($name, $seq) if($seq); 1; sub print_scafftig { my $name = shift; my $seq = shift; my $temp = $seq; my $id = 1; my $flag = 0; my $pos = 1; while($seq=~/([ATGCatgc]+)/g){ my $s = $1; if($flag==1){ if($temp=~/([ATGCatgc]+[Nn]+)/g){ my $g = $1; $pos+=length($g); } } else{$flag=1;} next if(length($s) < $min_length); my $length=length($s); my $end=$pos+$length-1;# print ">$name\_$id start=$pos length=".length($s)."\n"; print ">$name\_$id\t$pos\t$end\t$length\n"; while($s=~/(.{1,60})/g){ print "$1\n"; } $id++;}}

(4)生成.agp文件

perl ../AGP.pl Ascaris_suum.scaf.fa.greater.Nchange.split > Ascaris_suum.scaf.fa.agp

AGP.pl 代码:

#!/usr/bin/perluse strict; my $fasta=shift; my %Scaf;open IN,$fasta or die "$!";while(<IN>){        if (/^>(\S+)(_\d+)\s+(\d+)\s+(\d+)\s+(\d+)/){                my $scaf=$1;                my $contig=$1.$2;                my $start=$3;                my $end=$4;                my $len=$5;                push @{$Scaf{$scaf}},[$contig,$start,$end,$len];        }}close IN; foreach my $id ( sort keys %Scaf ){        @{$Scaf{$id}} = sort {$a->[1]<=>$b->[1]}  @{$Scaf{$id}};} foreach my $id ( sort { $Scaf{$b}[-1][2] <=> $Scaf{$a}[-1][2] } keys %Scaf ){        my $p=$Scaf{$id};        my $idx=1;        for(my $i=0;$i<@$p;$i++){                if ($i>0){                        print join("\t",$id,$p->[$i-1][2]+1,$p->[$i][1]-1,$idx,'N',($p->[$i][1]-1)-($p->[$i-1][2]+1)+1,'fragment','yes',"\t")."\n";                        $idx++;                }                print join("\t",$id,$p->[$i][1],$p->[$i][2],$idx,'W',$p->[$i][0],1,$p->[$i][3],'+')."\n";                $idx++;        }}

至此为止,第一步准备数据的过程已经完成。下一步就是向3811提交任务。

这个过程需要的东西:

l  Ascaris_suum.scaf.fa.greater.Nchange.gz

l  Ascaris_suum.agp.gz

和NCBI沟通的文件,主要是确定的Project ID的准确性。

未完待续.........

 

	
				
		
原创粉丝点击