blat analysis

来源:互联网 发布:声音鉴定软件 编辑:程序博客网 时间:2024/05/02 01:44
usage:   blat database query [-ooc=11.ooc] output.pslwhere:   database and query are each either a .fa , .nib or .2bit file,   or a list these files one file name per line.   -ooc=11.ooc tells the program to load over-occurring 11-mers from               and external file.  This will increase the speed               by a factor of 40 in many cases, but is not required   output.psl is where to put the output.   Subranges of nib and .2bit files may specified using the syntax:      /path/file.nib:seqid:start-end   or      /path/file.2bit:seqid:start-end   or      /path/file.nib:start-end   With the second form, a sequence id of file:start-end will be used.options:   -t=type     Database type.  Type is one of:   库序列        dna - DNA sequence                 prot - protein sequence                 dnax - DNA sequence translated in six frames to protein               The default is dna   -q=type     Query type.  Type is one of:  查询序列       dna - DNA sequence                 rna - RNA sequence                 prot - protein sequence                 dnax - DNA sequence translated in six frames to protein                 rnax - DNA sequence translated in three frames to protein               The default is dna   -prot       Synonymous with -t=prot -q=prot   -ooc=N.ooc  Use overused tile file N.ooc.  N should correspond to               the tileSize   -tileSize=N sets the size of match that triggers an alignment.               Usually between 8 and 12               Default is 11 for DNA and 5 for protein.   -stepSize=N spacing between tiles. Default is tileSize.   -oneOff=N   If set to 1 this allows one mismatch in tile and still               triggers an alignments.  Default is 0.   -minMatch=N sets the number of tile matches.  Usually set from 2 to 4               Default is 2 for nucleotide, 1 for protein.   -minScore=N sets minimum score.  This is the matches minus the               mismatches minus some sort of gap penalty.  Default is 30   -minIdentity=N Sets minimum sequence identity (in percent).  Default is               90 for nucleotide searches, 25 for protein or translated               protein searches.   -maxGap=N   sets the size of maximum gap between tiles in a clump.  Usually               set from 0 to 3.  Default is 2. Only relevent for minMatch > 1.   -noHead     suppress .psl header (so it's just a tab-separated file)   -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.   -repMatch=N sets the number of repetitions of a tile allowed before               it is marked as overused.  Typically this is 256 for tileSize               12, 1024 for tile size 11, 4096 for tile size 10.               Default is 1024.  Typically only comes into play with makeOoc.               Also affected by stepSize. When stepSize is halved repMatch is               doubled to compensate.   -mask=type  Mask out repeats.  Alignments won't be started in masked region               but may extend through it in nucleotide searches.  Masked areas               are ignored entirely in protein or translated searches. Types are                 lower - mask out lower cased sequence                 upper - mask out upper cased sequence                 out   - mask according to database.out RepeatMasker .out file                 file.out - mask database according to RepeatMasker file.out   -qMask=type Mask out repeats in query sequence.  Similar to -mask above but for query rather than target sequence.   -repeats=type Type is same as mask types above.  Repeat bases will not be               masked in any way, but matches in repeat areas will be reported               separately from matches in other areas in the psl output.   -minRepDivergence=NN - minimum percent divergence of repeats to allow               them to be unmasked.  Default is 15.  Only relevant for               masking using RepeatMasker .out files.   -dots=N     Output dot every N sequences to show program's progress   -trimT      Trim leading poly-T   -noTrimA    Don't trim trailing poly-A   -trimHardA  Remove poly-A tail from qSize as well as alignments in               psl output   -fastMap    Run for fast DNA/DNA remapping - not allowing introns,               requiring high %ID   -out=type   Controls output file format.  Type is one of:                   psl - Default.  Tab separated format, no sequence                   pslx - Tab separated format with sequence                   axt - blastz-associated axt format                   maf - multiz-associated maf format                   sim4 - similar to sim4 format                   wublast - similar to wublast format                   blast - similar to NCBI blast format                   blast8- NCBI blast tabular format                   blast9 - NCBI blast tabular format with comments   -fine       For high quality mRNAs look harder for small initial and               terminal exons.  Not recommended for ESTs   -maxIntron=N  Sets maximum intron size. Default is 750000   -extendThroughN - Allows extension of alignment through large blocks of N's


Blat,全称The BLAST-Like Alignment Tool, 可以称为“类BLAST比对工具”,由W.James Kent于2002年开发。当时随着人类基因组计划的进展,把大量的基因和ESTs快速定位到较大的基因组上称为一种迫切需要。blast相对于这种比对有几个缺陷:速度偏慢、结果难于处理、无法表示包含intron的基因定位。Blat就是再这种形势下应运而生了。

Blat的主要特点是:速度快,共线性输出结果简单易读。对于比较小的序列(如cDNA等)对大基因组的比对,blat无疑是首选。Blat把相关的呈共线性的比对结果连接成更大的比对结果,从中也可以很容易的找到exons和introns。因此,在相近物种的基因同源性分析和EST分析中,blat得到了广泛的应用。

如下图所示,blast会把每一个比对作为一个输出,而blat会把一些符合共线性关系的比对连接起来作为一个输出。

Blat的输入文件必须满足fasta格式,运行时非常的简单,不需要进行建库就可以直接比对。Blat的基本命令:

blat      database  query [-参数]  output

程序正常运行时,会在读完database中的所有subject序列时在屏幕输出database的统计结果:

Loaded 1493629 letters in 486 sequences###486条序列中有1493629个letters

Searched 1493629 bases in 486 sequences###自己和自己比对

默认的输出结果是列表形式的文本文件,即psl格式。

psl格式的结果包含了详细的比对位置信息,每一列的意义都在文件开头列出。第1~8列是通体的比对统计,包括精确比对碱基数、错配、query和subject上的gap个数与gap总长等;第9~17列是比对位置信息,包括比对方向、query和subject的名字、长度、比对起止位置;18~21列是显示每一个精确比对的block的信息,包括blocks数、每个block的长度和在query、subject上的位置。

对psl输出结果,需要注意一下几点:

1.blat的结果在subject上允许存在很大的gap(intron区域),所以同一个结果在query和subjects上覆盖的区域可能会相差很多,这一点与blast不同。

2.在基因对基因组的比对中,block的个数不能等同于exon的个数。因为blat对block的定义是一个没有插入缺失的比对,任何插入或者缺失的碱基都会使一个block终止,所以一个exon很可能是有很多block构成的。因此exon和intron的个数要通过足够大的gap来判断。

3.psl结果里面碱基位置的计算是从0开始的而不是1.

 

做不同类型的比对时候需要注意一个问题,就是 “-t”和“-q”的定义必须为同一类型。比如database和query都是蛋白序列,并且两者同时定义为 “prot”的时候,比对能够正常进行;如果database是DNA序列而query序列是蛋白序列,那么在定义 “-q=prot”的同时还需要定义 “-tdnax”.下面就用同一个基因的DNA和蛋白序列举几个例子。

运行命令1:

blat  cdna.seq  pro.seq  -q=prot  out.psl

程序报错退出:

d  and  q  must both be either protein or dna

运行命令2:

blat  cdna.seq  pro.seq  -t=dnax  -q=prot  -noHead  out.psl

ok, right

注意蛋白比对和核酸比对在输出上的不同点,在显示方向的位置显示了2个“+”,表示query和subject都是正向比对。

运行命令3,核酸序列的蛋白级别比对:

blat  cdna.seq  cdna.seq  -t=dnax  -q=dnax  -noHead out.psl

原创粉丝点击