blat analysis

来源：互联网发布：声音鉴定软件编辑：程序博客网时间：2024/05/02 01:44

usage: blat database query [-ooc=11.ooc] output.pslwhere: database and query are each either a .fa , .nib or .2bit file, or a list these files one file name per line. -ooc=11.ooc tells the program to load over-occurring 11-mers from and external file. This will increase the speed by a factor of 40 in many cases, but is not required output.psl is where to put the output. Subranges of nib and .2bit files may specified using the syntax: /path/file.nib:seqid:start-end or /path/file.2bit:seqid:start-end or /path/file.nib:start-end With the second form, a sequence id of file:start-end will be used.options: -t=type Database type. Type is one of: 库序列 dna - DNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein The default is dna -q=type Query type. Type is one of: 查询序列 dna - DNA sequence rna - RNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein rnax - DNA sequence translated in three frames to protein The default is dna -prot Synonymous with -t=prot -q=prot -ooc=N.ooc Use overused tile file N.ooc. N should correspond to the tileSize -tileSize=N sets the size of match that triggers an alignment. Usually between 8 and 12 Default is 11 for DNA and 5 for protein. -stepSize=N spacing between tiles. Default is tileSize. -oneOff=N If set to 1 this allows one mismatch in tile and still triggers an alignments. Default is 0. -minMatch=N sets the number of tile matches. Usually set from 2 to 4 Default is 2 for nucleotide, 1 for protein. -minScore=N sets minimum score. This is the matches minus the mismatches minus some sort of gap penalty. Default is 30 -minIdentity=N Sets minimum sequence identity (in percent). Default is 90 for nucleotide searches, 25 for protein or translated protein searches. -maxGap=N sets the size of maximum gap between tiles in a clump. Usually set from 0 to 3. Default is 2. Only relevent for minMatch > 1. -noHead suppress .psl header (so it's just a tab-separated file) -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome. -repMatch=N sets the number of repetitions of a tile allowed before it is marked as overused. Typically this is 256 for tileSize 12, 1024 for tile size 11, 4096 for tile size 10. Default is 1024. Typically only comes into play with makeOoc. Also affected by stepSize. When stepSize is halved repMatch is doubled to compensate. -mask=type Mask out repeats. Alignments won't be started in masked region but may extend through it in nucleotide searches. Masked areas are ignored entirely in protein or translated searches. Types are lower - mask out lower cased sequence upper - mask out upper cased sequence out - mask according to database.out RepeatMasker .out file file.out - mask database according to RepeatMasker file.out -qMask=type Mask out repeats in query sequence. Similar to -mask above but for query rather than target sequence. -repeats=type Type is same as mask types above. Repeat bases will not be masked in any way, but matches in repeat areas will be reported separately from matches in other areas in the psl output. -minRepDivergence=NN - minimum percent divergence of repeats to allow them to be unmasked. Default is 15. Only relevant for masking using RepeatMasker .out files. -dots=N Output dot every N sequences to show program's progress -trimT Trim leading poly-T -noTrimA Don't trim trailing poly-A -trimHardA Remove poly-A tail from qSize as well as alignments in psl output -fastMap Run for fast DNA/DNA remapping - not allowing introns, requiring high %ID -out=type Controls output file format. Type is one of: psl - Default. Tab separated format, no sequence pslx - Tab separated format with sequence axt - blastz-associated axt format maf - multiz-associated maf format sim4 - similar to sim4 format wublast - similar to wublast format blast - similar to NCBI blast format blast8- NCBI blast tabular format blast9 - NCBI blast tabular format with comments -fine For high quality mRNAs look harder for small initial and terminal exons. Not recommended for ESTs -maxIntron=N Sets maximum intron size. Default is 750000 -extendThroughN - Allows extension of alignment through large blocks of N's

Blat,全称The BLAST-Like Alignment Tool, 可以称为“类BLAST比对工具”，由W.James Kent于2002年开发。当时随着人类基因组计划的进展，把大量的基因和ESTs快速定位到较大的基因组上称为一种迫切需要。blast相对于这种比对有几个缺陷：速度偏慢、结果难于处理、无法表示包含intron的基因定位。Blat就是再这种形势下应运而生了。

Blat的主要特点是：速度快，共线性输出结果简单易读。对于比较小的序列（如cDNA等）对大基因组的比对，blat无疑是首选。Blat把相关的呈共线性的比对结果连接成更大的比对结果，从中也可以很容易的找到exons和introns。因此，在相近物种的基因同源性分析和EST分析中，blat得到了广泛的应用。

如下图所示，blast会把每一个比对作为一个输出，而blat会把一些符合共线性关系的比对连接起来作为一个输出。

Blat的输入文件必须满足fasta格式，运行时非常的简单，不需要进行建库就可以直接比对。Blat的基本命令：

blat database query [-参数] output

程序正常运行时，会在读完database中的所有subject序列时在屏幕输出database的统计结果：

Loaded 1493629 letters in 486 sequences###486条序列中有1493629个letters

Searched 1493629 bases in 486 sequences###自己和自己比对

默认的输出结果是列表形式的文本文件，即psl格式。

psl格式的结果包含了详细的比对位置信息，每一列的意义都在文件开头列出。第1~8列是通体的比对统计，包括精确比对碱基数、错配、query和subject上的gap个数与gap总长等；第9~17列是比对位置信息，包括比对方向、query和subject的名字、长度、比对起止位置；18~21列是显示每一个精确比对的block的信息，包括blocks数、每个block的长度和在query、subject上的位置。

对psl输出结果，需要注意一下几点：

1.blat的结果在subject上允许存在很大的gap（intron区域），所以同一个结果在query和subjects上覆盖的区域可能会相差很多，这一点与blast不同。

2.在基因对基因组的比对中，block的个数不能等同于exon的个数。因为blat对block的定义是一个没有插入缺失的比对，任何插入或者缺失的碱基都会使一个block终止，所以一个exon很可能是有很多block构成的。因此exon和intron的个数要通过足够大的gap来判断。

3.psl结果里面碱基位置的计算是从0开始的而不是1.

做不同类型的比对时候需要注意一个问题，就是 “-t”和“-q”的定义必须为同一类型。比如database和query都是蛋白序列，并且两者同时定义为 “prot”的时候，比对能够正常进行；如果database是DNA序列而query序列是蛋白序列，那么在定义 “-q=prot”的同时还需要定义 “-tdnax”.下面就用同一个基因的DNA和蛋白序列举几个例子。

运行命令1：

blat cdna.seq pro.seq -q=prot out.psl

程序报错退出：

d and q must both be either protein or dna

运行命令2：

blat cdna.seq pro.seq -t=dnax -q=prot -noHead out.psl

ok, right

注意蛋白比对和核酸比对在输出上的不同点，在显示方向的位置显示了2个“+”，表示query和subject都是正向比对。

运行命令3，核酸序列的蛋白级别比对：

blat cdna.seq cdna.seq -t=dnax -q=dnax -noHead out.psl