bowtie和samtools在tophat中的使用

来源:互联网 发布:在c语言中拆分英文名 编辑:程序博客网 时间:2024/04/29 18:14

Bowtie介绍

1 Bowtie和一般的比对工具不一样,他适用于短reads比对到大的基因组上,尽管它也支持小的参考序列像amplicons和长达1024readsBowtie采用基因组索引和reads的数据集作为输入文件并输出比对的列表。Bowtie设计思路是,1)短序列在基因组上至少有一处最适匹配,2)大部分的短序列的质量是比较高,3)短序列在基因组上最适匹配的位置最好只有一处。这些标准基本上和RNA-seq, ChIP-seq以及其它一些正在兴起的测序技术或者再测序技术的要求一致。

2 Bowtie有两种比对策略:

-n (默认使用-n

该参数要求比对时碱基错配数不超过N,这里N的取值范围是0-3,并且这个错配数是指种子序列上允许的碱基错配数。

在全部错配位置的phred 质量值的和可能会超过参数e。对于没有质量值的fasta文件,质量值默认是40.

-v

比对不允许超过V个错配,V的取值范围是0-3.此时忽略质量值。

Strara

-n比对模式下,stratum是定义种子区域的错配数,结合-l参数使用。在-v比对模式下,stratum定义在所有记录中的错配数,结合-m参数使用。

结果参数 –k –a –m –M –best –strara

3 Bowtie使用方法

3.1 Usage:

  bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | <s>} [<hit>]

 

  <m1>    Comma-separated list of files containing upstream mates (or the

          sequences themselves, if -c is set) paired with mates in <m2>

  <m2>    Comma-separated list of files containing downstream mates (or the

          sequences themselves if -c is set) paired with mates in <m1>

  <r>     Comma-separated list of files containing Crossbow-style reads.  Can be

          a mixture of paired and unpaired.  Specify "-" for stdin.

  <s>     Comma-separated list of files containing unpaired reads, or the

          sequences themselves, if -c is set.  Specify "-" for stdin.

  <hit>   File to write hits to (default: stdout)

3.2 输入参数:

Input:

  -q                 query input files are FASTQ .fq/.fastq (default)输入fastq文件

  -f                 query input files are (multi-)FASTA .fa/.mfa输入fasta文件

  -r                 query input files are raw one-sequence-per-line输入raw文件

  -c                 query sequences given on cmd line (as <mates>, <singles>)

  -C                 reads and index are in colorspace

  -Q/--quals <file>  QV file(s) corresponding to CSFASTA inputs; use with -f -C

  --Q1/--Q2 <file>   same as -Q, but for mate files 1 and 2 respectively

  -s/--skip <int>    skip the first <int> reads/pairs in the input

  -u/--qupto <int>   stop after first <int> reads/pairs (excl. skipped reads)

  -5/--trim5 <int>   trim <int> bases from 5' (left) end of reads

  -3/--trim3 <int>   trim <int> bases from 3' (right) end of reads

  --phred33-quals    input quals are Phred+33 (default)默认质量值

  --phred64-quals    input quals are Phred+64 (same as --solexa1.3-quals)

  --solexa-quals     input quals are from GA Pipeline ver. < 1.3

  --solexa1.3-quals  input quals are from GA Pipeline ver. >= 1.3

  --integer-quals    qualities are given as space-separated integers (not ASCII)

Tophat调用bowtie使用的输入参数是-q

3.3 比对参数:

Alignment:

  -v <int>           report end-to-end hits w/ <=v mismatches; ignore qualities

    or

  -n/--seedmms <int> max mismatches in seed (can be 0-3, default: -n 2)

  -e/--maqerr <int>  max sum of mismatch quals across alignment for -n (def: 70)-n模式下最大错配质量的和

  -l/--seedlen <int> seed length for -n (default: 28)-n模式下种子的长度

  --nomaqround       disable Maq-like quality rounding for -n (nearest 10 <= 30)

  -I/--minins <int>  minimum insert size for paired-end alignment (default: 0)pairend比对最小插入片段大小

  -X/--maxins <int>  maximum insert size for paired-end alignment (default: 250)paired比对最大插入片段大小

  --fr/--rf/--ff     -1, -2 mates align fw/rev, rev/fw, fw/fw (default: --fr)

  --nofw/--norc      do not align to forward/reverse-complement reference strand对于完全正向或反向的参考方向不必对对于

  --maxbts <int>     max # backtracks for -n 2/3 (default: 125, 800 for --best)

  --pairtries <int>  max # attempts to find mate for anchor hit (default: 100)对于anchor试着找到mate的最大值

  -y/--tryhard       try hard to find valid alignments, at the expense of speed尽可能找到有效的比对值

  --chunkmbs <int>   max megabytes of RAM for best-first search frames (def: 64)

Tophat调用bowtie使用的比对参数是-v 2

3.4 报告参数:

Reporting:

  -k <int>           report up to <int> good alignments per read (default: 1)给出每个reads的所有比对结果中好的前k

  -a/--all           report all alignments per read (much slower than low -k)给出每个reads的所有比对结果

  -m <int>           suppress all alignments if > <int> exist (def: no limit)

  -M <int>           like -m, but reports 1 random hit (MAPQ=0); requires –bestbest一起使用随机给出一个比对结果

  --best             hits guaranteed best stratum; ties broken by quality比对结果按照bestworst的顺序给出

  --strata           hits in sub-optimal strata aren't reported (requires --best)best一起使用,不给出sub-optimal strara的比对结果

Tophat调用bowtie使用的报告参数是-k 40 m 40

3.5 输出参数:

Output:

  -t/--time          print wall-clock time taken by search phases 

  -B/--offbase <int> leftmost ref offset = <int> in bowtie output (default: 0)

  --quiet            print nothing but the alignments只输出比对结果

  --refout           write alignments to files refXXXXX.map, 1 map per reference将比对结果写入到refXXXXX.map文件,每个参考序列对应一个比对结果

  --refidx           refer to ref. seqs by 0-based index rather than name

  --al <fname>       write aligned reads/pairs to file(s) <fname>把所有比对结果写入fname

  --un <fname>       write unaligned reads/pairs to file(s) <fname>把没有比对上的结果写入fname

  --max <fname>      write reads/pairs over -m limit to file(s) <fname>把超过m限制的结果写入fname

  --suppress <cols>  suppresses given columns (comma-delim'ed) in default output指定输出的列

  --fullref          write entire ref name (default: only up to 1st space)写出整个参考序列的名字

Tophat调用bowtie使用的输出参数是unmax

Colorspace:

  --snpphred <int>   Phred penalty for SNP when decoding colorspace (def: 30)

     or

  --snpfrac <dec>    approx. fraction of SNP bases (e.g. 0.001); sets --snpphred

  --col-cseq         print aligned colorspace seqs as colors, not decoded bases

  --col-cqual        print original colorspace quals, not decoded quals

  --col-keepends     keep nucleotides at extreme ends of decoded alignment

SAM:

  -S/--sam           write hits in SAM format

  --mapq <int>       default mapping quality (MAPQ) to print for SAM alignments

  --sam-nohead       supppress header lines (starting with @) for SAM output

  --sam-nosq         supppress @SQ header lines for SAM output

  --sam-RG <text>    add <text> (usually "lab=value") to @RG line of SAM header

Performance:

  -o/--offrate <int> override offrate of index; must be >= index's offrate

  -p/--threads <int> number of alignment threads to launch (default: 1)

  --mm               use memory-mapped I/O for index; many 'bowtie's can share

  --shmem            use shared mem for index; many 'bowtie's can share

Other:

  --seed <int>       seed for random number generator

  --verbose          verbose output (for debugging)

  --version          print version information and quit

  -h/--help          print this usage message

Tophat调用bowtie使用的Performance参数是p,这个参数也是tophat中的参数。

 

4 bowtie-build介绍

Bowtie-build是bowtie从DNA数据集中建立索引。Bowtie-build输出6个后缀为ebwt的索引文件。这些文件一起构成索引,他们都需要比对reads到参考序列。一旦建立索引,就不需要使用原始的序列文件了。

4.1 参数使用

Usage: bowtie-build [options]* <reference_in> <ebwt_outfile_base>

    reference_in            comma-separated list of files with ref sequences

    ebwt_outfile_base       write Ebwt data to files with this dir/basename

Options:

    -f                      reference files are Fasta (default)默认参考序列文件时fa文件

    -c                      reference sequences given on cmd line (as <seq_in>)给定cmd行的参考文件

    -C/--color              build a colorspace index 建立颜色空间的index

    -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting

    -p/--packed             use packed strings internally; slower, uses less mem使用内存较少

    -B                      build both letter- and colorspace indexes

    --bmax <int>            max bucket sz for blockwise suffix-array builder

    --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)

    --dcv <int>             diff-cover period for blockwise (default: 1024)

    --nodc                  disable diff-cover (algorithm becomes quadratic)

    -r/--noref              don't build .3/.4.ebwt (packed reference) portion不创建3,4部分的索引

    -3/--justref            just build .3/.4.ebwt (packed reference) portion仅创建3,4部分的索引

    -o/--offrate <int>      SA is sampled every 2^offRate BWT chars (default: 5)

    -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)

    --ntoa                  convert Ns in reference to As在参考序列中把Ns转成As

    --seed <int>            seed for random number generator生成随机种子数

    -q/--quiet              verbose output (for debugging)用于debug的详细输出

    -h/--help               print detailed description of tool and its options打印工具盒选项的描述细节

    --usage                 print this usage message 打印使用信息

--version               print version information and quit打印版本信息并退出

 

5 Samtools介绍:

Samtools主要处理BAM格式的比对结果。将SAM格式的数据转成BAM格式的数据,然后对其进行排序,合并,索引,找snp等操作。

 

6 tophat 调用bowtiesamtools情况小结

Tophat的参数中存在—bowtie-n 参数,默认使用-v比对模式。在使用tophat时命令行处不能修改bowtie的参数。根据run.log,可以看出tophat在调用bowtie时已经设定好了几个参数,在上文黄色背景标识的地方已经写明。也就是说在使用tophat时,没有修改bowtie参数的余地。

Tophat的参数中没有明确的关于samtools的参数。根据run.log可以看出,在tophat_reports部分,其输出结果用samtools处理,生成最后的accepted_hits.bam文件。

运行tophat(比对RNA-seq reads到基因组)生成如下结果文件:

accepted_hits.bam     

deletions.bed         

insertions.bed        

junctions.bed         

left_kept_reads.info  

logs                  

right_kept_reads.info

原创粉丝点击