phrap 拼接序列

来源：互联网发布：淘宝二手苹果7可信吗编辑：程序博客网时间：2024/06/10 13:33

phrap是phred\phrap软件包的一部分，主要用于shotgun序列的组装。

主要特征:

允许使用全长的序列（而不仅仅是高质量部分）

使用质量信息进行组装提高组装的准确度

由最高质量部分的序列构建contig序列

提供广泛的组装信息帮助解决错拼等问题（包括contig序列的质量信息）

能够处理比较大的数据集

使用

程序运行命令行：

phrap [sequence file] –new_ace > phrap.out

输入

Fasta格式的核酸序列，如：pp.seq.screen：

>10_A8-9.ab1

gtgctctggtctctgctcctttcccctaagcaatagtaggcagaatcaac

aaaaacaaccccttctcccctccctacctggggaacagagccaatgagac

aggctcaggaacagggcaccagcacctgcactcaccattcaatctcttta

ggctcacggtccttcagaagctcttgtacctcctgccgacagcgctcctg

gtattccgggtgctttgcaaggtggtacaggacccaggagagaccactgg

cccccataaaaagtcacagtacctctgagggctcttgagtctaatctgag

acagtctctgaagattcatcctctttccagaaacccaagcccatcttgct

ctcctagaaacctttctataaaaaaaaaaaaan

>11_A8-9_R.ab1

gggagaggcggagctctggtccttgtcatctaagctgtgtggattgatcg

cctagaacctccctatctaccctccctacctggggaacagagccaatgag

aaaggctcaggaacagggcaccagcacctgcactcaccattcaatctctt

taggctcacggtccttcagaagctcttgtacctcctgccgacagcgctcc

caacttcttcccatcttcatcctggagagaaggcaataaccccccacccc

cacccccataaaaagtcacagtacctctgagggctcttgagtctaatctg

agacagtctctgaagattcatcctctttccagaaacccaagcccatcttg

ctctccagaacccttcttaaa

>15_A8-9.ab1

aagactggcagnggatctctgcatctagtcacctaagctatagctggtag

actcgaccaaaacaaccctttctaccctccctacctggggaacagagcca

atgagacaggctcaggaacag

… …

如有质量文件，则质量文件需和序列文件放在同一目录下，且名字为[序列文件名.qual]，如，序列文件名为pp.seq.screen，质量文件名必须为pp.seq.screen.qual，质量文件不需要在命令行中。并且质量文件中的序列和序列文件中的序列必须一一对应，包括顺序和碱基个数。

输出

在程序运行目录，除屏幕输出外，会产生一系列相关文件，分别为：

*.contigs文件。组装好的contig序列，格式为FASTA格式。其中包括单个read的contig（这类reads和其他contig有比对上的部分，但达不到连上的标准）。

*.contigs.qual文件。Contig组装的质量文件，FASTA格式。此文件记录每个contig的碱基质量信息。

*.singlets 文件。和任何其他reads没有overlap的序列，FASTA格式。

*.log文件和*.problems文件。对使用者基本没用。

*.ace 文件。当使用参数-new_ace或-old_ace时才会产生的文件，用consed查看组装结果时需要。

*.view 文件。当使用-view参数时产生的文件，用phrapview查看组装结果时需要。

除以上文件外，phrap还有屏幕输出，可重定向到文件，如phrap > phrap.out，此输出包含contigs的组成信息。

参数

详细的参数列表可以查看phrap文档：

参数及默认值：

1. 比对分值

-penalty -2 碱基不匹配（替换）罚分

-gap_init penalty-2 gap罚分

-gap_ext penalty-1 扩展gap罚分

-ins_gap_ext gap_ext 插入罚分

-del_gap_ext gap_ext 删除罚分

-matrix [None] 打分矩阵

-raw * 只用原始的Smith-Waterman打分

2. 结合搜索

-minmatch 14 最小匹配长度

-maxmatch 30 最大匹配长度，默认为30

-max_group_size 20 组的限制

-word_raw * Use raw rather than complexity-adjusted word length, in testing against minmatch (N.B. maxmatch always refer to raw lengths).

-bandwidth 14 1/2 band width for banded SWAT searches (full width is 2 times bandwidth + 1).

3. 比对过滤参数

-minscore 30 最小比对分值

-vector_bound 80 序列开始部分可能的载体碱基数目

Special cases:

-masklevel 0 只报告单条最高分值的比对

-masklevel 100 report any match whose domain is not completely contained within a higher scoring match

-masklevel 101 报告所有的比对

4. 输入相关

-default_qual 15 没有质量文件时的碱基默认质量值

-subclone_delim . 克隆名称的分隔符号

-n_delim 1 Indicates which occurrence of the subclone delimiter character denotes the end of the subclone name

-group_delim _ Group name delimiter: Character used to indicate end of that part of the read name that corresponds to the group name (relevant only if option -preassemble is used);

-trim_start 0 序列开头去掉的碱基数

5. 组装相关

-forcelevel 0 Relaxes stringency to varying degree during final contig merge pass.

-bypasslevel 1 Controls treatment of inconsistent reads in merge.

-maxgap 30 Maximum permitted size of an unmatched region in merging contigs, during first (most stringent) merging pass.

-repeat_stringency .95 控制匹配的严紧度

-revise_greedy * 在弱结合部位打断，并尝试重新结合

-shatter_greedy * 打断弱的结合但不尝试重新结合

-preassemble * 组内序列先结合

-force_high * Causes edited high-quality discrepancies to be ignored during final contig merge pass.

6. 一致性序列构建参数

-node_seg 8 Minimum segment size (for purposes of traversing weighted directed graph).

-node_space 4 Spacing between nodes (in weighted directed graph).

7. 输出相关

-tags * Tag selected lines in the standard output, to facilitate parsing.

-screen * when the -old_ace or -new_ace option is specified (see below), this option causes parts of the read sequences that consist of phrap-inferred sequencing vector and chimeric segments to be replaced by X's in the .ace file.

-old_ace * 产生旧格式的ace文件

-new_ace * 产生新格式的ace文件

-ace * 同参数-new_ace

-view * 产生适用于phrapview的".view"文件

-qual_show 20 Cutoff for flagging "low_quality" regions in contig sequence and "high quality" discrepancies between read and contig.

-print_extraneous_matches * 打印contigs间的非局部比对信息

8. 其他

-retain_duplicates * 保留完全相同的序列，而不是去除

-max_subclone_size 5000 最大克隆长度

-trim_penalty -2 Penalty used for identifying degenerate sequence at beginning & end of read.

-trim_score 20 Minimum score for identifying degenerate sequence at beginning & end of read.

-trim_qual 13 定义序列高质量部分的质量值

-confirm_length 8 Minimum size of confirming segment.

-confirm_trim 1 Amount by which confirming segments are trimmed at edges.

-confirm_penalty -5 Penalty used in aligning against "confirming" reads.

-confirm_score 30 Minimum alignment score for a read to be allowed to "confirm" part of another read.

-indexwordsize 10 Size of indexing (hashing) words, used in finding word matches between sequences.

运行问题

内存不足：

如果程序运行提前终止，并给出以下错误信息提示：

FATAL ERROR: REQUESTED MEMORY UNAVAILABLE

程序长时间运行：

可以试着提高参数-minmatch的值

phrap 注意事项

数据量和数据性质

通常情况下reads数量不要超过15万。

如果覆盖度不是很高并且重复序列很少，phrap能完成50万以下的拼接。

如果覆盖度很高（几十以上）或者重复序列很多，phrap就很难处理了。对于特殊数据的拼接策略

对于有重复序列的非finish项目，可以通过对序列的深度统计去掉高重复的reads，只保留uniq区的reads拼接(RePS方法)。

同样在比较难拼接的finish项目中也可以使用这种策略来保证正确性，再通过其他方法补充gap

phrap.out文件包含了reads拼成contig的方式，包括位置，方向等。把这些信息提取出来存入contig.list文件

阅读全文

0 0