转录组测序分析中cufflinks的使用及问题

来源：互联网发布：过山车大亨中文版 mac 编辑：程序博客网时间：2024/05/18 02:33

在转录组高通量测序数据分析过程中，主要就是转录本的构建（拼接）及表达水平的衡量，其次还包括可变剪接之类的转录后修饰等研究分析。因此，做过RNA-seq转录组测序数据分析的童鞋们一定使用过或至少听说过Cufflinks这个软件。

Cufflinks是加利福尼亚大学伯克利分校数学和计算机生物实验室，由LiorPachter领导的StevenSalzberg’s团队，和马里兰大学生物信息和计算机生物中心的Steven Salzberg小组，以及加州理工学院的Barbara Wold实验室联合作用的结果。Cufflinks 利用Tophat比对的结果（alignments）来组装转录本，估计这些转录本的丰度，并且检测样本间的差异表达及可变剪接。这个软件其实是个套装，包括四个部分分别命名为：cufflinks、cuffcompare、cuffmerge及cuffdiff。

tophat-->Cufflinks-->cuffdiff转录组分析大致流程如下：

第一步，利用tophat/bowtie比对结果（bam格式）及参考基因组构建转录本，最终的转录本是以gtf格式保存的。
第二步，Cuffcompare主要是对两个或多个转录本集合中转录本相似情况的比较，例如将第一步构建出的转录本与ENSEMBL数据库中的转录本进行比较，评估转录本构建情况，此外，根据构建的转录本与已知ENSEMBL数据库中的转录本的相对位置定义了一系列分类，例如内含子区域、反义、基因间区域转录本等等近10种分类。
第三步，cuffmerge是将多个转录本集合合并成一套转录本集合，例如将在多个组织样本中构建的多套转录本合并成一套转录本，cuffmerge能够很好地完成去除冗余。
第四步，cuffdiff衡量两个或多个样本间差异表达的基因，例如癌症与正常组织间差异表达的转录本，此外还能衡量差异可变剪接体。
至此，转录本测序常规数据分析基本结束，接下来进行实验验证或深入数据分析。

OK，如果对cufflinks的流程理解没问题的话，现在问题来了！！

在tophat-->cufflinks --> cuffdiff流程中，第一步对每一个样本会产生一个的gtf格式的转录本，因此若有n个样本就会有n个gtf. 但是在对n个样本之间做差异表达分析时，由于各样本之间的转录本集合不相同。在做cuffdiff时，却要求输入一个gtf，以便对该gtf 文件中转录本在不同样本间进行差异分析。因此需要将n个样本的gtf文件合并成一个。问题是，cufflinks有cuffcompare和cuffmerge两个合并转录本的方法，这是应该选用那个？这两个有什么区别？

个人理解及经验分享：

这是所有初做NGS分析的很因惑的问题，事实上，对于多个样本构建的多套转录本，如何得到统一的一套转录本，有下面三种方法：

1 在转录本拼接之前，把各样本的比对bam文件合并，然后用合并的bam跑cufflinks
2 每个样本的比对bam文件分别单独跑cufflinks，各样本的转录本构建后，再用cuffcompare合并为一套转录本。
3 每个样本的比对bam文件分别单独跑cufflinks，各样本的转录本构建后，再用cuffmerge合并为一套转录本。

这三种做法区别在于：
第一种方法流程相对简单，所有的工作都抛给cufflinks一人完成，你都不需要知道cuffmerge、cuffcompare的用法。貌似是种完美解决方案。但很大的问题是：cufflinks能处理得了最终合并的bam吗？对于小物种的样本还可以，但对于人，若是7，8个样本合成的bam，cufflinks吃不销！
第二和三种方法是类似的，都是在保留可变剪切结构的前提下，将转录本合并。不同的是，cuffcompare只有A、B两条转录本结构相同的时候，才将A、B合并。而cuffmerge是A、B某些部分互相overlap，就将它俩合并。事实上，cuffmerge再做合并的时候，是把overlap的transfrag重新调用了cufflinks，合成一个transfrag。
第二和三种方法还有一个很大的不同是，cuffmerge可以带上参考注释有参考的进行合并，而cuffcompare不能如此。这是cuffmerge对cuffcompare的一个优势。

因此，我认为第三种方法比第二种方法是最接近于第一种方法，而第一种方法的可实现性较差，在现有条件，最完美的解决方案就是第三种方法：每个bam单独跑cufflinks，跑完的结果再用cuffmerge合并。

Hi, JiWenI thought about this before, here is the answer from Cole Trapnell fromSeqanswer website:"I can shed some light on this. We have an upcoming protocol paper thatdescribes our recommended workflow for TopHat and Cufflinks that discussessome of these issues.As turnersd outlined, there are three strategies:1) merge bams and assemble in a single run of Cufflinks2) assemble each bam and cuffcompare them to get a combined.gtf3) assemble each bam and cuffmerge them to get a merged.gtfAll three options work a little differently depending on whether you'realso trying to integrate reference transcripts from UCSC or anotherannotation source.#1 is quite different from #2 and #3, so I'll discuss its pros and consfirst. The advantage here is simplicity of workflow. It's one Cufflinksrun, so no need to worry about the details of the other programs. Asturnersd mentions, you might also think this maximizes the accuracy of theresulting assembly, and that might be the case, but it also might not (fortechnical reasons that I don't want to get into right now). Thedisadvantage of this approach is that your computer might not be powerfulenough to run it. More data and more isoforms means substantially morememory and running time. I haven't actually tried this on something likethe human body map, but I would be very impressed and surprised ifCufflinks can deal with all of that on a machine owned by mere mortals.#2 and #3 are very similar - both are designed to gracefully mergefull-length and partial transcript assemblies without ever mergingtransfrags that disagree on splicing structure. Consider two transfrags, Aand B, each with a couple exons. If A and B overlap, and they don'tdisagree on splicing structure, we can (and according to Cufflinks'assembly philosophy, we should) merge them. The difference betweenCuffcompare and Cuffmerge is that Cuffcompare will only merge them if A is"contained" in B, or vice versa. That is, only if one of the transfrags isessentially redundant. Otherwise, they both get included. Cuffmerge on theother hand, will merge them if they overlap, and agree on splicing, and arein the same orientiation. As turnersd noted, this is done by converting thetransfrags into SAM alignments and running Cufflinks on them.The other thing that distinguishes these two options is how they deal witha reference annotation. You can read on our website how the CufflinksReference Annotation Based Transcript assembler (RABT) works. Cuffcomparedoesn't do any RABT assembly, it just includes the reference annotation inthe combined.gtf and discards partial transfrags that are contained andcompatible with the reference. Cuffmerge actually runs RABT when youprovide a reference, and this happens during the step where transfrags areconverted into SAM alignments and assembled. We do this to improvequantification accuracy and reduce errors downstream. I should also saythat Cuffmerge runs cuffcompare in order annotate the merged assembly withcertain helpful features for use later on.So we recommend #3 for a number of reasons, because it is the closest inspirit to #1 while still being reasonably fast. For reasons that I don'twant to get into here (pretty arcane details about the Cufflinks assembler)I also feel that option #3 is actually the most accurate in mostexperimental settings. "Hope this helps.Wei LiaoResearch Scientist,Brentwood Biomedical Research Institute16111 Plummer St.Bldg 7, Rm D-122North Hills, CA 91343818-891-7711 ext 7645> Today's Topics:>> 1. cuffcompare or cuffmerge (???)> 2. Data upload... (Gregory Miles)>>> ---------- Forwarded message ----------> From: 杨继文 <jiwenyang0605 at 126.com>> To: galaxy-user at lists.bx.psu.edu> Cc:> Date: Mon, 23 Apr 2012 20:32:28 +0800 (CST)> Subject: [galaxy-user] cuffcompare or cuffmerge> Hi all,> I read one paper "Differential gene and transcript expression analysis of> RNA-seq experiments with TopHat and Cufflinks".> They say the procedure for RNA-Seq analysis is *Tophat-->cufflinks-->> cuffmerge-->cuffdiff*>> But what I normally do in Galaxy is *Tophat-->cufflinks-->cuffcompare-->cuffdiff.> I have six samples, which means I will generate 6 assembled transcript> files by cufflinks. Then I run cuffcompare using all six assembled> transcript files as input. The resulting "combined transcript" is the input> for cufflinks. *> **> *I don't know why I shoud use cuffmerge. Actually I don't understand the> function of cuffmerge.*> **> *Did I miss something?? *> Please let me know your opinions.> Jiwen>>> ------------------------------> -->Wei LiaoResearch Scientist,Brentwood Biomedical Research Institute16111 Plummer St.Bldg 7, Rm D-122North Hills, CA 91343818-891-7711 ext 7645

Difference between Cuffcompare and Cuffmerge

Cufflinks is one of the most commonly used program for reference-genome based transcriptome assembly, and expression estimation and differential expression analysis. Cufflinks comes with two supplementary programs (in addition to few others) for post cufflinks workflow:: Cuffcompare and Cuffmerge.

Although Cuffcompare and Cuffmerge may seem to perform same task when it comes to handling multiple transcriptome assemblie, there are sitll substantial differences among the two.Cufflinks manual, Tuxedo pipeline (Bowtie-TopHat-Cufflinks-Cuffdiff) paper and forum postsfrom the developers have pointed out these differences but still I get a lot of questions regarding the same. I will try to explain differences to make it more clear.

Cuffcompare and Cuffmerge both are used to merge multiple transcript assemblies but in a little different manner.
Cuffcompare takes all the transcripts from multiple assemblies (in GTF format) and creates a union of all the transcripts where all the redundant transcripts are removed. Cuffcompare does not change any of the assembled transcript in any of the assembly instead it simply compares the coordinates of the transcripts.
Resulting file, "combined.gtf", contains a set of "unified" transcripts across all the assemblies. The "combined.gtf" file can be used as the reference GTF file for the quantification across the samples using Cuffdiff (another program in Cufflinks toolkit).

Cuffmerge, on the other hand, creates a "merged" set of transcripts form multiple assemblies. During this merging transcripts from all the assemblies (GTF files) are converted to representative reads in SAM format and Cufflinks (original assembly program) is run internally to see of there is any gaps that can be filled and a longer consensus sequence can be created. Basically, Cuffmerge merges transcripts that are overlapping and share a similar exon structure (or splicing structure) to generate a longer chain of connected exons.

Overall, Cuffcompare will generate a non-redundant set of transcripts while Cuffmerge will generate a more consensus assembly form a multiple set of assemblies. So from Cuffmerge you get a cleaner, somewhat more complete assembly and ,generally, fewer number of assembled transcript as compared to the transcripts from Cuffcompare.

Additional note:
Cuffcompare is a more comprehensive program than simply a tool to combine assemblies. For example, ".trackinbg" file generated by Cuffcompare contains the information about how many samples each transcripts was present so that you can the idea of multiplicity (recurrence) of each transcript across multiple samples. Cuffcompare can also annotate your transcript assemblies using a reference annotation files (in GTF format) and will assign reference transcript Id( such as ensemnbl id) and gene symbol to the assembled transcripts.

Some of the small fragements that are unlikely to be conclusive of a functional transcript are removed by cuffcompare. This is the reason you are seeing missing transcripts from he tmap file. I think you can tell cuffcompare to not remove any transcripts during the filtering process.

0 0