Latex 模板

来源:互联网 发布:php 图片字节流 编辑:程序博客网 时间:2024/05/02 20:19

首先是 test_latex.tex:


\documentclass{article}   \author{ Hu Boqiang\thanks{Email:huboqiang@gmail.com}\\ Biodynamic Imagine Center\\ Peking University, Beijing 100871, China.}   \title{MCMC method in Computational\\ Molecular Evolution}   \usepackage{amsmath}   \usepackage{amssymb}   \usepackage{graphicx}  \usepackage{subfigure}\begin{document} \maketitle \begin{abstract}\emph{Markov Chain Monte Carlo methods (MCMC) is a class of statistics-algorithms for sampling from probability distributions based on construction a Markov Chain that desired distributrion. It has been widely used in many fields, including motif-searching and phylogenitics. Here, the application of MCMC in scanning motif and bayesian-method in phylogenitics will be reviewed in details.}\end{abstract}\section{Introduction}In the field of computational molecular evolution, there are two questions people would always ask: what is the differences among different sequences and what is common among those the sequence? As for the differences, Multi-Sequence-Alignment(MSA) would usually performed and then using the alignment result, we can build a phylogenitics-tree to measure the differences among those multi-sequences. And for the commons, we usually perform motif scanning method to find the potential regulatory elements in the genome. In both aspect, however, MCMC method could be a good tool. Here, I would using my code to introduce the application in this two approaches in detail.\section{Markov Chain Monte Carlo alogrithms}Markov Chain Monte Carlo (MCMC) is a sampling method, which is developed from Monte Carlo method.The concept of Monte Carlo method is to get a set of random values which is independent from each other. However, Markov Chain Monte Carlo method consider the process of getting the random value is a Markov Chain process, which means the state of the present is dependent on the last-previous state. So the random value of  MCMC method is aquired by first get a random value $ R_1 $, then we can get all random values by recursion:\begin{equation} R_i = R_{i-1} + random(range) \end{equation} A question arised from this equation: How to calculate $ random(range) $? Not just for the range of the random values, but also with which condition we would accept the new $R{i-1}$. There are several algorithms for solving this problem. Metropolis developed a algorithm described as this\cite{metropolis1953equation}:1. Set the initiate state $\theta$.2. Select the next state ${\theta}^{*}$ with equal-probabilities. The proposal-density is a uniform-distribution. 3. Calculate the probability of the occurance of the present state ($\pi(\theta)$) and the next state ($\pi({\theta}^{*})$). If the next state is more likely to occur,  accept the next state. Else, accept the new state with the probability of $\frac{\pi({\theta}^{*})}{\pi(\theta)}$. Hasting then suggest that the proposal-density could be disequilibrium for each candidate states \cite{hastings1970monte}, which means in step3, we shall accept the new state with the probability of $\frac{\pi({\theta}^{*})}{\pi(\theta)}\frac{q(\theta|{\theta}^{*})}{q({\theta}^{*}|\theta)}$. The MCMC algorithm develped by Metropolis and Hastings is named Metroplois-Hasting Method.However, while using MCMC methods, we usually have more then one parameters need to be trained. If we select the new states for all parameters, the calculation could be complex. So usually the parameters would be renewed one by one for calculation. Gibbs Sampling is based on this algorithm\cite{gelman1992inference}. In this algorithm, the parameters would be divided into several blocks, one block one parameter. When renewing one parameter in a block, the expect value of this parameter is decided by all other parameters. This lead to new parameter must be accepted.In conclusion, MCMC methods includes several algorithms, Metroplois-Hasting algorithm is the basic one, and Gibbs Sampling for multi-parameters.\section{MCMC and Phylogenitics}In the restruction of a phylogenitcs tree, several methods could be used, including Distance Methods, Maximum Parsimony Method(MP), Maximun Likehood Method(MLE) and Bayesian Approaches.  Bayesian approach would using two models to describe the evolution\cite{rannala1996probability}. One is a tree-structure model, which using a Markov-Chain to describe the time-scale of the evolution. This model determines the position of the internal-nodes and the leave-nodes, which represent the possible ancestor’s sequence and the sequence that we have. The other model is the base-substitution model, which describe tyahe rate of substitution from one base to another. Then, for a bayesian approach, it would using the distance for each internal-nodes together with species arrangement in the leave-nodes as parameters in a bayesian fuction , to get the highest possiblity for the input sequences. A random set of parameters could be the prior-distribution, then its prior-possibility could be calculated using the two models referred above. Consider all possible prior-possibilities for each prior-distribution together, a total-possiblity would be summed. Then we can get the posterior-possibility for a given prior-distribution.A Bayesian-Approach function itself, however, could be optimized by Maximun Likehood Method to get the best solution for all parameters. However, if we have more then 5 species, the parameters could be too many to optimized using ML. So a Monte-Carlo approach were considered to solve this problem. In a Monte-Carlo approach, independent prior-distribution were generated and then we used all Monte-Carlo generated results as the total-possiblity. However, as so many scanning dimensions for the parameters, Yang used a Markov Chain Monte Carlo approach \cite{yang1997bayesian}, using dependent prior-distribution instead, to decrease the scanning space for quick-scanning. One possible method is Metropolis-Hasting algorithm, let's take a example in Yang's book \cite{yang2006computational} to explain it in detail:Let's make the phylogenitic reconstruction as brief as possible. Imagine we have two sequence n=948, with variants x=90.  Here we will using JC69 model to esitmate the distance between two sequences:1. Initiate the sliding windows $ (w) $ of the prior-distribution parameters, here let $ w=0.05 $ . A exponent-distribution was used to describe the prior-distribution of the distance of the two sequence:  \begin{equation}f(\theta)=\frac{1}{\mu},\\ \mu=0.2\end{equation} 2. Initiate the prior-distribution parameters $\theta$, for example $\theta=0.05$. 3. Set the new parameters at random within the sliding windows. ${\theta}^{*} = random[\theta-w/2,\theta+w/2] $.4. Calculate acception index $\alpha$. Here, we used a binormial distribution to describe this distance. \begin{equation}\alpha = \min( 1,\frac{f({\theta}^{*})f(x|{\theta}^{*})}{f(\theta)f(x|\theta) } )\end{equation}As was described in JC69 base-substitution model, for a single site, the probability of two sequence share's the same base-pair is:\begin{equation}P_{same}=\frac{3}{4}-\frac{3}{4}e^{-\frac{4}{3}\theta}\end{equation}While the probability of the different base-pair is:\begin{equation}P_{diff}=\frac{1}{4}+\frac{3}{4}e^{-\frac{4}{3}\theta}\end{equation}So if we have $x$ different sites and $n-x$ same-sites, we can solve the function:\begin{equation}f(x|\theta) = (\frac{3}{4}-\frac{3}{4}e^{-\frac{4}{3}\theta})^x(\frac{3}{4}-\frac{3}{4}e^{-\frac{4}{3}\theta})^{(n-x)}\end{equation}5. Adapt or refuse the new parameter $ {\theta}^{*} $. Here we can generate a random number $ r = random[0,1] $. If $ r<alpha $, then accept the new paramter, $ \theta = {\theta}^{*} $. Else, refuse the new parameter $ \theta = \theta $.6. Then repeat step3 to step 5 until the parameters become stable. \\Using the Metropolis-Hasting algorithm, we solved the Bayesian fuction and finally get the distribution of Posterior Probabilities(\textbf{Figure.1(a)}). We found not only could we get the parameters with the highest-posterior probability, but also we could get the confidential-intervals for the parameters. The average of the posterior distribution is $E(\theta)=\int\theta f(x|\theta)d\theta) = 0.10213$ with the 95\% confidential interval (0.08191,0.12463).Furthermore, different prior-probability have no significant influence on the posterior-probability if there are enough trials (\textbf{Figure.1(b)}), we can see that, all $\theta$s, with the $\theta_0 = 0.01,0.1,0.4,0.8$, are all convergent to 0.1 after approaximately 100 trials .\begin{figure}[htbp]\subfigure[Posterior Probabilities's density distribution]{\label{fig:fft:a}\begin{minipage}[c]{0.6\textwidth}\includegraphics[width=6.5cm]{/Users/hubq/Documents/MolEvo_homework/JC69_density_10000.pdf}\end{minipage}%}\subfigure[Posterior Probabilities using different Prior Probabilities]{\begin{minipage}[c]{0.7\textwidth}\includegraphics[width=6.5cm]{/Users/hubq/Documents/MolEvo_homework/JC69.pdf}\end{minipage}}\caption{Posterior Probabilities using different Prior Probabilities}\label{fig:fft}\end{figure}\section{MCMC and Motif scanning}Motif scanning could also use MCMC method. Here I will review a method developed by Xuhua Xia based on PWM \cite{xia2012position}, Gibbs Sampling algorithm and associated significance tests. Gibbs Sampling has been introduced above. Here let's see how Gibbs Sampling works in scanning motifs. The input is $n$ DNA sequences, we select a k-mers in each sequence and forms a n*k string-matrix. Then we can get the count-matrix and fraction-matrix by calcuating the fraction of each bases in the string-matrix. By considering other bases in the sequence, a Position Weight Matrix(PWM) was calculated:\begin{equation}PWM_{ij} = \frac{Count_{ij} + {\alpha}*\sum_{j}Other_j}{ len_{i} + {\alpha}*\sum_{k}^{k=A,T,G,C}Other_j }, i=1..6, j{\in}A,T,G,C\end{equation}$Count_{ij}$ is the count-matrix, $Other_j$ is the count of bases for sequence other then bases in the k-mer matrix. $\alpha$ is the pseudo-count index.The motif is that we find the PWM with the highest probability and the k-mers position in DNA-sequence for each k-mers supporting this PWM.The position of each k-mers, could be considered as n parameters. We renewed the position of each k-mers to calculate the new PWMs. In detaily, we first select one k-mer for each sequence randomly. Then we randomly choose one sequence. For the rest of sequences we use the k-mers selected to calculate the PWM. And then using the PWM, we could calculate the Odds ratio of the k-mer along the sequence we select for 1bp step:\begin{equation}Odds\ Ratio = \sum_{i=1}^{6}PWM_{i,j=kmer\_seq_j}\end{equation}The k-mer with the highest Odds Ratio would be considered as the new k-mers and put back to the (n-1)*k string maxtix. This is a Gibbs Sampling because the position of a kmer in one sequence is determined by the other k-mer's positions and the best result would always be accepted.When the PWM is stable after several trials, we evaluate the PWM by calculating the score of PWM. Then we repeat this process after several generations, and select the PWM with the highest score.Here, I would using the sequence in XuHua Xia's paper (\textbf{Table.1}) as a example to show how it works. \begin{table}\caption{Input sequences}    %表格标题\label{tab:1}       % Give a unique label% For LaTeX tables use{\small                           %字体的大小,字体的大小标准参照相关手册\begin{tabular}{ll}       %几个“l”表示几列,“l”之间用“|”分开表示有竖线分开\hline\noalign{\smallskip}Sample & Sequence   \\\noalign{\smallskip}\hline\noalign{\smallskip}S1 & TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT               \\S2 & CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG\\S3 & TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAG \\S4 & AAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTC \\S5 & GGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGC \\S6 & AGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGA \\S7 & GCCTCAGGATCCAGCACACATTATCACAAACTTAGTGTCCA \\S8 & CATTATCACAAACTTAGTGTCCATCCATCACTGCTGACCCT \\S9 & TCGGAACAAGGCAAAGGCTATAAAAAAAATTAAGCAGC\\S10 & GCCCCTTCCCCACACTATCTCAATGCAAATATCTGTCTGAAACGGTTCC  \\S11 & CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG\\S12 & GATTGGTCACAGCATTTCAAGGGAGAGACCTCATTGTAAG\\S13 & TCCCCAACTCCCAACTGACCTTATCTGTGGGGGAGGCTTTTGA\\S14 & CCTTATCTGTGGGGGAGGCTTTTGAAAAGTAATTAGGTTTAGC\\S15 & ATTATTTTCCTTATCAGAAGCAGAGAGACAAGCCATTTCTCTTTCCTCCC\\ S16 & AGGCTATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTC\\S17 & CCAGCACACACACTTATCCAGTGGTAAATACACATCAT\\S18 & TCAAATAGGTACGGATAAGTAGATATTGAAGTAAGGAT\\S19 & ACTTGGGGTTCCAGTTTGATAAGAAAAGACTTCCTGTGGA\\S20 & TGGCCGCAGGAAGGTGGGCCTGGAAGATAACAGCTAGTAGGCTAAGGCCA \\S21 & CAACCACAACCTCTGTATCCGGTAGTGGCAGATGGAAA\\S22 & CTGTATCCGGTAGTGGCAGATGGAAAGAGAAACGGTTAGAA\\S23 & GAAAAAAAATAAATGAAGTCTGCCTATCTCCGGGCCAGAGCCCCT\\S24 & TGCCTTGTCTGTTGTAGATAATGAATCTATCCTCCAGTGACT\\S25 & GGCCAGGCTGATGGGCCTTATCTCTTTACCCACCTGGCTGT\\S26 & CAACAGCAGGTCCTACTATCGCCTCCCTCTAGTCTCTG\\S27 & CCAACCGTTAATGCTAGAGTTATCACTTTCTGTTATCAAGTGGCTTCAGC\\ S28 & GGGAGGGTGGGGCCCCTATCTCTCCTAGACTCTGTG\\S29 & CTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC\\\noalign{\smallskip}\hline\end{tabular}}%对应于{\small\end{table}Following the Gibbs Sampling methods described above, we can get a PWM matrix and the position for k-mer in each sequences. For one initation sites, we could get a stable PWM after about 300 trials (\textbf{Figure.2(a)}). Furthermore, by using F-score,\begin{equation}F = \sum_{i=1}^{k}\sum_{j=1}^{m}Count_{ij}PWM{ij} \end{equation}k for k-mers' length, j = 4 for A T G Cit can be indicated that different initation sites could have significant effect on the final PWM. The result also showed that, among different final PWM results, only one with the highest F-score would be selected (\textbf{Figure.2(b)}). We could concluded that by using Gibbs Sampling, the result could be different for each trials, so in order to get the best answer, sufficient trials should be done.\begin{figure}[htbp]\subfigure[For one intiation k-mers sites, the 1st base in k-mers' PWMs score after several Gibbs sampling trials]{\label{fig:fftt:a}\begin{minipage}[c]{0.6\textwidth}\includegraphics[width=6.5cm]{/Users/hubq/Documents/MolEvo_homework/test_PWM.pdf}\end{minipage}%}\subfigure[For each initiaion k-mers sites, the final 1st base in k-mers' PWM score and the F-score of the result]{\begin{minipage}[c]{0.7\textwidth}\includegraphics[width=6.5cm]{/Users/hubq/Documents/MolEvo_homework/test_PWM_by_round.pdf}\end{minipage}}\caption{Different intiation k-mers sites and final PWMs}\label{fig:fftt}\end{figure}Finally, the PWM and positions of all the motifs is in \textbf{Table 2} and \textbf{Table 3}. This result showed a same result as Xia's paper.\begin{table}\caption{The final Position Weight Matrix}    %表格标题\label{tab:2}       % Give a unique label% For LaTeX tables use{\small                           %字体的大小,字体的大小标准参照相关手册\begin{tabular}{lllllll}       %几个“l”表示几列,“l”之间用“|”分开表示有竖线分开\hline\noalign{\smallskip}Base & 1 & 2 & 3 & 4 & 5 & 6   \\\noalign{\smallskip}\hline\noalign{\smallskip}A & -0.6553147& -5.47621642 &  0.94765472 & -5.47621642 &  0.25612859 &  0.66051329\\C & 0.22747199& -5.53286628 & -5.53286628 & -5.53286628 &  0.86783264 & -0.96917759\\G & -5.57384204&  0.13202315 &  0.24943652 & -5.57384204 & -5.57384204 & -5.57384204\\T & 0.86463807&  1.20053795 & -5.3486469  &  1.52291643 & -5.3486469  &  0.55521111\\\noalign{\smallskip}\hline\end{tabular}}%对应于{\small\end{table}\begin{table}\caption{Motif positions}    %表格标题\label{tab:3}       % Give a unique label% For LaTeX tables use{\small                           %字体的大小,字体的大小标准参照相关手册\begin{tabular}{llll}       %几个“l”表示几列,“l”之间用“|”分开表示有竖线分开\hline\noalign{\smallskip}Sample & Motif 6-mer & Motif Position & Sequence   \\\noalign{\smallskip}\hline\noalign{\smallskip}S1 & TTATCA& 18 & TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT \\S2 & CGGTCA& 22 & CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG \\S3 & CTATCA& 14 & TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAG \\S4 & AGATAA& 17 & AAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTC \\S5 & TGGTAC& 7   &GGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGC \\S6 & CTATCT& 18 & AGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGA \\S7 & TTATCA& 20 & GCCTCAGGATCCAGCACACATTATCACAAACTTAGTGTCCA \\S8 & TTATCA& 2   &CATTATCACAAACTTAGTGTCCATCCATCACTGCTGACCCT \\S9 & CTATAA& 17 & TCGGAACAAGGCAAAGGCTATAAAAAAAATTAAGCAGC \\S10 & CTATCT& 14 & GCCCCTTCCCCACACTATCTCAATGCAAATATCTGTCTGAAACGGTTCC \\S11 & TGGTCA& 21 & CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG \\S12 & TTGTAA& 33 & GATTGGTCACAGCATTTCAAGGGAGAGACCTCATTGTAAG \\S13 & TTATCT& 20 & TCCCCAACTCCCAACTGACCTTATCTGTGGGGGAGGCTTTTGA \\S14 & TTATCT& 2   &CCTTATCTGTGGGGGAGGCTTTTGAAAAGTAATTAGGTTTAGC \\S15 & TTATCA& 10 & ATTATTTTCCTTATCAGAAGCAGAGAGACAAGCCATTTCTCTTTCCTCCC \\S16 & CTATAA& 3   &AGGCTATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTC\\S17 & TTATCC& 13 & CCAGCACACACACTTATCCAGTGGTAAATACACATCAT \\S18 & AGATAT& 20 & TCAAATAGGTACGGATAAGTAGATATTGAAGTAAGGAT \\S19 & TGATAA& 16 & ACTTGGGGTTCCAGTTTGATAAGAAAAGACTTCCTGTGGA \\S20 & AGATAA& 24 & TGGCCGCAGGAAGGTGGGCCTGGAAGATAACAGCTAGTAGGCTAAGGCCA \\S21 & CTGTAT& 12 & CAACCACAACCTCTGTATCCGGTAGTGGCAGATGGAAA \\S22 & CTGTAT& 0   &CTGTATCCGGTAGTGGCAGATGGAAAGAGAAACGGTTAGAA \\S23 & CTATCT& 23 & GAAAAAAAATAAATGAAGTCTGCCTATCTCCGGGCCAGAGCCCCT \\S24 & TTGTCT& 4   &TGCCTTGTCTGTTGTAGATAATGAATCTATCCTCCAGTGACT \\S25 & TTATCT& 17 & GGCCAGGCTGATGGGCCTTATCTCTTTACCCACCTGGCTGT \\S26 & AGGTCC& 7   &CAACAGCAGGTCCTACTATCGCCTCCCTCTAGTCTCTG \\S27 & TTATCA& 19 & CCAACCGTTAATGCTAGAGTTATCACTTTCTGTTATCAAGTGGCTTCAGC \\S28 & CTATCT& 15 & GGGAGGGTGGGGCCCCTATCTCTCCTAGACTCTGTG \\S29 & TTGTCA& 2   &CTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC \\\noalign{\smallskip}\hline\end{tabular}}%对应于{\small\end{table}\section{Discussion}This article reviewed two MCMC algorithms, Metropolis-Hasting algorithm and  Gibbs sampling. The application in Phylogenitics and motif scanning was also introduced with a example. From the example we could found that using MCMC methods, the result would be convergent if enough trials were performed and different initation of the MCMC would also have a great effect on the results. The code wrote in python for the two examples were also included in the e-mail, and a small debug were needed if you want to plot Figure1 and Figure2. \bibliography{test_latex.bib}\bibliographystyle{plain}\end{document}


然后是 test_latex.bib:

@article{metropolis1953equation,  title={Equation of state calculations by fast computing machines},  author={Metropolis, Nicholas and Rosenbluth, Arianna W and Rosenbluth, Marshall N and Teller, Augusta H and Teller, Edward},  journal={The journal of chemical physics},  volume={21},  number={6},  pages={1087--1092},  year={1953},  publisher={AIP Publishing}}@article{hastings1970monte,  title={Monte Carlo sampling methods using Markov chains and their applications},  author={Hastings, W Keith},  journal={Biometrika},  volume={57},  number={1},  pages={97--109},  year={1970},  publisher={Biometrika Trust}}@article{gelman1992inference,  title={Inference from iterative simulation using multiple sequences},  author={Gelman, Andrew and Rubin, Donald B},  journal={Statistical science},  pages={457--472},  year={1992},  publisher={JSTOR}}@article{rannala1996probability,  title={Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference},  author={Rannala, Bruce and Yang, Ziheng},  journal={Journal of molecular evolution},  volume={43},  number={3},  pages={304--311},  year={1996},  publisher={Springer}}@article{yang1997bayesian,  title={Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method.},  author={Yang, Ziheng and Rannala, Bruce},  journal={Molecular biology and evolution},  volume={14},  number={7},  pages={717--724},  year={1997},  publisher={SMBE}}@book{yang2006computational,  title={Computational molecular evolution},  author={Yang, Ziheng},  volume={284},  year={2006},  publisher={Oxford University Press Oxford}}@article{xia2012position,  title={Position weight matrix, gibbs sampler, and the associated significance tests in motif characterization and prediction},  author={Xia, Xuhua},  journal={Scientifica},  volume={2012},  year={2012},  publisher={Hindawi Publishing Corporation}}

编译:

pdflatex test_latex.texbibtex test_latexpdflatex test_latex.texpdflatex test_latex.tex



0 0
原创粉丝点击