PHYLOGENY RECONSTRUCTION: THE BASICS Ciona s.Homo Anopheles g. C. elegans C. briggsae Apis m. Drosophila m. Drosophila p. Danio r.Takifugu r.Tetraodon n. Gallus g. Xenopus t. Monodelphis d. Mus m. Rattus n. Canis f.Bos t.Maccaca m.Pan t. Saccharomyces s. Ciona s.Homo Anopheles g. C. elegans C. briggsae Apis m. Drosophila m. Drosophila p. Danio r.Takifugu r.Tetraodon n. Gallus g. Xenopus t. Monodelphis d. Mus m. Rattus n. Canis f.Bos t.Maccaca m.Pan t. Saccharomyces s. A Simple Concept of Speciation The Coalescent Model Wright Fisher population model with constant population size. Each generation chooses its parent at random. Pairs of lineages are traced back to a coalescent event. Kingman (1982) developed a continuous model that allows to estimate times between the coalescent events. The coalescent rate for any pair of genetic lineages is proportional to 1/Ne in generations or to 1/ in substitutions. The evolution of biological sequences The evolution of biological sequences Biological sequences can change by substitutions ...deletions and insertions At the leafs of the tree we find the contemporary sequences a a Divergence and Coalescent What we have... What we want... ? Public Data Sources The Problem: Finding the homologous positions Seq1: - A C G A Seq2: T A C G T Seq3: - A T - T Seq4: - A T G T A C G A T A C G T A T T A T G T Seq1 Seq2 Seq4 Seq3 The objective function An mathematical function able to measure the biological quality of an alignment... The objective function Related questions: What should a biologically correct alignment look like? To what extent can we define and formalize its properties? An mathematical function able to measure the biological quality of an alignment... The objective function Mathematical Optimal Alignment Biologically Optimal Alignment minimize Related questions: What should a biologically correct alignment look like? To what extent can we define and formalize its properties? An mathematical function able to measure the biological quality of an alignment... The objective function A mathematical function meant to measure the biological quality of an alignment... ! "(#) = S(ai i=1 n $ ,bi ) ! (): the score of the pairwise alignment n : length of ai : letter of sequence A at position i in bi : letter of sequence B at position i in The objective function A mathematical function meant to measure the biological quality of an alignment... ! Objective: find that maximizes ()! ! "(#) = S(ai i=1 n $ ,bi ) The scoring function S, an example then we look for that alignment, that gives us the highest score by summing up the column scores S(ai,bj) for all columns of the alignment. Given two sequences A ={a1,a2,....,an} and B={b1,b2,....,bm} and a scoring function S such that ! S(ai,bj ) = +5, if ai = bj "2, if ai # bj "6, for introduction of a gap $ % & ' & The scoring function S, an example For example: T G C T C G T A T - - T C A T A +5 -6 -6 +5 +5 +5 +5 = 11 then we look for that alignment, that gives us the highest score by summing up the column scores S(ai,bj) for all columns of the alignment.! S(ai,bj ) = +5, if ai = bj "2, if ai # bj "6, for introduction of a gap $ % & ' & Given two sequences A ={a1,a2,....,an} and B={b1,b2,....,bm} and a scoring function S such that continue..... T G C T C G T A T - - T C A T A +5 -6 -6 +5 +5 +5 +5 = 11 T G C T C G T A T - T - C A T A +5 -6 -2 -6 +5 +5 +5 = 4 A1: A2: etc... Why not just scoring all alignments? * There are far too many number of possible pairwise alignments: for two sequences of length N=300 there are 10179 possibilities ! 2n n " # $ % & ' Why not just scoring all alignments? * There are far too many number of possible pairwise alignments: for two sequences of length N=300 there are 10179 possibilities ! 2n n " # $ % & ' Hence, we need a smart way to cut the computation short, like the dynamic programming approach for pairwise alignments by Needleman and Wunsch (1970). Re-use of previous results T G C T C G T A T - - T C A T A +5 -6 -6 +5 +5 +5 +5 = 11 T G C T C G T A T - T - C A T A +5 -6 -2 -6 +5 +5 +5 = 4 A1: A2: etc... Dynamic Programming A dynamic programming approach usually includes: A mathematical description of the (biological) quality of an solution, i.e. an recursive objective function The computation of all intermediate values needed to obtain the globally optimal solution, thereby avoiding double-computations The reconstruction of the globally optimal solution from the values obtained in the previous step (backtracking) The Needleman-Wunsch pair-wise alignment "(i, j) = max "(i #1, j #1) + S(ai,bj ) "(i, j #1) + S(gap,bj ) "(i #1, j) + S(ai,gap) $ % & ' & S(ai,bj ) = +5, if ai = bj "2, if ai # bj "6, for introduction of a gap $ % & ' & Scoring function Objective function The Needleman-Wunsch algorithm (i,j-1) (i-1,j)(i-1,j-1) ! "(i, j) = max "(i #1, j #1) + S(ai,bj ) "(i, j #1) + S(gap,bj ) "(i #1, j) + S(ai,gap) $ % & ' & (i,j) is the optimal alignment score up to and including ai and bj S(ai,bj ) = +5, if ai = bj "2, if ai # bj "6, for introduction of a gap $ % & ' & Needleman-Wunsch algorithm: Initialization S(ai,bj ) = +5, if ai = bj "2, if ai # bj "6, for introduction of a gap $ % & ' & The Needleman-Wunsch algorithm: Recursion S(ai,bj ) = +5, if ai = bj "2, if ai # bj "6, for introduction of a gap $ % & ' & The Needleman-Wunsch algorithm: Recursion S(ai,bj ) = +5, if ai = bj "2, if ai # bj "6, for introduction of a gap $ % & ' & The Needleman-Wunsch algorithm: Recursion S(ai,bj ) = +5, if ai = bj "2, if ai # bj "6, for introduction of a gap $ % & ' & Needleman-Wunsch algorithm: Backtrack * * Needleman-Wunsch algorithm: Backtrack A* A* Needleman-Wunsch algorithm: Backtrack TA* TA* Needleman-Wunsch algorithm: Backtrack Alignment Score: 11*TGCTCGTA* *T--TCATA* Smith-Waterman pairwise local alignment "(i, j) = max "(i #1, j #1) + S(ai,bj ) "(i, j #1) + S(gap) "(i #1, j) + S(gap) 0 $ % & & ' & & S(ai,bj ) = +5, if ai = bj "2, if ai # bj "6, for introduction of a gap $ % & ' & Smith-Waterman pairwise local alignment *TCGTA* *TCATA* Alignment Score: 18 Affine Gap costs ! g(l) = go + l" ge ! "(i, j) = max "(i #1, j #1) + S(ai,bj ) maxk= 0 i#1 ("(k, j) + g(i # k)), gap in B maxk= 0 j#1 ("(i,k) + g( j # k)), gap in A $ % & ' & ! "(i, j) = max "(i #1, j #1) + S(ai,bj ) "(i, j #1) + S(gap,bj ) "(i #1, j) + S(ai,gap) $ % & ' & Alternative Scoring Functions Blosum62: PAM250: Many others... Exact vs. Heuristic searches Both, Needleman-Wunsch and Smith-Waterman alignment methods are exact methods since they guarantee a globally optimal solution for the optimization problem! Drawback: Computational expensive, i.e. O(nm) in time and memory Exact vs. Heuristic searches Solutions: omit regions from the grid, that cannot contribute to the optimal alignment (reduction of the search space, by remaining exact) Exact vs. Heuristic searches Solutions: use of heuristics (more rigorous reduction of the search space, sacrificing the guaranteed optimal solution for search speed) Hashing * Lookup method for finding an alignment Pos: 1 2 3 4 5 6 7 8 9 10 11 Seq 1: k c s p t a . . . . . Seq 2: . . . . . a c s p r k -10-r 066a --5t -594p -583s -572c 10111k OffsetPos in Seq 2Pos in Seq 1Amino acid Hashing * Lookup method for finding an alignment Pos: 1 2 3 4 5 6 7 8 9 10 11 Seq 1: k c s p t a . . . . . Seq 2: . . . . . a c s p r k -10-r 066a --5t -594p -583s -572c 10111k OffsetPos in Seq 2Pos in Seq 1Amino acid Seq 1: k c s p t a Seq 2: a c s p r k Resulting alignment: What we are really looking for: How to construct Multiple Sequence Alignments? Optimal Solution: Extend Needleman-Wunsch or Smith-Waterman to multiple sequences How to construct Multiple Sequence Alignments? Optimal Solution: Extend Needleman-Wunsch or Smith-Waterman to multiple sequences How to construct Multiple Sequence Alignments? Optimal Solution: Extend Needleman-Wunsch or Smith-Waterman to multiple sequences But O(nm) in time and memory: Computationally not feasible... 4 sequences of length 1000 -> 1TB RAM A new objective function: Sum of Pairs Seq1: AGA--CTA Seq2: G-A--CTT Seq3: AGAAACTT A new objective function: Sum of Pairs Seq1: AGA--CTA Seq2: G-A--CTT Seq3: AGAAACTT Seq1: AGA--CTA Seq2: G-A--CTT Seq1: AGA--CTA Seq3: AGAAACTT Seq2: G-A--CTT Seq3: AGAAACTT ! S(ai,bj ) = +5, if ai = bj "2, if ai # bj "6, for introduction of a gap $ % & ' & Seq1: AGA--CTA Seq2: G-A--CTT Seq1: AGA--CTA Seq3: AGAAACTT Seq2: G-A--CTT Seq3: AGAAACTT Score: +5 Score: +11 Score: 0 A new objective function: Sum of Pairs Seq1: AGA--CTA Seq2: G-A--CTT Seq3: AGAAACTT Seq1: AGA--CTA Seq2: G-A--CTT Seq1: AGA--CTA Seq3: AGAAACTT Seq2: G-A--CTT Seq3: AGAAACTT ! S(ai,bj ) = +5, if ai = bj "2, if ai # bj "6, for introduction of a gap $ % & ' & Seq1: AGA--CTA Seq2: G-A--CTT Seq1: AGA--CTA Seq3: AGAAACTT Seq2: G-A--CTT Seq3: AGAAACTT Score: +5 Score: +11 Score: 0 SUM OF PAIRS SCORE: 16 Progressive Alignment Strategies (ClustalW) The sequences are added stepwise. Thus, never more than two sequences (or multiple sequence alignments) are simultaneously aligned Sequences or MSAs are aligned using Dynamic Programming Progressive Alignment Strategies (ClustalW) Scoring for the alignment of two alignments respective weights of the sequences x and yx, y score for aligning position i in sequence x from alignment a to position j in sequence y from alignment b number of sequences in alignments a and b, respectivelyn,m score for aligning column i from alignment (or sequence) a to column j from alignment or sequence b (ai,bj): ! S(ax i ,by j ) ! "(ai ,b j ) = 1 n + m S(ax i ,by j ) y=1 m # x=1 n # $%x $%y Scoring for the alignment of two alignments With sequence weights: Score = (S(t,v)*15 + S(t,i)*16 + S(l,v)*25 + S(l,i)*26 + S(k,v)*35 + S(k,i)*36 + S(k,v)*45 + S(k,i)*46)/8 1 peeksavtal 2 geekaavlal 3 padktnvkaa 4 aadktnvkaa 4 egewglvlhv 5 aaektkirsa ! "(ai ,b j ) = 1 n + m S(ax i ,by j ) y=1 m # x=1 n # $%x $%y Features of ClustalW progressive strategy Distance based generation of a guide tree (approximative or exact) tree-guided (NJ) alignment change of the scoring matrix as the alignment proceeds (adaptation to increasing divergence of the sequences dynamic variation of gap penalties in position- and residue-specific manner * gap opening penalties are locally reduced in stretches of 5 or more hydrophilic residues (indicative of loop or random coil regions). * gap penalties are locally increased within eight residues of existing gaps. sequence weighting (Known) Problem of ClustalW: Local Optima a.k.a: Once a gap always a gap Iterative Alignment Strategy set of sequences initial (suboptimal) alignment refined alignment optimized alignment refinement checkno! yes! convergence? END Stochastic Iterative Alignment set of sequences initial (suboptimal) alignment modified alignment optimized alignment random modification assess score no! yes! convergence? END acceptance function accept reject/revert alignment Stochastic Iterative Alignment (SAGA) set of sequences initial (suboptimal) alignment(s) modified alignment optimized alignment random modification assess score no! yes! convergence? END acceptance function accept reject/revert alignment Genetic Algorithm: Alignments evolve by `mutation' and crossing over alignments score determines fitness over the generations, alignments survive and reproduce or die Non-Stochastic Iterative Alignment Point: The initial alignment is modified by splitting it into two groups and re-aligning them with dynamic programming. Example: Prrp, both, alignment (inner loop) and tree/weight (outer loop) are optimized. Consistency based algorithm Point: The optimal MSA is defined as the one that agrees the most with all optimal pair-wise alignments Features: does not depend on a specific substitution rate can apply any method capable to align two sequences position dependent, i.e. the score associated with the alignment of two residues depends on their position within the sequence rather that their individual nature rationale: given a set of independent observations, the constellation most often observed is often closer to the truth Consistency based Objective Function For alignEment Evaluation (COFFEE) The Principle of T-Coffee Position specific substitution matrix The score of each pair of residues depends on the compatibility of this pair with the rest of the library A comparison Reconstructing Trees from Sequences Reconstructing the tree of life Notations How many possible trees are there? How many possible trees are there? ! b(n) = (2n " 5)! 2n"3 (n " 3)! ! b(10) = 2027025 ! b(55) = 2.9 "1084 ! b(100) =1.7 "10182 Finding the root of the tree Finding the root of the tree Outgroup-routing Midpoint-routing Tree Formats Three typical representations of the same tree (((((CIOIN:0.4222,HOMSA:0.2777)97:0.0575,(ACRMI:0.2611,HYD MA:0.3700)100:0.0745)100:0.0764,DROME:0.4200)100:0.1034, CAEEL:0.6027):0.5804,SACCE:0.5832); #NEXUS begin taxa; dimensions ntax=7; taxlabels DROME CIOIN HYDMA SACCE CAEEL ACRMI HOMSA ; end; begin trees; tree [&r] tree_1 = (((((CIOIN:0.4222,HOMSA:0.2777) [&label=97]:0.0575, (ACRMI:0.2611,HYDMA:0.37)[&label=100]:0.0745) [&label=100]:0.0764, DROME:0.42)[&label=100]:0.1034,CAEEL:0.6027):0.5804, SACCE:0.5832); end; begin figtree; set appearance.backgroundColour=#-1; end figtree; Complex trees Tree display is an unsolved problem Some more notations Monophyletic group (clade, sistergroup) Chimpanzee Human Gorilla Paraphyletic group (e.g. reptiles) Birds Crocodiles Lizards Turtles Storks Birds of Prey Old world vultures New world vultures Polyphyletic group Character based phylogeny reconstruction: A character has to be expressed in at least two states in the taxa under study. Taxa are grouped on the basis of shared character states. An evolutionary derived character (state) is called an Apomorphy Aut-Apomorphy: an evolutionary derived character (state) present only in a single taxon Syn-Apomorphy: an evolutionary derived character (state) shared by a group of taxa. Plesiomorphy: an ancestral character (state) shared by a group of extant taxa. Homoplasy: A derived character (state) that is shared for reasons other than common decent. The Parsimony Principle A rule in science and philosophy stating that entities should not be multiplied needlessly. This rule is interpreted to mean that the simplest of two or more competing theories is preferable and that an explanation for unknown phenomena should first be attempted in terms of what is already known. Also called law of parsimony. (Ockham`s razor, ca 1285-1350) The Parsimony Principle A,B A,B,C A,C A,B A,B,C +C -B A,B A,B,C A,C A,C +B,-C +B The Parsimony Principle A,B A,B,C A,C A,B A,B,C +C -B A,B A,B,C A,C A,C +B,-C +B How to infer a tree from the data The Criterion of Maximum Parsimony The Criterion of Maximum Parsimony Find the tree that minimizes the following expression: where diff measures the distance between two characters j is an alignment specific weight factor L alignment length B number of branches in the tree k` and k`` are the two nodes connected by branch k The Criterion of Distance (Hamming Distance) caactgattattcacseq 4 tagccctttgaacgcseq 3 tagccctttaaatgcseq 2 tcattgtccattcgaseq 1 09108Seq 4 90211Seq 3 102011Seq 2 811110Seq 1 Seq 4Seq 3Seq 2Seq 1 ! Distance-based tree reconstruction Find branch lengths L(b) such that the sum of the branch lengths connecting any two leaves gets close to the measured distances between all pairs of leaves. That is Dmeasured (A,B)= L(1)+L(2)+L(3)+L(4) A BD C L(1) L(2) L(3) L(4) Clustering Methods UPGMA = Unweighted Pair Group Methods using Arithmetic means. The ultrametric condition implies the molecular clock Clustering methods work well, if sequences evolve according to a molecular clock or equivalently: if the ultrametric inequality is holds: d(A,B) " max d(A,C),d(B,C){ } for each triple (A,B,C) Theorem: Four-Point-Condition A distance matrix (di,j)i,j=1....n is representable as a tree, if and only if ! d(u,v)+ d(x,z) " max d(u,x)+ d(v,z),d(u,z)+ d(v,x){ } for all { }nzxvu ,...,2,1,,, ! Getting rid of the molecular clock requirement Theorem: Four-Point-Condition A distance matrix (di,j)i,j=1....n is representable as a tree, if and only if ! d(u,v)+ d(x,z) " max d(u,x)+ d(v,z),d(u,z)+ d(v,x){ } for all { }nzxvu ,...,2,1,,, ! 2.5 3.5 2 2.5 8.5 The Neighbour Joining Algorithm 1. begin with star tree: 2. compute for each pair (1,2) the net-divergence 3. take the pair (A,B) that minimizes Eq. (1) (1) The Neighbour Joining Algorithm 4. cluster (A,B) and define an interior node W 5. compute branch lengths for the external edges: L(A,W ) = 1 2 D(A,B) + 1 m " 2 D(A,k) " D(B,k) k=1 m # $ % & ' ( ) L(B,W ) = D(A,B) 2 " L(A,W ) The Neighbour Joining Algorithm 6. compute distance W to the remaining m-2 leaves: ! D(W ,k) = 1 2 D(A,k) + D(B,k) " D(A,B)( ) 7. continue with step 1 with the reduced set of leaves The Neighbour Joining Algorithm Least-squares Methods Find a tree that minimizes ! S(") = #(i,k) $ D(i,k)( ) i,k % 2 where (i,k) is the length of the unique path connecting leaves i and k in the tree. Distance Correction A G C C G C G C A G T = 0 A G C C A T G C A G T = 1 T = 2 T = 3 T = 4 T = 5 T = 6 T = 7 T = 8 T = 9 T =10 A G C C A C G C A G A G C C C C G C A G A G C C C G G C A G A G C A C G G C A G A G G A C G G C A G A G G A C T G C A G A G G A C T G C A A A G A A C T G C A A A G A A C C G C A A substitutions 0 2 4 6 8 10 0 2 4 6 8 10 true substitutions observerd differences Substitutions Jukes Cantor Correction Forumla ! obs(d) = 3 4 " 3 4 Exp["4d /3] obs(d) can be estimated from the number of observed different pairs of positions n1 between two aligned sequences of length l. Solving ! n1 l = 3 4 " 3 4 Exp["4d /3] leads to Jukes Cantor correction: d = " 3 4 Log 1" 4 3 n1 l # $% & '( Distance Correction substitutions true substitutions observerd differences Substitutions The Problem: Different alignments, different trees Seq1: - N Y L S Seq2: N K Y L S Seq3: - N F - S Seq4: - N F L S N Y L S N K Y L S N F S N F L S Seq1 Seq2 Seq4 Seq3 Seq1: N - Y L S Seq2: N K Y L S Seq3: N - F - S Seq4: N - F L S Seq1 Seq4 Seq2 Seq3 The Problem: Different alignments, different trees Seq1: - N Y L S Seq2: N K Y L S Seq3: - N F - S Seq4: - N F L S N Y L S N K Y L S N F S N F L S Seq1 Seq2 Seq4 Seq3 The alignment strategy may have more impact on the reconstructed tree than does the type of tree building method. Morrison and Ellis (1997) Mol. Biol. Evol. 14:428-441 Focussing on stable parts of the alignment Gblocks (Castresana (2000) Mol. Biol. Evol. 17:540-552 Objective: Define a set of conserved blocks from an alignment to be used in phylogeny reconstuction Approach: 1) Classification of Columns non-conserved : n/2 + 1 and < 85% identical residues highly conserved :>85% identical residues 2) discard contiguous stretches of non-conserved positions (default l = 8) 3) from remaining blocks: remove flanking positions until blocks begin and end with highly conserved positions, i.e. selected blocks are anchored by positions that can be aligned with high confidence 4) discard blocks with l < 15 5) remove all positions with gaps together with adjacent positions until a conserved position is reached 6) discard blocks with l < 10 Note: all given values are the program defaults as given in the original publication Focussing on stable parts of the alignment Hidden paralogy mimics orthology ma :0) A typical variant: Weighted Sum of Pairs Seq1: AGA--CTA Seq2: AGA--CTA Seq3: G-A--CTT Seq4: AGAAACTT Seq1: AGA--CTA Seq3: G-A--CTT Seq1: AGA--CTA Seq4: AGAAACTT Seq3: G-A--CTT Seq4: AGAAACTT SUM OF PAIRS SCORE: 62 Seq1: AGA--CTA Seq2: AGA--CTA Seq2: AGA--CTA Seq3: G-A--CTT Seq2: AGA--CTA Seq4: AGAAACTT Score: 2*(+5) Score: 2*(+11) Score: 0Score: +30 A typical variant: Weighted Sum of Pairs Seq1: AGA--CTA Seq2: AGA--CTA Seq3: G-A--CTT Seq4: AGAAACTT Seq1: AGA--CTA Seq3: G-A--CTT Seq1: AGA--CTA Seq4: AGAAACTT Seq3: G-A--CTT Seq4: AGAAACTT SUM OF PAIRS SCORE: 62 Seq1: AGA--CTA Seq2: AGA--CTA Seq2: AGA--CTA Seq3: G-A--CTT Seq2: AGA--CTA Seq4: AGAAACTT Score: 2*(+5) Score: 2*(+11) Score: 0Score: +30 Seq1/Seq2 Seq4 Seq30.49 0.2 0.29 Weighting of sequences: one variant Seq1/Seq2 Seq4 Seq30.49 0.2 0.29 Seq1: AGACTA Seq2: AGACTA Seq3: GACTT Seq4: AGAAACTT Dataset: - 3 3 -4 -2 -1 421 Compute Pairwise Distance Matrix Reconstruct Seq1: (0.29/2+0.2/3)=0.21 Seq2: (0.29/2+0.2/3)=0.21 Seq3: 0.49 Seq4: (0.29+0.2/3)=0.36 Apply weights Seq1: 0.43 Seq2: 0.43 Seq3: 1 Seq4: 0.73 Normalize A typical variant: Weighted Sum of Pairs ! "wsop (#) = $i i< j % $ j S(#i,# j ) Seq1: AGA--CTA Seq2: AGA--CTA Seq3: G-A--CTT Seq4: AGAAACTT Seq1: AGA--CTA Seq3: G-A--CTT Seq1: AGA--CTA Seq4: AGAAACTT Seq3: G-A--CTT Seq4: AGAAACTT SUM OF PAIRS SCORE: 16.7 Seq1: AGA--CTA Seq2: AGA--CTA Seq2: AGA--CTA Seq3: G-A--CTT Seq2: AGA--CTA Seq4: AGAAACTT Score: (0.43x5)2 Score: (0.43x0.73x11)2 Score: 0Score: 0.432x30 Orthologous Sequences, Please!! Arguments for orthology assumption: a sequence tree that is congruent to the species tree conservation of genomic position sequence similarity (typically, reciprocal best blast hit) similarity of function Orthologous Sequences, Please!! Arguments for orthology assumption: a sequence tree that is congruent to the species tree conservation of genomic position sequence similarity (typically, reciprocal best blast hit) similarity of function