>gi|5835135|ref|NC_001644.1| Pan paniscus mitochondrion, complete genome GTTTATGTAGCTTACCCCCTTAAAGCAATACACTGAAAATGTTTCGACGGGTTTATATCACCCCATAAAC AAACAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCGTCCCGTGAG TCACCCTCTAAATCACCATGATCAAAAGGAACAAGTATCAAGCACACAGCAATGCAGCTCAAGACGCTTA GCCTAGCCACACCCCCACGGGAGACAGCAGTGATAAACCTTTAGCAATAAACGAAAGTTTAACTAAGCCA TACTAACCTCAGGGTTGGTCAATTTCGTGCTAGCCACCGCGGTCACACGATTAACCCAAGTCAATAGAAA CCGGCGTAAAGAGTGTTTTAGATCACCCCCCCCCCAATAAAGCTAAAATTCACCTGAGTTGTAAAAAACT CCAGCTGATACAAAATAAACTACGAAAGTGGCTTTAACACATCTGAACACACAATAGCTAAGACCCAAAC TGGGATTAGATACCCCACTATGCTTAGCCCTAAACTTCAACAGTTAAATTAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGGAGCCTGTTCT GTAATCGATAAACCCCGATCAACCTCACCGCCTCTTGCTCAGCCTATATACCGCCATCTTCAGCAAACCC TGATGAAGGTTACAAAGTAAGCGCAAGTACCCACGTAAAGACGTTAGGTCAAGGTGTAGCCTATGAGGCG GCAAGAAATGGGCTACATTTTCTACCCCAGAAAATTACGATAACCCTTATGAAACCTAAGGGTCGAAGGT GGATTTAGCAGTAAACTAAGAGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGT CACCCTCCTCAAGTATACTTCAAAGGATATTTAACTTAAACCCCTACGCATTTATATAGAGGAGATAAGT CGTAACATGGTAAGTGTACTGGAAAGTGCACTTGGACGAACCAGAGTGTAGCTTAACATAAAGCACCCAA CTTACACTTAGGAGATTTCAACTCAACTTGACCACTCTGAGCCAAACCTAGCCCCAAACCCCCTCCACCC TACTACCAAACAACCTTAACCAAACCATTTACCCAAATAAAGTATAGGCGATAGAAATTGTAAATCGGCG CAATAGATATAGTACCGCAAGGGAAAGATGAAAAATTACACCCAAGCATAATACAGCAAGGACTAACCCC TGTACCTTTTGCATAATGAATTAACTAGAAATAACTTTGCAAAGAGAACTAAAGCCAAGATCCCCGAAAC CAGACGAGCTACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCAAAATAGTGGGAAGATTTATA GGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGGTTGTCCAAGATAGAATCTTAGTTCAACTTTA AATTTACCTACAGAACCCTCTAAATCCCCCTGTAAATTTAACTGTTAGTCCAAAGAGGAACAGCTCTTTA GACACTAGGAAAAAACCTTATGAAGAGAGTAAAAAATTTAATGCCCATAGTAGGCCTAAAAGCAGCCACC AATTAAGAAAGCGTTCAAGCTCAACACCCACAACCTCAAAAAATCCCAAGCATACAAGCGAACTCCTTAC GCTCAATTGGACCAATCTATTACCCCATAGAAGAGCTAATGTTAGTATAAGTAACATGAAAACATTCTCC TCCGCATAAGCCTACTACAGACCAAAATATTAAACTGACAATTAACAGCCCAATATCTACAATCAACCAA MODULARIZACE VÝUKY EVOLUČNÍ A EKOLOGICKÉ BIOLOGIE CZ.1.07/2.2.00/15.0204 PF_72_100_grey_tr ubz_cz_black_transparent 1_1 phylogenetic tree = phylogeny (fylogenie): rooted, unrooted branches = edges (větve): peripheral, internal, central nodes = vertices (uzly): internal, terminal dichotomy = bifurcation, polytomy = multifurcation OTU = operational taxonomic unit, HTU = hypothetical taxonomic unit tree topology Definition of basic concepts: 1_1 path (dráha) lineage (linie) connects two terminal nodes connects terminal node with root Definition of basic concepts: 1_2 Definition of basic concepts: http://www.almob.org/content/figures/1748-7188-2-8-1-l.jpg http://www.vizachero.com/R1b1/R1bSplits.png How many trees? number of electrons in visible universe (Eddington number) > Avogadro constant*) *) 6,022 140 76×1023 mol−1 What type of data can we use? DATA Distances Discrete characters Immunology DNA-DNA hybridization Binary Multistate unordered ACGTTAGCT ordered A®B®C 11010010011 ABCDEF Types of data Nucleotide and protein sequences: H_sapiens MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGS P_troglod ATGACCCCGACACGCAAAATTAACCCACTAATAAA site = character base = character state retroelements: SINE (Alu, B1, B2), LINE microsatellites, SNP Types of data Grafika1 Problem with homology of sequences 1.3.tif Individual sites in DNA sequences may not be fully independent! Problem with homology of sequences DNA databases: EMBL (European Molecular Biology Laboratory) – European Bioinformatics Institute, Hinxton, UK: http://www.ebi.ac.uk/embl/ GenBank – NCBI (National Center for Biotechnology Information), Bethesda, Maryland, USA: http://www.ncbi.nlm.nih.gov/Genbank/ DDBJ (DNA Data Bank of Japan) – National Institute of Genetics, Mishima, Japan: http://www.ddbj.nig.ac.jp/ Database managment: usually packages Sybase or ORACLE outputs: ASCII (American Standard Code for Information Interchange) Sequences Protein databases: SWISS-PROT – University of Geneve & Swis Institute of Bioinformatics: http://www.expasy.ch/sprot/ a http://www.ebi.ac.uk/swissprot/ PIR (Protein Information Resource) – NBRF (National Biomedical Research Foundation, Washington, D.C., USA) & Tokyo University & JIPID (Japanese International Protein Information Database, Tokyo) & MIPS (Martinsried Institute for Protein Sequences, Martinsried, Germany): http://www-nbrf.georgetown.edu/ PRF/SEQDB (Protein Resource Foundation) – Ósaka, Japan: http://www.prf.or.jp/en/os.htm PDB (Protein Data Bank) – University of New Jersey, San Diego & Super-computer Center, University of California & National Institute of Standards and Technology: http://www.rcsb.org/pdb/ Sequences FASTA: >H_sapiens ATGACCCCAATACGCAAAATTAACCCCCTAATAAAATTAATTAACCACTCATTCATCGACCTCCCCACCC CATCCAACATCTCCGCATGATGAAACTTCGGCTCACTCCTTGGCGCCTGCCTGATCCTCCAAATCACCAC AGGACTATTCCTAGCCATACACTACTCACCAGACGCCTCAACCGCCTTTTCATCAATCGCCCACATCACT CGAGACGTAAATTATGGCTGAATCATCCGCTACCTTCACGCCAATGGCGCCTCAATATTCTTTATCTGCC TCTTCCTACACATCGGGCGAGGCCTATATTACGGATCATTTCTCTACTCAGAAACCTGAAACATCGGCAT ... >P_troglod ATGACCCCGACACGCAAAATTAACCCACTAATAAAATTAATTAATCACTCATTTATCGACCTCCCCACCC CATCCAACATTTCCGCATGATGGAACTTCGGCTCACTTCTCGGCGCCTGCCTAATCCTTCAAATTACCAC AGGATTATTCCTAGCTATACACTACTCACCAGACGCCTCAACCGCCTTCTCGTCGATCGCCCACATCACC CGAGACGTAAACTATGGTTGGATCATCCGCTACCTCCACGCTAACGGCGCCTCAATATTTTTTATCTGCC TCTTCCTACACATCGGCCGAGGTCTATATTACGGCTCATTTCTCTACCTAGAAACCTGAAACATTGGCAT ... >P_paniscus ATGACCCCAACACGCAAAATCAACCCACTAATAAAATTAATTAATCACTCATTTATCGACCTCCCCACCC CATCCAATATTTCCACATGATGAAACTTCGGCTCACTTCTCGGCGCCTGCCTAATCCTTCAAATCACCAC AGGACTATTCCTAGCTATACACTACTCACCAGACGCCTCAACCGCCTTCTCATCGATCGCCCACATTACC CGAGACGTAAACTATGGTTGAATCATCCGCTACCTTCACGCTAACGGCGCCTCAATACTTTTCATCTGCC TCTTCCTACACGTCGGTCGAGGCCTATATTACGGCTCATTTCTCTACCTAGAAACCTGAAACATTGGCAT ... File formats: GenBank: ORIGIN 1 tgaaatgaag atattctctt ctcaagacat caagaagaag gaactactcc ccaccaccag 61 cacccaaagc tggcattcta attaaactac ttcttgtgta cataaattta catagtacaa 121 tagtacattt atgtatatcg tacattaaac tattttcccc aagcatataa gcaagtacat 181 ttaatcaatg atataggcca taaaacaatt atcaacataa actgatacaa accatgaata 241 ttatactaat acatcaaatt aatgctttaa agacatatct gtgttatctg acatacacca 301 tacagtcata aactcttctc ttccatatga ctatcccctt ccccatttgg tctattaatc 361 taccatcctc cgtgaaacca acaacccgcc caccaatgcc cctcttctcg ctccgggccc 421 attaaacttg ggggtagcta aactgaaact ttatcagaca tctggttctt acttcagggc 481 catcaaatgc gttatcgccc atacgttccc cttaaataag acatctcgat ggtatcgggt 541 ctaatcagcc catgaccaac ataactgtgg tgtcatgcat ttggtatttt tttattttgg 601 cctactttca tcaacatagc cgtcaaggca tgaaaggaca gcacacagtc tagacgcacc 661 tacggtgaag aatcattagt ccgcaaaacc caatcaccta aggctaatta ttcatgcttg 721 ttagacataa atgctactca ataccaaatt ttaactctcc aaacccccca accccctcct 781 cttaatgcca aaccccaaaa acactaagaa cttgaaagac atatattatt aactatcaaa 841 ccctatgtcc tgatcgattc tagtagttcc caaaatatga ctcatatttt agtacttgta 901 aaaattttac aaaatcatgc tccgtgaacc aaaactctaa tcacactcta ttacgcaata 961 aatattaaca agttaatgta gcttaataac aaagcaaagc actgaaaatg cttagatgga 1021 taattttatc cca // File formats: PHYLIP (“interleaved” format): 6 1120 H_sapiens ATGACCCCAA TACGCAAAAT TAACCCCCTA ATAAAATTAA TTAACCACTC P_troglod ATGACCCCGA CACGCAAAAT TAACCCACTA ATAAAATTAA TTAATCACTC P_paniscus ATGACCCCAA CACGCAAAAT CAACCCACTA ATAAAATTAA TTAATCACTC G_gorilla ATGACCCCTA TACGCAAAAC TAACCCACTA GCAAAACTAA TTAACCACTC P_pygmaeus ATGACCCCAA TACGCAAAAC CAACCCACTA ATAAAATTAA TTAACCACTC H_lar ATGACCCCCC TGCGCAAAAC TAACCCACTA ATAAAACTAA TCAACCACTC ATTCATCGAC CTCCCCACCC CATCCAACAT CTCCGCATGA TGAAACTTCG ATTTATCGAC CTCCCCACCC CATCCAACAT TTCCGCATGA TGGAACTTCG ATTTATCGAC CTCCCCACCC CATCCAATAT TTCCACATGA TGAAACTTCG ATTCATTGAC CTCCCTACCC CGTCCAACAT CTCCACATGA TGAAACTTCG ACTCATCGAC CTCCCCACCC CATCAAACAT CTCTGCATGA TGGAACTTCG ACTTATCGAC CTTCCAGCCC CATCCAACAT TTCTATATGA TGAAACTTTG File formats: NEXUS (PAUP*, “interleaved”): #NEXUS begin data; dimensions ntax=6 nchar=1120; format datatype=DNA interleave datatype=DNA missing=? gap=-; matrix P_troglod ATGACCCCGACACGCAAAATTAACCCACTAATAAAATTAATTAATCACTC P_paniscus ATGACCCCAACACGCAAAATCAACCCACTAATAAAATTAATTAATCACTC H_sapiens ATGACCCCAATACGCAAAATTAACCCCCTAATAAAATTAATTAACCACTC G_gorilla ATGACCCCTATACGCAAAACTAACCCACTAGCAAAACTAATTAACCACTC P_pygmaeus ATGACCCCAATACGCAAAACCAACCCACTAATAAAATTAATTAACCACTC H_lar ATGACCCCCCTGCGCAAAACTAACCCACTAATAAAACTAATCAACCACTC P_troglod ATTTATCGACCTCCCCACCCCATCCAACATTTCCGCATGATGGAACTTCG P_paniscus ATTTATCGACCTCCCCACCCCATCCAATATTTCCACATGATGAAACTTCG H_sapiens ATTCATCGACCTCCCCACCCCATCCAACATCTCCGCATGATGAAACTTCG G_gorilla ATTCATTGACCTCCCTACCCCGTCCAACATCTCCACATGATGAAACTTCG P_pygmaeus ACTCATCGACCTCCCCACCCCATCAAACATCTCTGCATGATGGAACTTCG H_lar ACTTATCGACCTTCCAGCCCCATCCAACATTTCTATATGATGAAACTTTG end; File formats: Clustal X: P_troglod ATGACCCCGACACGCAAAATTAACCCACTAATAAAATTAATTAATCACTCATTTATCGAC P_paniscus ATGACCCCAACACGCAAAATCAACCCACTAATAAAATTAATTAATCACTCATTTATCGAC H_sapiens ATGACCCCAATACGCAAAATTAACCCCCTAATAAAATTAATTAACCACTCATTCATCGAC G_gorilla ATGACCCCTATACGCAAAACTAACCCACTAGCAAAACTAATTAACCACTCATTCATTGAC P_pygmaeus ATGACCCCAATACGCAAAACCAACCCACTAATAAAATTAATTAACCACTCACTCATCGAC H_lar ATGACCCCCCTGCGCAAAACTAACCCACTAATAAAACTAATCAACCACTCACTTATCGAC ******** ******* ***** *** **** **** ** ****** * ** *** P_troglod CTCCCCACCCCATCCAACATTTCCGCATGATGGAACTTCGGCTCACTTCTCGGCGCCTGC P_paniscus CTCCCCACCCCATCCAATATTTCCACATGATGAAACTTCGGCTCACTTCTCGGCGCCTGC H_sapiens CTCCCCACCCCATCCAACATCTCCGCATGATGAAACTTCGGCTCACTCCTTGGCGCCTGC G_gorilla CTCCCTACCCCGTCCAACATCTCCACATGATGAAACTTCGGCTCACTCCTTGGTGCCTGC P_pygmaeus CTCCCCACCCCATCAAACATCTCTGCATGATGGAACTTCGGCTCACTTCTAGGCGCCTGC H_lar CTTCCAGCCCCATCCAACATTTCTATATGATGAAACTTTGGTTCACTCCTAGGCGCCTGC ** ** **** ** ** ** ** ****** ***** ** ***** ** ** ****** File formats: Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. FASTQ: File formats: Progressive alignment - ClustalX 1.Alignment of sequence pairs ® pairwise distances 2.Construction of guide tree (eg. Neighbor-Joining) 3.Alignment of all sequences according to guide tree I. I. II. III. 3 phases: Problem with progressive alignment gorilla AGGTT horse AG-TT panda AG-TT 6 species: penguin A-GTT chicken A-GTT ostrich AGGTT gorilla AGGTT horse AG-TT panda AG-TT penguin A-GTT chicken A-GTT ostrich AGGTT AGGTT AG-TT AG-TT AG-TT AG-TT AGGTT AGGTT A-GTT A-GTT A-GTT A-GTT AGGTT Many other alignment programs: e.g. MAFFT, MUSCLE, Geneious... There are also methods without alignment: UPGMA neighbor- joining Fitch- Margoliash minimum evolution maximum parsimony maximum likelihood Bayesian a. distances characters Data types Methods Efficiency: how fast is the method? Power: how many characters we need? Consistency: does increasing characters result in true tree? Robustness: how does it work when assumptions are violated? Falsifiability: does it allow testing assumptions? How to assess the methods? MAXIMUM PARSIMONY, MP (maximální úspornost) I II III A 1 0 1 B 0 0 1 C 1 0 0 D 0 1 0 E 1 0 1 2 steps 1 step 2 steps minimal number of steps = 3 real number of steps = 5 Þ 2 extra steps ® homoplasy William of Ockham (c. 1287 – 1347) Occam’s razor MP1 1. arbitrary root Estimation of number of steps: Fitch algorithm MP1 1. arbitrary root 2. Downward: w = C or T x = T y = A or T z = T Estimation of number of steps: Fitch algorithm MP1 1. arbitrary root 2. Downward: w = C or T x = T y = A or T z = T 3. Upward: z = T, nebo A DELTRAN (DELayed TRANsformation) ACCTRAN (ACCelerated TRANsformation) total length = 3 Estimation of number of steps: Fitch algorithm parsimony-informative and non-informative characters (sites) - invariant sites (symplesiomorphies) - singletons (autapomorphies) • • index of consistency, CI retention index, RI rescaled consistency index, RC homoplasy index, HI RC = CI ´ RI HI = 1 - CI Problem of homoplasy: m = min. no. of possible steps s = min. no. needed for explaining the tree g = max. no. of steps for any tree Metods of parsimony: Fitch: X ® Y a Y ® X neseřazené znaky (A ® T nebo A ® G etc.) Wagner: X ® Y a Y ® X seřazené znaky (1 ® 2 ® 3) Dollo Dollo: X ® Y a Y ® X, potom nelze X ® Y … restriction-site and restriction-fragment data Camin-Sokal: X ® Y, not Y ® X … SINE, LINE weighted = transversion p. generalized p.: cost matrix = step matrix “relaxed Dollo criterion” 2.5.tif *) M is an arbitrarily large number, guaranteeing that only one transformation to each derived state will be permitted. Wagner Fitch Dollo transversion *) Parsimony and consistency ((A,B),(C,D))* p>>q “true” ((A,C),(B,D))* “wrong” * tree written in Newick format Konzistence_obr „Felsenstein zone“ In the Felsenstein zone, parsimony is inconsistent Parsimony and consistency Simulation Parsimony and consistency LBA long branches Konzistence_tab2 long-branch attraction (LBA) Parsimony and consistency Search for optimal tree 1.Exact methods: a) exhaustive search b) branch-and-bound BaB1 starts with 3 taxa, sequential addition if the tree is longer than a randomly chosen tree, the process is terminated branch-and-bound Bayes1 all possible trees 2. Heuristic search stepwise addition star decomposition branch swapping Bayes1 heuristic search Swap nearest-neighbor interchanges (NNI) subtree prunning and regrafting (SPR) tree bisection and reconnection (TBR) Jukes-Cantor (JC): equal base frequencies equal substitution rates Evolutionary models and distance methods Base after substitution A C G T A -¾ ¼ ¼ ¼ Original base C ¼ -¾ ¼ ¼ G ¼ ¼ -¾ ¼ T ¼ ¼ ¼ -¾ - a a a a - a a a a - a a a a - Q = Kimura 2-parameter (K2P): transitions ≠ transversions TsTv - b a b b - b a a b - b b a b - Q = If a = b, K2P = JC - pCb pGa pTb pAb - pGb pTa pAa pCb - pTb pAb pCa pGb - Q = If pA = pC = pG = pT, F81 = JC Felsenstein (F81): different base frequencies - pC pG pT pA - pG pT pA pC - pT pA pC pG - Q = Hasegawa-Kishino-Yano (HKY): different base frequencies transitions ≠ transversions General time-reversible (GTR, REV): different base frequencies different substitution rates Jukes-Cantor (JC) pA=pC=pG=pT a=b Felsenstein (F81) pA¹pC¹pG¹pT a=b Kimura‘s two-parameter (K2P) pA=pC=pG=pT a¹b Hasegawa-Kishino-Yano (HKY) pA¹pC¹pG¹pT a¹b Felsenstein (F84) pA¹pC¹pG¹pT a=c=d=f=1, b=(1+K/pR), e=(1+K/pY), kde pR=pA+pG pY=pC+pT Kimura’s three-substitution-type (K3ST) pA=pC=pG=pT a¹b Tamura-Nei (TrN) pA¹pC¹pG¹pT a¹b General-time reversible (GTR) pA¹pC¹pG¹pT a, b, c, d, e, f unequal base frequencies more than 1 type of substitution 2 transition types Heterogenity of substitution rates in different parts of sequences Gama Gamma distribution: shape parameter α discrete gamma model invariant sites ® GTR+Γ+I nebo GTR+G+I the higher a, the more homogeneous are substitutions Model comparison: Likelihood ratio test (LRT): nested models LR = 2(lnL2 – lnL1) c2 distribution, p2 – p1 degrees of freedom Akaike information criterion (AIC): nonnested models AIC = -2lnL + 2p, kde p = number of free parametres better model ® lower AIC Bayesian information criterion (BIC): nonnested models BIC = -2lnL + plnN, where N = sample size Amazon.com: KaleaBoutique Collectible Beatles Nesting Doll Memorabilia Stacking Matryoshka Doll in Doll 5.75 Inches Tall : Toys & Games hierarchical LRT – ModelTest (Crandall and Posada), jModelTest Model comparison: dynamic LRT: LRT Model comparison: More parametres Þ more realism, but … • … also less confidence (estimates based on the same amount of data!) Model comparison: Distances computed for each pair of taxa, from distance (or similarity) matrix – tree inference distance methods base on assumption that if we know true distances, we can very easily infer the true phylogeny advantage: very fast and simple (also with a calculator) 1 10 20 30 sequence 1: ACCCGTTAAGCTTAACGTACTTGGATCGAT sequence 2: ACCCGTTAGGCTTAATGTACGTGGATCGAT p-distance: p = k/n = 3/30 = 0,10 Diff problem of saturation: Distances for some models: Dist2 Cluster analysis - UPGMA 1.Find min d(ij) 2.Calculate new matrix (ŠB-k) = [d(B-k)+d(Š-k)]/2 3.Repeat 1 a 2. chimp bonobo gorilla human orang. chimp (Š) -- bonobo (B) 0,0118 -- gorilla (G) 0,0427 0,0416 -- human (Č) 0,0382 0,0327 0,0371 -- orangutan (O) 0,0953 0,0916 0,0965 0,0928 -- Š B Č G O UPGMA (unweighted pair-group method using arithmetic means): d[(BŠČ)G] = {d(BG)+d(ŠG)+d(ČG)}/3 WPGMA: d[(BŠČ)G] = {d[(BŠ)G] + d(ČG)}/2 single-linkage (metoda nejbližšího souseda) complete-linkage (m. nejvzdálenějšího souseda) ŠB gorilla human orang. ŠB -- gorilla (G) 0,0422 -- human (Č) 0,0355 0,0371 -- orangutan (O) 0,0935 0,0965 0,0928 -- UPGMA and consistency additive distances: dAB + dCD £ max (dAC + dBD, dAD + dBC) tj. distance between 2 taxa equals sum of branches connecting them ultrametric distances: dAC £ max (dAB, dBC) A B C D A B C additive tree ultrametric tree Simulation UPGMA and consistency Algorithmic method Principle of minimal evolution ® minimizes sum of branch lenghts S Each pair of nodes adjusted according to its divergence from others Single additive tree Neighbor-Joining, NJ NJ2 star tree NJ2 finding nearest neighbors star tree NJ2 distance recalculation finding nearest neighbors star tree NJ2 S = 32,4 S = 29,5 S = 28,0 repeating... distance recalculation finding nearest neighbors star tree Drawbacks of distance data: 1.loss of information during transformation 2.after transformation to distances, we cannot infer original data (different sequences may result in the same distance) 3. 3.we cannot study the evolution in different parts of sequence 4. 4.difficult biological interpretation of branch lengths 5. 5.we cannot combine more distance matrices