25/03/2016 Assumption for population structure analysis: • neutral loci = no effect of selection included • classical population genetics approach = populations are (thought to be) known (e.g. we want to quantify level of genetic differentiation between two localities / ?populations) • BUT populations are not usually known (e.g. due to no obvious spatial heterogeneity over the distribution range) - we want to reveal any potential population differentiation/structure according to our genetic data 1 25/03/2016 We are interested in genetic structure of populations \ i 2 Recently observed genetic structure indicates what happened v in the past Genetic structure - any pattern in the genetic make-up of individuals within a population AIMS: • Detection of any genetic structure (subdivision) in a population (in my dataset) • Are there any differences between ..different" (in space and time) populations? • Quantification of such differences = description of genetic structure in population • What factors shape (have shaped) these differences? e.g. population history • Is there any migration/connection between different populations? = detection and quantification of gene flow, what influences gene flow (e.g. spatial heterogeneity) • What happens during migration/connection of populations? = hybridisation 25/03/2016 Population genetic structure neutral markers GENETIC DRIFT - creates subpopulation differentiation (changes in allele frequencies -extremely up to fixation of distinct alleles) MUTATION may increase differentiation (not necessarily - homoplasy) MIGRATION (GENE FLOW) - AGAINST subpopulation differentiation 1=1 drift Effect of population structure on heterozygosity • Wahlund effect- first documented by Swedish geneticist Sten Wahlund (1901-1976) in 1928 • two isolated subpopulations with fixed distinct alleles • both SUBPOPULATIONS are in HWE, but the pooled dataset (the whole POPULATION) shows deficit of heterozygotes B A R R 1 E //AA AA *\ / / AA\ / / AA AA AA \ AA I V V AA / \\ AA / / aa \ / aa \ \ / aa aa \ \ 1 aa aa I J \ aa aa / / \ aa / / aa yy/ R 4 Wahlund effect (isolate breaking) Homozygosity reduction when subpopulations merge Wahlund, S. (1928) Zusammensetzung von Population und Korrelationserscheinung vom Standpunkt der Vererbungslehre aus betrachtet. Hereditas, 11:65-106 Wahlund effect - an example • Bunnersjoarna lake (northern Sweden) - „brown trout" • one trait with 2 alleles 170/170 170/172 172/172 Total P 2pq (= Ho) (=He) Přítok 50 0(0) 0 50 1.000 0.000 Odtok 1 13 (0.26) 36 50 0.150 0.255 Whole 51 13(0.13) 36 100 0.575 0.489 lake (expected) (33.1) (48.9) (18.1) p2 = 0.5752 Rymanetal. 1979_ 25/03/2016 Wright's F-statistics Masatoshi Nei *1931 Wright (1950), Nei (e.g. 1987) Sewall Wright 1889-1988 detecting and describing population structure describe heterozygosity (i.e. deviation from HWE) at different levels Estimate of population structure effect on genetic diversity Total population 116 (TU?) SI H4) © for two alleles at a single locus (Wright 1950) > more complicated for more alleles (Nei 1987) F-statistics F1S Heterozygosity overHP popi Hs - Hj Heterozygosity decrease of an individual due to H„ non-random mating in a subpopulation (vs. HWE) Mean heterozygosity within subpopulations VHT - Hs ^Influence of division of the total population in subpopulations (i.e. heterozygosity decrease due to Wahlund effect) jj — jj Total coefficient of inbreeding fit - measures FIT = — heterozygosity decrease of an individual in T relation to the total population (1-FIT)= (1-FST)(1-FIS) Weir & Cockerham (1984) / (~ FIS), 9 (~ FST), F (~ FIT) Correction for sample size and number of subpopulations 25/03/2016 Computation of F-statistics Mean allele A frequency in the whole population Subpopulation 1 (N1-40) Subpopulation 2 (N2=20) I Locus AA AB BB Pw AA AB BB Pm Note Loci 10 20 10 0.5 5 10 5 0.5 0.5 HWE Locll 16 8 16 0.5 4 4 12 0.3 0.4 heterozygote deficit Loc III 12 28 0 0.65 6 12 2 0.6 0.625 heterozygote excess LocIV 0 0 40 0.0 20 0 0 1.0 0.5 alternatively fixed alleles Computation of allele frequencies Observed heterozygosity Expected heterozygosity Wright's F-statistics Locus Hiffl H2ffl H'» Hs« HTt) Loc I 0.5 0.5 0.5 0.5 0.5 CT3.0 0.0 Locll 0.2 0.2 0.2 0.46 0.48 I ^0565) 0.042 0.583 Loc III 0.7 0.6 0.65 0.4675 0.46875 I 0.0027 -0.387 LocIV 0.0 0.0 0.0 0.0 0.5 C 0 1.0 Mean 0.058 0.261 0.300 Mean values of F-statistics may hide distinct evolution history of different loci F-statistics • FiS decrease of heterozygosity in local subpopulation high values - inbreeding • FiT summary measure - limited use • FST = subdivision measure = limited gene flow between subpopulations (i.e. existence of a barrier -Wahlundeffect) - originally developed for estimation of the amount of allelic fixation due to genetic drift (fixation index) 8 Permutation test of Fst significance 2. Merged into a 3- 1000 x randomly re- 1. Real measured populations single dataset separated populations Real Fst 1000 x simulated Fst TWO DIFFERENT CASES: 0.80 % simulated values higherthan real Fst 35.40 % simulated values higher than real Fst p = 0.008 (i.e. significant difference) p = 0.354 (e.g. non-significant difference) FST computation - an example Přítok 50 0(0) 0 50 1.000 0.000 Odtok 1 13 (0.26) 36 50 0.150 0.255 Whole lake 51 13(0.13) 36 100 0.575 0.489 (expected) (33.1) (48.9) (18.1) H,-H,= 0.489 - 0.128 HT 0.489 As a consequence of gene flow barrier: Heterozygosity is about 72.8% lower than would be under HWE Rymanetal. 1979_ FST analysis - BE AWARE Global vs. pairwise indices Absolute values depends on heterozygosity level of used loci!!! (i.e. microsatellite-based FST cannot be compared to allozyme-based FST) Demands standardization: FST' = FST/FSTmax (Hedrick 2005) - e.g. GenAlEx In case of null alleles presence: needs to be corrected! (increased FST - increase of homozygosity); FreeNA software Giant Panda 192 feces samples—► 136 genotypes—* 53 unique genotypes separation by a river (ca 26 ky ago) and by roads (recently) even the roads are important barriers, even if less Tabic 3 Pairwise Fsr in the Xiacxiangling and Daxi; populations; Pateh A B c D A K! ÜÜ33* C 0.1(17* o.otó* :.i 0.1(17* 0.(W7* 0.ÍB7* *Signi licant level alter Bonicrroini airrectioiri (P < 0.011. 25/03/2016 GST (Nei 1973) • Analogy of FSTfor haploid (haplodiploid) organisms, mtDNA sequences • Takes into account haplotype (gene) diversity instead of heterozygosity • Haplotype diversity = probability that any two randomly chosen sequences in a population will be different • Pracuje tedy jen s frekvencemi alel, ne s procentem heterozygotů • Analogy of FST • Takes into account the size of alleles (number of repeats in microsatellite loci) • Assumption of a known mutation model assumption of SMM (stepwise mutation model) • Indicates traces of mutations • RSt>Fsj higher effect of mutations • RSJ=FSJ higher effect of genetic drift • Randomisation tests for RST significance (Hardy et al. 2003, program SPAGeDi 1.1) 11 25/03/2016 AMOVA Excoffier et at. 1992 Arlequin ver. 2.000 A software foe population genetics data analysis Authors: Stefan Schneider David Roessli Laurent Excoffier Contact Arlequin: UN: http:tf anthropologic.unige.chlarlequi Mail: arleqijin@sc2a.unjge.ch Analysis of Molecular Variance • • • • • • • •• Analysis of allele frequencies variance (before in Cockerham & Weir 1987,1993) Quantifies population differentiation Takes into account difference between alleles - allelic state (mutations) Program ARLEQUIN Data: sequences microsatellites (assuming SMM stepwise mutation mode!) • • • • .. ... • • •• ..... • . • Hierarchical AMOVA How much variation may be explained by: • differentiation in big groups of populations • differentiation in populations within the groups differentiation between individuals within the populations 12 25/03/2016 AMOVA and F-statistics description of results, not causes —> possible alternative explanations (use of population history analyses - based on coalescency and allele phylogenetics) 13 Clustering methods DISTANCE-BASED methods • a tree or a plot is constructed according to a pairwise distance matrix • clusters then may be defined visually MODEL-BASED methods • observations from each cluster are random draws from some parametric model • inference for the parameters corresponding to each cluster is done jointly with inference for the cluster membership of each individual • standard statistical methods are used (e.g. maximum-likelihood in Bayeasian methods) Turdus hellen Fragments of humid tropical forest Localities Chawia, Ngangao, Mbololo, Yale (Kenya) 7 microsatellite loci Neighbour-joining * wrongly clustered individuals 1.1 M Clustering method based on microsatellite distances 25/03/2016 Factorial correspondence analysis Factorial axis 1 (13%) Fig. 2 A two-dimensional plot of the factorial correspondence analysis performed using genettx based on 12 microsatellite loci. Three geographical groups are bounded by grey lines. - each locus as one variable, reduction of number of variables - Genetix - orientační zjištění štruktúrovanosti populace - individuals vs. populations STRUCTURE program Pritchard, Stephens and Donnelly 2000, Genetics • a model-based Bayesian clustering method • uses multilocus genotype data (e.g. microsatellites, RFLPs, SNPs; various levels of ploidy) • MCMC algorithm • INFERS POPULATION STRUCTURE: - presence of population structure - assignment of individuals to populations - identification of migrants or admixed individuals (parameter Q - individual membership coefficient) 15 25/03/2016 Model implemented in STRUCTURE assumes: - K populations/clusters (K may be unknown) - each of K populations is characterized by a set of allele frequencies at each locus - within each of K populations marker loci are at LINKAGE EQUILIBRIUM with each other and in HARDY-WEINBERG EQUILIBRIUM under these assumptions each allele at each locus in each genotype is an independent draw from the appropriate frequency distribution, and this is completely specified by the probability distribution P{X\Z,F) X- genotypes of the sampled individuals Z- unknown populations of origin of the individuals P- unknown allel frequencies in all populations MODELS in STRUCTURE ✓ \ ANCESTRY MODELS ALLELE FREQUENCY MODELS • no admixture model • admixture model • independent frequencies model • linkage model • models with • correlated frequencies informative priors model 16 25/03/2016 Ancestry models: NO ADMIXTURE MODEL • each individual is discretely from one of the K populations • the output reports the posterior probability that individual / is from population K • the prior probability for each population is 1/K This model is appropriate for studying fully discrete populations and is often more powerful than the admixture model at detecting subtle structure. Ancestry models: ADMIXTURE MODEL • individuals may have mixed ancestry • each individual has inherited some proportion of its genome from each of the K populations = Q • the output records the posterior mean estimates of these proportions Recommended as a starting point for most populations. "It is a reasonably flexible model for dealing with many of the complexities of real populations. Admixture is a common feature of real data, and you probably won't find it if you use the no-admixture model." 17 25/03/2016 Allele frequency models: INDEPENDENT FREQUENCIES MODEL • the allele frequencies in each population are independent draws from a distribution that is specified by a parameter A • this prior says that we expect allele frequencies in different populations to be reasonably different from each other Allele frequency models: CORRELATED FREQUENCIES MODEL • frequencies in the different populations are likely to be similar (probably due to migration or shared ancestry) • this prior says that the allele frequencies in different populations may be quite similar between the populations • better clustering for closely related populations • but may increase the risk of over-estimating K • If one population is quite divergent from the others, the correlated model can sometimes achieve better inference if that population is removed. Falush, Stephens and Pritchard 2003, Genetics 18 MODELS in STRUCTURE ANCESTRY MODELS ALLELE FREQUENCY MODELS no admixture model admixture model linkage model models with informative priors I ■ independent frequencies jnp_clel__ correlated frequencies model How long to run it it is not possible to determine suitable run-lengths theoretically this requires some experimentation on the part of the user burnin length: how long to run the simulation before collecting data to minimize the effect of the starting configuration • typically a burnin of 10,000— 100,000 is more than adequate run length: how long to run the simulation after the burnin to get accurate parameter estimates • several runs at each K, possibly of different lengths, and see whether you get consistent answers • you can get good estimates of the parameter values (P and Q) with runs of 10,000-100,000 steps, but accurate estimation of Pr(X|K) may require longer runs • at least 500,000 In practice your run length may be determined by your computer speed and patience as much as anything else. 25/03/2016 STRUCTURE program Pritchard, Stephens et Donnelly 2000, Genetics 1 FM» PnyKt PnwwSM noting MM Mdp -. a ■ □ X ! o -2 |— • ProjKT Wbrmjoon • swutacon Summer, a 1 ; ' ■ -SM • :--r:v S * Koute • paramMt_n/>_10 ( K*3 ) • par*nMC.w_12 {r=3 } • par*nMt™_U < K=3 ) • pcamMt_nxi_14 < K=3 ) UamJ»_. Vaa»J». loaa_M_ loa»_M_. loa»J*_ laa»_M_. laam_m_ ion*_M_ . 195 198 199 201 »1 207 207 183 195 196 199 201 191 207 20' IBS 189 199 197 201 191 207 207 183 199 19S 197 201 191 207 207 183 198 190 199 201 191 207 20' 183 199 198 197 201 191 207 207 183 195 199 199 201 »1 207 205 1X3 • parmtxjvUS ! »=3 ) • pnnKjviJf(M) • paramBet_nxi_l8(K»4} • paau<0 {) • parjmMt.rui_l ( K»l) • o*-amMt_rui_20 ( k=4 ) • par«vnMt rixi 21 (K=5 ) ii 199 199 199 201 191 207 205 183 175 199 197 201 191 207 207 183 195 :* 197 201 191 207 207 183 195 198 197 Ml Wl 207 205 183 L95 SB 199 201 191 207 207 183 195 199 199 201 191 20' 207 183 195 196 197 301 191 207 207 133 195 199 201 201 191 207 Tfff 183 • par mKjm_23 ( r-5) • pram* nri_23 ( i =5 ) • param»«_nxi_24 ; • =5 : • D»»i»»t_nn_x ( K»3 ) • 0.ramwt_rui_2 {>"1 ) • p*w«Jixi_3<«rt) • p»*ns*t,r\n_4 (K«I) • prans«_nxi_5 ( K>1) fS jl-010 195 198 199 201 191 207 207 IBS 1 B 199 197 201 191 207 ZE IBS 195 199 197 201 191 207 207 183 I B 196 199 201 191 2CT 205 IBS 05199301301191307307183 199 198 199 201 191 207 207 1S3 198 197 201 191 207 203 1B3 189199203201191207207183 • pa-an«_nn_7 () • pr*nMt_>\n_0 : > =2 ) • p*«mUtfiJ{K>2) : ::: 1-014 1-OH :::: 195 199 199 201 191 207 207 IBS 199 199 199 201 191 207 207 193 195 199 199 XI 191 307 307 183 KB 301 201 101 307 207 IBS Data format: genotypes of an individual in TWO rows loe_a loe.l) loC_C loc_(l locjt* George 1 -9 145 66 u 92 George 1 -9 -9 G4 ii 94 Paula 1 lOG 142 GS 1 92 Paula 1 10G 148 G4 ii 94 Matthew _) 1 II) 1 15 -9 ii 92 Matthew 2 1 III 148 66 1 -!) Boh 2 108 142 G4 1 '11 Boh 2 -. Detecting the number of clusters of individuals using the software structure: a simulation study G. EVANNO.S. REGNAUT and J. GOUDET Dt-;wr/»h'»/ of Eoologj/and Evolution, Biology building. Unh\rsity of Lnuminii: Cii Wl;> bmxiiim: Switzerland ■""Ii 5 10 IS 20 15 20 5 10 15 20 5 10 15 K=5 25/03/2016 Post-processing of the STRUCTURE outputs Main Pipline Distruct (or many K's Compare Best K Download Help Contact & Citing Issues Clumpak - Cluster Markov Packager Across K clumpak was designed to aid users in four main objectives: Separate distinct solutions obtained from STRUCTURE-like programs. Compare and align solutions obtained for different K values. Compare results obtained using different models/data subsets/programs. Indicate the preferred value of K according to Evanno et al. Graphical output from STRUCTURE-a serie of barplots with increasing K K-2 K=3 K=4 K=5 K*6 K=7 K=6 K=9 K=10 K=11 K=12 K=13 K=14 K=15 Jorced clustering' Picture of hierarchical structure between clusters Bartäkovä et al. 2013 24 25/03/2016 25