POPULATION GENETICS Biological Hierarchy É G.ines Populations Species Communities Ecosystems I. GENETIC DIVERSITY 13 March 2017 POPU LAT IC SPECIES 1 X I--1 SUBPOPULATIONS I. GENETIC DIVERSITY-ANALYSIS OF SINGLE POPULATIONS 15.5.2017 POPULATION and problems of definition • a population is a group of interbreeding indiviuals that exist together in time and space • to develop the basic concepts of population genetics, we initially consider the ideal population = large, random-mating ALLELE FREQUENCY • proportion of an allele in comparison to all the others alleles of the same locus (gene) in a population sample • basic characteristics for genetic diversity (variation) of a population • population genetics studies genetic diversity and processes that have created it and influence it - i.e. the dynamics of distribution and frequency of alleles (genotypes —► phenotypes), i.e. processes shaping evolution: increase of gen. diversity: mutation and migration decrease of gen. diversity: genetic drift (and natural selection) 2 15.5.2017 MUTATIONS increase genetic diversity responsible for variation/heterogeneity in populations - essential to evolution 1. substitutions (transitions, transversions) non-coding regions GTC -> GTA synonymous Va| ^ Va| nonsynonymous missense nonsense } silent substitutions GTC -> TTC Val -> Phe AAG -> TAG Lys -»■ ochre (stop) insertion deletion ACGGT ACGGT ACAGGT AGGT } ® © transice Ot = indels -> frameshift mutations a transice Mutation rate - rate at which number of various types of mutations occur in a given position over time OBSERVATION Callimorpha dominula prastevnfk 3 OBSERVATION Callimorpha dominula přástevník hluchavkový Scarlet tiger m< Tab le 3.1. Data from a collection of 1612 scarlet tiger moths. i'hcnotype No. of individuals White spotting Intermediate Little spotting 1469 138 Genotype and allele frequency AA Aa a a Q R Relative numbers = frequencies: genotype f.: P(GAA), Q(GAa), R(Gaa) allele (gene) f.: p (A), q (a) P+Q+ R=-\ p+q= 1 Genotype A A AA> Number n2 Frequency P=n1/A/ Q=n2IN p = (2^ + n2)/2N R=n3/N g = (n2 + 2n3)/2N Total N Hardy-Weinberg Equilibrium (HWE) Ex. Single locus with 2 alleles Allele Allele frequency A P a q Genotype Expected genotype frequency AA P2 Aa 2pq aa q2 p + q= 1 p, q - Allele frequencies known from our samples = Hardy-Weinberg equilibrium > Observed genotype frequencies (H0) are known from our samples > deviation of H0 from HWE conditions => for example x2 test Expected heterozygosity, (HJ under HWE He=1 -(p2+q2).....for 1 locus with the allele frequencies p and q Assumptions for ideal population in HWE • random-mating • negligible effect of mutations and migration („closed populations") • infinitely large population (negligible effect of random fluctuations in allele frequencies in time - genetic drift) - in HWE population the allele frequencies are stable = do not change between generations • Mendelian inheritance of the analysed loci • neutral loci - not under selection • diploid, sexually reproducing organisms with discrete generations • loci are independent from each other - test for Jinkage disequilibrium" in ii I 0 0 1 vs. or 2 loci physically close to each other | (decreased probability of recombination__ - linkage disequilibrium) 2 loci physically distant (probability of recombination not influenced - linkage equilibrium) LINKAGE DISEQUILIBRIUM (LD) loci in LINKAGE EQUILIBRIUM - segregate independently of each other during meiosis the most common reason for non-random association among loci (LD) is the proximity of two loci on a chromosome (others e.g. small pop. size - gen. drift, immigration, overlapping generations, admixture, etc.) haplotype diversity - p(AB) * p(A) x p(B) in presence of LD: we have fewer independent loci for our genetic analysis than anticipated neutral loci (alleles) linked to selected ones will appear non-neutral presence of LD needs to be tested when analysing data from multiple loci q=-\-p 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 p = Jl-q Figure 3.4 The combinations of homozygote and heterozygote frequencies that can be found in populations that are in HWE. Note that the frequency of heterozygotes is at its maximum when p-q-0.5. When the allele frequencies are between 1/3 and 2/3, the genotype with the highest frequency will be the heterozygote. 15.5.2017 Example of genetic diversity estimation in a sample of 4 individuals (on 4 loci) Ind 1 170/170 223/227 116/116 316/316 Ind 2 170/172 223/225 112/112 316/316 Ind 3 172/172 223/225 112/112 316/316 Ind 4 170/172 223/227 112/112 316/316 Počet alel 2 3 2 1 2 Ho 0,5 1,00 0 0 0,375 P 0,5 P = 0,5 0,75 1,00 q 0,5 q = 0,25 r = 0,25 0,25 0 He 0,5 0,625 0,375 0 0,375 He=1-(p2+q2) He=1-(p2+q2+r2) Proportion of polymorphic loci (polymorphism) = 0,75 Is our population in HWE? Callimorpha 7 Is our population in HWE? Table 3.1. Data from a collection of 1612 scarlet tiger moths. Phenotype No. of individuals Assumed genotype No. of A alleles No. of a alleles White spotting 1469 A4 1469x2=2938 - Intermediate 138 Aa 138 138 Little spotting 5 aa - 5x2 10 d ihr üfiirlrt tiger mot Ii, Panax t a tics in tin- Muring of"the onigra iA m »I < i AR KP nb iif.nis t m.fcv www.shutterstock.com 60840859 Deviation from HWE • HWE test - e.g. Genepop software („exaet probability tests") - any case of significant deviations from HWE indicates that some of HWE assumptions were not fulfilled —► detailed inspection required: • heterozygote excess - negative assortative mating (i.e. intentional mating of distinct individuals) - used loci are advantageous in heterozygote situation (= balancing selection favouring heterozygotes, e.g. MHC genes) - mutation - migration • heterozygote deficit - inbreeding (all loci are equally affected), assortative mating - genetic structure in populations - null alleles (only some loci affected by heterozygote deficit) Quantifying genetic diversity Polvmorfism (proportion of Polymorphie loci) - P • polymorphic locus = with at least two alleles with having frequency of more numerous allele being less or equal 0.95 (or 0.99) • e.g. a population sample with four polymorphic loci out of five —► P = 0.8 r 50 Number of alleles - N„ -2 v> • number of alleles per locus (mean over loci) f -□- Coralllna v Gastoctonlum —-•--- Geltdlum — -o — Perumytllua . N ^ a ^ \\ o 25 • number of alleles corrected for sample size § (rarefaction method e.g. in FSTAT software) z n Observed heterozvaositv - H„ < • observed frequency of heterozygote genotypes (mean over loci) 1 500 1000 1500 Sample size HAPLOID DIVERSITY • genetic diversity for haploid data HAPLOTYPE DIVERSITY (h; Nei et Tajima198ip? frequency of different haplotypes it _ N i-t _ 2\ x, -haplotype frequency of each haplotype in the sample — — 1 ^' N - sample size NUCLEOTIDE DIVERSITY (tt; Nei 1987) - quantifies the mean nucleotide divergence between sequences - probability that two randomly chosen homologous nucleotides will be identical ^ _ x, and x; - respective frequencies of the rth and/th sequences 7f = 2—t xixj^ij Try - number of nucleotide differences per nucleotide site ij between the /th and /th sequences 15.5.2017 WHAT INFLUENCES GENETIC DIVERSITY? • influenced by a multitude of factors • varies considerably between populations MOST IMPORTANT DETERMINANTS OF GENETIC DIVERSITY: > genetic drift >population bottlenecks > natural selection > methods of reproduction GENETIC DRIFT population not infinitely large —> population not in HWE —> increase of influence of CHANCE —> allele frequencies vary between generations in absence of selection, each allele goes to: 1. fixation ^ DECREASE of 2. extinction genetic diversity more quickly in smaller populations genetic drift - process causing a population's allele frequencies to change from one generation to the next as a result of CHANCE 10 GENETIC DRIFT Population n-20 Random samplinn_and genrtic-drift _ j ' j j j i j > j j j j j . j very profound effect of genetic drift in small populations - founder effect, bottleneck inextricable link between genetic drift and population size - the effective population size 15.5.2017 12 OVERVIEW Mutation \ Sexual reproduction / Balancing selection Directional selection I TGenetic diversity J-Genetic diversity i Gene flow. / Inbreeding Small population ■ size Immediate loss of alleles' t — lNe- Variance in reproductive success, uneven sex ratio - Population -bottlenecks Figure 3.16 An overview of some of the main factors that influence levels of genetic diversity within populations. Freeland era/. 2011 Assumption for population structure analysis: • neutral loci = no effect of selection included • classical population genetics approach = populations are (thought to be) known (e.g. we want to quantify level of genetic differentiation between two localities / ?populations) • BUT populations are not usually known (e.g. due to no obvious spatial heterogeneity over the distribution range) -we want to reveal any potential population differentiation/structure according to our genetic data 15.5.2017 We are interested in genetic structure of populations \ i N 15 Recently observed genetic structure indicates what happened v in the past Genetic structure - any pattern in the genetic make-up of individuals within a population AIMS: • Detection of any genetic structure (subdivision) in a population (in my dataset) • Are there any differences between ..different" (in space and time) populations? • Quantification of such differences = description of genetic structure in population • What factors shape (have shaped) these differences? e.g. population history • Is there any migration/connection between different populations? = detection and quantification of gene flow, what influences gene flow (e.g. spatial heterogeneity) • What happens during migration/connection of populations? = hybridisation 15.5.2017 Population genetic structure neutral markers GENETIC DRIFT - creates subpopulation differentiation (changes in allele frequencies -extremely up to fixation of distinct alleles) MUTATION may increase differentiation (not necessarily - homoplasy) MIGRATION (GENE FLOW) - AGAINST subpopulation differentiation IZZI drift Effect of population structure on heterozygosity • Wahlund effect- first documented by Swedish geneticist Sten Wahlund (1901-1976) in 1928 • two isolated subpopulations with fixed distinct alleles • both SUBPOPULATIONS are in HWE, but the pooled dataset (the whole POPULATION) shows deficit of heterozygotes 17 Wahlund effect (isolate breaking) Homozygosity reduction when subpopulations merge Wahlund, S. (1928) Zusammensetzung von Population und Korrelationserscheinung vom Standpunkt der Vererbungslehre aus betrachtet. Hereditas, 11: 65-106 Wahlund effect - an example • Bunnersjbarna lake (northern Sweden) - „brown trout" • one trait with 2 alleles 170/170 170/172 172/172 Total P 2pq (= Ho) (=He) Přítok 50 0(0) 0 50 1.000 0.000 Odtok 1 13 (0.26) 36 50 0.150 0.255 Whole 51 13 (0.13) 36 100 0.575 0.489 lake (expected) (33.1) (48.9) (18.1) I I p2 = 0.5752 q2 = 0.4252 Ryman etal. 1979_ Wright's F-statistics ^is> FST, F|T Masatoshi Nei *1931 Wright (1950), Nei (e.g. 1987) Sewall Wright 1889-1988 detecting and describing population structure describe heterozygosity (i.e. deviation from HWE) at different levels Estimate of population structure effect on genetic diversity Total population J13KL12 116 (TÜ3) SI H4) ©<®| |®»@^S) S2 <2z> S3 • 3 levels (Total, Subpopulation, Individual) • xsubpopulations (x = 1 to k; here k=3) • each subpopulation has Nx individuals • AA, AB, BB -genotypes with different symbols •e.g. 11-13= 13st individual from the 1st subpopulation F-statistics and heterozygosity Hj - averaged observed heterozygosity of an individual in a subpopulation Hs - expected heterozygosity of an individual in a subpopulation under HWE HT - expected heterozygosity of an individual over the total population under HWE k J-J — ~y ^ H jk ^x = observed heterozygosity in subpopulation x x=l rj 2 p 2 = frequency of i-th _ k averaged expected H S ~ 1 " 2^ Pi,x allele in subpopulation x Hs = Y Hjk heterozyg°slty m ;=l j* °' subpopulation pa = allele frequency in HT — 2Pq(Jq the total population > for two alleles at a single locus (Wright 1950) > more complicated for more alleles (Nei 1987) F-statistics F1S Hs - Hj Heterozygosity decrease of an individual due to Jjs non-random mating in a subpopulation (vs. HWE) Hete Ozygosity_Mpan hgtprn7vnn Loc II 0.2 0.2 0.2 0.46 0.48 I 0.042 0.583 Loc III 0.7 0.6 0.65 0.4675 0.46875 1 0.0027 -0.387 Loc IV 0.0 0.0 0.0 0.0 0.5 1.0 Mean 0.058 0.261 0.300 Mean values of F-statistics may hide distinct evolution history of different loci F-statistics • FiS decrease of heterozygosity in local subpopulation high values - inbreeding • FiT summary measure - limited use • FST = subdivision measure = limited gene flow between subpopulations (i.e. existence of a barrier -Wahlund effect) - originally developed for estimation of the amount of allelic fixation due to genetic drift (fixation index) Permutation test of Fst significance 1. Real measured populations 2. Merged into a single dataset 3. 1000 x randomly re-separated populations Real Fst 1000 x simulated Fst TWO DIFFERENT CASES: Fst = 0.072 Fst = 0.0013 0.8 % simulated values higher than real Fst p = 0.008 (i.e. significant difference) 35.4 % simulated values higherthan real Fst p = 0.354 (e.g. non-significant difference) FST computation - an example Přítok 50 0(0) 0 50 1.000 0.000 Odtok 1 13 (0.26) 36 50 0.150 0.255 Whole lake 51 13(0.13) 36 100 0.575 0.489 (expected) (33.1) (48.9) (18.1) HL-Hi = 0.489-0.128 = 0728 HT 0.489 As a consequence of gene flow barrier: Heterozygosity is about 72.8% lower than would be under HWE Ryman etal. 1979_ FST analysis - BE AWARE Global vs. pairwise indices Absolute values depends on heterozygosity level of used loci!!! (i.e. microsatellite-based FST cannot be compared to allozyme-based FST) Demands standardization: FST' = FST/FSTmax(Hedrick 2005) - e.g. GenAlEx In case of null alleles presence: needs to be corrected! (increased FST - increase of homozygosity); FreeNA software Giant Panda 192 feces samples—► 136 genotypes-53 unique genotypes separation by a river (ca 26 ky ago) and by roads (recently) even the roads are important barriers, even if less Tabic 3 Pairwise F^r in populations the Xiaoxiangling and D axiangling Patch A B C A k! C :.i ÚÚ33* 0.1(17* (!.](I7* tí.Utó* Ü.Ü97* v. • 'Significant level after Bo.nfcrro.ni correction (P < (101). 15.5.2017 <3ST (Nei 1973) • Analogy of FSTfor haploid (haplodiploid) organisms, mtDNA sequences • Takes into account haplotype (gene) diversity instead of heterozygosity • Haplotype diversity = probability that any two randomly chosen sequences in a population will be different • Pracuje tedy jen s frekvencemi alel, ne s procentem heterozygotů • Analogy of FST • Takes into account the size of alleles (number of repeats in microsatellite loci) • Assumption of a known mutation model assumption of SMM (stepwise mutation model) • Indicates traces of mutations • RSt>Fsj higher effect of mutations • RSJ=FSJ higher effect of genetic drift • Randomisation tests for RST significance (Hardy et al. 2003, program SPAGeDi 1.1) 24 15.5.2017 Arlequin ver. 2.000 AMOV/ ™rr | • • • • Excoffier et al. 1992 c™,*™,,,™,: Url: http://an1hropoloaie.urige.chiarlequini Mail: arlequiri@sc2s.unlge.ch • •• • • Analysis of Molecular Variance • Analysis of allele frequencies variance (before in Cockerham & Weir 1987,1993) • Quantifies population differentiation • Takes into account difference between alleles - allelic state (mutations) • Program ARLEQUIN • Data: sequences microsatellites (assuming SMM stepwise mutation model) Hierarchical AMOVA How much variation may be explained by: • differentiation in big groups of populations • differentiation in populations within the groups • differentiation between individuals within the populations * t« i • _ • 25 15.5.2017 Bombus pascuorum Widmer& Schmid-Hempel 1999 F/ d.f. SSDt Variance component Among populations Among legions Among populations within regions 17 77.71 0.07 17 5198.20 5.02 4 56.15 0.0S 4 34*1.94 4.55 11 24.35 11 1773.71 0.02 2.16 tSum of squared deviations. •P<0m1. Microsatellites, AMOVA Most explained by the Alps Total variance1 „51" 8.74* 5.16* 7.49* 3.53* Between north and F 1 38.57 0.11 7.12* soiitr, of Alps a> 1 2622.89 7.25 11.74* Among populations noith and F 16 .=9.14 5.52 1.46* south of the Alps, respectively m 16 2575.31 2.18 3.53* AMOVA and F-statistics description of results, not causes —>■ possible alternative explanations (use of population history analyses - based on coalescency and allele phylogenetics) 26 Clustering methods DISTANCE-BASED methods • a tree or a plot is constructed according to a pairwise distance matrix • clusters then may be defined visually MODEL-BASED methods • observations from each cluster are random draws from some parametric model • inference for the parameters corresponding to each cluster is done jointly with inference for the cluster membership of each individual • standard statistical methods are used (e.g. maximum-likelihood in Bayeasian methods) Turdus hellen Fragments of humid tropical forest Localities Chawia, Ngangao, Mbololo, Yale (Kenya) 7 microsatellite loci Neighbour-joining * wrongly clustered individuals 1-1 m Clustering method based on microsatellite distances 15.5.2017 Factorial correspondence analysis ■j: Danube and Struma basins dan1 dans \ ■ dan3 \ ■ elb ■ rh ■ net .uk-l° dn£b /■ visi ■ vise ■ dpr y ( "dns / ■ vol \ Dnieper, Dniester and Vistula basins ■ tur2 \ V" tlir1 J \ vari / Turkey and QreeceX^ y -1.5 -1 -0.5 0 0.5 1 Factorial axis 1 (13%) Fig. 2 A two-dimensional plot of the factorial correspondence analysis performed using geweux based on 12 microsatellite loci. Three geographical groups are bounded by grey lines. - each locus as one variable, reduction of number of variables - Genetix - inference about population structure - individuals vs. populations STRUCTURE program Pritchard, Stephens and Donnelly 2000, Genetics • a mode I-based Bayesian clustering method • uses multilocus genotype data (e.g. microsatellites, RFLPs, SNPs; various levels of ploidy) • MCMC algorithm • INFERS POPULATION STRUCTURE: - presence of population structure - assignment of individuals to populations - identification of migrants or admixed individuals (parameter Q - individual membership coefficient) 28 Model implemented in STRUCTURE assumes: - K populations/clusters (K may be unknown) - each of K populations is characterized by a set of allele frequencies at each locus - within each of K populations marker loci are at LINKAGE EQUILIBRIUM with each other and in HARDY-WEINBERG EQUILIBRIUM under these assumptions each allele at each locus in each genotype is an independent draw from the appropriate frequency distribution, and this is completely specified by the probability distribution P{X\Z,F) X- genotypes of the sampled individuals Z- unknown populations of origin of the individuals P- unknown allel frequencies in all populations MODELS in STRUCTURE ✓ \ ANCESTRY MODELS ALLELE FREQUENCY MODELS • no admixture model • admixture model • independent frequencies model • linkage model • models with • correlated frequencies informative priors model Ancestry models: NO ADMIXTURE MODEL • each individual is discretely from one of the K populations • the output reports the posterior probability that individual / is from population K • the prior probability for each population is 1/K This model is appropriate for studying fully discrete populations and is often more powerful than the admixture model at detecting subtle structure. Ancestry models: ADMIXTURE MODEL • individuals may have mixed ancestry • each individual has inherited some proportion of its genome from each of the K populations = Q • the output records the posterior mean estimates of these proportions Recommended as a starting point for most populations. "It is a reasonably flexible model for dealing with many of the complexities of real populations. Admixture is a common feature of real data, and you probably won't find it if you use the no-admixture model." 15.5.2017 Allele frequency models: INDEPENDENT FREQUENCIES MODEL • the allele frequencies in each population are independent draws from a distribution that is specified by a parameter A • this prior says that we expect allele frequencies in different populations to be reasonably different from each other Allele frequency models: CORRELATED FREQUENCIES MODEL • frequencies in the some populations are likely to be similar (probably due to migration or shared ancestry) • this prior says that the allele frequencies in different populations may be quite similar between the populations • better clustering for closely related populations • but may increase the risk of over-estimating K • If one population is quite divergent from the others, the correlated model can sometimes achieve better inference if that population is removed. Falush, Stephens and Pritchard 2003, Genetics 31 MODELS in STRUCTURE ✓ N ANCESTRY MODELS ALLELE FREQUENCY MODELS no _acl mjxtu^gJUQfilfih admixture model linkage model models with informative priors I • independent frequencies _ jjiodel_B correlated frequencies model How long to run it it is not possible to determine suitable run-lengths theoretically this requires some experimentation on the part of the user burnin length: how long to run the simulation before collecting data to minimize the effect of the starting configuration • typically a burnin of 10,000— 100,000 is more than adequate run length: how long to run the simulation after the burnin to get accurate parameter estimates • several runs at each K, possibly of different lengths, and see whether you get consistent answers • you can get good estimates of the parameter values (P and Q) with runs of 10,000-100,000 steps, but accurate estimation of Pr(X|K) may require longer runs • at least 500,000 In practice your run length may be determined by your computer speed and patience as much as anything else. STRUCTURE program Pritchard, Stephens et Donnelly 2000, Genetics File Project Parametri Set Plotting Vie w Help to s QX ! • . ftoiK^irnOMaSrMleMaB Project Dau lQrHflr.l • fvofÉCUnfarmaCW • SimjUOoi Summary Q , Parameter Stt» & ,. tnnmm • paramMt_run_10 (k "2) • pararr»»tjvfi_tl(«5 ) • paramMt_rvi_22 (K"S ] • param»tt_run_23 (k«a ) • pjrimwi an 2-* (► =5 ) • pararr»«jn/i_25 11 .5) • (k«1) • paramaat_run_) («•!) • paramM_run_4 (• = i) 199 198 199 201 19t 207 207 1B3 189 198 19« 201 191 207 207 18] 199 198 197 201 191 207 207 im 195 196 201 201 191 307 205 183 199 198 199 201 191 207 207 183 199 198 197 201 191 207 209 183 185 198 19' 201 191 207 207 183 199 198 199 201 19t 207 209 189 195 198 201 201 191 207 207 183 IB'j im 198 -'-■! i-'i 207 UH 199 198 197 201 191 207 203 183 • param*tt_run_5t Ó Admixture model - allows assignement of an individual to several clusters Barplot for K = 7 _HHIH^.'._I I m Genome proportion of each individual assigned to each of K clusters Eurasia K=4 L r ! I I i c o i 3 = ! What K is the best??? -20000 -21000 --22000 ■ o -23000 ■ -24000 ■ -25000 ■ -26000 ■ -27000 --28000 ■ -29000 ■ -30000 20 ♦ ♦ K (number of clusters) 25 Mohcriu Ecology <2«Rl 14, 2611 -2h31 doi: 10.1111 /j.lJftS-JMX.aiOi.OI'TKU Detecting the number of clusters of individuals using the software structure: a simulation study G. EVANNO.S. REGNAUT and]. COUDET DeynfJifKAi of Eootogj/tnd Evolution, BUogyhu&Hng, Uniutnltyof Ltmntimc, Cil 1015 Imwuo, Swtturkoid ..""[•11 5* 10 IS 2D c\ck\ 10 15 20 ■""Ii 10 15 20 15 20 K=5 Post-processing of the STRUCTURE outputs Main PIpHne Distruct (or many K's Compare Best K Download Help Contact & Citing Issues Clumpak - Cluster Markov Packager Across K clumpak was designed to aid users in four main objectives: Separate distinct solutions obtained from STRUCTURE-like programs. Compare and align solutions obtained for different K values. Compare results obtained using different models/data subsets/programs. Indicate the preferred value of K according to Evanno et al. Graphical output from STRUCTURE a serie of barplots with increasing K K=2 K=3 K=4 K=5 K=6 K=7 K=6 K=9 K=10 K=11 K=12 K=13 K=14 K=15 Jorced clustering" Picture of hierarchical structure between clusters Bartäkovä et al. 2013 15.5.2017 Bartáková et al. 2013 38