Variant calling in targeted sequencing Eva Budinská autumn 2023 Aim of variant calling • - To define genomic positions with single nucleotide polymorphisms, deletions, insertions or long repetitive sequence insertions specific for the sample - and if of interest, consequently call a genotype (identify whether the sample is heterozygote or homozygote) for the allele at the particular position or haplotype. Variant, genotype and haplotype calling • Variant calling – identifies variable sites in genome • Genotype calling – determines the genotype for each individual at each site (0/0, 1/1 – homozygote, 0/1 - heterozygote) • Haplotype calling – determines the haplotype for each individual at each site C/TA/G A/TT/T A T T A G C T T Genotype of an individual Haplotype 1 Haplotype 2 Site 1 Reference: A Variants: G, C Site 3 Reference: T Variants: A Site 2 Reference: C Variants: T, TT Site 4 Reference: A Variants: T, G Typical applications • Mendelian disorders – identification of causative genes in single gene disorders (germline mutations) Complex diseases – identification of candidate genes in complex diseases for further functional studies Somatic mutations – identification of constitutional mutations as well as driver and passenger genes in cancer … Whole genome/exome sequencing, targeted sequencing Results of variant calling Results of variant calling algorithms are presented in standardized VCF file (variant calling format). .vcf/.bcf Variant call format (.vcf) and its binary form (.bcf) Standard format file for results of variant calling Developed in 2010 by 1000 genomes project (1000 Genomes | A Deep Catalog of Human Genetic Variation (internationalgenome.org)) Current release: v4.3. (November 2022) - VCFv4.3.pdf (samtools.github.io) ...format example from lecture 3 Interpreting the file header Version of the VCF specification to which the file conforms (important for handling and interpreting files!) The filter lines – what type of filters have been applied to the data. Which reference file was used INFO lines define shortcuts for various site-level annotations. These are then later used in the INFO field in the variant specification. Beware, the definitions may differ between tools generating vcf files... Interpreting the file header - INFO FORMAT lines define how the genotype and other sample-level information is represented. Beware, the definitions may differ between tools generating vcf files... Interpreting the file header - FORMAT Interpreting the records: Reference sequence at the position Alternative sequence(s) at the position Quality: The Phred-scaled probability that REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance of error. This number can grow very large when a large amount of data is used for variant calling => not often a very useful property for evaluating the quality of a variant call.Identifiers of the position and gene at which the variant was found Interpreting the records: FILTER Result of quality filters applied for filtering out low quality variant calls. The values are defined in the header: Let's get more in detail: INFO Various site-level annotations and their values, as defined in the INFO lines in the header. Let's get more in detail: FORMAT ●How the genotype and other sample-level information is represented in the sample columns! Sample NA0001, first line variant: ●Genotype: 0|0 (homozygote for reference allele) ●Genotype quality: 48 ●Read depth: 1 ●Haplotype quality: 51 and 51 More info about vcf files https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF- Variant-Call-Format http://www.1000genomes.org/node/101 VCFv4.3.pdf (samtools.github.io) Help files of the tools generating vcf files! Variant identification Tools can be (very roughly) divided based on what they can best do: • Germline callers – central part of finding causes of rare inherited diseases • Somatic callers – mainly cancer studies • CNV identification tools – large structural modifications • SV (structural variants) identification tools – inversions, translocations or large indels (insertions-deletions) Variant calling – the basic steps after alignment 1. The pileup step – use SAMtools to generate counts of insertions, deletions and substitutions at each covered position in the BAM/SAM file (efficient in terms of time and memory) 2. Estimation of background error rates (whatever the source) 3. Calling variants 4. Filtering variants 5. Calling genotypes 6. Calling haplotypes The variants • Are real mismatches (from the perspective of alignment) • We want to keep these reads • Depending on goal of the study, the aligner must be able to handle them very properly (e.g. if we want to detect SNPs or SNVs) Source of errors in variant calling • NGS data suffer from high error rates that are due to multiple factors, including basecalling and alignment errors • Low-coverage sequencing (<5X per site per individual) – high probability that only one of the two chromosomes of a diploid individual has been sampled at specific site • Accurate variant and genotype calling are difficult • It is crucial to quantify uncertainty of the results VCF tools • Different methods of calling vcf, often designed on specific NGS tools/data types – not necessarily compatible! • To distinguish between real variants and technical artifacts (DNA polymerase and sequencing errors), the background error rate must be estimated • The background error rate is not constant and can vary at different positions => each position has a specific error rate • Variant callers differ also based on the method estimating the background error rate and whether they use specific methods to LOCALLY REALIGN the reads to perfect the precision Pabinger et al. (2013) Survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform Somatic variant callers – for tumours • Background error estimation, tools use: • matched normal / tumour samples • control samples to model the error noise to provide the variant calling tool with the built-in model • use information from databases of germline SNPs Somatic callers: GATK – Mutect2 Strelka2 smCounter2 UMI-VarCal UMI based (somatic) variant calling Assumption: reads that have the same UMI should be identical • Main approach: 1. Perform a majority vote within a UMI family 2. Build a consensus read for each UMI family 3. Apply a statistical method (like Beta distribution) to model background error rates at each position and apply standard filters to call final variants DeepSNVMiner MAGERI smCounter2 UMI-VarCal UMI-VarCal • UMI-VarCal: • Does not rely on SAMtools (like MAGERI), uses innovative homemade pileup algorithm specifically designed to treat the UMI tags in the reads • After the pileup, a Poisson statistical test is applied at every position to determine if the frequency of the variant is significantly higher than the background error noise. • Finally, an analysis of UMI tags is performed, a strand bias and a homopolymer length filter are applied to achieve better accuracy • UMI-VarCal is faster than both raw-reads-based and UMI-based variant callers GATK – Genome Analysis Toolkit • GATK (broadinstitute.org) GATK – Genome Analysis Toolkit Genotype calling methods • Cutoff-based methods (early methods) • simple: based on SNP and genotype calling on separate analyses of data from each individual • Probabilistic methods • based usually on multiple samples and assigning probabilities for a given genotype Cutoff-based genotype calling methods Simple: based on SNP and genotype calling on separate analyses of data from each individual Steps: 1. filtering step in which only high-confidence bases are kept (Qphred>20) 2. genotype calling – counting number of times each allele is observed, using fixed cutoffs (e.g. if the proportion of the alternative allele is between 20-80%, heterozygote is called) Used mainly for high sequencing depths (>20X), so that the probability of a heterozygous individual falling outside the selected range(20-80%) is small - Empirically determined cutoffs can be used - Usually used in targeted sequencing - Does not provide measures of uncertainty in the genotype inference Probabilistic genotype calling methods • For moderate to low sequencing depths (mainly WGS, WES) – genotype calling based on fixed cutoffs will typically lead to under-calling of heterozygous phenotypes • Several probabilistic methods were developed that use the quality score to provide a posterior probability for each genotype: P (X | G) – genotype likelihood for genotype G can be computed (X represents all of the read data for a particular individual and site) P(G) – genotype prior • From these two, Bayes' formula is used to calculate P (G | X) – the posterior probability of genotype G • The genotype with the highest posterior probability is generally chosen and this probability is used as a measure of confidence. • Advantages: • higher accuracy • natural framework for incorporating information regarding allelle frequencies and patterns of LD (linkage disequilibrium) Calculating genotype likelihood P(X|G) • Can be computed from quality scores for each read • Xi – data in read I for a particular individual and particular site with genotype G • P(Xi |G) is given by rescaling of the quality core of Xi and the genotype likelihood P(X|G) can be calculated directly from the data by taking the product of P(Xi |G) over all i • See for instance: • https://www.broadinstitute.org/gatk/media/docs/Samtools.pdf Calculating genotype priors • Can be performed using single or multiple samples • Single-sample: Prior-genotype probability may be chosen to assign equal probability to all genotypes, or it can be based on external information (e.g. the reference sequence, SNP databases or an available population sample) • In SOAPsnp tool, a prior is chosen by the use of dbSNP • Example: if a G/T polymorphism is reported in dbSNP, the prior probabilities are set to be 0.454 for each of the genotypes GG and TT; 0.0909 for GT and less than 10e-4 for all other genotypes • Multiple-samples: Priors derived by joint analysis of multiple individuals – by analysis of allele or genotype frequencies estimated from larger data sets e.g. using maximum likelihood Tools for genotype calling Nielsen et al. (2011): Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics 12, 443-451 Haplotype calling / haplotype phasing • The exponential growth problem…. • Example – two heterozygous genotypes (A/B) Brian Browning | Haplotype phasing: methods and accuracy - YouTube A A B B A B B A Haplotype calling / haplotype phasing Brian Browning | Haplotype phasing: methods and accuracy - YouTube A A ? B B ? A B ? B A ? • The exponential growth problem…. • Example – three heterozygous genotypes (A/B)? Haplotype calling / haplotype phasing Brian Browning | Haplotype phasing: methods and accuracy - YouTube A A ? B B ? B B B A A A A A B B B A A B ? B A ? B A A A B B A B A B A B • The exponential growth problem…. • Example – three heterozygous genotypes (A/B)? Haplotype phasing algorithms • Main approaches: • Clark’s algorithm • EM algorithm (Arlequin, PL-EM) • Coalescent-based methods and hidden Markov models (MACH, IMPUTE2, PHASE, BEAGLE) • Haplotype phasing: Existing methods and new developments - PMC (nih.gov) Is genotype/haplotype calling necessary? • Attention – sometimes, it is not of interest to call a genotype – e.g. if heterogenous tissues are sampled and we can expect different clones! • Example: cancer cells – one clone from thousands can regrow into metastasis! Variant annotation • We have the possibility of predicting the functional impact of variants in an automated fashion • This is very important for biological interpretation of the results! • Not all the tools are able to annotate all types of variants • Output: usually vcf file, with information on annotation in the INPUT field Example of annotated .vcf file #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chr1 868329 . A C 25.56 LowQual AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;MQ= 99.00;QD=12.78;SOR=0.693;ANN=C|intron_variant|MODIFIER|S AMD11|SAMD11|transcript|NM_152486.2|Coding|4/13|c.305+ 1860A>C|||||| GT:AD:DP:FT:GQ:PL chr1 1665702 . T C 62.55 PASS AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;MQ= 60.00;QD=31.27;SOR=2.303;ANN=C|intron_variant|MODIFIER|S LC35E2|SLC35E2|transcript|NM_182838.2|Coding|5/5|c.732+4 27A>G||||||,C|intron_variant|MODIFIER|SLC35E2|SLC35E2|tra nscript|NM_001199787.1|Coding|6/6|c.732+427A>G|||||| GT:AD:DP:FT:GQ:PL chr1 1665740 . T C 66.55 PASS AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;MQ= 60.00;QD=33.27;SOR=2.303;ANN=C|intron_variant|MODIFIER|S LC35E2|SLC35E2|transcript|NM_182838.2|Coding|5/5|c.732+3 89A>G||||||,C|intron_variant|MODIFIER|SLC35E2|SLC35E2|tra nscript|NM_001199787.1|Coding|6/6|c.732+389A>G|||||| GT:AD:DP:FT:GQ:PL Variant annotation tools • Pabinger et al. (2013) Survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform Additional resources • Best practices for variant calling in clinical sequencing | Genome Medicine | Full Text (biomedcentral.com) Specific issues in variant calling The effect of sequencing depth and mutation frequency on performance of variant callers Systematic comparison of somatic variant calling performance among different sequencing depth and mutation frequency | Scientific Reports (nature.com) Base calling algorithms • Reducing the error rate of base calls and improving the accuracy of the per-base quality score (Phred) have important implications for assembly, variant detection and downstream population-genomic analyses. • Aside of NGS platform provided base calling algorithms, several other have been developed to optimize data acquisition, improving by 5-30% the error rates: • Pyrobayes (454 Roche) • Rsolid (SOLiD) • BayesCall, Ibis (Illumina) Alignment • The accuracy of alignment has a crucial role in variant detection. • Incorrectly aligned reads may lead to errors in SNP and genotype calling • Alignment algorithms – need to be: • able to cope with sequencing errors and potentially real differences (point mutations and indels) between the reference genome and the sequenced genome • able to produce well calibrated alignment quality values (variant calls and their posterior probabilities depend on these scores) Alignment - mismatches • The amount of sequence identity required between each read and the reference sequence is determined by a trade-off between accuracy and length • The optimal choice of tolerable number of mismatches depends on the organism and also on the genome part! • e.g. Drosophila melanogaster is more variable than human – using mapping criteria for human on Drosophila may lead to a severe loss of sequencing depth for Drosophila. Vice-versa, using Drosophila melanogaster mapping criteria on human would lead to large amount of incorrectly aligned reads • MHC complex – very variable between individuals – consider combining alignment and assembly! Steps increasing the precision • Pair-end reads are highly recommended, if not even a requirement for WES and WGS to overcome the problem of ambiguity • Reads that can only be mapped with many mismatches should be discarded from the analysis => variants backed only by such reads should not be considered • Multiple reads originating from only one template might be sequenced, interfering with variant calling statistics => remove the PCR duplicates after alignment in WGS and WES (not in targeted sequencing!!) Which aligners? *Lunter, G. & Goodson, M. (2011) Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 19: 936-939 Bowtie – cannot perform gapped alignment => cannot find short indels => think of Bowtie2! BWA BWT vs hash-based: BWT are faster, BUT – not as sensitive as hash-based => can introduce mapping biases in regions with high variability* According to a comparative study*, Novoalign and Stampy currently produce the most accurate overall results, being at the same time fast enough However, Stampy not applicable to SOLiD colour reads SHRiMP2 and BFAST – hash based, capable of dealing with SOLiD colour reads Stampy* ●Mapping hash-based with optional speedup using BWA ●No support of color reads ●Combines the speed-up of BWT and the sensitivity of hash-based aligners ●http://www.well.ox.ac.uk/project-stampy ●for download you need to register • *Lunter, G. & Goodson, M. (2011) Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 19: 936-939 Stampy + BWA • Steps: • 1. Create BWA index for the same reference genome as used by Stampy • bwa index • 2. Map the reads using BWA in the usual way and use samtools to convert the resulting SAM file into a BAM file • bwa aln • bwa samse or sampe • samtools view • 3. Remap the BAM file using Stampy, but keep the well-mapped reads • ./stampy.py -g hg18 \ • -h hg18 \ • -- bamkeepgoodreads -M bwa.bam After alignment ● Step 1: Post-alignment quality control (assessment of target coverage, removal of duplicates and non-unique alignments, ...) ● Step 2: Realignment around indels - variant calling specific step ● Step 3: Quality score recalibration - variant calling specific step Assessment of target coverage ● QC step very specific for targeted sequencing ● Target coverage – is the selection of targets working? ● Specific files used: “bait list” and “target list” (both are .bed files, provided by company supplying the enrichment kit) ● Baits are oligonucleotides that retrieve specific RNA or genomic DNA used for selection/enrichment of target regions. The desired DNA or RNA molecules hybridize with the baits, and others do not ● Targets are sequences of interest Bait file Target file Assessment of target coverage • Tools: picard CollectHsMetrics • java -jar picard.jar CollectHsMetrics \ • I=your.bam \ • O=result.txt \ • R=reference_sequence.fasta \ • BAIT_INTERVALS=bait_list.bed \ TARGET_INTERVALS=target_list.bed \ CollectHsMetrics result http://broadinstitute.github.io/picard/picard-metric-definitions.html#HsMetrics Coverage using bedtools • coverageBed -b file.bam \ • -a targets.bed \ • > result.txt Result: - chromosome - target start position - target end position - name of target region - number of features in bam file that overlapped (by at least one base pair) with target region - number of bases in target region that had non-zero coverage by reads in bam file - the length of target region - the fraction of bases in target region that had non-zero coverage by reads in bam file Coverage using bedtools Visualization of coverage Removing duplicates? ● Multiple reads originating from only one template (PCR effect) might be sequenced ● This interferes with variant calling statistics => remove the PCR duplicates ● However a duplicate could be PCR effect or reading same fragment twice, there is no way to tell. ● The degree to which we expect duplicate fragments is highly dependent on the depth of the library and the type of library (whole genome, exome, transcriptome, ChIP-Seq, etc.). ● We do not remove duplicates in amplicon based targeted sequencing – here we expect high number of duplicate reads by design! Defining a duplicate ● Different ways of definition of duplicates: ● Reads with identical sequence: + reduces the pool of reads before mapping – accurate removal is dependent on low error rate (if an error, then the read is not considered a duplicate) ● Reads with the same mapping position: + not influenced by error rate – reads at the same position may come from different DNA fragments (diploid genoma and polymorphism present...) ● Paired-end reads and longer reads help to correctly identify duplicates ● The best way to deal with duplicates that correspond to PCR amplification bias is to reduce their generation in the first place ● UMIs! – unique molecular identifiers Marking and removing duplicates • picard MarkDuplicates – the most recommended tool • samtools rmdup/rmdupse – alternative • "Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15." Indels – a real example AML data, NPM1 gene, patient heterozygous for short insertion CATG: wt: ATT CAA GAT CTC TGG CAG TGG ||| ||| ||| ||| || mut: ATT CAA GAT CTC TGC ATG GCA GTG G The NGS identified insertion TGCA due to incorrect alignment: wt: ATT CAA GAT CTC TGG CAG TGG ||| ||| ||| ||| mut: ATT CAA GAT CTC TGC ATG GCA GTG G pindel • Common variant calling tools are not designed to find long repetitive inserts! • Pindel is a very good alternative – mainly for targeted sequencing data. • http://gmt.genome.wustl.edu/packages/pindel/ • Generally, it is recommended to combine multiple tools, based on the biological questions and data type! Realignment around indels • These artifact mismatches can harm base quality recalibration and variant detection (unless a sophisticated caller like the HaplotypeCaller or MuTect2 is used) Three types of realignment targets • Known sites • Indels seen in original alignments • Sites where evidence suggests a hidden indel https://www.broadinstitute.org/gatk/events/slides/1212/GATKwh0-BP-2-Realignment.pdf https://www.broadinstitute.org/gatk/events/slides/1212/GATKwh0-BP-2-Realignment.pdf Realignment around indels • GATK • 2 step process: • Determining (small) suspicious intervals which are likely in need of realignment (RealignerTargetCreator tool) • Running the realigner over those intervals (IndelRealigner) • SMRA – realigning reads in color space (SOLiD) https://www.broadinstitute.org/gatk/events/slides/1212/GATKwh0-BP-2-Realignment.pdf https://www.broadinstitute.org/gatk/events/slides/1212/GATKwh0-BP-2-Realignment.pdf https://gatkforums.broadinstitute.org/gatk/discussion/7847 Long indels • However – GATK does not call well longer indels! - these analyzed by pindel, which needs realignment! Recalibration of base quality scores ● Phred-like quality scores issued by the sequencing platforms may often deviate from the true error rate ● Having accurate quality scores is essential for the modern SNP calling algorithms, as they integrate the Phred scores of the bases covering the site to be examined into their scoring function ● Main idea: estimate true base quality based on known sites without SNPs ● Recalibration tools/approaches: ●SOAPsnp (http://soap.genomics.org.cn/soapsnp.html) ●GATK, ... Recalibration of quality scores • All recalibrating algorithms use a comprehensive database of known SNPs. • If no such database is available, one can first identify candidate polymorphic sites that are highly likely to be real and use the remaining sites for the recalibration procedure – in this case, another round of SNP calling should be performed with recalibrated quality scores. • Also – in targeted sequencing – most of the targets is expected to have variants. In this case, the recalibration is not recommended (e.g. true SNPs would be evaluated as mismatch rate) SOAPsnp • Exploits sites in the reference genome without any reported SNPs. On these sites, it computes the empirical mismatch rate as an estimate for the true base quality. • For a: • given machine provided quality score • sequencing cycle (position of base in the read) and • substitution type (e.g. A->G: A in reference and G in read) • it calculates the average mismatch rate with respect to the reference genome • This mismatch rate is then used as the recalibrated quality score by adding to the raw quality scores the residual differences between empirical quality scores and the mismatch rates implied by the raw quality scores GATK • Similar concept as SOAPsnp: • 1. bases are grouped with respect to several features (raw quality, dinucleotide content) • 2. empirical mismatch rate is computed and used to correct the raw quality score Additional resources • Edge effects in calling variants from targeted amplicon sequencing | BMC Genomics (springer.com) • Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing | Genome Medicine | Full Text (biomedcentral.com) • Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery | BMC Genomics (springer.com) • Toward better understanding of artifacts in variant calling from highcoverage samples | Bioinformatics | Oxford Academic (oup.com)