Variant calling in targeted sequencing
Eva Budinská
autumn 2023
Aim of variant calling
• - To define genomic positions with single nucleotide polymorphisms, deletions,
insertions or long repetitive sequence insertions specific for the sample
- and if of interest, consequently call a genotype (identify whether the sample is
heterozygote or homozygote) for the allele at the particular position or haplotype.
Variant, genotype and haplotype calling
• Variant calling – identifies variable sites in genome
• Genotype calling – determines the genotype for each individual at each site (0/0, 1/1 –
homozygote, 0/1 - heterozygote)
• Haplotype calling – determines the haplotype for each individual at each site
C/TA/G A/TT/T
A T T A
G C T T
Genotype of an
individual
Haplotype 1
Haplotype 2
Site 1
Reference: A
Variants: G, C
Site 3
Reference: T
Variants: A
Site 2
Reference: C
Variants: T, TT
Site 4
Reference: A
Variants: T, G
Typical applications
• Mendelian disorders – identification of causative genes in single gene disorders
(germline mutations)
Complex diseases – identification of candidate genes in complex diseases for further
functional studies
Somatic mutations – identification of constitutional mutations as well as driver and
passenger genes in cancer
…
Whole genome/exome sequencing, targeted sequencing
Results of variant calling
Results of variant calling algorithms are presented in standardized
VCF file (variant calling format).
.vcf/.bcf
Variant call format (.vcf) and its binary form (.bcf)
Standard format file for results of variant calling
Developed in 2010 by 1000 genomes project
(1000 Genomes | A Deep Catalog of Human Genetic Variation
(internationalgenome.org))
Current release: v4.3. (November 2022) - VCFv4.3.pdf
(samtools.github.io)
...format example from lecture 3
Interpreting the file header
Version of the VCF specification to
which the file conforms (important
for handling and interpreting files!)
The filter lines – what type of filters have been applied to the data.
Which reference file was used
INFO lines define shortcuts for various site-level annotations.
These are then later used in the INFO field in the variant specification.
Beware, the definitions may differ between tools generating vcf files...
Interpreting the file header - INFO
FORMAT lines define how the genotype and other sample-level
information is represented.
Beware, the definitions may differ between tools generating vcf files...
Interpreting the file header - FORMAT
Interpreting the records:
Reference sequence
at the position
Alternative sequence(s)
at the position
Quality: The Phred-scaled probability that
REF/ALT polymorphism exists at this site given
sequencing data.
Because the Phred scale is -10 * log(1-p), a
value of 10 indicates a 1 in 10 chance of error,
while a 100 indicates a 1 in 10^10 chance of
error.
This number can grow very large when a large
amount of data is used for variant calling =>
not often a very useful property for evaluating
the quality of a variant call.Identifiers of the position and gene at which the variant was found
Interpreting the records: FILTER
Result of quality filters applied for filtering out low quality variant calls.
The values are defined in the header:
Let's get more in detail: INFO
Various site-level annotations and their values, as defined in the INFO lines in
the header.
Let's get more in detail: FORMAT
●How the genotype and other sample-level information is represented in the
sample columns!
Sample NA0001, first line variant:
●Genotype: 0|0 (homozygote for reference allele)
●Genotype quality: 48
●Read depth: 1
●Haplotype quality: 51 and 51
More info about vcf files
https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-
Variant-Call-Format
http://www.1000genomes.org/node/101
VCFv4.3.pdf (samtools.github.io)
Help files of the tools generating vcf files!
Variant identification
Tools can be (very roughly) divided based on what they can best do:
• Germline callers – central part of finding causes of rare inherited diseases
• Somatic callers – mainly cancer studies
• CNV identification tools – large structural modifications
• SV (structural variants) identification tools – inversions, translocations or large indels
(insertions-deletions)
Variant calling – the basic steps after alignment
1. The pileup step – use SAMtools to generate counts of insertions, deletions and
substitutions at each covered position in the BAM/SAM file (efficient in terms of time
and memory)
2. Estimation of background error rates (whatever the source)
3. Calling variants
4. Filtering variants
5. Calling genotypes
6. Calling haplotypes
The variants
• Are real mismatches (from the perspective of alignment)
• We want to keep these reads
• Depending on goal of the study, the aligner must be able to handle them very properly
(e.g. if we want to detect SNPs or SNVs)
Source of errors in variant calling
• NGS data suffer from high error rates that are due to multiple factors, including basecalling
and alignment errors
• Low-coverage sequencing (<5X per site per individual) – high probability that only one
of the two chromosomes of a diploid individual has been sampled at specific site
• Accurate variant and genotype calling are difficult
• It is crucial to quantify uncertainty of the results
VCF tools
• Different methods of calling vcf, often designed on specific NGS tools/data types – not
necessarily compatible!
• To distinguish between real variants and technical artifacts (DNA polymerase and sequencing
errors), the background error rate must be estimated
• The background error rate is not constant and can vary at different positions => each position
has a specific error rate
• Variant callers differ also based on the method estimating the background error rate and
whether they use specific methods to LOCALLY REALIGN the reads to perfect the precision
Pabinger et al. (2013) Survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform
Somatic variant callers – for tumours
• Background error estimation, tools use:
• matched normal / tumour samples
• control samples to model the error noise to provide the variant calling tool with the
built-in model
• use information from databases of germline SNPs
Somatic callers:
GATK – Mutect2
Strelka2
smCounter2
UMI-VarCal
UMI based (somatic) variant calling
Assumption: reads that have the same UMI should be identical
• Main approach:
1. Perform a majority vote within a UMI family
2. Build a consensus read for each UMI family
3. Apply a statistical method (like Beta distribution) to model background error rates at each position
and apply standard filters to call final variants
DeepSNVMiner
MAGERI
smCounter2
UMI-VarCal
UMI-VarCal
• UMI-VarCal:
• Does not rely on SAMtools (like MAGERI), uses innovative homemade pileup
algorithm specifically designed to treat the UMI tags in the reads
• After the pileup, a Poisson statistical test is applied at every position to determine
if the frequency of the variant is significantly higher than the background error
noise.
• Finally, an analysis of UMI tags is performed, a strand bias and a homopolymer
length filter are applied to achieve better accuracy
• UMI-VarCal is faster than both raw-reads-based and UMI-based variant callers
GATK – Genome Analysis Toolkit
• GATK (broadinstitute.org)
GATK – Genome Analysis Toolkit
Genotype calling methods
• Cutoff-based methods (early methods)
• simple: based on SNP and genotype calling on separate analyses of data from each
individual
• Probabilistic methods
• based usually on multiple samples and assigning probabilities for a given genotype
Cutoff-based genotype calling methods
Simple: based on SNP and genotype calling on separate analyses of data from each individual
Steps:
1. filtering step in which only high-confidence bases are kept (Qphred>20)
2. genotype calling – counting number of times each allele is observed, using fixed cutoffs (e.g.
if the proportion of the alternative allele is between 20-80%, heterozygote is called)
Used mainly for high sequencing depths (>20X), so that the probability of a heterozygous
individual falling outside the selected range(20-80%) is small
- Empirically determined cutoffs can be used
- Usually used in targeted sequencing
- Does not provide measures of uncertainty in the genotype inference
Probabilistic genotype calling methods
• For moderate to low sequencing depths (mainly WGS, WES) – genotype calling based on fixed cutoffs will
typically lead to under-calling of heterozygous phenotypes
• Several probabilistic methods were developed that use the quality score to provide a posterior probability for
each genotype:
P (X | G) – genotype likelihood for genotype G can be computed (X represents all of the read data for a
particular individual and site)
P(G) – genotype prior
• From these two, Bayes' formula is used to calculate P (G | X) – the posterior probability of genotype G
• The genotype with the highest posterior probability is generally chosen and this probability is used as a measure of
confidence.
• Advantages:
• higher accuracy
• natural framework for incorporating information regarding allelle frequencies and patterns of LD (linkage
disequilibrium)
Calculating genotype likelihood P(X|G)
• Can be computed from quality scores for each read
• Xi – data in read I for a particular individual and particular site with genotype G
• P(Xi |G) is given by rescaling of the quality core of Xi and the genotype likelihood
P(X|G) can be calculated directly from the data by taking the product of P(Xi |G) over
all i
• See for instance:
• https://www.broadinstitute.org/gatk/media/docs/Samtools.pdf
Calculating genotype priors
• Can be performed using single or multiple samples
• Single-sample: Prior-genotype probability may be chosen to assign equal probability to all genotypes, or it can
be based on external information (e.g. the reference sequence, SNP databases or an available population
sample)
• In SOAPsnp tool, a prior is chosen by the use of dbSNP
• Example: if a G/T polymorphism is reported in dbSNP, the prior probabilities are set to be 0.454 for each of
the genotypes GG and TT; 0.0909 for GT and less than 10e-4 for all other genotypes
• Multiple-samples: Priors derived by joint analysis of multiple individuals – by analysis of allele or genotype
frequencies estimated from larger data sets e.g. using maximum likelihood
Tools for genotype calling
Nielsen et al. (2011): Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics 12, 443-451
Haplotype calling / haplotype phasing
• The exponential growth problem….
• Example – two heterozygous genotypes (A/B)
Brian Browning | Haplotype phasing: methods and accuracy - YouTube
A A
B B
A B
B A
Haplotype calling / haplotype phasing
Brian Browning | Haplotype phasing: methods and accuracy - YouTube
A A ?
B B ?
A B ?
B A ?
• The exponential growth problem….
• Example – three heterozygous genotypes (A/B)?
Haplotype calling / haplotype phasing
Brian Browning | Haplotype phasing: methods and accuracy - YouTube
A A ?
B B ?
B B B
A A A
A A B
B B A
A B ?
B A ?
B A A
A B B
A B A
B A B
• The exponential growth problem….
• Example – three heterozygous genotypes (A/B)?
Haplotype phasing algorithms
• Main approaches:
• Clark’s algorithm
• EM algorithm (Arlequin, PL-EM)
• Coalescent-based methods and hidden Markov models (MACH, IMPUTE2, PHASE, BEAGLE)
• Haplotype phasing: Existing methods and new developments - PMC (nih.gov)
Is genotype/haplotype calling necessary?
• Attention – sometimes, it is not of interest to call a genotype – e.g. if heterogenous
tissues are sampled and we can expect different clones!
• Example: cancer cells – one clone from thousands can regrow into metastasis!
Variant annotation
• We have the possibility of predicting the functional impact of variants in an automated
fashion
• This is very important for biological interpretation of the results!
• Not all the tools are able to annotate all types of variants
• Output: usually vcf file, with information on annotation in the INPUT field
Example of
annotated .vcf
file
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 868329 . A C 25.56 LowQual AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=
99.00;QD=12.78;SOR=0.693;ANN=C|intron_variant|MODIFIER|S
AMD11|SAMD11|transcript|NM_152486.2|Coding|4/13|c.305+
1860A>C||||||
GT:AD:DP:FT:GQ:PL
chr1 1665702 . T C 62.55 PASS AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=
60.00;QD=31.27;SOR=2.303;ANN=C|intron_variant|MODIFIER|S
LC35E2|SLC35E2|transcript|NM_182838.2|Coding|5/5|c.732+4
27A>G||||||,C|intron_variant|MODIFIER|SLC35E2|SLC35E2|tra
nscript|NM_001199787.1|Coding|6/6|c.732+427A>G||||||
GT:AD:DP:FT:GQ:PL
chr1 1665740 . T C 66.55 PASS AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=
60.00;QD=33.27;SOR=2.303;ANN=C|intron_variant|MODIFIER|S
LC35E2|SLC35E2|transcript|NM_182838.2|Coding|5/5|c.732+3
89A>G||||||,C|intron_variant|MODIFIER|SLC35E2|SLC35E2|tra
nscript|NM_001199787.1|Coding|6/6|c.732+389A>G||||||
GT:AD:DP:FT:GQ:PL
Variant annotation
tools
• Pabinger et al. (2013) Survey of tools for variant
analysis of next-generation genome sequencing data.
Brief Bioinform
Additional resources
• Best practices for variant calling in clinical sequencing | Genome
Medicine | Full Text (biomedcentral.com)
Specific issues in variant calling
The effect of sequencing depth and mutation
frequency on performance of variant callers
Systematic comparison of somatic variant calling performance among different
sequencing depth and mutation frequency | Scientific Reports (nature.com)
Base calling algorithms
• Reducing the error rate of base calls and improving the accuracy of the per-base quality score
(Phred) have important implications for assembly, variant detection and downstream
population-genomic analyses.
• Aside of NGS platform provided base calling algorithms, several other have been developed to
optimize data acquisition, improving by 5-30% the error rates:
• Pyrobayes (454 Roche)
• Rsolid (SOLiD)
• BayesCall, Ibis (Illumina)
Alignment
• The accuracy of alignment has a crucial role in variant detection.
• Incorrectly aligned reads may lead to errors in SNP and genotype calling
• Alignment algorithms – need to be:
• able to cope with sequencing errors and potentially real differences (point mutations and indels)
between the reference genome and the sequenced genome
• able to produce well calibrated alignment quality values (variant calls and their posterior probabilities
depend on these scores)
Alignment - mismatches
• The amount of sequence identity required between each read and the reference sequence is
determined by a trade-off between accuracy and length
• The optimal choice of tolerable number of mismatches depends on the organism and also on the
genome part!
• e.g. Drosophila melanogaster is more variable than human – using mapping criteria for human on
Drosophila may lead to a severe loss of sequencing depth for Drosophila. Vice-versa, using Drosophila
melanogaster mapping criteria on human would lead to large amount of incorrectly aligned reads
• MHC complex – very variable between individuals – consider combining alignment and assembly!
Steps increasing the precision
• Pair-end reads are highly recommended, if not even a requirement for WES and WGS
to overcome the problem of ambiguity
• Reads that can only be mapped with many mismatches should be discarded from the
analysis => variants backed only by such reads should not be considered
• Multiple reads originating from only one template might be sequenced, interfering
with variant calling statistics => remove the PCR duplicates after alignment in WGS and
WES (not in targeted sequencing!!)
Which aligners?
*Lunter, G. & Goodson, M. (2011) Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 19: 936-939
Bowtie – cannot perform gapped alignment => cannot find short indels =>
think of Bowtie2!
BWA
BWT vs hash-based: BWT are faster, BUT – not as sensitive as hash-based =>
can introduce mapping biases in regions with high variability*
According to a comparative study*, Novoalign and Stampy currently produce
the most accurate overall results, being at the same time fast enough
However, Stampy not applicable to SOLiD colour reads
SHRiMP2 and BFAST – hash based, capable of dealing with SOLiD colour reads
Stampy*
●Mapping hash-based with optional speedup using BWA
●No support of color reads
●Combines the speed-up of BWT and the sensitivity of hash-based aligners
●http://www.well.ox.ac.uk/project-stampy
●for download you need to register
• *Lunter, G. & Goodson, M. (2011) Stampy: a statistical algorithm for sensitive and fast mapping
of Illumina sequence reads. Genome Res. 19: 936-939
Stampy + BWA
• Steps:
• 1. Create BWA index for the same reference genome as used by Stampy
• bwa index
• 2. Map the reads using BWA in the usual way and use samtools to convert the resulting SAM file into a BAM file
• bwa aln
• bwa samse or sampe
• samtools view
• 3. Remap the BAM file using Stampy, but keep the well-mapped reads
• ./stampy.py -g hg18 \
• -h hg18 \
• -- bamkeepgoodreads -M bwa.bam
After alignment
● Step 1: Post-alignment quality control (assessment of target coverage,
removal of duplicates and non-unique alignments, ...)
● Step 2: Realignment around indels - variant calling specific step
● Step 3: Quality score recalibration - variant calling specific step
Assessment of target coverage
● QC step very specific for targeted sequencing
● Target coverage – is the selection of targets working?
● Specific files used: “bait list” and “target list” (both are .bed files, provided by company
supplying the enrichment kit)
● Baits are oligonucleotides that retrieve specific RNA or genomic DNA used for selection/enrichment of target
regions. The desired DNA or RNA molecules hybridize with the baits, and others do not
● Targets are sequences of interest
Bait file
Target file
Assessment of target coverage
• Tools: picard CollectHsMetrics
• java -jar picard.jar CollectHsMetrics \
• I=your.bam \
• O=result.txt \
• R=reference_sequence.fasta \
• BAIT_INTERVALS=bait_list.bed \ TARGET_INTERVALS=target_list.bed \
CollectHsMetrics
result
http://broadinstitute.github.io/picard/picard-metric-definitions.html#HsMetrics
Coverage using bedtools
• coverageBed -b file.bam \
• -a targets.bed \
• > result.txt
Result:
- chromosome
- target start position
- target end position
- name of target region
- number of features in bam file that overlapped
(by at least one base pair) with target region
- number of bases in target region that had
non-zero coverage by reads in bam file
- the length of target region
- the fraction of bases in target region that had
non-zero coverage by reads in bam file
Coverage using bedtools
Visualization of
coverage
Removing duplicates?
● Multiple reads originating from only one template (PCR effect) might be sequenced
● This interferes with variant calling statistics => remove the PCR duplicates
● However a duplicate could be PCR effect or reading same fragment twice, there is no way to tell.
● The degree to which we expect duplicate fragments is highly dependent on the depth of the library and the type of library
(whole genome, exome, transcriptome, ChIP-Seq, etc.).
● We do not remove duplicates in amplicon based targeted sequencing – here we expect high number of duplicate reads by
design!
Defining a duplicate
● Different ways of definition of duplicates:
● Reads with identical sequence:
+ reduces the pool of reads before mapping
– accurate removal is dependent on low error rate (if an error, then the read is not considered a duplicate)
● Reads with the same mapping position:
+ not influenced by error rate
– reads at the same position may come from different DNA fragments (diploid genoma and polymorphism present...)
● Paired-end reads and longer reads help to correctly identify duplicates
● The best way to deal with duplicates that correspond to PCR amplification bias is to reduce their generation in the first place
● UMIs! – unique molecular identifiers
Marking and removing duplicates
• picard MarkDuplicates – the most recommended tool
• samtools rmdup/rmdupse – alternative
• "Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and
mapping orientations of each read pair. When doing this it takes into account all clipping that has taking
place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases
from the read were aligned, where would the 5' most base have been aligned". It then matches all read
pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair.
"Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."
Indels – a real example
AML data, NPM1 gene, patient heterozygous for short insertion CATG:
wt: ATT CAA GAT CTC TGG CAG TGG
||| ||| ||| ||| ||
mut: ATT CAA GAT CTC TGC ATG GCA GTG G
The NGS identified insertion TGCA due to incorrect alignment:
wt: ATT CAA GAT CTC TGG CAG TGG
||| ||| ||| |||
mut: ATT CAA GAT CTC TGC ATG GCA GTG G
pindel
• Common variant calling tools are not designed to find long repetitive inserts!
• Pindel is a very good alternative – mainly for targeted sequencing data.
• http://gmt.genome.wustl.edu/packages/pindel/
• Generally, it is recommended to combine multiple tools, based on the biological
questions and data type!
Realignment around indels
• These artifact mismatches can harm base quality recalibration and variant
detection (unless a sophisticated caller like the HaplotypeCaller or MuTect2 is used)
Three types of realignment targets
• Known sites
• Indels seen in original alignments
• Sites where evidence suggests a hidden indel
https://www.broadinstitute.org/gatk/events/slides/1212/GATKwh0-BP-2-Realignment.pdf
https://www.broadinstitute.org/gatk/events/slides/1212/GATKwh0-BP-2-Realignment.pdf
Realignment around indels
• GATK
• 2 step process:
• Determining (small) suspicious intervals which are likely in need of realignment
(RealignerTargetCreator tool)
• Running the realigner over those intervals (IndelRealigner)
• SMRA – realigning reads in color space (SOLiD)
https://www.broadinstitute.org/gatk/events/slides/1212/GATKwh0-BP-2-Realignment.pdf
https://www.broadinstitute.org/gatk/events/slides/1212/GATKwh0-BP-2-Realignment.pdf
https://gatkforums.broadinstitute.org/gatk/discussion/7847
Long indels
• However – GATK does not call well longer indels! - these analyzed by pindel,
which needs realignment!
Recalibration of base quality scores
● Phred-like quality scores issued by the sequencing platforms may often deviate from the true error rate
● Having accurate quality scores is essential for the modern SNP calling algorithms, as they integrate the
Phred scores of the bases covering the site to be examined into their scoring function
● Main idea: estimate true base quality based on known sites without SNPs
● Recalibration tools/approaches:
●SOAPsnp (http://soap.genomics.org.cn/soapsnp.html)
●GATK, ...
Recalibration of quality scores
• All recalibrating algorithms use a comprehensive database of known SNPs.
• If no such database is available, one can first identify candidate polymorphic sites that are
highly likely to be real and use the remaining sites for the recalibration procedure – in this
case, another round of SNP calling should be performed with recalibrated quality scores.
• Also – in targeted sequencing – most of the targets is expected to have variants. In this
case, the recalibration is not recommended (e.g. true SNPs would be evaluated as
mismatch rate)
SOAPsnp
• Exploits sites in the reference genome without any reported SNPs. On these sites, it computes the empirical
mismatch rate as an estimate for the true base quality.
• For a:
• given machine provided quality score
• sequencing cycle (position of base in the read) and
• substitution type (e.g. A->G: A in reference and G in read)
• it calculates the average mismatch rate with respect to the reference genome
• This mismatch rate is then used as the recalibrated quality score by adding to the raw quality scores the
residual differences between empirical quality scores and the mismatch rates implied by the raw quality
scores
GATK
• Similar concept as SOAPsnp:
• 1. bases are grouped with respect to several features (raw quality, dinucleotide content)
• 2. empirical mismatch rate is computed and used to correct the raw quality score
Additional resources
• Edge effects in calling variants from targeted amplicon sequencing |
BMC Genomics (springer.com)
• Low concordance of multiple variant-calling pipelines: practical
implications for exome and genome sequencing | Genome Medicine
| Full Text (biomedcentral.com)
• Systematic benchmark of state-of-the-art variant calling pipelines
identifies major factors affecting accuracy of coding sequence variant
discovery | BMC Genomics (springer.com)
• Toward better understanding of artifacts in variant calling from highcoverage
samples | Bioinformatics | Oxford Academic (oup.com)