Lecture 3 : DNA re-sequencing + Small variant calling Vojta Bystry vojtech.bystry@ceitec.muni.cz Modern methods for genome analysis (PřF:Bi7420) NGS data analysis 22 Raw data .fastq Genome/Transcriptome Reference Mapping .bam Interaction analysis CHIP-seq Expression analysis RNAseq Variant analysis WES de-multiplexing Not known reference QC QC Experiment design Not ”classic” reference Metagenomics Reference assembly Immunogenetic VDJ-genes CRISPR sgRNA Methylation Bisulfide-seq… DNA re-sequencing 3 • Variant Calling • Medical genomics ‒ Cancer genomics • Small variants (SNV + small indels) vs. Structural Variants • Germline vs. Somatic Mapping 4 • Computationally most demanding • More or less standardized • Output .bam ‒ .bam = binary (ziped) .sam ‒ .sam = Sequence Alignment Map • Tools ‒ BWA - DNA ‒ STAR – RNA (eucaryotic) DNA re-sequencing Mapping QC 5 Small Variant calling Name of the presentation 6 Variant Calling - Germline 7 • What you have from birth • Family trio sequencing • Predispositions Variant Calling - Germline 8 • What you have from birth • Family trio sequencing • Predispositions Variant Calling - Somatic 9 • Diagnostics / prognostic / therapy decision • Tumor – normal paired ‒ Somatic variant calling without normal needs high coverage (200x >) • not all germline variants will be filtered • Expected variant heterogeneity • Expected variant allelic frequency (VAF) ‒ Histopathology prediction overestimate tumor load ‒ Negative correlation to the necessary coverage Tumor purity estiamtion Tumor composition Variant Calling - Tools Name of the presentation 10 • Multiple tools: ‒ strelka2, verdict, mutect2, somaticsniper, lofreq, muse, varscan • Ensemble/meta callers usually outperformes individual ‒ SomaticSeq • Benchmarking ‒ Genome in a Bottle ‒ GIAB ‒ son/father/mother trios of Ashkenazi Jewish Variant Calling - Tools Name of the presentation 11 • Problem is variant filtering ‒ Complex regions ‒ Pseudo-genes • Sensitivity vs. specificity tradeoff ‒ Preferred sensitivity ‒ Preferred accuracy for automated processing Small Variant annotation Name of the presentation 12 • VEP – variant effect predictor • Transcript ”selection” ‒ Refseq vs. ensemble • Population frequency ‒ 1000 genome project ‒ Gnomad • Many clinical variant DBs ‒ Gene based vs. variant based ‒ snpDB ‒ COSMIC ‒ clinvar ‒ CGC Small Variant annotation – functional prediction Name of the presentation 13 • General variant consequence ‒ Based on the position ‒ Impact • Effect of the variant on protein structure ‒ PolyPhen ‒ SIFT Cancer genomics introduction 14 Cancer genomics introduction 15 • Based on molecular state ‒ Classification ‒ Prognostic ‒ Treatment selection • Precission medicine Cancer genomics introduction - Case report 8 • 5 years old boy with diffuse intrinsic pontine glioma (DIPG), 6 months of standard chemo/radiotherapy > tumor progression, only 6 months to live • WES identified activation mutation in PI3K kinase -> Akt oncogenic signalling pathway Miltefosin/impavido (only approved Akt inhibitor) DRUG REPURPOSING Leishmaniasis At the beggining 6m treatment 4m of miltefosin 8m of miltefosin Somatic variant NGS data analysis 17 • Primary analysis and QC • Variant calling • Variant annotation • Variant interpretation • Clinical application Somatic variant NGS data analysis 18 • Primary analysis and QC • Variant calling • Variant annotation • Variant interpretation • Aggregated feature extraction • Predictive modeling • … • Clinical application Variant interpretation – derived informations 19 • Tumor mutational burden ‒ Several definitions ‒ Mutations per million bases ‒ Good indicator for immunotherapy to work • Microsatellite Instability ‒ Specific variants occurence • HPV status Variant interpretation – derived informations Name of the presentation 20 • Tumor mutational burden ‒ Several definitions ‒ Mutations per million bases • Mutational Signatures ‒ COSMIC ‒ exposure to ultraviolet light ‒ Tabacco smoking ‒ Defective DNA damage repair Genomic variant predictive modeling 21 • Genomic variant data are very problematic for modeling ‒ Enormous feature space • ~ 100 000 features ‒ Limited number of data points • Only one predictive label per patient • Feature selection/extraction • Increase number of samples Name of the presentation Genomic variant predictive modeling 22 • Genomic variant data are very problematic for modeling ‒ Enormous feature space • ~ 100 000 features ‒ Limited number of data points • Only one predictive label per patient • Feature selection/extraction • Increase number of samples Curse of dimensionality Genomic variant predictive modeling 23 • Genomic variant data are very problematic for modeling ‒ Enormous feature space • ~ 100 000 features ‒ Limited number of data points • Only one predictive label per patient • Feature selection/extraction • Increase number of samples • Biologically meaningful data extraction • Usage of publicly available data Curse of dimensionality Variants Genes Pathway Genomic variant predictive modeling • Pathway level “disruption” score from gene- and mutation-level scores ‒ KEGG pathways ‒ Mutation effect combination of CADD, EVE, Polyphen2 scores FI MUNI Bioinformatics Seminar 24 25www.ceitec.eu CEITEC @CEITEC_Brno Vojta Bystry vojtech.bystry@ceitec.muni.cz Thank you for your attention!