Lecture 3 : DNA re-sequencing + Small variant calling Vojta Bystry vojtech.bystry@ceitec.muni.cz Modern methods for genome analysis (PřF:Bi7420) NGS data analysis 22 Raw data .fastq Genome/Transcriptome Reference Mapping .bam Interaction analysis CHIP-seq Expression analysis RNAseq Variant analysis WES de-multiplexing Not known reference QC QC Experiment design Not ”classic” reference Metagenomics Reference assembly Immunogenetic VDJ-genes CRISPR sgRNA Methylation Bisulfide-seq… DNA re-sequencing 3 • Variant Calling • Medical genomics ‒ Cancer genomics Genome Variation Name of the presentation 4 Name of the presentation 5 Name of the presentation 6 Name of the presentation 7 Name of the presentation 8 Name of the presentation 9 Name of the presentation 10 Name of the presentation 11 NGS data analysis 1212 Raw data .fastq Genome/Transcriptome Reference Mapping .bam Interaction analysis CHIP-seq Expression analysis RNAseq Variant analysis WES de-multiplexing Not known reference QC QC Experiment design Not ”classic” reference Metagenomics Reference assembly Immunogenetic VDJ-genes CRISPR sgRNA Methylation Bisulfide-seq… Mapping 13 • Computationally most demanding • More or less standardized • Output .bam ‒ .bam = binary (ziped) .sam ‒ .sam = Sequence Alignment Map • Tools ‒ BWA - DNA ‒ STAR – RNA (eucaryotic) DNA re-sequencing Mapping QC 14 Name of the presentation 15 Name of the presentation 16 Name of the presentation 17 Name of the presentation 18 Name of the presentation 19 Name of the presentation 20 Name of the presentation 21 Klastrování regionů podle AF PH JHC JHM KVK VYS HKK LBK MSK OLK PK PLK STC ULK ZLK KVK JHM VYS ZLK MSK OLK JHC LBK PK PLK HKK ULK STC PH KVK JHM VYS ZLK MSK OLK JHC LBK PK PLK HKK ULK STC PH Korelace patogenity a AF 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Alelle frequency in Czech population AlphaMissensepathogenityscore Zajímavé varianty - F5 - Trombofilie Gene_name chrom pos ref alt HGVSc HGVSp F5 chr1 169542640 T G c.2450A>C p.N817T F5 chr1 169514323 T C c.6665A>G p.D2222G F5 chr1 169549811 C T c.1601G>A p.R534Q AF_CZE EUR_AF EAS_AF AFR_AF AMR_AF SAS_AF 8,4% 6,2% 3,3% 0,5% 9,2% 6,4% 8,2% 6,6% 3,2% 0,2% 9,1% 6,4% 3,8% 1,2% 0,0% 0,0% 1,0% 1,1% Name of the presentation 25 Name of the presentation 26 Name of the presentation 27 Variant Calling - Tools Name of the presentation 28 • Multiple tools: ‒ strelka2, verdict, mutect2, somaticsniper, lofreq, muse, varscan • Ensemble/meta callers usually outperformes individual ‒ SomaticSeq • Benchmarking ‒ Genome in a Bottle ‒ GIAB ‒ son/father/mother trios of Ashkenazi Jewish Variant Calling - Tools Name of the presentation 29 • Problem is variant filtering ‒ Complex regions ‒ Pseudo-genes • Sensitivity vs. specificity tradeoff ‒ Preferred sensitivity ‒ Preferred accuracy for automated processing Small Variant annotation Name of the presentation 30 • VEP – variant effect predictor • Transcript ”selection” ‒ Refseq vs. ensemble • Population frequency ‒ 1000 genome project ‒ Gnomad • Many clinical variant DBs ‒ Gene based vs. variant based ‒ snpDB ‒ COSMIC ‒ clinvar ‒ CGC Small Variant annotation – functional prediction Name of the presentation 31 • General variant consequence ‒ Based on the position ‒ Impact • Effect of the variant on protein structure ‒ PolyPhen ‒ SIFT Cancer genomics introduction 32 Cancer genomics introduction 33 • Based on molecular state ‒ Classification ‒ Prognostic ‒ Treatment selection • Precission medicine Cancer genomics introduction - Case report 8 • 5 years old boy with diffuse intrinsic pontine glioma (DIPG), 6 months of standard chemo/radiotherapy > tumor progression, only 6 months to live • WES identified activation mutation in PI3K kinase -> Akt oncogenic signalling pathway Miltefosin/impavido (only approved Akt inhibitor) DRUG REPURPOSING Leishmaniasis At the beggining 6m treatment 4m of miltefosin 8m of miltefosin Somatic variant NGS data analysis 35 • Primary analysis and QC • Variant calling • Variant annotation • Variant interpretation • Clinical application Somatic variant NGS data analysis 36 • Primary analysis and QC • Variant calling • Variant annotation • Variant interpretation • Aggregated feature extraction • Predictive modeling • … • Clinical application Variant interpretation – derived informations 37 • Tumor mutational burden ‒ Several definitions ‒ Mutations per million bases ‒ Good indicator for immunotherapy to work • Microsatellite Instability ‒ Specific variants occurence • HPV status Variant interpretation – derived informations Name of the presentation 38 • Tumor mutational burden ‒ Several definitions ‒ Mutations per million bases • Mutational Signatures ‒ COSMIC ‒ exposure to ultraviolet light ‒ Tabacco smoking ‒ Defective DNA damage repair Genomic variant predictive modeling 39 • Genomic variant data are very problematic for modeling ‒ Enormous feature space • ~ 100 000 features ‒ Limited number of data points • Only one predictive label per patient • Feature selection/extraction • Increase number of samples Name of the presentation Genomic variant predictive modeling 40 • Genomic variant data are very problematic for modeling ‒ Enormous feature space • ~ 100 000 features ‒ Limited number of data points • Only one predictive label per patient • Feature selection/extraction • Increase number of samples Curse of dimensionality Genomic variant predictive modeling 41 • Genomic variant data are very problematic for modeling ‒ Enormous feature space • ~ 100 000 features ‒ Limited number of data points • Only one predictive label per patient • Feature selection/extraction • Increase number of samples • Biologically meaningful data extraction • Usage of publicly available data Curse of dimensionality Variants Genes Pathway Genomic variant predictive modeling • Pathway level “disruption” score from gene- and mutation-level scores ‒ KEGG pathways ‒ Mutation effect combination of CADD, EVE, Polyphen2 scores FI MUNI Bioinformatics Seminar 42 43www.ceitec.eu CEITEC @CEITEC_Brno Vojta Bystry vojtech.bystry@ceitec.muni.cz Thank you for your attention!