DNA re-sequencing - Analysis Vojtěch Bystrý 18. November 2019 Goals of the presentation • Overview of NGS bioinformatics • NGS bioinformatics < Sequence analysis < Bioinformatics • What to think about when you • plan experiment • discuss data analyses • check results • Not to teach you how to do bioinformatics 22 NGS Bioinformatics 33 NGS experiments 44 Cell Static Cell Dynamics Next Generation Sequencing NGS experiments 55 Cell Static Cell Dynamics Next Generation Sequencing DNA re-sequencing RNAseq, Chipseq, CLIP-seq.. NGS experiments 66 DNA re-sequencing RNAseq, Chipseq, CLIP-seq.. Next Generation Sequencing Recognize differences from “normal” Counting elements NGS data analysis 77 Raw data .fastq Mapping .bam Variant analysis Structural variants de-multiplexing QC QC Experiment design 8 Data pre-processing § Primer (adaptor) trimming § To cut adapter usually not necessary but good practice § Primer removal is necessary § UMI extraction 9 UMI – unique molecular identifiers § Each molecular fragment gets unique n-base sequence (n ~ 8-12) § Usage: § Mark duplicates § Consensus sequence § sequencing (PCR) error removal Raw data - QC • Fastq - q stands for quality – coded phred score 1010 Q = −10⋅log10 P Quality Error probability 5 31% 10 10% 20 1% 30 0.1% • Very good for early problem detection • Reasonable for trimming and read filtering • RNA seq - above phred score 5 • Not good for individual variant analysis CFFFFEFFGCEEGECFGGGGAFF87@E:++6C<++3:,8,33,,:,,,:,,:,,, NGS data analysis 1111 Raw data .fastq Mapping .bam Variant analysis Structural variants de-multiplexing QC QC Experiment design Alignment • Computationally most demanding • More or less standardized • Align to genome then select region of interest (ROI) <- .bed file • Don’t force alignment • Keep the information about wrongly aligned for QC • Exception targeted structural variant detection 1212 Alignment - QC • Mean coverage and variance • Percentage of covered with at least • In WES we define good quality if at lest 90% of positions are covered at least 20x • Insert size • BAM cross-contamination • Cross-sample snp allele frequency correlation 1313 Variant Calling 1414 § Type of comparison § Germline § Somatic § Tumor - normal § Somatic variant calling without normal needs high coverage § Expected variant heterogeneity § Indirectly corelates to the necessary coverage 85 Variant Calling Variant Calling • Scope 1616 Scope genes ~bp ~% of WG ~ Germ vars WGS ~22000 3 200 mil 100% 700 000 WES 22000 30 mil 1% 60 000 PanCancer 1049 1.2 mil 0.04% 3000 CZECANCA 219 250 000 0.0083% 400 TP53 1 25772 0.000859% 30 Variant Calling - planning • Sample design • Germline • Somatic (Tumor - Normal \0 • Any relationship between samples for comparison improve specificity dramatically • Not sensitivity • Somatic variant calling without normal needs high coverage • RNA • Depends on gene expression levels • Variant might not be there! – gtex, previous runs QC 1717 Variant Calling • Specificity vs. Sensitivity • Tools • varscan – no statististics = no assumptions • vardict • gatk haplotype caller • mutect – only snp • pindel – only indels • freebayes • Callers combining – usual strategy • Variant Annotation • Annovar – good database • snpEff • vep – variant effect predictor 1818 Variant Calling • Variant annotation can help variant calling significantly • Variant occurrence in normal population • 1000 genome project – above 5% • Variant consequences cut off 1919 • Database can help significantly – Sophia Genetics NGS data analysis 2020 Raw data .fastq Mapping .bam Variant analysis Structural variants de-multiplexing QC QC Experiment design Structural variants • discordant read(-pairs) mapping • copy number variants (CNV) 2121 Structural variants • CNV • long variants in WGS – ControlFreec • Smaller variants for WES / target panel • Somatic – tumor,normal • Germline - lot of references • XHMM • Read-pairs very noisy expect a lot of FP • BreakPoint • Target panel with short reads • Delly • everything else 2222 Structural variants • Manual check with IGV 2323 Thank you for your attention Central European Institute of Technology Masaryk University Kamenice 753/5 625 00 Brno, Czech Republic www.ceitec.muni.cz | info@ceitec.muni.cz