ceitec_PPT_podklad_uvod logo+napis_en OPVaVpI_loga-eu_pos_H_EN partner_logo_2 Moderní metody pro analýzu genomu: Bioinformatika I Vojtěch Bystrý 29. October 2018 Good morni CEITEC_logo_pos Goals of the presentation •Overview of NGS bioinformatics •NGS bioinformatics < Sequence analysis < Bioinformatics •What to think about when you •plan experiment •discuss data analyses •check results • •Not to teach you how to do bioinformatics • 2 2 CEITEC_logo_pos NGS Bioinformatics 3 3 Heatmap_RNAseqV2_1.png CEITEC_logo_pos NGS experiments 4 4 Cell Static Cell Dynamics Next Generation Sequencing CEITEC_logo_pos NGS experiments 5 5 Cell Static Cell Dynamics Next Generation Sequencing DNA sequencing RNAseq, Chip-seq, CLIP-seq.. CEITEC_logo_pos NGS experiments 6 6 DNA sequencing RNAseq, Chip-seq, CLIP-seq.. Next Generation Sequencing Recognize differences from “normal” Counting elements CEITEC_logo_pos NGS data analysis workflow 7 7 Raw data .fastq Mapping .bam Variant analysis Expression analysis Structural variants de-multiplexing Special cases QC QC > Experiment design Not known reference: Non-standard species Assembly IG/TR receptors … CEITEC_logo_pos NGS data analysis workflow 8 8 Raw data .fastq Mapping .bam Variant analysis Expression analysis Structural variants de-multiplexing Special cases QC QC > Experiment design CEITEC_logo_pos Experimental design •Have a hypotheses! •Consult with sequencing expert and bioinformatician • 1.If experiment you have in mind can be done in a way you are planning to. 2.If the results you want can be obtained from the planned sequencing. (desired outcome) 3.If the bioinformatician knows how to perform specific types of analyses and how long it will probably take. • 9 9 CEITEC_logo_pos NGS data analysis 10 10 Raw data .fastq Mapping .bam Variant analysis Expression analysis Structural variants de-multiplexing Special cases QC QC > Experiment design CEITEC_logo_pos De-multiplexing •Not perfect •In silico contamination – problem for MRD detection • •Sample naming and organisation •Naming •Unique names •_ vs – vs . •Special characters: $&|@+- … •Really tricky: - vs − •Organization •Should not be your worries •For any longer ‘operation’ comprehensive database is necessary •Currently working on it ourselves J •Please fill the forms carefully and as much as possible 11 11 CEITEC_logo_pos NGS data analysis 12 12 Raw data .fastq Mapping .bam Variant analysis Expression analysis Structural variants de-multiplexing Special cases QC QC > Experiment design CEITEC_logo_pos 13 Data pre-processing §Primer (adaptor) trimming §To cut adapter usually not necessary but good practice §Primer removal is necessary § §UMI extraction CEITEC_logo_pos 14 UMI – unique molecular identifiers § § §Each molecular fragment gets unique n-base sequence (n ~ 8-12) §Usage: §Mark duplicates §Consensus sequence §sequencing (PCR) error removal CEITEC_logo_pos Raw data - QC •Fastq - q stands for quality – coded phred score • • 15 15 Quality Error probability 5 31% 10 10% 20 1% 30 0.1% •Very good for early problem detection •Reasonable for trimming and read filtering •RNA seq - above phred score 5 •Not good for individual variant analysis • • CFFFFEFFGCEEGECFGGGGAFF87@E:++6C<++3:,8,33,,:,,,:,,:,,, Most is CEITEC_logo_pos NGS data analysis 16 16 Raw data .fastq Mapping .bam Variant analysis Expression analysis Structural variants de-multiplexing Special cases QC QC > Experiment design CEITEC_logo_pos Alignment •Computationally most demanding •More or less standardized • •Align to genome then select region of interest (ROI) <- .bed file •Don’t force alignment •Keep the information about wrongly aligned for QC •Exception targeted SV detection • •Our standard procedure: •BWA - DNA •STAR - RNA •Chimira – sRNA • 17 17 CEITEC_logo_pos Alignment - QC •DNA •Mean coverage and variance •Percentage of covered with at least •In WES we define good quality if at lest 90% of positions are covered at least 20x •Per base coverage – in smaller experiments • •RNA •Per gene coverage •Variability of per gene mapping •Gene counts distribution •rRNA content estimate •Tissue expression check - gtex • 18 18 img1.png CEITEC_logo_pos Alignment - QC •BAM cross-contamination •verifyBamID •FREEMIX - bellow 0.03 = OK •Cross-sample snp allele frequency correlation •merge vcf tools 19 19 CEITEC_logo_pos NGS data analysis 20 20 Raw data .fastq Mapping .bam Variant analysis Expression analysis Structural variants de-multiplexing Special cases QC QC > Experiment design CEITEC_logo_pos Variant Calling 21 21 CEITEC_logo_pos §Type of comparison §ermline – to reference genome §Somatic – to other sample(s) §Expected variant heterogeneity §Indirectly corelates to the necessary coverage § § § § § Exome sequencing workshop, VBCF, 1.8.2018 22 Variant Calling CEITEC_logo_pos Variant Calling - planning •Scope • • • • • • • •With fixed cost of bp-read it seems the price is linear •The “price” of the analysis must be considered •Power of the results (sensitivity, specificity) • • • 23 23 Scope genes ~bp ~% of WG WGS ~22000 3 200 mil 100% WES 22000 30 mil 1% PanCancer 1049 1.2 mil 0.04% CZECANCA 219 250 000 0.0083% TP53 1 25772 0.000859% CEITEC_logo_pos Variant Calling - planning •Example § § § § • • •WES on a single healthy person with a question: § Are there any variants? •Answer is YES • • 24 24 Lets have an analysis with per base false positive error rate 0.0001. Resulsts in: 2 false variants in TP53 gene 3000 false variants in WES! CEITEC_logo_pos Variant Calling - planning •Sample design •Germline •Somatic •Tumor - Normal •Family • •Any relationship between samples for comparison improve specificity dramatically •Not sensitivity •Somatic variant calling without normal needs high coverage • •RNA •Depends on gene expression levels •Variant might not be there! – gtex, previous runs QC • • • 25 25 CEITEC_logo_pos Variant Calling •Specificity vs. Sensitivity •Tools •varscan – no statististics = no assumptions •vardict •gatk haplotype caller •mutect – only snp •pindel – only indels •freebayes •Callers combining – usual strategy •Variant Annotation •Annovar – good database •snpEff •vep – variant effect predictor • 26 26 CEITEC_logo_pos Variant Calling •Variant annotation can help variant calling significantly •Variant occurrence in normal population •1000 genome project – above 5% •Variant consequences cut off • • • • • • 27 27 Screenshot 2017-06-13 19.53.57.png •Database can help significantly – Sophia Genetics • • • • • CEITEC_logo_pos NGS data analysis 28 28 Raw data .fastq Mapping .bam Variant analysis Expression analysis Structural variants de-multiplexing Special cases QC QC > Experiment design CEITEC_logo_pos Structural variants •discordant read(-pairs) mapping •copy number variants (CNV) • • 29 29 You are looking for something you don’t expect CEITEC_logo_pos Structural variants •CNV •long variants in WGS – ControlFreec •Smaller variants for WES / target panel •Somatic – tumor,normal •Germline - lot of references •XHMM § •Read-pairs very noisy expect a lot of FP •BreakPoint •Target panel with short reads •Delly •everything else • • • • 30 30 You are looking for something you don’t expect CEITEC_logo_pos Structural variants •Manual check with IGV (batchmode) • • • • 31 31 DEL58.right.png INV47.right.png DEL20.right.png You are looking for something you don’t expect CEITEC_logo_pos NGS data analysis 32 32 Raw data .fastq Mapping .bam Variant analysis Expression analysis Structural variants de-multiplexing Special cases QC QC > Experiment design CEITEC_logo_pos Expression analysis § 1/3/2019 CEITEC_logo_pos Counting schemes 1/3/2019 § CEITEC_logo_pos Screenshot 2017-04-28 15.36.43.png Expression analysis - planning •3 way balance •Read depth •Biological replicates •Fold change (number of genes) sensitivity 35 35 many BR 1 BR low RD high RD not sensitive very sensitive CEITEC_logo_pos Expression analysis - planning •Replicates •Technical vs. biological •Technical only for technique testing •Highly suggested minimum = 4 rep • • • 36 36 N1 N2 T1 T2 T3 N3 CEITEC_logo_pos Expression analysis - planning •Depth •Human ~ 22 000 genes = minimum 20 mil mapped reads •Good 25 mil mapped reads • •Mapped reads! •rRNA removal •Size selection for sRNA • •Trade-off §4 replicates with 20 mil vs. 3 replicates with 30 mil §9 replicates with 25 mil vs. 10 replicates with 20 mil • • • 37 37 CEITEC_logo_pos Expression analysis - planning •Depth •Human ~ 22 000 genes = minimum 20 mil mapped reads •Good 25 mil mapped reads • •Mapped reads! •rRNA removal – 90% rRNA •Size selection for sRNA • •Trade-off §4 replicates with 20 mil vs. 3 replicates with 30 mil §9 replicates with 25 mil vs. 10 replicates with 20 mil §Reality: 3 replicates with 15 mil reads • • • • 38 38 CEITEC_logo_pos Heatmap_RNAseqV2_1.png Visualization •Bioinformatician provide all he think might be helpful •Researcher requests exactly what he wants •Good to find a balance • 39 39 Cover You are looking for something you don’t expect ceitec_PPT_podklad_uvod CEITEC_logo_pos_RGB OPVaVpI_loga-eu_pos_H_EN partner_logo_2_col Thank you for your attention Central European Institute of Technology Masaryk University Kamenice 753/5 625 00 Brno, Czech Republic www.ceitec.muni.cz | info@ceitec.muni.cz