£^2> CEITEC Central European Institute of Technology BRNO I CZECH REPUBLIC Moderní metody analýzy genomu Bioinformatika I Mgr. Nikola Tom Brno, 11.11.20 EUROPEAN UNION EUROPEAN REGIONAL DEVELOPMENT FUND INVESTING IN YOUR FUTURE OP Research and Development for Innovation Bioinformatics Bioinformatics is a quite new field... (first NGS in 2005) How to analyse data defived from NGS = bottleneck of NGS AIM: clean the data and give them biological sense Bioinformatics SOLUTION 1: • commercial software and ready to use pipelines BUT they have usually not-transparent settings and/or not enough of options (good programs expensive) ClCbia A Ql AGEN "Company SOPHIAGENETICS Bioinformatics Bioinformatics SOLUTION 2: • command-line based tools/software Each tools solves only a part of the analysis • Need for setup the pipeline & tune programs' parameters (challenging & more precise!!!) ubuntu Bioinformatics Choice of programs & settings heavily depends on type of experiment, library preparation, biological question Laptop or PC are usually not enough... need for cluster Before we start analysis We have to know what we are dealing with... and what we want to find out... Concept of the project DNA/RNA/methylation/... DNA • Targeted sequencing (amplicons, gene panels, exomes) • Whole genome sequencing - Finding differences to known reference genome = re-sequencing De novo assembly - Genome construction Before we start analysis RNA - Gene expression, ncRNA, alternative splicing Metagenomics (bacteria, viruses) - Composition of organisms in the sample, genetic variants ChIP sequecing (DNA-protein interactions) Bioinformatics' starting point Raw sequencing data - READ Produced during base calling - signal to sequence conversion and assigning base quality scores (fastq file) Fastq file • Consists of reads - biological sequences (each read represents 1 input molecule sequenced on flowcell) • Corresponding quality score for each base • Phred score - probability of arising an error (log based) • ASCII character • (fasta+ qual, csfasta + csqual, sff) • Pair-end sequencing - 2 fastq files example.fastq @ S E Q_ ID G ATTTG G G G TTC A A AG C AG T ATCG ATC A A AT AG T A A ATC C ATTTG TTC A ACTC AC AG TTT + !M*((((***+))%%%++)(%%%%).l***-+*M))**55CCF>>>>>>CCCCCCC65 NGS pipeline Input Pre alignment Quality Control Alignment (Mapping) Variant calling Reads (fastq) fastq bam Reference Output r Variants\ Annotation dbSNP HGMD COSMIC Input Reads (fastq) Pre alignment Quality Control & trimming Alignment x Variant calling (Mapping) b Output Annotation Quality control (FastQC) £ile Help bad_sequence,trt | good_sequence_short.txt| Basic Statistics Per base sequi o © Sequence Length Distribution (j^JJ^ Sequence Duplication Levels (j^^ Overrepresented sequences ^mer Content 3si" se'-..;e:i-:e c,f scores Per base sequence content Per base GC content Per sequence GC content Per base N content Quality scores across all bases (Illumina >vl ,3 encoding) Input Reads (fastq) Pre alignment Quality Control & trimming Alignment (Mapping) Variant calling Reference fastq bam Cleaning reads (Cutadapt) • Adaptor trimming (miRNA) • Quality trimming • Length filtering STRUCTURE DETAILS Rd1 Seq Primer P5 Index Seq Primer _2^ P7 _INDEX - r Sequence of Interest Output Annotation COSMIC Input Reads (fastq) Pre alignment Quality Control & trimming Alignment (Mapping) Variant calling fastq •Usually mapping reads on reference sequence (DNA/cDNA/16S/other seq) to find corresponding location & differences (substitutions, insertions, deletions, inversions, etc... ) •Problem with too many sequences and billions bp long references - need for special algorithms (Burrows-Wheeler transform, hash table indexing) •BWA, Bowtie, Bfast, SHRiMP (BAM format) Example of read mapping I IB 'illuminajiise... X | =-* [unpairedJ...] * unpaired_ Sudden covera lluminamiseq contig 44 rATATTTAAGATGTTTTGCCTGAAAAGTGAGCGAA ge chanqel ' > igt Of Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Output Reference fastq :bam I i Usually alignment is not perfect - false positive indels & Substitutions => Need for local indel realignment 12; i 155ts|) TTAGTTTCTTTT ■ CTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI 0 aftar B ■TTAGTTTCTTTT- - - -GCCGCTTTCTTTCTTTCTTTTTTT ■TTAGTTTCTTTT- - - -GCCGCTTTCTTTCTTTCTTTTTTT "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCT "TTAGTTTCGTTTGCCGC TTTCTTTCTTTCTTTATTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTT- - - -GCCGCTTTCTTTCTTTCTTTTTTT ■TTAGTTTCTT TTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCC "TTAGTTTCTTT TGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCC T CTGTCACC CAGGT "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTCI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTCTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I "TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTTAGTCTCCCTCTGTCACCCAGGTT I CTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTA ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTC ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCT ■TTAGTTTCGTTTGCCGCTTTCTTTCTTTCTTTATTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTT TTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCC "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCC ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGT ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTC I "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I "T TAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTCTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTTAGTCTCCCTCTGTCACCCAGGTT I GTTTCTT TTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI Annotation Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Of fastq Mapping, Coverage reports • Repeat alignment/other steps with different criteria? • Important checkout for lab protocol • Specificity of PCR • Settings of variant calling threshold, CNV • Target bed file (Browser Extensible Data) chrl 127471196 127472363 chrl 127472363 127473530 chrl 127473530 127474697 chrl 127474697 127475864 chrl 127475864 127477031 (bed format) Output Annotation s 100 De novo assembly - alternative for mapping on reference sequence To uncover unknown genomes/transcriptomes To detect large structural variants Reads I Assemble Contig J Map reads to contigs = = = = = = = "-! Contigl —i--1— Contig2 Assemble contigs to scaffolds -NNNNNN Scaffold I Gap filling j Long sequence cluster and assembly -Unigene Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Reference at fastq I I bam Output REMOVE PCR DUPLICATES Each read represents 1 input molecule THEORY: E.g. in case of DNA re-sequencing, 1 diploid cell is represented by 2 reads because of 2 chromosomes BUT there is a PCR to amplify genetic material to be analyzable => 1 input molecule from 1 cell could be after PCR represented by more reads => Biased variant allele frequency Annotation COSMIC How to solve it? 1) Molecular barcodes (very new method) 2) Identity of start-end positions of read pair Introduction of Molecular barcodes during brary preparation Downstream Upstream Targeting Sequence Targeting Sequence Custom I I Custom Probe 1 V + I S Probe 2 Round 1 Round 2 Product of Round 2 Round 3 Rounds Clean fo remove P5-SIMT Add P5 — and P7-index — Additional rounds of PCR Indexed and Single Molecule Tagged amplicon library ready for cluster generation and sequencing B 3 - ■ Unique fl^Kis □ FL I r j Gniirriirirt Sartylns ISfDir-| 3 ■ -1 Diflllceie I'laie >1B-fS* ..!■ -c- 1S-rin&rSMT 6 8 10 I 23 Duplicate Clualeraize (limes SMT observed per Target, Log sea bc; Smith et al. 2014 Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Of 8f Reference Output r VariantsN fastq Annotation 36,661,601 36,661,660 36,66 Mutation types: Germinal mutations Somatic mutations Substitutions Insertions Deletions Complex variants 22 GGAGTTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTT) Consensus GGAGTTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTy 2»- Coverage LlUAkJ I I I I I U la Li I bAÜAML/M I M I Ül-AAfl I ÜA I I I LL I I Ably I bbL AA I AL I M< ggagtttttgggggagaacatatccaactt tggc aat ac TTj ggagtttttgg catatccaactttgtttccttagctggcaatactTj ggagIttttgggtgagaacatatccaactttctttccttagctggcaatacttj ggagttttIgggtgagaacatatccaactItctttccItagctggcaatacttj ggagt(tt t gggtgagaacatat c caac t t t ct t t c ct tagc tggcaatac t tj ggagtttttgggtgag ttccttlgctggcaatacitj ggaggt tt t gggtgagaacatat c caac t t t ct t t c ct tagc tggcaa ggaBagttagggagagaacatatccaaAtttctttccttagctggcaatact ggagtttttgggtgaggacatatccaactttctttcc ggagt t tt tgggtgagaacatatc caabt ttctttccttagct ctt, ggaggttttgggigagaacatatccaaltttgtitcct gaaatgtttgggtgagaacatatccaaatttctttccttagctggcaatgctt tgaggt tgtgggtgagaacatatc caaät t tct t tc ct tagc tggcaatac tt, Inversions Large structural variations (translocations, indels) Copy number variations Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Of fastq Experimental designs (also depends on types of samples available): Normal only (genotyping) Tumor only (genotyping, somatic mutations) Tumor + related normal control Tumor + unrelated normal controls Tumor in time Family (rare diseases, genotyping) Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Output r VariantsN OF fastq Program algorithms: • Bayesian statistics (Mutect, DeepSnv) • Fisher exact test (Varscan, Vardict) Annotation 36.661,660 i 36,661.660 36,66 22 GGAGTTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTJ Consensus GGAGTTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTJ 2sr Coverage Giving p-value based on different features Options for many parameters & filters: • Minimum coverage • Variant allele frequency • Base quality • Genomic context (homopolymers) • Position in read (errors at the reads end) • Mapping quality • Presence in both forward and reverse reads (strand bias) UUAU I I I I I IjLiLi I bAUAAL A I A ! bLAAA I I la A I J I 1» U I I AbL I bbl AA I AU I AI GGAGTT TT T GGGGGAGAACATATC CAACT T TGGCAATACTTi GGAGTTTTTGG CATATCCAACTTTGTTTCCTTAGCTGGCAATACTTi GGAGlTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTi GGAGTTTT1gGGTGAGAACATATCCAACT|TCTTTCCItAGCTGGCAATACTTi GGAGT|TTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTi GGAGTTTTTGGGTGAG TTCCTtIgCTGGCAATACCTj GGAGGTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAA GGACAGTTAGGGAGAGAACATATCCAAÄTTTCTTTCCTTAGCTGGCAATACT GGAGTTTTTGGGTGAGGACATATCCAACTTTCTTTCC GGAGTTTTTGGGTGAGAACATATCCAA|TTTCTTTCCTTAGCT CTTj jGGAGGTTTTGGGlGAGAACATATCCAAATTTGTATCCT GAAATGTTTGGGTGAGAACATATCCAAATTTCTTTCCTTAGCTGGCAATGCTT TGAGGTTGTGGGTGAGAACATATCCAAATTTCTTTCCTTAGCTGGCAATACTTi Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Reference Of fastq To distinguish real mutation from ERROR (library preparation, sequencing, alignment) Usually 1 approach is not enough => gatk to combine more variant callers (aligners) & different settings Specific pipeline for each type of mutations (SNV, INDELS, CNV...) Output Annotation dbSNP SOAPmp [Number*** SNV* Fcmnlulivnl Tinv Bilm SAMTools SNVer O'Rawe, J. etal. Low concordance of multiple varan!-calling pipelines, practical implications for exome and genome sequencing. Genome MedcineS, 28 (2013). rto a eal ti me 5 tan ford Un iversi t y j genomics f Input Reads (fastq) Pre alignment Quality Control & trimming Alignment (Mapping) Of Variant calling fastq Output r VariantsN Annotation VCF file Example T3 IS 2 •< u > ##fileformat=VCFv4.0 ##fileDate=2Q100707 ##source=vCFtooLs ##reterence=NCBI36 ##INF0= ##INFO= ality (phred score)"> or RR,RA,AA genotypes !R=ref,A=alt)"> ructural variant" of the variant"> Deletion SVTYPE=DEL;END=3G0 Other event FORMAT GT: DP GT:GQ GT:GQ GT:GQ:DP SAMPLE1 1/2:13 0|1:10Q 0:77 /1:12:3 IPLK^" Large SV SAMP^ 0/0' 2/2^0 1/1:9. 0/0:20 Phased data (G and C above are on the same chromosome) Reference alleles (GT=0) Alternate alleles (GT>0 is an index to the ALT column) Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Of fastq Visualization of genotypes by IGV Input vcf Vi ■ t" I WWW • I.L k^:-:■ : MUTO1T HW)H HUTHT Wiour NA1M1I Wi'lH NM IUI HA1I-MQ M.11-U1 ui:ui hu* I l-J W M11-tttl Mil-Ill M-M-i-l-i k*- " ■: > iMJ j1jil I !|l l|ll|jL|l|lflll i l|JIII Jill Ulli [II iii El ]IA)IJ J IL Mill Ulli if 1111 |j ill i i Ii i I ii ii i ii it ii I in I in J I ii II Ml mi I i I i I i i I I I i i ii i III i i ii ii i i ii i I i i i I i i ii III i i i I I i ii i I i I ,M, i iii I i I I I i ■1 I II i i I i I ii I I ■ iii I ii I iii I ii I II i ■ II j ii II ii I I i i i i ii i i- II I II i III I, ■■ ii ii i j Ii i Hill "I!1 I ii i ii i ii I in ii ii i ii ii i ii I I 111 111 11 11 11 11 11 111 ii 11 ii 111 I I ii in ii ii ii ii I 11 11 ii ii ii i ii ii ii i ii i ii ,1! i ii i hi ii ii li ill II II III H I I 11 i II .! ■ i ii ii in ii ii i ii i ■ ii i 'i|i J II IM J i i i ii III i i i Ml i 1 1 Ii I ■ Ii i i i ii i ii I I I I I I i 11 Ii i Output C VariantsN Annotation 1) Each bar across the top of the plot shows the allele fraction for a single locus. 2) The genotypes for each locus in each sample. Dark blue = heterozygous, Cyan = homozygous variant, Grey = reference. Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Output r VariantsN Of fastq Annotation From genomic coordinate to biological meaning Provide links to various databases (RefSeq, dbSNP, etc.) To distinguish significant variant from non-significant (synonymous vs. non-synonymous, gene, exon, intron, cDNA, codon, transcript, freq in population, presence in other diseases...) RefSeq dbSNP Regulation Comparative genomics Repeats Functional Gene ontology Etc. Annotation dbSNP HGMD COSMIC Input Pre alignment Quality Control Alignment /IV/1 v Variant calling (Mapping) b Reference Reads (fastq) Sensitivity & Specificity as a matter of: • Experiment design (library preparation + NGS technology + number of samples + amount of data) • Data processing (pre-processing + alignment + variant calling + annotations + filtering) Courses http://meetings.embo.org/event/17-genome http://www.embo.org/events/practical-courses