£^2> CEITEC Central European Institute of Technology BRNO I CZECH REPUBLIC Moderní metody analýzy genomu Bioinformatika I Mgr. Nikola Tom Brno, 13.11.2017 EUROPEAN UNION EUROPEAN REGIONAL DEVELOPMENT FUND INVESTING IN YOUR FUTURE OP Research and Development for Innovation Bioinformatics Bioinformatics is a quite new field... (first NGS in 2005) Intersection of biology, computer science and statistics AIM: clean the data and give them biological sense NGS data analysis = bottleneck of NGS Bioinformatics SOLUTION 1: • commercial software and ready to use pipelines BUT they have usually not-transparent settings and/or not enough of options (good programs expensive) ClCbia A Ql AGEN "Company SOPHIAGENETICS Bioinformatics Bioinformatics SOLUTION 2: • command-line based tools/software Each tools solves only a part of the analysis • Need for setup the pipeline & tune programs' parameters (challenging & more precise!!!) ubuntu Bioinformatics Modern laptop or PC might be enough... but bigger computer = better Before we start the analysis We have to know what we are dealing with... and what we want to find out... Choice of programs & settings heavily depends on type of experiment, library preparation, biological question Concept of the project DNA/RNA/epigenomics/metagenomics... DNA • Targeted sequencing - amplicons, gene panels, whole exomes (target enrichment methods - PCR, ligation...) • Whole genome sequencing - Finding differences to known reference genome = re-sequencing • De novo assembly - Genome (re)construction Before we start analysis RNA - Gene expression, miRNA, ncRNA, alternative splicing Metagenomics (bacteria, viruses) - Composition of the microorganisms in the sample, genetic variants Epigenomics - DNA-protein interactions, methylations Bioinformatics' starting point Raw sequencing data - READ Produced during base calling - signal (fluorescence, electric current) to sequence conversion and assigning base quality scores (fastq file) 07 Fastq file • Consists of reads - biological sequences (each read represents 1 input molecule sequenced on flowcell) • Pair-end sequencing 1 molecule = 2 reads = 2 fastq files (Rl, R2) • Corresponding quality score for each base • Phred score - probability of arising an error (log based) Q10 = 1 in 10 = 90% base accuracy Q20 = 1 in 100 = 99% base accuracy Q30 = 1 in 1 000 = 99.9% base accuracy Q40 = 1 in 10 000 = 99.99% base accuracy • ASCII character example.fastq @ S E Q_ ID G ATTTG G G G TTC A A AG C AG T ATCG ATC A A AT AG T A A ATC C ATTTG TTC A ACTC AC AG TTT + !M*((((***+))%%%++)(%%%%).l***-+*M))**55CCF>>>>>>CCCCCCC65 ASCII TABLE Decimal Hex Char Decimal Hex Char Decimal Hex Char Decimal Hex Char 0 0 [NULL] 32 20 [SPACE] 64 40 @ 96 60 l l [START OF HEADING] 33 21 j 65 41 A 97 61 a 2 2 [START OF TEXT] 34 22 II 66 42 B 98 62 b 3 3 [END OF TEXT] 35 23 # 67 43 C 99 63 c 4 4 [END OF TRANSMISSION] 36 24 $ 68 44 D 100 64 d 5 5 [ENQUIRY] 37 25 % 69 45 E 101 65 e 6 5 [ACKNOWLEDGE] 38 26 tv 70 46 F 102 66 f 7 7 [BELL] 39 27 1 71 47 G 103 67 g S 8 [BACKSPACE] 40 28 ( 72 48 H 104 68 h 9 9 [HORIZONTAL TAB] 41 29 ) 73 49 1 105 69 i 10 A [LINE FEED] 42 2A * 74 4A J 106 6A j 11 B [VERTICAL TAB] 43 2B + 75 4B K 107 6B k 12 C [FORM FEED] 44 2C f 76 4C L 108 6C 1 13 D [CARRIAGE RETURN] 45 2D • 77 4D M 109 6D m 14 E [SHIFT OUT] 46 2E 78 4E N 110 6E n 15 F [SHIFT IN] 47 2F / 79 4F O 111 6F o 16 10 [DATA LINK ESCAPE] 48 30 0 80 50 P 112 70 P 17 11 [DEVICE CONTROL 1] 49 31 1 81 51 Q 113 71 q 18 12 [DEVICE CONTROL 2] 50 32 2 82 52 R 114 72 r 19 13 [DEVICE CONTROL 3] 51 33 3 83 53 S 115 73 5 20 14 [DEVICE CONTROL 4] 52 34 4 84 54 T 116 74 t 21 15 [NEGATIVE ACKNOWLEDGE] 53 3b 5 85 55 u 117 75 u 22 16 [SYNCHRONOUS IDLE] 54 36 6 86 56 V 118 76 V 23 17 [ENG OF TRANS. BLOCK] 55 37 7 87 57 w 119 77 w 24 18 [CANCEL] 56 38 8 88 58 X 120 78 X 25 19 [END OF MEDIUM] 57 39 9 89 59 Y 121 79 y 26 1A [SUBSTITUTE] 58 3A ■ ■ 90 5A z 122 7A z 27 1B [ESCAPE] 59 3B ■ i 91 5B [ 123 7B { 28 IC [FILE SEPARATOR] 60 3C < 92 5C \ 124 7C 1 29 1D [GROUP SEPARATOR] 61 3D = 93 5D ] 125 7D } 30 IE [RECORD SEPARATOR] 62 3E > 94 5E 126 7E 31 1F [UNIT SEPARATOR] 63 3F • 95 5F - 127 7F [DEL] NGS pipeline - DNA re-sequencing Input Pre alignment Quality Control Alignment (Mapping) Variant calling Reference Reads (fastq) fastq Input Reads (fastq) Read length filtering & trimming Alignment (Mapping) Variant calling Reference <=££ 6& fastq bam Cleaning reads (Cutadapt, Trimmomatic) • Adaptor trimming • Quality trimming • Length filtering STRUCTURE DETAILS Rd1 Seq Primer P5 Index Seq Primer -*►----> P7 _INDEX - Output Annotation Sequence of Interest Input Reads (fastq) Pre alignment Quality Control & trimming fastq Alignment (Mapping) Variant calling •Mapping reads onto reference sequence (organism genome or part of the genome) - to find corresponding location & differences (substitutions, insertions, deletions, inversions, etc •Problem with: - too many sequences - billions bp long references - non-perfect matches between reads and reference => need for special algorithms (Burrows-Wheeler transform, hash table indexing) BWA, Bowtie2, Bfast, SHRiMP (SAM/BAM/CRAM format) Example of read alignment TAGAGTCC CATTTGGAGCC CCT CTAAGCC GTTCTATTTGTAATGAAAAC; C A Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Output Reference fastq :bam i i Usually alignment is not perfect - false positive indels & Substitutions => Need for local indel realignment 12; i 155ts|) TTAGTTTCTTTT ■ CTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI 0 aftar B ■TTAGTTTCTTTT- - - -GCCGCTTTCTTTCTTTCTTTTTTT ■TTAGTTTCTTTT- - - -GCCGCTTTCTTTCTTTCTTTTTTT "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCT "TTAGTTTCGTTTGCCGC TTTCTTTCTTTCTTTATTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTT- - - -GCCGCTTTCTTTCTTTCTTTTTTT ■TTAGTTTCTT TTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCC "TTAGTTTCTTT TGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCC T CTGTCACC CAGGT "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTCI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTCTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I "TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTTAGTCTCCCTCTGTCACCCAGGTT I CTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTA ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTC ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCT ■TTAGTTTCGTTTGCCGCTTTCTTTCTTTCTTTATTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTT TTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCC "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCC ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGT ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTC I "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I "T TAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTCTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTTAGTCTCCCTCTGTCACCCAGGTT I GTTTCTT TTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI Annotation Input Reads (fastq) Pre alignment Quality Control & trimming Alignment (Mapping) Variant calling Reference Of Of fastq bam REMOVE PCR DUPLICATES Each read represents 1 input molecule Output Annotation dbSNP HGMD COSMIC THEORY: In case of DNA re-sequencing, 1 diploid genome (1 ce is represented by 2 reads because of 2 chromosomes BUT there is a PCR before the sequencing => 1 input molecule from 1 cell could be represented by more reads - PCR duplicates => Biased variant allele frequency (EXAMPLE...) How to solve it? 1) Molecular barcodes (very new method) 2) Identity of start-end positions of read pair (not suitable for amplicons) Introduction of Molecular barcodes during brary preparation Downstream Upstream Targeting Sequence Targeting Sequence Custom I I Custom Probe 1 V + I S Probe 2 Round 1 Round 2 Product of Round 2 Round 3 Rounds Clean fo remove P5-SIMT Add P5 — and P7-index — Additional rounds of PCR Indexed and Single Molecule Tagged amplicon library ready for cluster generation and sequencing B 3 - ■ Unique fl^Kis □ FL I r j Gniirriirirt Sartylns ISfDir-| 3 ■ -1 Diflllceie i'laie >1B-fS* ..!■ -c- 1S-rin&rSMT 6 8 10 I 23 Duplicate Clualeraize (limes SMT observed per Target, Log sea bc; Smith et al. 2014 Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Of Of Reference Output r VariantsN fastq Annotation 36,661,601 36,661,660 36,66 Mutation types: Germinal mutations Somatic mutations Substitutions Insertions Deletions Complex variants 22 GGAGTTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTT) Consensus GGAGTTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTy 2»- Coverage LlUAkJ I I I I I U la Li I bAÜAML/M I M I Ül-AAfl I ÜA I I I LL I I Ably I bbL AA I AL I M< ggagtttttgggggagaacatatccaactt tggc aat ac TTj ggagtttttgg catatccaactttgtttccttagctggcaatactTj ggagIttttgggtgagaacatatccaactttctttccttagctggcaatacttj ggagttttIgggtgagaacatatccaactItctttccItagctggcaatacttj ggagt(tt t gggtgagaacatat c caac t t t ct t t c ct tagc tggcaatac t tj ggagtttttgggtgag ttccttlgctggcaatacitj ggaggt tt t gggtgagaacatat c caac t t t ct t t c ct tagc tggcaa ggaBagttagggagagaacatatccaaAtttctttccttagctggcaatact ggagtttttgggtgaggacatatccaactttctttcc ggagt t tt tgggtgagaacatatc caabt ttctttccttagct ctt, ggaggttttgggigagaacatatccaaltttgtitcct gaaatgtttgggtgagaacatatccaaatttctttccttagctggcaatgctt tgaggt tgtgggtgagaacatatc caaät t tct t tc ct tagc tggcaatac tt, Inversions Large structural variations (translocations, indels) Copy number variations Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Output r VariantsN Of fastq Annotation AIM of variant calling - to identify the variant and distinguish from an error (library preparation, sequencing, alignment) Experimental designs (also depends on types of samples available): Normal only (genotyping) Tumor only (genotyping, somatic mutations) Tumor + related normal control Tumor collected in time Family (rare diseases, genotyping) Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Output r VariantsN Of fastq Program algorithms: • Bayesian statistics (Mutect, DeepSnv) • Fisher exact test (Varscan, Vardict) Annotation 36.661,660 I 36,661.660 36,66 22 GGAGTTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTJ Consensus GGAGTTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTJ 2sr Coverage Giving p-value based on different features Options for many parameters & filters: • Minimum coverage • Variant allele frequency • Base quality • Genomic context (homopolymers) • Position in read (errors at the reads end) • Mapping quality • Presence in both forward and reverse reads (strand bias) UUAU I I I I I IjLiLi I bAUAAL A I A ! bLAAA I I la A I J I 1» U I I AbL I bbl AA I AU I AI GGAGTT TT T GGGGGAGAACATATC CAACT T TGGCAATACTTi GGAGTTTTTGG CATATCCAACTTTGTTTCCTTAGCTGGCAATACTTi GGAGlTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTi GGAGTTTT1gGGTGAGAACATATCCAACT|TCTTTCCItAGCTGGCAATACTTi GGAGT|TTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTi GGAGTTTTTGGGTGAG TTCCTtIgCTGGCAATACCTj GGAGGTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAA GGACAGTTAGGGAGAGAACATATCCAAÄTTTCTTTCCTTAGCTGGCAATACT GGAGTTTTTGGGTGAGGACATATCCAACTTTCTTTCC GGAGTTTTTGGGTGAGAACATATCCAA|TTTCTTTCCTTAGCT CTTj jGGAGGTTTTGGGlGAGAACATATCCAAATTTGTATCCT GAAATGTTTGGGTGAGAACATATCCAAATTTCTTTCCTTAGCTGGCAATGCTT TGAGGTTGTGGGTGAGAACATATCCAAATTTCTTTCCTTAGCTGGCAATACTTi Input Reads (fastq) Pre alignment Quality Control & trimming Alignment (Mapping) Variant calling Of fastq Usually 1 approach is not enough => to combine more variant callers (aligners) & different settings CATK Specific pipeline for each type of mutations (SNV, INDELS, CNV...) t'iNUMAP Output Annotation dbSNP SOAPsnp LWvRnio SAMTools SNVer O'Rawe, J. er al. Low concordance of multiple variant-calling pipelines, practical implications for exome and genome sequencing. Genome Mederns 5, 28 (2013). rtg REALTIME GENOMICS Stanford University Input Reads (fastq) Pre alignment Quality Control & trimming Alignment (Mapping) Of Variant calling fastq Output r VariantsN Annotation VCF file Example T3 IS 2 •< u > ##fileformat=VCFv4.0 ##fileDate=2Q100707 ##source=vCFtooLs ##reterence=NCBI36 ##INF0= ##INFO= ality (phred score)"> or RR,RA,AA genotypes !R=ref,A=alt)"> ructural variant" of the variant"> Deletion SVTYPE=DEL;END=3G0 Other event FORMAT GT: DP GT:GQ GT:GQ GT:GQ:DP SAMPLE1 1/2:13 0|1:10Q 0:77 /1:12:3 IPLK^" Large SV SAMP^ 0/0' 2/2^0 1/1:9. 0/0:20 Phased data (G and C above are on the same chromosome) Reference alleles (GT=0) Alternate alleles (GT>0 is an index to the ALT column) Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Of fastq Visualization of variants and alignments (IGV) Input vcf Output r VariantsN Annotation Input Reads (fastq) Pre alignment Alignment Quality Control & (Mapping) trimming Variant calling Output r VariantsN Of fastq Annotation From genomic coordinate to biological meaning Provide links to various databases (RefSeq, dbSNP, etc.) To distinguish significant variant from non-significant (synonymous vs. non-synonymous, gene, exon, intron, cDNA, codon, transcript, freq in population, presence in other diseases...) RefSeq dbSNP Regulation Repeats Functional Gene ontology Etc. Annotation dbSNP HGMD COSMIC Input Pre alignment Quality Control Alignment /IV/1 v Variant calling (Mapping) b Reference Reads (fastq) Sensitivity & Specificity as a matter of: • Experiment design (library preparation + NGS technology + number of samples + amount of data) • Data processing (pre-processing + alignment + variant calling + annotations + filtering) Courses http://www.embo.org/events/events-calendar http://www.embo.org/events/practical-courses