OCEITEC Central European Institute of Technology BRNO | CZECH REPUBLIC Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Before we start analysis We have to know what we are dealing with... and what we want to find out... Concept of the project DNA/RNA/methylation/... DNA Targeted sequencing (amplicons, gene panels, exomes) Whole genome sequencing - Finding differences to known reference genome = re-sequencing De novo assembly - Genome construction Before we start analysis... RNA - Gene expression, alternative splicing Metagenomics (bacteria, viruses) - Their composition, variants ChIP sequrcing (DNA-protein interactions) Library preparation - example of DNA library PCR cycle 0: extension to fill 3' PCR amplification c Read2 Pe«a£ sr«r Ff.-r.'2013 Sequencing-ready DNA fragments Readl Library preparation - example of mRNA library A i Barcode Rdl AAAAAAAAA 1) PolyA+ RNA captured TTTTTTTTTTT© 2) RNA fragmented and primed 3) First strand cDNA synthesized i A 4) Second strand cDNA synthesized 5) 3' ends adenylated and 5' ends repaired 6) DNA sequencing adapters ligated <— <— Rd2 Index 7) Ligated fragments PCR amplified Bioinformatics Bioinformatics is a quite new field... (first NGS in 2005) How to analyse data defived from NGS = bottleneck of NGS A lot of tools/software for NGS data analysis... Most of the tools are command-line based No tool is working perfectly...© Each tools solves only a peace of the cake... NO tool, that is able to perform analysis from the very beginning to the end => Need for setup the pipeline Bioinformatics Exception: commercial software and ready to use pipelines BUT they have usually not-transparent settings and/or not enough of options Heavily depends on type of experiment, library preparation and project Laptop or PC are usually not enough... need for cluster t*sy* ^ annotation Base calling Signal to sequence conversion and assigning base quality scores (fastq file) Phred score - probability of arising an error (log based) fastq •Consists of reads - biological sequences (each read represents 1 input molecule sequenced on flowcell) •Corresponding quality score for each base •ASCII character •(fasta+ qual, csfasta + csqual, sff) •Pair-end sequencing - 2 fastq files @ SEQJD GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !M*((((***+))%%%++)(%%%%).l***-+*M))**55CCF>>>>>>CCCCCCC65 Quality control (FastQC) FastQC File Help bad_sequence.txt good_sequence_short.txt (^^1 Basic Statistics Per base sequence quality Per sequence quality scores Per Dase sequence content o 0 Per base GC content Per sequence GC content Per base N content Sequence Length Distribution I Sequence Duplication Levels Overrepresented sequences Kmer Content Quality scores across all bases (Illumina >vl.3 encoding) IIIIIIIIIIII IIIIIIIIIIII" J t*sy* ^ annotation Cleaning reads (Cutadapt) •Adaptor trimming (miRNA) •Quality trimming •Length filtering STRUCTURE DETAILS Rd1 Seq Primer Index Seq Primer ----> P^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^ _INDEX <_ ^- _I y Sequence of Interest t*sy* ^ annotation Read mapping (alignment) •Usually mapping reads on reference sequence (DNA/cDNA/16S/ other seq) to find corresponding location & differences •Problem with too many sequences and billions bp long references - need for special algorithms (Burrows-Wheeler transform, hash table indexing) Mapping of DNA reads •On Existing DNA reference sequence (ready for many organisms) •To find substitutions, insertions, deletions, inversions, etc... Precisely! -BWA, Bowtie, Bfast, SHRiMP Example of DNA re-sequencing ^ " [unpairedjl...] x 74.300 i 74.320 i 74.340 i 74.360 i Sudden coverage changeC^^^lUnaligned ends unpaired_illumina_miseq contig 44 tatatttaagatgttttgcctgaaaagtgagcgaacgataaagtttttataatttcctcttgtcaggccggaataactccc Coverage 0 tatat tatat tatat tatat tatat tatat tatat ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt ttgcc ttgcc ttgcc ttgcc ttgcc ttgcc ttgcc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt -agttt tatat tatat tatat tatat tatat tatat ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt taagatgtt agatgtt ttgcc ttgcc ttgcc ttgcc ttgcc ttgcc ttgcc ttgcc ttgcc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc agtgagc gagc gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt aaagttt aaagttt aaagttt agttt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt atattt gccgg aataactccc *■ tt ttcgct ttcgct ttcgct ttcgct ttcgct ttcgct tt ttcgct ttcgct ttcgct t g ttcgct ttcgct ttcgct t g ttcgct ttcgct ttcgct t g ttcgct ttcgct ttcgct g tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc TGTCAGGr tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc cggaataactccc cgg cggaataactccc cggaataactccc cggaataactccc cggaataactccc ggaataactccc cggaataactcc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc tatatttaagatgttttgcctgaaaagtgagcgaacgataaag- ttttaggctgatttggttgaatgttgcgcggtcagaaaattatttta aataagcataaagaataaaaaatgcgcggtcagaaaat aacggggcttttgctgaaaaaatgcgcggtcagaaaat acggggcttttgctgaaaaaatgcgcggtcagaaaat taggctgatttggttgaatgttgcgcggtcagaaaat cttttgctgaaaaaatgcgcggtcagaaaat taaagaataaaaaatgcgcggtcagaaaat tatatttaagatgttttgcctgaa tatatttaagatgttttgcctgaaaagtgagcgaacga tt tt tt tt tt tt tt tt tatt tt tt tt tatt tatt tt tt tt tt tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a aatttcctcttgtcaggccggaataactccc cttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc _Cj_ Mapping of RNA reads - alternative splicing Reads can span exon junctions - mRNA splicing DNA Exon 1 Exon Exon 3 Exon mRNA 1 Alternative Splicing mRNA 1 2 3 Protein A Protein B Mapping of RNA reads •To measure gene expression OR alternative splicing •On existing DNA reference sequence •To find alternative splicing •More tricky, complicated, slower - TopHat [de novo splice aligner) On transcriptome reference sequences Reads can map to multiple transcripts (shared exons) Easier, faster, no need for special aligners - BWA •On miRNA sequences - miRBase -Grouping and annotate against mirBase De novo assembly ■to uncover unknown genomes/transcriptomes To detect large structural variants _ Reads I Assemble Contig J Map reads to contigs Contigl --1— Contig2 Assemble contigs to scaffolds -NNNNNN Scaffold I Gap filling J Long sequence cluster and assembly -Unigene SAM/BAM 1 9H0 VN:l.e SO:unsorted 2 *SQ SM:giIlie64e213lreflNC_0O8253.il LN:493892e 3 *>G ID:bowtie2 PN:bowtie2 VN:2.1.e 4 gi11106402131refINC.008253.II.418.952.1:0:0_1:0:0_0/1 0 gi11106402131refIMC. 088253.11 418 42 70M • 0 0 CCAGGCAGTGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAA ATCACCAACCATCTGGTAGCGATGAT 2222222222222222222222222222222222222222222222222222 222222222222222222 AS:i:-3 XN:i:0 XM:i:l XO:i:0 XG:i:0 NM:i:l HD:Z:8G61 Y7:Z:UU 5 gi11106402131refINC_008253.1I.31_476_0:0:0_0:0:0_1/1 16 gi11106402131refIMC. 008253.11 407 42 70M • 0 0 GGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTG CCCCCGCCAAAATCACCAACCATCTG 2222222222222222222222222222222222222222222222222222 222222222222222222 AS:i:0 XN:i:0 XM:i:0 X0:i:6 XG:i:0 NM:i:0 MO:2:70 YT:2 6 gx11106402131refINC_008253.1I_210_743_2:0:0_1:1:0_2/1 0 gi11106402131refIMC_ 008253.11 210 42 70M • 0 0 CATTACCACCACCATCACCATTACCACAGGAAACGGTGCGGGCT GACGCGTACAGGAAACACCGAAAAAA 2222222222222222222222222222222222222222222222222222 222222222222222222 AS:i:-6 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:30T31A7 Headers Alignments Each row describes a single alignment of a raw read against the reference genome. Each alignment has 11 mandatory fields, followed by any number of optional fields. Mapping, Coverage reports •Repeat alignment/other steps with different criteria? •Important checkout for lab protocol •Specificity of PCR •Settings of variant calling threshold, CNV t*sy* ^ annotation I1199V900V0191010001019VV1111111111101110111011109009111101119 I1199V900V0191010001019V111111111111011101110111 9009 L11101119V11. »1199V000V0191910 001019VVl1111111111011101110111 9009 11101119V11. »1199V000V0191010001019VV11101111111011101110111 9009 -11101119V11. H199V000V0191010001019VV1111111111101110111011109009111101119V11. U199V000V0191010001019VV1111111111101110111011109009111101119V11. H199V000V0191010001019VV1111111111101110111011109009111101119V11. »1199V000V0191010001019VV11111111111Olli01110111 9009 - 11101119V11. »1199 V000V0191010001019Will 11111111 Ol 11 Ol 11011109009111101119V11. I1199VOOOV0191010001019VV1111111111101110111011100|Í0111101119V11. »0199V000V0191010001019VV11111111111Olli0111011109009111101119V11. 199V000V0191010001019VV11111111111011101110111 9009 11191119V11. 000V0191010001019VV11111111111Olli011101110 9009 11101119V11. 0 01019VVl111111111101110111011109009 11101119V11. >1199V000V0191010001019VVl111111V111011101110111 9009 11001119V11. 1019VV11111111111011101110111: 9009 11101119V11. 019VVl1111111111Ol11Ol11011109009 11101119V11. V11111111111011101110111 9009 11101119V11. H199V000V0191010001019VV111111111110111011101110 »1199V000V0191010001019Vll111111111101110111011109339111101119V11. H199V000V0191010001019VV111111111110111Ol11011109009111101119V11. »1199V000V0191010001019VV111011111110111Ol11011109000111101119V11. »1199VOOOV0191010001019VV11111111111Olli0111011109009111101119V11. »1199V000V0191010001019VV1111111111101110111011109009111101119V11. tll99V000V01910100 01019VV1111111111101110111011109009111101119Vll. >1199VO00V0191010 001019VVl111111111101110111011109009111101119V11. U199V000V0191010001019VVlIII111111101110111011109009111101119V11. »1199V000V0191010001019VV1111111111101110111011109009111101119V11. »0199V000V0191010001019VV1111111111101110111011109009111101119V11. 199V000V0191010001019VV1111111111101110111011109009111101119V11. 000V0191010001019VV11111111111Ol11Ol11011109009111101119V11. 1111111011101110111 9009 •111101119V11. »1199V000V0191010001019VV111111111110111Ol11011109009111901119V11. 1019VV1111111111101110111011109009111101119V11. III11111111Ol11Ol1109009" •- -111101119V11. Illllll|lll0111011109009-- - -111101119V11. g spb8j81 iiljľ o 81 »1199V000V0191010001019Will 111111110111011101110 - - sprej 81 ajojoq o 111101119V11. oouwsjoj i El i 901 i 08 lU9LUuß!|b8j |8puj |b00| joj paeN <= suounwsqns $ s|8pu| 9a!J!sod esiej - pajjad jou si juewußüe á||ensn lU9UJUß!|B8J |3pU| *esU 1 input molecule from 1 cell could be after PCR represented by more reads => Biased variant allele frequency How to solve it? 1) Molecular barcodes (very new method) 2) Identity of start-end positions of read pair Molecular barcodes Downstream Upstream Targeting Sequence Targeting Sequence Custom Probe 1 Round 1 Round 2 Product of Round 2 Round 3 Round 4 Custom Probe 2 >aOQ0QOQ0QOQ0QOQ0Q0Q0Q0«. i I Clean to remove P5-SMT Add P5 - and P7-index — - Additional rounds of PCR Indexed and Single Molecule Tagged amplicon library ready for cluster generation and sequencing B 5 I 8j o 1 § □ OuCioate Reads ■ Unique Rsads ■ Gisrrrtino Sarn;ilw« SnrnrJ«s 1«*06- gi 10000- 2, tooo- ioo- Dieiicais Flats S-10S >10-1S% »- >15-20% S- »20-30% •v- 63% 0» a-mar SWT •- 12-mefSMT ~i—r- 8 10 -1 28 Duplicate Cluster size (limes SMT observed per Target. Log seated) Smith et al. 2014 *esU Optional header lines (metadata about the annotations in the VCF body) Hp' ##INF0= ##F0RMAT= ##F0RMAT= ##ALT= ##INFO= ##INFO= m #CHR0M POS ID 1 1 . 1 2 1 5 1 100 QUAL FILTER PASS PASS Deletion INFO H2;AA=T SVTYPE=DEL;END=300 Other event FORMAT GT:DP GT:GQ GT:GQ L GT:GQ:DP At SAMPLE1 1/2:13 811:100 10:77 /1:12:3 SAMP 0/0 2/2^0 1/1:9 0/0:20 Reference alleles (GT=0) Insertion Large SV Phased data (G and C above are on the same chromosome) Alternate alleles (GT>0 is an index to the ALT column) DNA Seq variant calling • Tumor only (amplicon sequencing & diagnostics) • Tumor & normal (exome sequencing) -to do variant calling and genotyping more precisely (somatic, germinal mutations) • Option is also to analyze tumor vs. group of tumors Application of many statistical tests: • negative beta-binomial test • Bayesian statistics • Fisher exact test As higher coverage as higher sensitivity and specificity (but limited) More about statistics and RNA sequencing in the next courses *esU