Analysis of Sequencing Data (Illumina NGS technology) Marek Mráz Assistant Professor of Oncology Group leader at CEITEC MU and Univ. Hospital Brno 9/2023 Analysis of Sequencing Data Today…… •DNA......rules them all? •PCR/Sanger •DNA NGS….principles •RNA NGS… principles •NGS applications in general •Examples in cancer research • • • Illumina platform ? •Library •Adaptor •Index •Barcode •Read •Flowcell •Sequencing by synthesis •T4 Ligase • • • • • Central Dogma central_dogma.jpg DNA Has Two Jobs •It serves as a store of information •It directs the synthesis of proteins C:\Documents and Settings\kweck\My Documents\My Pictures\DNAHelix5.gif It’s important before we launch into DNA sequencing to spend a moment reviewing why DNA is imporant and what sequencing it can tell us •What are the other types of nucleic acids….? DNA – nucleus, mitochondria RNA – mRNA, rRNA, tRNA, snoRNA, miRNA, lncRNA …… all RNAs can be converted to DNA….we always work/sequence DNA: DNA or cDNA •With genomic DNA we are interested in the sequence …..mutations, SNP, CNV, translocations • •With RNA we are interested in other things….. Like? DNA Sequencing •You have 3 billion bases •~20,000(0) genes Gene (DNA) gives rise to mRNA Collagen PCR PCR •Mix DNA with dNTPs and primer •Amplify…DNA polymerase 11 DNA has orientation, need of primer for PCR Sanger seq 12 Sanger Sequencing •Advantages §Long reads (~900bps) §Suitable for small projects – •Disadvantages §Low throughput §Expensive (cost per base) 13 Next Generation Sequencing •Takes advantage of miniaturization to engage in massively parallel analysis –Essentially carrying out millions of sequencing reactions simultaneously in each of 10 million tiny wells/spots •Sophisticated computer analysis of huge amounts of information allows “assembly" of a given sequence • 10 million tiny wells in a single machine, each of which Massive Parallel Seq workflow 1) Library preparation 4) Data processing & analysis A SAMPLE 2) Cluster generation on a flow cell SE, PE reads, 50-250 bases (miseq) 3) Sequencing & imaging This is the trick 17 High Parallelism is Achieved in Polony Sequencing Polony Sanger •Polony sequencing refers to all commercial technologies except for Helicos. •Polony sequencing takes place using array of polonies, in which all amplicons of the same DNA fragment are clustered together on the same region of the array. These groups of amplicons were termed polonies, shortcut for polymerase colonies. •The degree of parallelism that can be achieved through Sanger sequencing is only a fraction of what can be achieved in polony sequencing • Dark downward diagonal C A G T C A T C A C C T A G C G T A 5’ G T C A G T C A G T C A G T 3’ 5’ First base incorporated Cycle 1: Add sequencing reagents Detect signal Cleave terminator and dye Cycle 2-n: Add sequencing reagents and repeat Sequencing by synthesis A C G A A A T T T T C C G G G G G C T A Emission Excitation Purpose: Graphic support to explain Solexa technology. Details: There are 2 market unique features to the Solexa technology for genome analysis = template clusters (in the flow cell) and reversibly terminated, individually fluorescently labeled nucleotides for Sequencing by Synthesis. The steps for sequencing by synthesis supported by this slide are: Hybridize sequencing primer (complimentary to terminal adaptor) Add polymerase and all 4 dNTPs Each dNTP is individually fluoerscently labeled and is reversibly blocked at the 3’ hydroxyl The 3’ block ensures only a single base addition. Once the single base addition occurs the clusters can be excited by a laser and the color of the added base can be imaged and determined. For every cluster, in cycle 1, the first base is added turning the entire cluster the color of the base. In the example shown in this cartoon at “T” is added, turning this cluster “green”. Once the first base has been imaged and determined, the fluorescent label is cleaved and 3’ blocking group is removed, regenerating a 3’-OH and enabling extension. Add polymerase and all 4 dNTPs and single base extension will occur for the second base in the template (in this cartoon the 2^nd base is a “G”) The clusters can be excited by a laser and the color of the 2nd base is imaged and determined (in this cartoon the cluster color would be “blue”) Repeating this process allow us to step-wise determine the sequence of the original template. Conclusion: Explanation of technology behind sequencing by synthesis 33 Sequencing by Synthesis - Fluorescently labeled Nucleotides (Illumina) Complementary strand elongation: DNA Polymerase • Solexa sequencing uses DNA polymerase for elongation of the complementary strand, in each cycle a single fluorescently labeled nucleotide is added. - Read length is limited by effectiveness of moiety cleavage and rmoval of FL labels Index-> video •https://www.youtube.com/watch?v=womKfikWlxM RNA Seq RNAseq • The general experimental procedure for RNA Transcriptom = sum of all RNA (mRNA, rRNA, tRNA and noncoding RNA) • Blunt end, adenylation The general experimental procedure for miRNA •Strict QC of starting material •appropriate quantification •gel images, bioanalyzer traces •which carrier was used – salmon sperm DNA, yeast RNA L, linear acrylamide J •How to get rid of rRNA… Library preparation Covaris –Fragmentation: Covaris, enzymes, for RNA ions+heat –Size selection: gel vs beads Library preparation Library preparation C:\Benes_PC-gf11\Pictures\New-gen seq\pictures\P1040210.JPG E-gel No DNase After DNase GAPDH Simon 2013 APPLICATIONS: NGS is good for many things Genom Exom Amplicon Transcriptom Applications •De novo genome assembly •Genome re-sequencing : •SNV = single nucleotide variants (mutation/SNP) •CNV = copy number variation (insertion/deletion) •structural aberation (translocation/inversion) •RNA-Seq (gene expression, exon-intron structure, small RNA profiling, and mutation) •CHIP-Seq (protein-DNA interaction) •Epigenetic profiling • • Whole Genome Sequencing •You sequence all of that – including the „junk“ • 1.De novo asembly – using the overlap of the reads to assemble a genome – needs a good coverage 2.Re-sequencing – mapping to your reference genome …you need to have one Celogenomové sekv. od ~ $10.000 (výzkum od $4.000) Využívá se nejčastěji k identifikaci nových nebo vzácných mutací, chromozomálních přestaveb, pro nalezení nových potenciálně terapeutických cílů WES = whole exome sequencing •You sequence only the coding regions of genes…exons (approx. 2 % of the genome) •Effective and cheap •Probably the most widely used Targeted sequencing •You already know the exact gene •And you want to screen •You are typically looking for a causative mutation that you know in advance can be there •Cheap and fast…. Good for detection of small clones using high coverage (but polymerase makes mistakes) •RNA sequencing • • •Detection of expression levels…counting reads that map •Somatic mutations (of expressed genes) •Gene fusions •Alternative splicing •ncRNA…a whole new universe • Alternative Splicing Generates Distinct Proteins in Different Tissues Transcript mRNA-1 Gene Intron Intron Exon Exon Exon Promoter Terminator 5’ 3’ Transcript mRNA-2 5’ 3’ Alternate Splicing Splicing Discovering noncoding RNAs •ncRNA presence in genome difficult to predict by computational methods with high certainty because the evolutionary diversity •Most have unknown function Zeni and Mraz,. 2020 Elucidating DNA-protein interactions through chromatin immunoprecipitation sequencing •Key part in regulating gene expression •Chip: technique to study DNA-protein interaccions •Readout of ChIP-derived DNA sequences onto NGS platforms •Insights into transcription factor/histone binding sites in the human genome C:\Documents and Settings\Antonio\Escritorio\chip-seq.gif Epigenomic variation •Enable of genome-wide patterns of methylation and how this patterns change through the course of an organism’s development/cancer etc. Bisulfit conversion + NGS: •conversion C ® U, Met-C not changes •Identification of methylated bases Metagenomics •Examples: ocean, acid mine site, soil, coral reefs, human microbiome which may vary according to the health status of the individual C:\Documents and Settings\Antonio\Escritorio\metagenomics_process_large.gif Kahvejian et al. 2008 :Integrating Omics mRNA expression Alternative Splicing microRNA expression Protein-DNA interaction Mutation discovery Copy number variation •National Cancer Institute (NCI):– The Cancer Genom Atlas (TCGA) •International Cancer Genome Consortium (ICGC): Cancer GenomeProject – genome, transcriptom and epigenom in 50 most common tumors • Examples of NGS applications in Oncology •Molecular diagnostics…mutations: known, novel and subclonal •RNA seq: new fusion genes •Fusion EML4- ALK in lung cancer •translocation TMPRSS2- ERG in prostate cancer(Dong 2012) •microRNA expression, gene patterns rekurentní mutace-identické mutace detekované ve velkém počtu vzorků opakovaně •Identification of germinal mutations (WES): •Familiar pancreas cancer(PALB2) •Feochromocytoma inherited (MAX) •Familiar melanom (MITF) •……..screening of large cohorts/families • •Targeted sequencing •BRCA1 mutations associated with breast and ovarian cancer (difficult to detect by sanger) (Walsh 2010) •………good for huge genes Hematooncology •First genome of a cancer patient (WGS, 2008): normal cells vs AML cell ® 8 new somatic mutations (Ley, Nature 2008) • Wang 2011 •Identification of novel recurrently mutated genes by WES…. AML CLL •clonal evolution in cancer : AML (WGS) Ding 2012 •Subclonal architecture of your tumor • Landau 2013 •…. Including new therapeutic targets Jardin 2014 Thank you for your attention •In summary: there is a whole new universe in front of you…. A one that nobody has ever seen • •New technologies: https://nanoporetech.com/how-it-works • •Marek Mraz •CEITEC and University Hospital Brno •marek.mraz@email.cz