NGS data analysis introduction Bi7420: Moderní metody pro analýzu genomu Vojta Bystry vojtech.bystry@ceitec.muni.cz Plan for Bi7420 2 • Next generation sequencing methods overview ‒ Focus on experiment planning and result interpretation 1. Introduction to NGS technology 2. Basic QC, DNA resequencing 3. DNA resequencing, Clinical genomics 4. miRNA, lncRNA in cancer - Marek Mráz 5. RNA-seq 6. RNA-seq, Single-cell RNA-seq, Spatial transcriptomics 7. Chip-seq (CLIP-seq), other methods 3 Research groups 4 NGS data processing Project-specific bioinformatics support Cultivation of bioinformatics know-how Multi-omics approaches Long-read sequencing Spatial transcriptomics CEITEC bioinformaticians help Sensitive cloud Complex structural variants Data integration Teaching bioinformatics NGS data analysis BioRoots Bioinformatics framework BioData CF Your problem is our mission What is NGS? 5 • Next generation sequencing ‒ New generation sequencing ‒ HTP = High throughput sequencing ‒ Massively parallel sequencing • Contrast to Sanger sequencing What is NGS? 6 NGS experiment workflow NGS experiment workflow 8 Experimental design Library preparation Sequencing Data analysis NGS experiment workflow 9 Experimental design Library preparation Sequencing Data analysis How we sequenceWhat we sequenceWhy we sequence NGS experiment workflow 10 Experimental design Library preparation Sequencing Data analysis How we sequenceWhat we sequenceWhy we sequence Consultation regarding data analysis is highly advisable. Library: Fragmented DNA with technical sequences attached Pool: Mix of different libraries, that are sequenced in one run Read: String of letters coming out of a sequencer Depth: How many reads we have coming from a single region of our reference Flow Cell: The glass slide where sequencing happens Barcode / Index: Technical sequence used to differentiate samples Adapter: Technical sequence used to anchor the template to the Flow Cell Vocabulary Currently provided sequencing technologies: Illumina: NovaSeq, NextSeq 500, MiSeq PacBio: Sequel IIe Oxford Nanopore: GridION, PromethION P2 Solo NGS sequencing technologies ONT Nanopore adapters with motor proteins Nucleic acids fed through pore, generating current Current is interpreted by algorithms to generate sequence - including base modifications: squiggle Input material (example: DNA) Library preparation Flow cell loading Shifts in electric current ONT The ONT sequencers: 1. MinION/Flongle 2. GridION 3. PromethION P2 Solo (developer version) MinION 1 Flowcell 512 channels/FC 10-15Gb (~900 €/FC) GridION 5 Flowcells 512 channels/FC 10-15Gb/FC (~900 €/FC) P2 Solo 2 Flowcells 2675 channels/FC 100-120Gb (~1600€/FC) ONT - “news” Chemistry V14 - New motor protein - New buffer composition - lower pH - Lower flowcell loading amounts New instrument: P2 Solo - Connected to GridION - Two high-yield flowcells (100-200Gb) Pore R10.4.1 - Improved enzyme-pore docking - Faster speeds - Higher output yield Duplex reads for higher accuracy POD5: New output file format - Smaller file - Faster file writing - Incompatible with most current tools ONT Direct RNA Additional kits available: ● cDNA sequencing kit PCR full length transcripts ● 16S sequencing kit PCR ● PCR sequencing kit targeted amplicon PacBio SMRT Sequencing: Single Molecule Real-Time Sequencing Warning Keeping epigenetics information or not must be decided prior to the run! PacBio ❏ Generates ~2.2-2.4 million HIFI reads / 8M SMRTCell ❏ HiFi reads have 99.9% accuracy* ❏ HiFi reads can reach between 18-25 kb* ❏ Movie times of 10-30h → depends on library size Warnings * The longer the less accurate! 1514€ /SMRTCell w/o library preparation PacBio MAS-Seq (Multiplexed Arrays Sequencing) Preprint: High-throughput RNA isoform sequencing using programmable cDNA concatenation. Aziz M. Al’Khafaji et al. 2021 3,000 -10,000 cell 10X Chromium Next GEM Single Cell 3’ kit v(3.1) 10X workflow for cDNA generation 15-75 ng of cDNA as input Illumina Sequencing • Short reads ~ 30 - 300 bases • Random error, mostly mismatches • Usually quite good quality 99.9% • A lot of data produced • “Affordable” 20 Illumina Sequencing - Library prep 21 • Hundreds of methods to select the desired molecular landscape • Adapters necessary • Barcodes to differentiate individual samples Illumina Sequencing - Library prep 22 • Hundreds of methods to select the desired molecular landscape • Adapters necessary • Barcodes to differentiate individual samples Illumina Sequencing - Clustering • Signal from a single DNA molecule is not enough to be detected 23 24 Illumina Sequencing - Clustering • Signal from a single DNA molecule is not enough to be detected • Sequencing by synthesis • Each cycle - 1 nucleotide read 25 Illumina Sequencing - Sequencing 26 Illumina Sequencing - Sequencing • Sequencing by synthesis • Each cycle - 1 nucleotide read • Readout is machine dependent • Different error profiles 27 Illumina Sequencing - Sequencing Currently provided sequencing technologies: Illumina: NovaSeq, NextSeq 500, MiSeq PacBio: Sequel IIe Oxford Nanopore: GridION, PromethION P2 Solo NGS sequencing technologies Currently provided sequencing technologies: Illumina: NovaSeq, NextSeq 500, MiSeq PacBio: Sequel IIe Oxford Nanopore: GridION, PromethION P2 Solo NGS sequencing technologies Currently provided sequencing technologies: Illumina: NovaSeq, NextSeq 500, MiSeq PacBio: Sequel IIe Oxford Nanopore: GridION, PromethION P2 Solo NGS sequencing technologies Short-read sequencing result 31 • 10^5 – 10^10 reads • 75 – 300Bp • Could be pair-end NGS library preparation - What we sequence 32 Biological material DNA Select some parts RNA (cDNA) Note on a direct RNA sequencing using Oxford nanopore 33 NGS data analysis 3434 Raw data .fastq Genome/Transcriptome Reference Mapping .bam Interaction analysis CHIP-seq Expression analysis RNAseq Variant analysis WES de-multiplexing Not known reference QC QC Experiment design Not ”classic” reference Metagenomics Reference assembly Immunogenetic VDJ-genes CRISPR sgRNA Methylation Bisulfide-seq… NGS data analysis 3535 Raw data .fastq Genome/Transcriptome Reference Mapping .bam Interaction analysis CHIP-seq Expression analysis RNAseq Variant analysis WES de-multiplexing Not known reference QC QC Experiment design Not ”classic” reference Metagenomics Reference assembly Immunogenetic VDJ-genes CRISPR sgRNA Methylation Bisulfide-seq… Metagenomics 36 Metagenomics results 37 • Environmental statistics about populations ‒ alpha, beta, gamma diversity Metagenomics results 38 • Environmental statistics about populations ‒ identify known bacterial species • taxonomy profiling ‒ eventually functional profiling • E.g. antimicrobial resistance genes Metagenomics results 39 • Environmental statistics about populations ‒ identify known bacterial species • taxonomy profiling ‒ eventually functional profiling • E.g. antimicrobial resistance genes • Sequencing techniques ‒ 16S rRNA sequencing ‒ Shotgun metagenomic sequencing Metagenomics – 16S rRNA vs. Shotgun 40 Factors 16S rRNA sequencing Shotgun Metagenomic Sequencing Cost ~$50 USD Starting at ~$150 but price will depend on sequencing depth required Sample preparation Similar complexity to shotgun sequencing Similar complexity to 16S rRNA sequencing Functional profiling (profile microbial genes) No (but ‘predicted’ functional profiling is possible) Yes (but it only reveals information on functional potential) Taxonomic resolution: Genus, species, strain? Bacterial genus (sometimes species); dependent on region(s) targeted Bacterial species (sometimes strains and single nucleotide variants, if sequencing is deep enough) Taxonomic coverage Bacteria and archaea All taxa, including viruses Bioinformatics requirements Beginner to intermediate expertise Intermediate to advanced expertise Databases Established, well-curated Relatively new, still growing Sensitivity to host DNA contamination Low (but PCR success depends on the absence of inhibitors and the presence of a detectable microbiome) High , varies with sample type (but this can be mitigated by calibrating the sequencing depth) Bias Medium to high (retrieved taxonomic composition is dependent on selected primers and targeted variable region) Lower (while metagenomics is “untargeted”, experimental and analytical biases can be introduced at various stages) Metagenomics – 16S rRNA vs. Shotgun 41 • Study Examples ‒ Assessment of the bacterial microbiome of Amazonian soil Metagenomics – 16S rRNA vs. Shotgun 42 • Study Examples ‒ Assessment of the bacterial microbiome of Amazonian soil • 16S rRNA sequencing may provide more taxonomic resolution Metagenomics – 16S rRNA vs. Shotgun 43 • Study Examples ‒ Assessment of the bacterial microbiome of Amazonian soil • 16S rRNA sequencing may provide more taxonomic resolution ‒ Changes in microbiome composition and antimicrobial gene carriage following fecal transplant Metagenomics – 16S rRNA vs. Shotgun 44 • Study Examples ‒ Assessment of the bacterial microbiome of Amazonian soil • 16S rRNA sequencing may provide more taxonomic resolution ‒ Changes in microbiome composition and antimicrobial gene carriage following fecal transplant • shotgun sequencing to assess both compositional and functional differences Metagenomics – 16S rRNA vs. Shotgun 45 • Study Examples ‒ Assessment of the bacterial microbiome of Amazonian soil • 16S rRNA sequencing may provide more taxonomic resolution ‒ Changes in microbiome composition and antimicrobial gene carriage following fecal transplant • shotgun sequencing to assess both compositional and functional differences ‒ Daily fluctuations in gut microbiome following 2 week dietary fiber intervention Metagenomics – 16S rRNA vs. Shotgun 46 • Study Examples ‒ Assessment of the bacterial microbiome of Amazonian soil • 16S rRNA sequencing may provide more taxonomic resolution ‒ Changes in microbiome composition and antimicrobial gene carriage following fecal transplant • shotgun sequencing to assess both compositional and functional differences ‒ Daily fluctuations in gut microbiome following 2 week dietary fiber intervention • shotgun sequencing or 16S rRNA ‒ assess both compositional and functional differences ‒ cheaper and in this case can use ‘predicted’ functional profiling NGS data analysis 4747 Raw data .fastq Genome/Transcriptome Reference Mapping .bam Interaction analysis CHIP-seq Expression analysis RNAseq Variant analysis WES de-multiplexing Not known reference QC QC Experiment design Not ”classic” reference Metagenomics Reference assembly Immunogenetic VDJ-genes CRISPR sgRNA Methylation Bisulfide-seq… Reference Assembly 48 Reference Assembly 49 Reference Assembly problematic with short read 50 Genome Assembly 51 • Very hard and costly (in eukaryota) • Multiple sequencing types needed ‒ Pair-end short reads ‒ Long reads ‒ Mate-pairs (e.g. Hi-C) Genome Assembly 52 • Very hard and costly (in eukaryota) • Multiple sequencing types needed ‒ Pair-end short reads ‒ Long reads ‒ Mate-pairs (e.g. Hi-C) T2T-CHM13 Transcriptome Assembly 53 • Assemble RNA fragments ‒ Similar reference helpful • Genome guided assembly ‒ Good for poorly annotated organisms with known genomic reference NGS data analysis 5454 Raw data .fastq Genome/Transcriptome Reference Mapping .bam Interaction analysis CHIP-seq Expression analysis RNAseq Variant analysis WES de-multiplexing Not known reference QC QC Experiment design Not ”classic” reference Metagenomics Reference assembly Immunogenetic VDJ-genes CRISPR sgRNA Methylation Bisulfide-seq… Immunogenetic • T-cell receptor , Immunoglobulin – (B-cell) • Gene rearrangement during cell maturation ‒ VDJ recombination 55 Immunogenetic • T-cell receptor , Immunoglobulin – (B-cell) • Gene rearrangement during cell maturation ‒ VDJ recombination 56 Immunogenetic • Different cell populations ‒ Clonal studies ‒ Repertoire usage • Main usage – blood malignancies (leukemias) 57 NGS data analysis 5858 Raw data .fastq Genome/Transcriptome Reference Mapping .bam Interaction analysis CHIP-seq Expression analysis RNAseq Variant analysis WES de-multiplexing Not known reference QC QC Experiment design Not ”classic” reference Metagenomics Reference assembly Immunogenetic VDJ-genes CRISPR sgRNA Methylation Bisulfide-seq… Genome-wide CRISPR-Cas9 knockout screens • Cas9 (CRISPR associated protein 9) is a protein which plays a vital role in the immunological defense of certain bacteria against DNA viruses • sgRNA libraries ‒ Each sgRNA knockout specific gene ‒ 76,000 guide RNAs (sgRNAs) with four highly active guides per gene, targeting about 19,000 genes as well as non-targeting sgRNA controls 59 Lentivirus Genome-wide CRISPR-Cas9 knockout screens • Screen selection + expansion/enrichment of surviving cells • NGS sequencing 60 Genome-wide CRISPR-Cas9 knockout screens • NGS data analysis ‒ Counting cells with different genes KD ‒ Counting sgRNA fragments ‒ Compare conditions 61 Genome-wide CRISPR-Cas9 knockout screens • Example study 62 Wei, L., Lee, D., Law, CT. et al. Genome-wide CRISPR/Cas9 library screening identified PHGDH as a critical driver for Sorafenib resistance in HCC. Nat Commun 10, 4681 (2019). https://doi.org/10.1038/s41467-019-12606-7 NGS data analysis 6363 Raw data .fastq Genome/Transcriptome Reference Mapping .bam Interaction analysis CHIP-seq Expression analysis RNAseq Variant analysis WES de-multiplexing Not known reference QC QC Experiment design Not ”classic” reference Metagenomics Reference assembly Immunogenetic VDJ-genes CRISPR sgRNA Methylation Bisulfide-seq… 64www.ceitec.eu CEITEC @CEITEC_Brno Vojta Bystry vojtech.bystry@ceitec.muni.cz Thank you for your attention!