Basics of Sequencing Technologies PřF:E4014 Projekt z Matematické biologie a biomedicíny biomedicínská bioinformatika FI:IV110 Project in Sequence Analysis FI:IV114 Projekt z bioinformatiky a systémové biologie Vojtěch Bartoň vojtech.barton@recetox.muni.cz RECETOX, Masaryk University September 29, 2024 Table of Contents Basics of sequencing Illumina Sequencing Oxford Nanopore Sequencing Comparison General Processing of Sequencing Data Summary V. Barton ·Sequencing technologies ·September 29, 2024 2 / 24 Sequencing Sequencing DNA Sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. Examples Question: What’s it good for? V. Barton ·Sequencing technologies ·September 29, 2024 3 / 24 Sequencing Sequencing Technology Figure: History of sequencing technology[1] V. Barton ·Sequencing technologies ·September 29, 2024 4 / 24 Illumina Illumina sequencing NextGeneration sequencing technology Sequencing by synthesis Utilizing PCR Widely used Principle https://youtu.be/fCd6B5HRaZ8?si=0Np6Q6pX4236HnvN V. Barton ·Sequencing technologies ·September 29, 2024 5 / 24 Nanopore Oxford Nanopore Third generation of sequencing technology Sequencing by ion stream disruption (electricity) Long reads, real-time Squiggle Principle https://youtu.be/RcP85JHLmnI?si=k732mK9liWV3gw5d V. Barton ·Sequencing technologies ·September 29, 2024 6 / 24 Comparison Comparison Illumina Oxford Nanopore Read length < 600 bp < 2 Mbp Accuracy 99 % 87-98 % Price per Gbp $ 40-60 (NextSeq) $ 50-200 (minION) $ 10-35 (NovaSeq) $ 20-40 (PromethION) Real-time ✓ Epigenomics (Special chemistry) ✓ Table: Comparison of technologies V. Barton ·Sequencing technologies ·September 29, 2024 7 / 24 Processing General workflow V. Barton ·Sequencing technologies ·September 29, 2024 8 / 24 Processing Basecalling V. Barton ·Sequencing technologies ·September 29, 2024 9 / 24 Processing Basecalling Definition Basecalling is the process of converting raw sequencing signals into a nucleotide sequence (A, T, C, G). Examples Question: How is basecalling done for Illumina and Oxford Nanopore? Fastq format The FASTQ format is a text-based file format used to store both the raw sequence data and the corresponding quality scores from sequencing. Each entry consists of four lines: 1. Sequence identifier starting with @. 2. Raw nucleotide sequence (A, T, C, G). 3. + symbol, sometimes followed by the same identifier. 4. PHRED quality scores encoded as ASCII characters corresponding to each nucleotide in the sequence. V. Barton ·Sequencing technologies ·September 29, 2024 10 / 24 Processing Fastq format Examples PHRED Score The PHRED score is a quality score that indicates the accuracy of a nucleotide base call in DNA sequencing, with higher scores representing higher confidence and lower error probabilities. V. Barton ·Sequencing technologies ·September 29, 2024 11 / 24 Processing Quality control V. Barton ·Sequencing technologies ·September 29, 2024 12 / 24 Processing Quality control Describe the quality of sequencing data Set parameters of preprocessing (Trimming & Filtering) Examples Question: What quality parameters to assess? Examples Fastqc Nanoplot Fastp V. Barton ·Sequencing technologies ·September 29, 2024 13 / 24 Processing Assembly & Alignment V. Barton ·Sequencing technologies ·September 29, 2024 14 / 24 Processing Assembly & Alignment DeNovo Assembly De novo assembly is the process of constructing a genome sequence from short DNA fragments without the use of a reference genome, by assembling overlapping reads into longer contiguous sequences (contigs). Alignment Mapping is the process of aligning sequencing reads to a reference genome to determine the origin of each read and identify variations or similarities. Examples DeNovo: SPADes Mappers: Bowtie2, BWA RNA Mappers: STAR (splice-aware mapping) V. Barton ·Sequencing technologies ·September 29, 2024 15 / 24 Processing SAM/BAM format SAM/BAM format SAM (Sequence Alignment/Map) and BAM (Binary Alignment/Map) are file formats used to store aligned sequencing reads. Both include information about the read sequences, their alignment positions, mapping quality, and optional metadata. Examples V. Barton ·Sequencing technologies ·September 29, 2024 16 / 24 Processing Alignment QC V. Barton ·Sequencing technologies ·September 29, 2024 17 / 24 Processing Alignment QC Examples Question: What parameters to collect? Examples Samtools QualiMap Picard tools V. Barton ·Sequencing technologies ·September 29, 2024 18 / 24 Processing Postprocessing V. Barton ·Sequencing technologies ·September 29, 2024 19 / 24 Processing Postprocessing Depends on type of the experiment, quality of data, study design, hypotheses, ... Visualization Integrated Genome Browser (IGV) Feature Quantification RNA-sequencing (genes), Metagenomics (bacteria) Variant calling Mutations, SNP, CNV V. Barton ·Sequencing technologies ·September 29, 2024 20 / 24 Summary Summary V. Barton ·Sequencing technologies ·September 29, 2024 21 / 24 Summary To Remember Bioinformatics (and especially the sequencing bioinformatics) is a very new field No good books, no standards, nothing lasts forever, ... almost everything is old and outdated! Garbage in –> garbage out If you do not understand the whole process you don’t know what the results mean V. Barton ·Sequencing technologies ·September 29, 2024 22 / 24 Summary Keywords Important terms Sequencing, Illumina, Oxford Nanopore, Basecalling, Paired-end sequencing, PCR, bridge PCR, Adapters, Index, Pooling, Demultiplexing, Squiggle, Fasta, Fastq, SAM/BAM, DeNovo Assembly, Alignment, Mapping, Splice-aware, Quality control, Filtering, Trimming, Phred Score, SNP, Mutation, CNV, Workflow, ... V. Barton ·Sequencing technologies ·September 29, 2024 23 / 24