Practical aspects of Illumina sequencing - summary The steps of Illumina sequencing 1. Fragment genomic DNA, e.g. with a sonicator. 2. Ligate adapters to both ends of the fragments. 3. PCR amplify the fragments with adapters 4. Spread DNA molecules across flowcells. Goal is to get exactly one DNA molecule per flowcell lawn of primers. This depends purely on probability, based on the concentration of DNA. 5. Use bridge PCR to amplify the single molecule on each lawn so that you can get a strong enough signal to detect. Usually this requires several hundred or low thousands of molecules. 6. Sequence by synthesis of complementary strand: reversible terminator chemistry. Sources of errors: adapters Sequencing random fragments of DNA is possible via the addition of short nucleotide sequences which allow any DNA fragment to: ● Bind to a flow cell for next generation sequencing ● Allow for PCR enrichment of adapter ligated DNA fragments only ● Allow for indexing or 'barcoding' of samples so multiple DNA libraries can be mixed together into 1 sequencing lane (known as multiplexing) • In step 2, adapters are ligated to the end of the fragments From: http://tucf-genomics.tufts.edu/documents/protocols/TUCF_Understanding_Illumina_TruSeq_Adapters.pdf Fragment add-ons Adapters Primers Tags Barcodes UMIs Spacers Linkers Fragment add-ons Have to be present: P5/P7 – adapters for flow-cell binding SP1/SP2 – binding point for sequencing primer Common add-ons: i5/i7 – sample index – to distinguish sequencing libraries Optional: Barcode – unique sequence UMI – Unique Molecular Identificator – for identification of PCR duplicates Indexed Sequencing Overview Guide (15057455) (ox.ac.uk) Spacers - for sequence elongation Linkers – for better binding of oligonucleotides Removal of adapters from library • Necessary step! • Removal of unligated adapters and adapter dimers (two adapters ligated to each other) is essential to improve data throughput and quality • Redundant adapters often compete with library fragments for binding to a flow cell, reducing data output. • Adapter dimers can also clonally amplify and generate sequencing “noise” that must be filtered out during data analysis. • An excess of unligated adapters makes libraries more prone to index skipping during sequencing Sources of errors: PCR duplicates • In step 3 we are intentionally creating multiple copies of each original genomic DNA molecule so that we have enough of them. • PCR duplicates occur when two copies of the same original molecule get onto different primer lawns in a flowcell. • In consequence we read the very same sequence twice! Higher rates of PCR duplicates e.g. 30% arise when you have too little starting material such that greater amplification of the library is needed in step 3, or when you have too great a variance in fragment size, such that smaller fragments, which are easier to PCR amplify, end up over-represented. Dense lawn of primers Adapter Adapter DNA fragment Find beautiful explanation of probabilities and much more at: https://www.cureffi.org/2012/12/11/how-pcr-duplicates-arise-in-next-generation-sequencing/ Clusters of identical sequences are created Step 0 of analysis • The identity of each base in the cluster is read from the sequence images • One cycle -> four images! How it works Error source: Library concentration • The concentrations of prepared NGS libraries can vary widely due to differences in the quantity and quality of input nucleic acid, as well as in the target enrichment method that may be used. • underclustering due to a low library concentrations can result in reduced reads against capacity • too many clusters can result in a low-quality score and problematic subsequent analysis - clusters are poorly distinguished by the image analysis program! Sources of errors: sequencing by synthesis – the fluorescence • In step 5 we amplify the signal and detect the fluorescence of each base • The assumption is that in a cycle, every molecule on the flowcell is extended by one base • The reality: • Some molecules are not extended or their base has no fluorescent dye • The previous fluorescent dye is not cleaved – the signal from the cluster after a few cycles is a mix of signals from previous bases Sequencing coverage Coverage in DNA sequencing is the number of unique reads that include a given nucleotide in the reconstructed sequence. Depth of coverage (coverage depth / mapping depth) How strongly is the genome "covered" by sequenced fragments (short reads)? Per-base coverage is the average number of times a base of a genome is sequenced (in other words, how many reads cover it). The coverage depth of a genome is calculated as the number of bases of all short reads that match a genome divided by the length of this genome. It is often expressed as 1X, 2X, 3X,... (1, 2, or, 3 times coverage). Average coverage of the genome (Av) Av = (NxL)/G G - length of the original genome N - number of reads L - average read length Breadth of coverage (covered length) What proportion of the genome is "covered" by short reads? Are there regions that are not covered, even not by a single read? Breadth of coverage is the percentage of bases of a reference genome that are covered with a certain depth. For example: "90% of a genome is covered at 1X depth; and still 70% is covered at 5X depth." Coverage recommendations Coverage is determined based on: Read lengths Genome size Application Recommendations in the literature Gene expression levels Complexities of the genome, repetitive regions • Errors in the sequencing tool or methodology • Analysis algorithm Average coverage of the genome (Av) Av = (NxL)/G G - length of the original genome N - number of reads L - average read length Coverage recommendations / DNA Coverage and Read Depth Recommendations for Next-Generation Sequencing Applications (genohub.com) Average coverage of the genome (Av) Av = (NxL)/G G - length of the original genome N - number of reads L - average read length Coverage recommendations / RNA Coverage and Read Depth Recommendations for Next-Generation Sequencing Applications (genohub.com) Different transcripts are expressed at different levels => more reads will be captured from highly expressed genes Transcriptome complexity, alternative expression, 3' associated bias, and distribution of expression levels make coverage estimation difficult. ATTENTION WHEN CALCULATING! We need to count mapped reads, not total reads. Coverage recommendations / application Coverage and Read Depth Recommendations for Next-Generation Sequencing Applications (genohub.com) Coverage recommendations / application Coverage and Read Depth Recommendations for Next-Generation Sequencing Applications (genohub.com) Coverage recommendations / application Coverage and Read Depth Recommendations for Next-Generation Sequencing Applications (genohub.com) How many samples per run? It depends on the platform used and its maximum and required number of reads per sample (in millions) Designing Next-Generation Sequencing Runs (genohub.com) Single or paired- end? Single-end sequencing • Pros: fast, cheap • Cons: limited use • Useage: usually sufficient for studies looking to detect counts rather than structural changes, such as RNA-Seq or ChIP-Seq Fragment DNA Read Genome Read Read Read Read Single or paired- end? Paired-end sequencing • Pros: • greater accuracy, double the number of reads per sample in one run (higher capacity) for less than the cost of two sequencing runs • Cons: slower, more expensive (relatively) • Usage: • de novo genome assembly • Analysis of structural changes (deletions, insertions, inversions) and SNPs • A study of splicing variants • Epigenetic modifications (methylation) Fragment DNA Read R1 Genome Read R2 Read R1 Read R2 Read R1 Read R2 Read R1 Read R2 Read length • Longer read lengths provide more precise information about the relative positions of the bases in the genome, they are more expensive than shorter ones. • 50-75 cycles are typically sufficient for simple mapping of reads to a reference genome and quantifying experiments e.g. gene expression (RNA-Seq) • Read lengths greater than or equal to 100 are typically chosen for genome or transcriptome studies that require greater precision • The exact read length depends on the length of the inserts!!! Read length and fragments • The length of the fragments should roughly correspond to the length of the read (in the case of paired-end reads their sum) • Uniformity of fragment sizes is essential because read lengths are limited • Significantly longer DNA inserts => some parts of the inserts remain unsequenced. • Shorter than recommended => suboptimal use of sequencing reagents and resources. • The combination of short and long inserts => reduces sequencing efficiency and presents problems in data analysis. Preparation of DNA Sequencing Libraries for Illumina Systems—6 Key Steps in the Workflow | Thermo Fisher Scientific - CZ Read length and fragments! Read length is limited by the sequencing platform and reagent kit How many cycles of SBS chemistry are in my kit? (illumina.com) Maximum read length for Illumina sequencing platforms More resources • Practical tips for lab library preparation: Preparation of DNA Sequencing Libraries for Illumina Systems—6 Key Steps in the Workflow | Thermo Fisher Scientific - CZ • Practical tips for sequencing run setup: Designing Next-Generation Sequencing Runs (genohub.com) • Indexed sequencing Illumina guide: Indexed Sequencing Overview Guide (15057455) (ox.ac.uk) • Sequencing depth and coverage: key considerations in genomic analyses | Nature Reviews Genetics