Practical aspects of Illumina
sequencing - summary
The steps of Illumina
sequencing
1. Fragment genomic DNA, e.g. with a sonicator.
2. Ligate adapters to both ends of the fragments.
3. PCR amplify the fragments with adapters
4. Spread DNA molecules across flowcells. Goal
is to get exactly one DNA molecule per
flowcell lawn of primers. This depends purely
on probability, based on the concentration of
DNA.
5. Use bridge PCR to amplify the single molecule
on each lawn so that you can get a strong
enough signal to detect. Usually this requires
several hundred or low thousands of
molecules.
6. Sequence by synthesis of complementary
strand: reversible terminator chemistry.
Sources of errors:
adapters
Sequencing random fragments of DNA is
possible via the addition of short
nucleotide sequences which allow any DNA
fragment to:
● Bind to a flow cell for next generation
sequencing
● Allow for PCR enrichment of adapter
ligated DNA fragments only
● Allow for indexing or 'barcoding' of
samples so multiple DNA
libraries can be mixed together into 1
sequencing lane (known as
multiplexing)
• In step 2, adapters are ligated to the end of the fragments
From:
http://tucf-genomics.tufts.edu/documents/protocols/TUCF_Understanding_Illumina_TruSeq_Adapters.pdf
Fragment
add-ons
Adapters
Primers
Tags
Barcodes
UMIs
Spacers
Linkers
Fragment
add-ons
Have to be present:
P5/P7 – adapters for flow-cell binding
SP1/SP2 – binding point for sequencing primer
Common add-ons:
i5/i7 – sample index – to distinguish sequencing libraries
Optional:
Barcode – unique sequence
UMI – Unique Molecular Identificator – for identification of PCR
duplicates
Indexed Sequencing Overview Guide (15057455) (ox.ac.uk)
Spacers - for sequence
elongation
Linkers – for better binding
of oligonucleotides
Removal of adapters from library
• Necessary step!
• Removal of unligated adapters and adapter dimers (two adapters ligated to each other) is
essential to improve data throughput and quality
• Redundant adapters often compete with library fragments for binding to a flow cell, reducing data
output.
• Adapter dimers can also clonally amplify and generate sequencing “noise” that must be filtered
out during data analysis.
• An excess of unligated adapters makes libraries more prone to index skipping during sequencing
Sources of errors:
PCR duplicates
• In step 3 we are intentionally creating multiple copies of
each original genomic DNA molecule so that we have enough
of them.
• PCR duplicates occur when two copies of the same original
molecule get onto different primer lawns in a flowcell.
• In consequence we read the very same sequence twice!
Higher rates of PCR duplicates e.g. 30% arise when you have too little starting
material such that greater amplification of the library is needed in step
3, or when you have too great a variance in fragment size, such that smaller
fragments, which are easier to PCR amplify, end up over-represented.
Dense lawn
of primers
Adapter
Adapter
DNA fragment
Find beautiful explanation of probabilities and much more at: https://www.cureffi.org/2012/12/11/how-pcr-duplicates-arise-in-next-generation-sequencing/
Clusters of identical
sequences are created
Step 0 of
analysis
• The identity of each base in the cluster is read from the
sequence images
• One cycle -> four images!
How it
works
Error source:
Library
concentration
• The concentrations of prepared NGS libraries can vary widely due to differences in
the quantity and quality of input nucleic acid, as well as in the target enrichment
method that may be used.
• underclustering due to a low library concentrations can result in reduced reads
against capacity
• too many clusters can result in a low-quality score and problematic subsequent
analysis - clusters are poorly distinguished by the image analysis program!
Sources of errors:
sequencing by synthesis –
the fluorescence
• In step 5 we amplify the signal and detect the
fluorescence of each base
• The assumption is that in a cycle, every
molecule on the flowcell is extended by one base
• The reality:
• Some molecules are not extended or their
base has no fluorescent dye
• The previous fluorescent dye is not cleaved –
the signal from the cluster after a few cycles
is a mix of signals from previous bases
Sequencing
coverage
Coverage in DNA sequencing is the number of unique reads
that include a given nucleotide in the reconstructed sequence.
Depth of coverage
(coverage depth /
mapping depth)
How strongly is the genome "covered" by sequenced fragments
(short reads)?
Per-base coverage is the average number of times a base of a genome is sequenced (in other words, how many
reads cover it).
The coverage depth of a genome is calculated as the number of bases of all short reads that match a genome
divided by the length of this genome. It is often expressed as 1X, 2X, 3X,... (1, 2, or, 3 times coverage).
Average coverage of the genome (Av)
Av = (NxL)/G
G - length of the original genome
N - number of reads
L - average read length
Breadth of coverage
(covered length)
What proportion of the genome is "covered" by short reads?
Are there regions that are not covered, even not by a single
read?
Breadth of coverage is the percentage of bases of a reference genome that are covered with a certain depth.
For example: "90% of a genome is covered at 1X depth; and still 70% is covered at 5X depth."
Coverage
recommendations
Coverage is determined based
on:
Read lengths
Genome size
Application
Recommendations in the
literature
Gene expression levels
Complexities of the genome,
repetitive regions
• Errors in the sequencing
tool or methodology
• Analysis algorithm
Average coverage of the
genome (Av)
Av = (NxL)/G
G - length of the
original genome
N - number of reads
L - average read length
Coverage recommendations / DNA
Coverage and Read Depth Recommendations for Next-Generation Sequencing Applications (genohub.com)
Average coverage of the
genome (Av)
Av = (NxL)/G
G - length of the
original genome
N - number of reads
L - average read length
Coverage recommendations / RNA
Coverage and Read Depth Recommendations for Next-Generation Sequencing Applications (genohub.com)
Different transcripts are expressed at different levels => more reads will be captured from highly expressed
genes
Transcriptome complexity, alternative expression, 3' associated bias, and distribution of expression levels
make coverage estimation difficult.
ATTENTION WHEN CALCULATING! We need to count mapped reads, not total reads.
Coverage recommendations / application
Coverage and Read Depth Recommendations for Next-Generation Sequencing Applications (genohub.com)
Coverage recommendations / application
Coverage and Read Depth Recommendations for Next-Generation Sequencing Applications (genohub.com)
Coverage recommendations / application
Coverage and Read Depth Recommendations for Next-Generation Sequencing Applications (genohub.com)
How many
samples
per run?
It depends on the platform
used and its maximum and
required number of reads per
sample (in millions)
Designing Next-Generation Sequencing Runs (genohub.com)
Single or paired- end?
Single-end sequencing
• Pros: fast, cheap
• Cons: limited use
• Useage: usually sufficient for
studies looking to detect
counts rather than structural
changes, such as RNA-Seq or
ChIP-Seq
Fragment DNA
Read
Genome
Read Read
Read
Read
Single or paired- end?
Paired-end sequencing
• Pros:
• greater accuracy, double the number of
reads per sample in one run (higher
capacity) for less than the cost of two
sequencing runs
• Cons: slower, more expensive (relatively)
• Usage:
• de novo genome assembly
• Analysis of structural changes (deletions,
insertions, inversions) and SNPs
• A study of splicing variants
• Epigenetic modifications (methylation)
Fragment DNA
Read R1
Genome
Read R2
Read R1 Read R2
Read R1 Read R2 Read R1 Read R2
Read length
• Longer read lengths provide more precise information
about the relative positions of the bases in the
genome, they are more expensive than shorter ones.
• 50-75 cycles are typically sufficient for simple mapping
of reads to a reference genome and quantifying
experiments e.g. gene expression (RNA-Seq)
• Read lengths greater than or equal to 100 are typically
chosen for genome or transcriptome studies that
require greater precision
• The exact read length depends on the length of the
inserts!!!
Read length and fragments
• The length of the fragments should
roughly correspond to the length of the
read (in the case of paired-end reads
their sum)
• Uniformity of fragment sizes is
essential because read lengths are
limited
• Significantly longer DNA inserts =>
some parts of the inserts remain
unsequenced.
• Shorter than recommended =>
suboptimal use of sequencing reagents
and resources.
• The combination of short and long
inserts => reduces sequencing
efficiency and presents problems in
data analysis. Preparation of DNA Sequencing Libraries for Illumina Systems—6 Key Steps in the Workflow | Thermo Fisher Scientific - CZ
Read length and fragments!
Read length is limited by the sequencing platform and reagent kit
How many cycles of SBS chemistry are in my kit? (illumina.com)
Maximum read length for
Illumina sequencing platforms
More resources
• Practical tips for lab library preparation: Preparation of DNA Sequencing
Libraries for Illumina Systems—6 Key Steps in the Workflow | Thermo
Fisher Scientific - CZ
• Practical tips for sequencing run setup: Designing Next-Generation
Sequencing Runs (genohub.com)
• Indexed sequencing Illumina guide: Indexed Sequencing Overview Guide
(15057455) (ox.ac.uk)
• Sequencing depth and coverage: key considerations in genomic analyses |
Nature Reviews Genetics