1 NGS introduction Mgr. Eva Budinská, Ph.D. doc. Ing. Matej Lexa, Ph.D. eva.budinska@recetox.muni.cz, matej.lexa@fi.muni.cz IV110 Projekt z bioinformatiky I IV114 Projekt z bioinformatiky a systémové biologie E4014 Projekt z Matematické biologie a biomedicíny biomedicínská bioinformatika Next-generation sequencing introduction • Deciphering DNA sequence is essential for all the branches of “biological” research • It has become widely adopted in numerous laboratories all over the world • Next-generation sequencing (NGS) is a new (almost) technology in the sequencing • It helps to overcome the limitations of older techniques such as speed, scalability, throughput and resolution Year 2010 Year 2023 DNA Sequencing Costs: Data (genome.gov) *Seq things • NGS sequencing has a wide range of use • One of many nice list give you an example of all possible applications • http://enseqlopedia.com/enseqlopedia/ • Approximately (on this list) ~200 different techniques… • Another (simple) list of NGS based techniques • https://liorpachter.wordpress.com/seq/ The NGS analysis pipeline Step 0: base calling (image analysis) + base quality control 10 NGS sequencing is a high-throughput sequencing PolonySanger •Polonysequencing takes place using array of polonies, in which all amplicons of the same DNA fragment are clustered togetheron the same region of the array. These groups of amplicons were termed polonies, shortcutfor polymerase colonies. DNA Library Preparation C A G T C A T C A C C T A G C G T A 5’ G T C A G T C A G T C A G T 3’ 5’ First base incorporated Cycle 1: Add sequencing reagents Detect signal Cleave terminator and dye Cycle 2-n: Add sequencing reagents and repeat Sequencing by synthesis A C G A A A T T T T C C G G G G G C T A Emission Excitation The steps of Illumina sequencing 1. Fragment genomic DNA, e.g. with a sonicator. 2. Ligate adapters to both ends of the fragments. 3. PCR amplify the fragments with adapters 4. Spread DNA molecules across flowcells.Goal is to get exactly one DNA molecule per flowcelllawn of primers. This depends purely on probability, based on the concentrationof DNA. 5. Use bridge PCR to amplify the single molecule on each lawn so that you can get a strong enough signal to detect. Usually this requires several hundred or low thousands of molecules. 6. Sequence by synthesis of complementary strand: reversible terminator chemistry. Lets see a video Sources of errors: adapters Sequencing random fragments of DNA is possible via the addition of short nucleotide sequences which allow any DNA fragment to: ● Bind to a flow cell for next generation sequencing ● Allow for PCR enrichment of adapter ligated DNA fragments only ● Allow for indexing or 'barcoding' of samples so multiple DNA libraries can be mixed together into 1 sequencing lane (known as multiplexing) • In step 2, adapters are ligated to the end of the fragments From: http://tucf-genomics.tufts.edu/documents/protocols/TUCF_Understanding_Illumina_TruSeq_Adapters.pdf Sources of errors: PCR duplicates • In step 3 we are intentionally creating multiple copies of each original genomic DNA molecule so that we have enough of them. • PCR duplicates occur when two copies of the same original molecule getonto differentprimer lawns in a flowcell. • In consequence we read the very same sequence twice! Higher rates of PCR duplicates e.g. 30% arise when you have too little starting material such that greater amplification of the library is needed in step 3, or when you have too great a variance in fragment size, such that smaller fragments, which are easier to PCR amplify, end up over-represented. Dense lawn of primers Adapter Adapter DNA fragment Sources of errors: sequencing by synthesis – the fluorescence • In step 5 we amplify the signal and detect the fluorescence of each base • The assumption is that in a cycle, every molecule on the flowcell is extended by one base • The reality: • Some molecules are not extended or their base has no fluorescent dye • The previous fluorescent dye is not cleaved – the signal from the cluster after a few cycles is a mix of signals from previous bases Sequencing by Synthesis - Fluorescently labeled Nucleotides (Illumina) ̶ During the process, clusters of same sequences are created Step 0: base calling (image analysis) ̶ The identityof each base of a cluster is read off from sequential images ̶ One cycle -> one image Flow-cell imaging Getting the sequences from clusters ̶ Illuminapipeline Firecrest(image analysis) Locatesclusters and calculates intensity and noise Bustard (base calling) Deconvolutessignal and correctsfor cross-talk, phasing Image analysis data output ̶ 100 tiles per lane, 8 lanes per flow cell, 36 cycles ̶ 4 images (A,G,C,T) per tile per cycle = 115,200 images ̶ Each tiff image is ~ 7 MB = 806,400 MB of data ̶ 1.6 TB per 70 nt read, 3.2 TB for 70 nt paired-end read ̶ Most technologies are erasing intensities as they are sequencing, because of a too high amount of data Step 0: base calling (image analysis) + base quality control Base call quality control ̶ Quality control (QC) of each base call is automatically performed by the sequencing platform ̶ In other words: For each letter in a read, we estimate the probability of it being erroneous (P). ̶ QC per base is specialized for each platform – each platform must solve challenges unique to the underlying sequencing technology The PHRED score Qphred = - 10 x log10P(error) ̶ The Phred qualityscore is the negative ratio of the error probabilityto the reference level ofP = 1 expressed in Decibel (dB). ̶ The error estimate is based on statistical model providing measure of certainty of each base call in addition to the nucleotide itself ̶ These statistical models base their error estimate on: ̶ Signal intensities from the recorded image ̶ Number of the sequencing cycle ̶ Distance to other sequence colonies ̶ Phred score is recoded using ASCII in fastq file Phred score Probability of incorrect base call Base call 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% 40 1 in 10 000 99.99% 50 1 in 100 000 99.999% 60 1 in 1 000 000 99.9999% Phred score encoding in ASCII https://en.wikipedia.org/wiki/FASTQ_format FASTA and FASTQ formats ̶ The reads obtained from the sequencer are typically stored in fasta (just the sequences) or fastq (sequences + QC measure) format files. ̶ For paired-end reads, we usually obtain two files. ̶ Reads are not generally grouped by strand, only by the order in which they were sequenced. FASTA format ̶ General format to represent sequences ̶ Two lines per sequence (read) ̶ ID line (starting with >) ̶ Sequence line ̶ Typical file extension: .fa or .fasta • HWI-ST132 - unique instrument name • 633 - run ID • D17U2ACXX - flowcell ID • 8 - flowcell lane • 1101 - tile number within lane • 14830 - x-coordinate of cluster within tile • 2376 - y-coordinate of cluster within tile • 1 - member of pair(1 or 2). Older versions: /1 and /2 • Y/N - whether the read failed qualitycontrol (Y = bad) • 0 - none of the control bits are on • CATGCA - index sequence (barcode) FASTQ format ̶ Combines sequence and base call quality information. ̶ Typical file extension:.fastq • Four lines per sequence (read): • ID (starting with @) • Sequence line • Another ID line (starting with +) • Base qualities (one for each letter in the sequence) Step 1: Read quality control and data filtering Step 1: Read quality control and data filtering ̶ Uses the output file with information about the quality of base calls (.fastq) ̶ First step in the pipeline that deals with actual sequencingdata in base or color space • Several metrics are evaluated, not all of them use the Phred score information: • Distribution of quality scores at each sequence, Sequence composition, Per-sequence and per-read distribution of GC content, Library complexity, Overrepresented sequences • Initial overview – already in base calling SW • More quality overview – SW solutions SolexaQA, FastQC Step 1: Read quality control and data filtering ̶ Based on the quality measures, we decide to remove low quality bases and reads • Trimming – removes low quality or unwanted bases from reads, thus shortening them. Is applied to increase the number of mappable reads. • Filtering – removes whole reads that do not meet quality standards (e.g. too short etc) Step 2: Alignment (mapping) Step 2: Alignment (mapping) ̶ To know, where the short reads (in our filtered .fastq file) come from (which part of the genome or transcriptome do they represent) they need to be (in most instances) aligned to a reference sequence Reference sequence ̶ The reference sequence can be a genome, a transcriptome or a collection of specific sequences. ̶ Typically, the reference sequence(s) is given in a .fa or .fasta file ̶ An alternative is the GTF (gene transfer format) - stores gene structure ̶ BED format (designed for annotation tracks in genomic browsers) (we will learn about where to get the reference sequences in one of the next lectures) Step 2: Alignment (mapping) • Intuitively an easy task • However, trying all the possible options (alignments), is very time consuming! • Efficient algorithms (aligners) exist • The result of mapping is stored by many algorithms in the Sequence alignment/map (SAM) format • We will talk about mapping a in one of the future lectures Step 3: Postalignment QC and visualization Step 3: Postalignment QC and visualization ̶ Necessary in order to see the efficiency of the alignment. • During the alignment, not all the reads are aligned – but what proportion? • If they were aligned – are there any errors? • How well is the reference genome covered? • Important in determining whether: • we can proceed with the analysis or some pre-processing needs to be done • we need to possibly redo the alignment • or we need to realign those unaligned reads Step 3: Postalignment QC and visualization Allows us to get a detailed look on the coverage of a given region. http://software.broadinstitute.org/software/igv/ IGV genome browser Alternative step 2: Genome/trans cript (de-novo) assembly Alternative step 2: Genome/transcript (de-novo) assembly ̶ When the reference sequence does not exist • Alignment is dependent on the existence of reference sequence. • However – sometimes this reference does not exist! – de novo genome assembly – we need to practically create the reference genome. • The assembly is sometimes preferred in order to identify large structural rearrangements even when reference genome is known. In transcriptomics we can use it to detect alternative splicing events Step 4: Feature detection (quantification) Step 4: Feature detection (quantification) ̶ Creates the final table with read counts for further statistical analyses • A feature of interest differs based on the experiment: • gene, exon, intron… (WGS, WES) • transcript, isoform (RNA-seq) • variant - SNP, insertion, deletion, CNV - (WGS, WES, targeted sequencing) • promotor sequence (ChIP-Seq) • In transcriptomics NGS experiments, the emphasis is on quantification of known transcripts (unless the aim is to get new isoforms) – we quantify the abundanceof the RNA. • In genomic NGS experiments, the emphasis is more on the detection of structural changes (the quantification is the % of alternative alleles found). Step 4: Feature detection (quantification) ̶ Creates the final table with read counts for further statistical analyses • The final output of this step is always a matrix with: • Information about the feature (ID, name, variant…) • Quantification of this feature in each of the samples Step 5: Statistical data analysis Step 5: Statistical data analysis ̶ The final matrix is input to four main analysis types: Group comparison(between groups of samples or groups of features) •Differential gene expression / splicing •Differential variantsdetection Group discovery (within samples or features) •Clustering of patients into unknown subtypes based on their sequencing profiles •Searching for genes with similar expression Group prediction (usually for samples) •Finding genes for diagnosis… Special analyses: pathway analysis,construction of gene networks, analysis of survival, … To remember: • Bioinformatics (and especially the sequencing bioinformatics) is a very new field • No good books, no standards, nothing lasts forever, … almost everything is old and outdated! • Bioinformaticians have to be always looking for new methods, tools, algorithms, … it’s the same when wet-lab people must search for novel methods which for decrease bias, are faster, require less input material, … • Garbage in –garbage out • If you do not understand the whole process you don’t know what the results mean Some important terms 63 Sequencing coverage Coverage in DNA sequencing is the number of unique reads that include a given nucleotide in the reconstructed sequence. Depthof coverage (coverage depth / mapping depth) How strongly is the genome "covered" by sequenced fragments (short reads)? Per-base coverage is the average number of times a base of a genome is sequenced (in other words, how many reads cover it). The coverage depth of a genome is calculatedas the number of bases of all short reads that match a genome divided by the length of this genome. It is often expressed as 1X, 2X, 3X,... (1, 2, or, 3 times coverage). Average coverage of the genome (Av) Av = (NxL)/G G - length of the original genome N - number of reads L - average read length Breadth of coverage (covered length) What proportion of the genome is "covered" by short reads? Are there regions that are not covered, even not by a single read? Breadth of coverage is the percentage of bases of a reference genome that are covered with a certain depth. For example: "90% of a genome is covered at 1X depth; and still 70% is covered at 5X depth." Single or paired- end? Single-end sequencing • Pros: fast, cheap • Cons: limited use • Usage: usually sufficient for studies looking to detect counts rather than structural changes, such as RNA-Seq or ChIP-Seq Fragment DNA Read Genome Read Read Read Read Single or paired- end? Paired-end sequencing • Pros: • greater accuracy, double the number of reads per sample in one run (higher capacity) for less than the cost of two sequencing runs • Cons: slower, more expensive (relatively) • Usage: • de novo genome assembly • Analysis of structural changes (deletions, insertions, inversions) and SNPs • A study of splicing variants • Epigenetic modifications (methylation) Fragment DNA Read R1 Genome Read R2 Read R1 Read R2 Read R1 Read R2 Read R1 Read R2 Read length • Longer read lengths provide more precise information about the relative positions of the bases in the genome, they are more expensive than shorter ones. • 50-75 cycles are typically sufficient for simple mapping of reads to a reference genome and quantifying experiments e.g. gene expression (RNA-Seq) • Read lengths greater than or equal to 100 are typically chosen for genome or transcriptome studies that require greater precision • The exact read length dependson the length of the inserts!!! Fragment DNA Read R1 Read R2 Fragment DNA Read R1 Read R2