Bi5444 Analysis of sequencing data
Introduction to NGS pipeline
Eva Budinska
Aims for today
• Introduce the general NGS analysis pipeline and touch (almost) all parts of analysis in
order to get the general idea
• Explain how the raw read files are created and what is their format
What we learned so far about NGS data analysis…
1. There are tens to hundreds of different algorithms/SW solutions available for
analysis of NGS data
2. There is no “one and the best” way to perform an analysis. The final
selection/pipeline you use largely depends on:
• Your experiment hypothesis and sample type
• The latest review of similar methods for the analysis step you just read
• The accessibility and comprehensibility of the algorithm/SW solution (in other
words, never use something you do not understand!)
• The compatibility of inputs/outputs between algorithms from different steps
….
3. There is NO tool, that can perform every analysis from the very beginning to the end >
like in any other analysis
4. Most of the tools are command-line based – many work the best under Linux or
MacOS environments
5. Windows is the worst environment you can use
6. The tools are written in many different languages: Python, Java, Perl, C++, R…
7. You do not need to become expert in programming in these languages, however, you
need to understand how the tools are installed and used
The NGS
analysis
pipeline
Step 0:
base calling
(image analysis) +
base quality
control
Sequencing by
Synthesis Fluorescently
labeled
Nucleotides (Illumina)
• During the process, clustersof same sequences are created
Step 0: base calling
(image analysis)
• The identityof each base of a cluster is read off from
sequential images
• One cycle -> one image
Flow-cell
imaging
Getting the
sequences from
clusters
• Illuminapipeline
Firecrest(image analysis)
Locatesclusters and calculates intensity and noise
Bustard (base calling)
Deconvolutessignaland correctsfor cross-talk, phasing
Image analysis data output
• 100 tiles per lane, 8 lanes per flow cell, 36 cycles
• 4 images (A,G,C,T) per tile per cycle = 115,200 images
• Each tiff image is ~ 7 MB = 806,400 MB of data
• 1.6 TB per 70 nt read, 3.2 TB for 70 nt paired-end read
• Most technologies are erasing intensities as they are sequencing, because of a too
high amount of data
Step 0:
base calling
(image analysis)
+ base quality
control
Base call quality
control
• Quality control (QC) of each base
call is automatically performed by
the sequencing platform
• In other words: Foreach letter in a
read, we estimatethe probability
of it being erroneous(P).
• QC per base is specialized for each
platform – each platform must
solve challenges unique to the
underlying sequencing technology
Alternative base
calling
algorithms
• Multiple algorithms were proposed reporting
improvements in sequence quality with respect to
the manufacturer's algorithms
• See some reviews:
• Cacho, Ashley & Smirnova, Ekaterina &
Huzurbazar, Snehalata & Cui, Xinping. (2015).
A Comparison of Base-calling Algorithms for
Illumina Sequencing Technology. Briefings in
bioinformatics. 17. 10.1093/bib/bbv088.
• Ledergerber, Christian & Dessimoz,
Christophe. (2011). Base-calling for nextgeneration
sequencing platforms. Briefings in
bioinformatics. 12. 489-97.
10.1093/bib/bbq077.
The PHRED score
Qphred = - 10 x log10P(error)
• The Phred qualityscore is the negative ratioof the error probabilityto the reference level of P = 1 expressed
in Decibel (dB).
• The error estimate is based on statistical model providing measure of certainty of each base call
in addition to the nucleotide itself
• These statistical models base their error estimate on:
• Signal intensities from the recorded image
• Number of the sequencing cycle
• Distance to other sequence colonies
• Phred score is recoded using ASCII in fastq file
Phredscore Probability of
incorrect base
call
Base call
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10 000 99.99%
50 1 in 100 000 99.999%
60 1 in 1 000 000 99.9999%
Phred score encoding in
ASCII
https://en.wikipedia.org/wiki/FASTQ_format
FASTA and
FASTQ formats
• The reads obtained from the sequencer
are typically stored in fasta (just the
sequences) or fastq (sequences + QC
measure) format files.
• For paired-end reads, we usually obtain
two files.
• Reads are not generally grouped by
strand, only by the order in which they
were sequenced.
FASTA format
• General format to represent sequences
• Two linesper sequence (read)
• ID line (startingwith >)
• Sequence line
• Typical file extension: .fa or .fasta
• HWI-ST132 - unique instrument name
• 633 - run ID
• D17U2ACXX - flowcell ID
• 8 - flowcell lane
• 1101 - tile number within lane
• 14830 - x-coordinateof cluster within tile
• 2376 - y-coordinateof cluster within tile
• 1 - member of pair(1 or 2). Older versions: /1 and /2
• Y/N - whether the read failed qualitycontrol (Y = bad)
• 0 - none of the controlbits are on
• CATGCA - indexsequence (barcode)
FASTQ format
• Combines sequence and base call quality information.
• Typical file extension:.fastq
• Four lines per sequence (read):
• ID (starting with @)
• Sequence line
• Another ID line (starting with +)
• Base qualities(one for each letter in the sequence)
Step 1:
Read quality
control and
data filtering
Step 1: Read quality
control and data
filtering
• Uses the output file with information about the quality of base
calls (.fastq)
• Firststep in the pipeline that deals with actual sequencingdata
in base or color space
• Several metrics are evaluated, not all of them use the Phred score information:
• Distribution of quality scores at each sequence, Sequence composition, Per-sequence and
per-read distribution of GC content, Library complexity, Overrepresented sequences
• Initial overview – already in base calling SW
• More quality overview – SW solutions SolexaQA, FastQC
Step 1: Read quality
control and data
filtering
• Based on the quality measures, we decide to remove low
quality bases and reads
• Trimming – removes low quality or unwanted bases from reads, thus shortening
them. Is applied to increase the number of mappable reads.
• Filtering – removes whole reads that do not meet quality standards (e.g. too
short etc)
Step 2:
Alignment
(mapping)
Step 2: Alignment (mapping)
• To know, where the short reads (in
our filtered .fastq file) come from
(which part of the genome or
transcriptome do they represent)
they need to be (in most instances)
aligned to a referencesequence
Reference
sequence
• The reference sequence can be a genome,
a transcriptome or a collection of specific
sequences.
• Typically, the reference sequence(s) is given
in a .fa or .fasta file
• An alternative is the GTF (gene transfer
format) - stores gene structure
• BED format (designed for annotation tracks
in genomic browsers)
(we will learn about where to get the reference
sequences in one of the next lectures)
Step 2: Alignment
(mapping)
• Intuitively an easy task
• However, trying all the possible options (alignments), is very time consuming!
• Efficient algorithms (aligners) exist
• The result of mapping is stored by many algorithms in the Sequence alignment/map (SAM)
format
• We will talk about mapping a in one of the future lectures
Step 3:
Post-alignment
QC and
visualization
Step 3: PostalignmentQC
and
visualization
• Necessary in order to see the efficiency of the alignment.
• During the alignment, not all the reads are aligned – but what proportion?
• If they were aligned – are there any errors?
• How well is the reference genome covered?
• Important in determining whether:
• we can proceed with the analysis or some pre-processing needs to be done
• we need to possibly redo the alignment
• or we need to realign those unaligned reads
Step 3: Postalignment
QC
and
visualization
Allows us to get a detailed look on
the coverage of a given region.
http://software.broadinstitute.org/software/igv/
IGV genome browser
Alternative
step 2:
Genome/transcript
(de-novo) assembly
Alternative step 2:
Genome/transcript
(de-novo)assembly
• When the reference sequence does not exist
• Alignment is dependent on the existence of reference sequence.
• However – sometimes this reference does not exist! – de novo genome assembly – we need to
practically create the reference genome.
• The assembly is sometimes preferred in order to identify large structural rearrangements even
when reference genome is known. In transcriptomics we can use it to detect alternative
splicing events
Step 4:
Feature
detection
(quantification)
Step 4: Feature
detection
(quantification)
• Creates the final table with read counts for further statistical
analyses
• A feature of interest differs based on the experiment:
• gene, exon, intron… (WGS, WES)
• transcript, isoform (RNA-seq)
• variant - SNP, insertion, deletion, CNV - (WGS, WES, targeted sequencing)
• promotor sequence (ChIP-Seq)
• In transcriptomics NGS experiments, the emphasis is on quantification of known transcripts
(unless the aim is to get new isoforms) – we quantify the abundanceof the RNA.
• In genomic NGS experiments, the emphasis is more on the detection of structural changes
(the quantification is the % of alternative alleles found).
Step 4: Feature
detection
(quantification)
• Creates the final table with read counts for further statistical
analyses
• The final output of this step is always a matrix with:
• Information about the feature (ID, name, variant…)
• Quantification of this feature in each of the samples
Step 5:
Statistical data
analysis
Step 5: Statistical
dataanalysis
• The final matrix is input to four main analysis types:
Group comparison(between
groups of samples or groups of
features)
•Differential gene expression /
splicing
•Differential variantsdetection
Group discovery (within samples
or features)
•Clustering of patientsinto
unknown subtypes based on
their sequencing profiles
•Searching for genes with similar
expression
Group prediction (usuallyfor
samples)
•Finding genes for diagnosis…
Special analyses: pathwayanalysis,construction of gene networks, analysisof survival,…
Analyzing and writing
the code
You cannot NGSanalyze without scripting(writingofcommands)and keeping track of it !
Why scripting and keeping track?
1. Reproducibility (you or anyone else must be able to reproduce your analysis
step by step)
2. Time saving (if something in your data changes, you can simply run all the
scripts again on new dataset)
3. No one-size-fits-all solutions (i. no program can cover all the possible
combinations of tools; ii. it is easier to change something in the existing script
than write it all over again)
4. Batch-execution of commands (high-performance cloud and cluster
computing requires commands in batches)
Examples of scripts for different analysis steps
• Quality control (using prinseq)
$ perl prinseq-lite.pl -fastq file1.fastq -graph_data file1.gd -out_good null -out_bad
null
• Alignment (using bwa)
$ bwa sampe -P hg19.fa file1.sai file2.sai \ file1.fastq file2.fastq > file_bwa.sam
• Variant calling
$ java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \
-R hg19.fa \
-I file1.bam -I file2.bam -I file3.bam -I file4.bam \
-stand_call_conf 30 -stand_emit_conf 10 \
-o output.raw.snps.indels.vcf
Conversion, conversion, conversion
• … be prepared for never-ending format conversions …
(wrong format of input file is usually one of the most common reasons of errors)
• SAM to BAM,
• BAM to SAM,
• sorted SAM to BAM,
• BAM to sorted SAM,
• BAM to indexed BAM,
• aligned, realigned, indexed, ….
Conversion, conversion, conversion
• … be prepared for never-ending format conversions …
(wrong format of input file is usually one of the most common reasons of errors)
• SAM to BAM,
• BAM to SAM,
• sorted SAM to BAM,
• BAM to sorted SAM,
• BAM to indexed BAM,
• aligned, realigned, indexed, ….
Small first
example
1. Download the toy example .fastq file http://www.ebi.ac.uk/ena/data/view/SRR014849
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR014/SRR014849/SRR014849_1.fastq.gz
2. Unzip the file
$ gunzip SRR014849_1.fastq.gz
3. See the header of the file:
$ head SRR014849_1.fastq
4. Calculate total number of lines
$ wc -l SRR014849_1.fastq
5. Calculate total number of reads
$ wc -l SRR014849_1.fastq | awk '{print $1/4}'