E5444 Analysis of sequencing data
Feature quantification
Autumn 2023
Eva Budinska
The NGS
analysis
pipeline
Step 4:
Feature
detection
(quantification)
Once we finish the alignment, we can continue
with the quantification of the features
Important – there are important differences within this and previous steps
(alignment) in case of targeted gene sequencing experiments or metagenomics.
These will be discussed more in detail later in separate (focused) lectures.
Step 4: Feature
detection
(quantification)
• Creates the final table with read counts for further statistical
analyses
• A feature of interest differs based on the experiment:
• gene, exon, intron… (WGS, WES)
• transcript, isoform (RNA-seq)
• variant - SNP, insertion, deletion, CNV - (WGS, WES, targeted sequencing)
• promotor sequence (ChIP-Seq)
• In transcriptomics NGS experiments, the emphasis is on quantification of known transcripts
(unless the aim is to get new isoforms) – we quantify the abundanceof the RNA.
• In genomic NGS experiments, the emphasis is more on the detection of structural changes
(the quantification is the % of alternative alleles found).
Step 4: Feature
detection
(quantification)
• Creates the final table with read counts for further statistical
analyses
• The final output of this step is always a matrix with:
• Information about the feature (ID, name, variant…) - annotation
• Quantification of this feature in each of the samples
Featureannotation • Gathering all the information about the feature
• Based on the feature type, we are using different information in the annotation files
• RNAseq – ID and name of the gene/transcript, positionon the chromosome…
• Variants - specific format.vcf - includes referenceallele, variant, annotation, ….
• Meteagenomics– taxonomical assignment of the OTUs (ASVs)
• We are using GTF/GFF files (remember from the previous alignment lecture)
• List of genes,transcripts, exons, introns, CDS, …
• We are using known databases for feature annotation
• Important! Feature annotation can change using different versions of dbs!
GTF/GFFfiles
• General feature format (.gff) or Gene transfer format (.gtf)
Extensions:
• Gtf/.gff2/.gff3
• GTF files can describe a varietyof genomicfeatures,such as genes,
transcripts,exons,introns,and more.Each feature is represented as a
separate line in the GTF file, with the relevant attributes specified.
• We need information about what and where it is located in a reference
genome/transcriptome
• This information is usually stored in an annotation file
• Contains information about each feature and its location
• They are very similar but differ in the “strictness” and/or syntax
• Is often used for gene expression counting, localization of genes, etc.
• There is also .gff1 but it is very rare
GTF/GFFfiles
• General feature format (.gff) or Gene transfer format (.gtf)
Extensions:
• gtf/.gff2/.gff3
• Structure: A GTF file is a tab-delimited text file with various columns that provide information
about genomic features. The minimum required columns typically include the following:
• Chromosome/contig name
• Source (the program or database that generated the annotation)
• Feature type (e.g., gene, transcript, exon, etc.)
• Start position (the beginning of the feature)
• End position (the end of the feature)
• Strand (either "+" for the positive strand or "-" for the negative strand)
• Attributes (a set of key-value pairs describing additional information about the feature, including gene and transcript
identifiers)
Positivevs negative
strand
• There are several conventions for labeling the two strands of a
piece of DNA depending on the frame of reference.
• In the human genome, the National Center
for Biotechnology Information (NCBI) website
uses the term "positive strand" to refer
specifically to the strand whose 5' end begins
at the end of the p arm. Basepair numbering
starts at the 5' end of this strand.
https://researchguides.library.vanderbilt.edu/c.php?g=69346&p=816436
GTF/GFFfiles-
example
• General feature format (.gff) or Gene transfer format (.gtf)
Extensions:
• gtf/.gff2/.gff3
• The files are usually TAB delimited and 1 numbering based
• .gff2–Sanger Institute http://www.sanger.ac.uk/resources/software/gff/spec.html#t_2
• .gtf – Modification of .gff2, sometimes called gff2.5 http://mblab.wustl.edu/GTF22.html
• .gff3 – Sequence Ontology Project http://www.sequenceontology.org/gff3.shtml
GTF/GFFfiles–
comparisonof
differentformats
• General feature format (.gff) or Gene transfer format (.gtf)
Extensions:
• gtf/.gff2/.gff3
.gff2
.gtf
.gff3
GTF/GFFfiles–
comparisonof
differentformats
• General feature format (.gff) or Gene transfer format (.gtf)
Extensions:
• gtf/.gff2/.gff3
• GFF3 is preferred to GFF2, its newer and comprises hierarchical
structure
GTF/GFFfiles– the
hierarchy
• Suppose you havea gene, which is a parent feature.
• This gene may have multiple child features, such as exons and introns.
• Exons can further contain subfeatures,like codingsequences (CDS) and
untranslated regions (UTRs).
• In GFF3, this hierarchical relationshipis represented usingthe "Parent"
and "ID" attributes in the attributes column.
This hierarchical structure is useful for accurately representing the organization of genes and
their constituent elements in a genome, which is crucial for various bioinformatics analyses,
including gene prediction, functional annotation, and visualization of genomic data.
Where to get GTF
files
!IMPORTANT - GTF files downloaded from the UCSC Table Browser have the
same entries for gene id and transcript id.
(not suitable for analyses of different isoforms)
To get the correct formatting, select in the output format field “all fields from
selected table“ and apply UCSC tool genePredToGtf on the resulting file.
Downloadingfrom UCSC Genome Table Browser
(https://genome.ucsc.edu/cgi-bin/hgTables)
BED files
• Could be considered as another type of annotation file but is much simpler
• Most often used for visualization in genomic browsers (UCSC Genome Browser:
https://genome.ucsc.edu/, IGV https://www.broadinstitute.org/igv/, Tablet
https://ics.hutton.ac.uk/tablet/, …)
• Very often as a results of ChIP-Seq or similar experiments
• Used for targeted experiments to import target regions
IMPORTANT – in comparison with .gft/.gff2/.gff3 is zero-based
• Has only three mandatory columns – chromosome, start, end
• Everything else is optional
(gtf/.gff2/.gff3 has 8/9 mandatory columns)
• Browser Extensible Data (.bed)
BED file example
• Browser Extensible Data (.bed)
https://genome.ucsc.edu/FAQ/FAQformat.html#format1
VCF/BCF files
• VariantCall Format (.vcf)
• Binary Call Format (.bcf) is compressed version of .vcf
• Used in targeted sequencing while calling SNPs/Indels/…
• Contains information about “genetic variations”
• Officially: It contains meta-information lines, a header line, and then data lines each
containing information about a position in the genome. The format also has the ability to
contain genotype information on samples for each position.
• Defined format but again not every tool provide all “necessary” fields in the output
• Several versions, current at 4.3
• http://www.1000genomes.org/node/101
• http://samtools.github.io/hts-specs/VCFv4.3.pdf
Feature
quantification
• Counting the numbers of reads that aligned to the feature
• The concept seems simple – lets have a look where the read mapped and then
count all the reads within the feature
• For gene-level quantifications, 2 possibilities:
1. Directly count reads overlapping with the gene loci
2. In the case of transcriptomics, we can quantify on the level of transcripts and then
aggregatethe values per gene
• But, what about multimapped reads – this is a problem especially for isoforms
in transcriptome sequencing!
Feature
quantification – the
mostusedSW
• GenomicRanges, IRanges- packages developed by core team of the Bioconductor project (R
language) - include functions for counting reads that overlap genomic features
• HTSeq-count - function of the HT-Seq Python framework for processing RNA-seq or DNAseq
data
• BEDTools - a popular tool for finding overlaps between genomic features that can be used to
count overlaps between reads and features, in C++, much faster, but not specifically designed
for RNA-seq data, so can count reads for exons or interval features only, similar
to countOverlaps.
• featureCounts – optimized read count program, fast and flexible, used to quantify reads
generated from either RNA or DNA sequencing technologies in terms of any type of genomic
feature
• available either as a Unix command or as a function in the R package Rsubread (the core
coded in C)
Usually count uniquely mapped reads and relies on
counting scheme
Available in different languages (R, Python, C++, …)
Important / The SW (usually) evolves over time and gain
more functions and get faster
Feature
quantification –
selectingthe SW
• When counting reads, make sure you know how the program handles the
following:
• overlap size (full read vs. partial overlap);
• multi-mapping reads, i.e. reads with multiple hits in the genome;
• reads overlapping multiple genomic features of the same kind;
• reads overlapping introns
• The gene quantification will be strongly affected by the underlying gene models
that are usually supplied to the quantification programs via GTF or BED(-like)
HTSeq gene-based
counting schemes
http://www-
huber.embl.de/users/anders/
HTSeq/doc/count.html
Three different modes to tune its
behavior with respect to:
- the multimapping
- the gaps
featureCounts
Also allows to count reads overlapping with individual exons.
If an exon is part of more than one isoform in the annotation file,
featureCounts will return the read counts for the same exon multiple times (n
= number of transcripts with that exon).
If we want to assess differential expression of exons, it is highly recommended to
create an annotation file where overlapping exons of different isoforms are split
into artificially disjoint bins before applying featureCounts (e.g. using
dexseq_prepare_annotation.py script of the DEXSeq package)
Isoform mapping
• Simple count-based approaches tend to ignore reads that overlap with more than one feature
• This is a problem if the aim is to quantify different isoforms (multiple isoform transcripts of the
same gene may overlap)
• Special SW: Cufflinks, eXpress, DEXseq, RSEM
• RSEM is the one that tends to perform best in most comparisons and the statistical
interpretations and assumptions to handle transcript structures have been widely adopted.
Quantifying gene
vs transcript
abundance
https://www.semanticscholar.org/paper/Differential-analysis-
of-gene-regulation-at-with-Trapnell-
Hendrickson/55507aba54a92b31a200e8112f088f5c356ed7bc
Cuffdiff SW
The alignment free
methods for transcript
quantification!
• “Special” counting – RSEM; (Salmon, Kallisto)
• Mapping to transcriptome and dividing multi mapped reads between all isoforms
• Expression by transcripts summarized to gene expression
• https://f1000research.com/articles/4-1521/v2
Alignment is the most computationally expensive step in
the whole pipeline.
But what if we skipped this step??
Threemainstepsin the
alignmentfree
transcript
quantification!
1. The sequences for comparison (reads,
reference) are sliced up into collections of
unique (!) k-mers of a given length k.
2. For each pairwise comparison, we count
the number of times a specic k-mer
appears in both sequence strings that are
being compared.
3. To assess the similarity between the two
strings, some sort of distance function is
employed (Euclidean distance; identical
sequences have a distance of zero)
Zielezinski A, Vinga S, Almeida J, and Karlowski WM. Alignment-free sequence comparison:
Benefits,applications, and tools. Genome Biology, 2017. doi:10.1186/s13059-017-1319-7.
The alignment-free
transcript
quantification!
• In practice, Salmon and Kallisto will first generate an index of k-mers from
all known transcript sequences.
• These transcript k-mers will then be compared with the k-mers of the
sequenced reads, yielding a pseudoalignment that describes how many kmers
a read shares with a set of compatible transcripts (based on the
distances)
• By grouping all pseudoalignments that belong to the same set of
transcripts, they can then estimate the expression level of each transcript
model.
The alignment-free
transcript
quantification!
• The pseudoaligners rely absolutely on a precise and comprehensive transcript annotation.
• If a sequenced fragment originates from an intron or an unannotated transcript, it can be falsely
mapped to a transcript since the relevant genomic sequence is not available.
• Alignment-based tools will discard reads if their edit distance becomes too large,
pseudoalignment currently does not entail a comparable scoring system to validate the
compatibility; therefore there is no safeguard againstspuriousalignments.
• For example, that a 100-bp-read can pseudoalign with a transcript with which it shares only a
single k-mer { if no better match can be found within the universe of the pre-generated cDNA
index.)
The speed is increased but to a
cost!
Comparison of
different algorithms
for counting features
in RNAseq data
Feature featureCounts HTSeq-count Cufflinks eXpress DEXSeq RSEM
Developed by Bioconductorteam Simon Anders Cole Trapnell's Lab
Lior Pachter's
Lab
Simon Anders'Lab Bo Li's Lab
Compatibility
SAM, BAM, CRAM
formats
Primarily works with SAM
SAM, BAM, GTF
formats
SAM, BAM
formats
BAM, GFF formats
SAM, BAM
formats
Parallelization
Supports parallel
processing
Does not support
parallelization
Supports parallel
processing
Supports parallel
processing
Supports parallel
processing
Supports parallel
processing
Annotation
Formats
GTF, GFF, SAF
formats
Requires GFF featuretype
format
GTF format GTF format GFF format GFF format
Output Format
Tab-delimited text
file with counts
Tab-delimited textfilewith
counts
Various files and
tables
Various files
Tab-delimited text
files and tables
Various files and
tables
Speed
Known for speed
and efficiency
Efficient but may be
slower for largedatasets
Slightly slower for
large datasets
Known for speed
and efficiency
Slightly slower for
large datasets
Known for speed
and efficiency
Handling Options
Various options for
read counting,
handling multimapped
reads, and
strandedness
Options for secondary
alignments, ambiguous
reads, andstranded data
Performs transcript
assembly and
quantification
Efficient and
accurate
quantification
Differentialexon
usage analysis
Accurate
quantification,
support for
alternative
isoforms
Statistical Methods
Employs a counting
algorithmthat
considers multimapping
reads
Employs a counting
algorithmfor read
assignment
Utilizes a
likelihood-based
method for
transcript
quantification
Utilizes a
Bayesian
framework for
quantification
Employs statistical
models for
identifying
differentially
expressed exons
Utilizes an
Expectation-
Maximization
(EM) algorithm
for
quantification
Supports Isoform
Counts
No No Yes Yes No Yes
Additional
Features
Flexibleannotation
format compatibility
Primarily designedfor
SAMfiles and GFF feature
types
Transcript assembly
and differential
expression analysis
Speed and
memory
efficiency
Differentialexon
usage analysis
Support for
estimating
isoform
expression
Documentation
Well-documented
and actively
maintained
Well-documented and
widely used
Comprehensive
documentation
Documentation
available
Documentation
available
Documentation
available
Generated with help of ChatGPT