FeatureCounts
Ashebir Gogile
Uco540613
1
Feature
Quantification
ht seq-
count
FeatureCounts
BEDToolsmmquant
2
3
• NGS technologies generate millions of short
sequence reads,
which are usually aligned to a reference genome.
• For many downstream analysis is the number of
reads mapping to each genomic feature(exons,genes)
• The process of counting reads is called read
summarization.
4
• featureCounts:
ü an ultrafast and accurate read summarization
program
ürequires far less computer memory.
• a highly efficient general-purpose read summarization
program that counts mapped reads for genomic features
• It can be used to count both RNA-seq and genomic DNAseq
reads (SAM/BAM files)
• It works with either single or paired-end reads
5
featureCounts
• uses genomics annotations in GTF or SAF format
for counting genomic features (exons) and metafeatures
(genes).
6
• When you want to analyze the data for
differential gene expression analysis, it would
be convenient to have counts for all samples in
a single file (gene count matrix).
• Gene count matrix file run featureCounts on all
mapped files at once.
7
• But, when you run a featureCounts for large
samples individually, then the counts for each
sample will be in a separate text file.
8
• To get the merged gene count matrix from all
individual counts files, you can use bioinfokit
v2.0.5
9
Input and output
Inputs
• takes as input Sequence Alignment(SAM)/Binary
Alignment(BAM) files and
• an annotation file including chromosomal coordinates
of features.
10
• The annotation file should be in either GTF
format or a simplified annotation format (SAF)
as shown below:
• GeneID Chr Start End Strand
• 497097 chr1 3204563 3207049 •
497097 chr1 3411783 3411982 •
497097 chr1 3660633 3661579 -
11
outputs
• are numbers of reads assigned to features
(meta-features).
• stat info for the overall summrization results,(no
of successfully assigned reads and no of reads
that failed to be assigned due to various reasons
12
• you can see the output file
gene_matrix_count.csv in the same folder,
which has counts merged for all samples.
13
ALGORITHM
• Overlap of reads with features:
FeatureCounts takes account of any gaps (Indels,
exon–exon junctions) that are found in the read.
• Multiple overlaps:
featureCounts provides users with the
option to either exclude multi-overlap reads or to
count them for each feature that is overlapped.
14
• Chromosome hashing: used to generate a hash table
for the reference sequence names.
• matching reads and features by reference sequence
• Genome bins and feature blocks:A two-level hierarchy
is created for each reference sequence.
The use of a hierarchical data structure (features within blocks
within bins) is a key component of the featureCounts algorithm.
15
• The query read
compared first with genomics bins,
with feature blocks within within bins
then features in any overlapping blocks.
16