Normalization methods
Barbora Zwinsová
September 2023
miRNA
Overviewof the methods(miRNA)
Normalization method Description Accounted factors Recommendations for use
CPM (counts per million) counts scaled by total number of reads
the simplest form of normalization
sequencingdepth gene count comparisonsbetween replicates of the same samplegroup;
NOT for within sample comparisons or DE analysis
Total count scaling after scalingeach sampleto its library size,they can be
rescaled to a common valueacross all samples
sequencingdepth and RNA composition
Upper quantile scaling modified quantile-normalization method: the upper quartileof
expressed miRNAs is used instead as a linear scalingfactor
sequencingdepth This method has been shown to yield better concordancewith qPCR
results than linear total counts scalingfor RNA-seq data
Trimmed mean of M
(edgeR)
calculates a linearscalingfactor,di,for samplei,based on a
weighted mean after trimming the data by logfold-changes
(M) relativeto a reference sampleand by absoluteintensity (A)
does not take into consideration the
potentially differentRNA composition
across thesamples
gene count comparisonsbetween and within samples and for DE analysis
DESeq2’s median of ratios counts divided by sample-specific sizefactors determined by
median ratio of gene counts relativeto geometric mean per
gene
sequencingdepth and RNA composition gene count comparisonsbetween samples and for DE analysis; NOTfor
within sample comparisons
Linear regression assumes thatthe systematic bias is linearly dependent on the
count abundance
samples normalized to a baselinereference,
which was defined as the median count of
each element across theprofiled samples
Cyclic loess (nonlinear
regression)
Baselinereferece
Quantile non-scalingapproach, forces the distribution of read counts in
all samples acrossan experiment to be equivalent
assumes thatmost targets are not differentially expressed and thatthe
true expression distribution issimilar acrossall samples
Tam et al.,2015,Briefings in Bioinformatics
https://doi.org/10.1093/bib/bbv019
Comparison of data distribution (Tam et al., 2015)
Variance comparisons (Tam et al., 2015)
Conclusion (miRNA) (Tam et al., 2015)
• simply adjusting miRNA counts to the sequencing depth is inadequate
• the distinct number of miRNAs identified in replicate samples may differ because of the
random sampling nature of the technology; normalizing to the library size ignores this.
• total count scaling introduces more variability by pushing all samples toward the same
distribution
• UQ, TMM, DESeq, cyclic loess and quantile normalization are highly similar
• quantile and cyclic loess normalization may be too aggressive by forcing the distribution
of the samples to be the same
• increased variability was noted in the lower abundance miRNAs compared with UQ and
TMM normalized data
• Dillies et al. & Tam et al. support the use of TMM (and UQ) for the normalization of
miRNA count data
• Tam et al. - BWA with one mismatch across the entire read and UQ or TMM,
respectively, lead to more accurate results in downstream analyses
Transcriptome
Overviewof the methods(transcriptomics)
Normalizationmethod Description Accountedfactors Recommendations for use
CPM (countspermillion) countsscaledbytotal numberof reads
the simplestformof normalization
sequencingdepth gene countcomparisonsbetweenreplicatesof the
same sample group;NOT for withinsample
comparisonsor DE analysis
TPM (transcriptsperkilobase million) countsperlengthof transcript(kb) permillion
readsmapped
sequencingdepthandgene
length
gene countcomparisonswithinasample or
betweensamplesof the same samplegroup; NOT
for DE analysis
RPKM/FPKM (reads/fragmentsper
kilobase of exonpermillion
reads/fragmentsmapped)
similartoTPM sequencingdepthandgene
length
gene countcomparisonsbetweengeneswithina
sample;NOTfor betweensample comparisonsor
DE analysis
DESeq2’smedianof ratios countsdividedbysample-specificsize factors
determinedbymedianratioof gene counts
relative togeometricmeanpergene
sequencingdepthandRNA
composition
gene countcomparisonsbetweensamplesandfor
DE analysis;NOTfor within sample comparisons
EdgeR’strimmedmean of M values
(TMM)
usesa weightedtrimmedmeanof the log
expressionratiosbetweensamples
sequencingdepth,RNA
composition,andgene length
gene countcomparisonsbetweenandwithin
samplesandforDE analysis
CPM, RPKM and TPM
Sample 1 Sample 2 Sample 3
Gene A (1.5kb) 50 25 85
Gene B (2kb) 75 50 90
Sequencing depth 125 75 175
RPKM (Reads Per Kilobase Million) TPM (TranscriptsPer Kilobase Million)
Step 1: Normalizefor sequencingdepth
For the example I am scalingby 10 instead of 1000000
Sample 1 Sample 2 Sample 3
Gene A (1.5kb) 4 3.33 4.85
Gene B (2kb) 6 6.66 5.14
Step 2: Normalizefor gene length
Sample 1 Sample 2 Sample 3
Gene A (1.5kb) 2.66 2.22 3.23
Gene B (2kb) 3 3.33 2.57
Step 1: Normalizefor gene length
Sample 1 Sample 2 Sample 3
Gene A (1.5kb) 33.33 16.66 56.66
Gene B (2kb) 37.5 25 45
Sample 1 Sample 2 Sample 3
Gene A (1.5kb) 4.7 3.99 5.57
Gene B (2kb) 5.29 6 4.426
Step 2: Normalizefor sequencingdepth
For the example I am scalingby 10 instead of 1000000
Seq. depth 70.83 41.66 101.66
Seq. depth 9.99 9.99 9.99Seq. depth 5.66 5.55 5.8
50/12.5 = 4
4/1.5 = 2.66
50/1.5 = 33.33
33.33/7.083
For the example I am scalingby 10 instead of 1000000
Sample 1 Sample 2 Sample 3
Gene A (1.5kb) 4 3.33 4.85
Gene B (2kb) 6 6.66 5.14
50/12.5 = 4
CPM (Counts Per Million)
RPKM/FPKM (not recommended)
• the normalized count values output by the RPKM/FPKM method are
not comparable between samples
• the total number of RPKM/FPKM normalized counts for each sample
will be different. Therefore, you cannot compare the normalized
counts for each gene equally between samples.
DESeq2-normalized counts: Median of ratios
method
• tools for differential expression analysis are comparing the counts
between sample groups for the same gene, gene length does not
need to be accounted for by the tool
• sequencing depth and RNA composition do need to be taken into
account
Median of ratios normalization
Step 3: calculate the normalizationfactorfor each
sample (size factor)
The median value(column-wise for the above table)
of all ratios for a given sample is taken as the
normalizationfactor(size factor) for that sample
Step 4: calculate the normalizedcount values using
the normalization factor
Step 1: creates a pseudo-referencesample
(row-wise geometric mean)
Step 2: calculates ratio of each sample to the
reference
Quantile
normalization
Smooth
quantile
normalization
Conditional
quantile
normalization
Projection of samples in 2D after PCA
Projection of samples in 2D after PCA
Comparison of different approaches
(sampling
MetaTranscriptome
Microbiome
Overviewof the methodsfor zeroreplacement
Normalizationmethod Description
Addconstantvalue The simplestmethodisreplacingall zeroswithaconstantvalue smallerthanthe detectionlimit.
Martín-Fernándezetal.(2003) foundthat65% of the detectionlimitminimizesthe distortioninthe
covariance structure.
Usinga constantvalue inthe majorityof cellsleadstounderestimationof the compositional variability.
0.65*detectionlimit=0.65*1
Usinguniformvaluesbetween
0 anddetectionlimit
Uniformvaluesbetween0andthe detectionlimit(DL) isoftenused,settingthe firstparameterat
0.1*DL preventsimputedvaluesfrombeingtooclose tozero.
runif(0.1*DL,DL)
Non-parametricmultiplicative
simple imputation
didnotwork if more than abouthalf of the entriesinthe compositionaldatamatrix were zero ZComposition package inR
Model-basedmultiplicative
lognormal imputation
The replacementisdone inaniterative manner,andforthatpurpose the EM algorithm,MarkovChain
Monte Carlo (MCMC) or multipleimputationare utilized.
ZComposition package inR
BDLs (Belowdetectionlimit) iterative model-basedprocedure whichperformsregressionstoreplace the zeros(e.g. ordinary
multiplelinearregression,robustregression,andpartial least-squares(PLS)regression), procedure is
basedonk-nearest-neighbourimputation
for a large numberof zerosthere are too few neighbourswithnon-zerosavailable,whichmakesthe
algorithmnotapplicable inthiscontext.
deepImp Imputationwithdeeplearningmethods,particularlyusingdeepartificialneural networksinanEM-
basedapproach
DeepImppackage inR
Sugnet Lubbe, PeterFilzmoser, Matthias Templ,
Comparisonofzero replacementstrategies for compositional data withlargenumbers ofzeros. Chemometrics and
IntelligentLaboratory Systems,2021. https://doi.org/10.1016/j.chemolab.2021.104248
Visual representationof distance matrices to
compare data structure between an original
simulatedcompositionaldatamatrix
and different zero imputed matrices when 50%
of the values are zero.
Overviewof the normalization methods
Normalizationmethod Description
CLR – centeredlog-ratio divideseachcompositionalpartbythe geometricmeanof all parts
CLR removesthe value-range restriction(whichisgoodforsome applications),butdoes
notremove the sumconstraint
ILR – Isometriclog-ratio Insteadof analyzingrelative abundances, yi, of D differentOTUs,the ILRtransform
producesD − 1 coordinates, x∗
i (called“balances”)
Each balance correspondstoa single internal node of the tree andrepresentsthe
averageddifference inrelativeabundancebetweenthe taxainthe twosisterclades
descendingfromthatnode
ALR – Additivelog-ratio One componentisusedasa baseline (reference),the proportionwiththe selected
reference is logarithmized