Normalization methods Barbora Zwinsová September 2023 miRNA Overviewof the methods(miRNA) Normalization method Description Accounted factors Recommendations for use CPM (counts per million) counts scaled by total number of reads the simplest form of normalization sequencingdepth gene count comparisonsbetween replicates of the same samplegroup; NOT for within sample comparisons or DE analysis Total count scaling after scalingeach sampleto its library size,they can be rescaled to a common valueacross all samples sequencingdepth and RNA composition Upper quantile scaling modified quantile-normalization method: the upper quartileof expressed miRNAs is used instead as a linear scalingfactor sequencingdepth This method has been shown to yield better concordancewith qPCR results than linear total counts scalingfor RNA-seq data Trimmed mean of M (edgeR) calculates a linearscalingfactor,di,for samplei,based on a weighted mean after trimming the data by logfold-changes (M) relativeto a reference sampleand by absoluteintensity (A) does not take into consideration the potentially differentRNA composition across thesamples gene count comparisonsbetween and within samples and for DE analysis DESeq2’s median of ratios counts divided by sample-specific sizefactors determined by median ratio of gene counts relativeto geometric mean per gene sequencingdepth and RNA composition gene count comparisonsbetween samples and for DE analysis; NOTfor within sample comparisons Linear regression assumes thatthe systematic bias is linearly dependent on the count abundance samples normalized to a baselinereference, which was defined as the median count of each element across theprofiled samples Cyclic loess (nonlinear regression) Baselinereferece Quantile non-scalingapproach, forces the distribution of read counts in all samples acrossan experiment to be equivalent assumes thatmost targets are not differentially expressed and thatthe true expression distribution issimilar acrossall samples Tam et al.,2015,Briefings in Bioinformatics https://doi.org/10.1093/bib/bbv019 Comparison of data distribution (Tam et al., 2015) Variance comparisons (Tam et al., 2015) Conclusion (miRNA) (Tam et al., 2015) • simply adjusting miRNA counts to the sequencing depth is inadequate • the distinct number of miRNAs identified in replicate samples may differ because of the random sampling nature of the technology; normalizing to the library size ignores this. • total count scaling introduces more variability by pushing all samples toward the same distribution • UQ, TMM, DESeq, cyclic loess and quantile normalization are highly similar • quantile and cyclic loess normalization may be too aggressive by forcing the distribution of the samples to be the same • increased variability was noted in the lower abundance miRNAs compared with UQ and TMM normalized data • Dillies et al. & Tam et al. support the use of TMM (and UQ) for the normalization of miRNA count data • Tam et al. - BWA with one mismatch across the entire read and UQ or TMM, respectively, lead to more accurate results in downstream analyses Transcriptome Overviewof the methods(transcriptomics) Normalizationmethod Description Accountedfactors Recommendations for use CPM (countspermillion) countsscaledbytotal numberof reads the simplestformof normalization sequencingdepth gene countcomparisonsbetweenreplicatesof the same sample group;NOT for withinsample comparisonsor DE analysis TPM (transcriptsperkilobase million) countsperlengthof transcript(kb) permillion readsmapped sequencingdepthandgene length gene countcomparisonswithinasample or betweensamplesof the same samplegroup; NOT for DE analysis RPKM/FPKM (reads/fragmentsper kilobase of exonpermillion reads/fragmentsmapped) similartoTPM sequencingdepthandgene length gene countcomparisonsbetweengeneswithina sample;NOTfor betweensample comparisonsor DE analysis DESeq2’smedianof ratios countsdividedbysample-specificsize factors determinedbymedianratioof gene counts relative togeometricmeanpergene sequencingdepthandRNA composition gene countcomparisonsbetweensamplesandfor DE analysis;NOTfor within sample comparisons EdgeR’strimmedmean of M values (TMM) usesa weightedtrimmedmeanof the log expressionratiosbetweensamples sequencingdepth,RNA composition,andgene length gene countcomparisonsbetweenandwithin samplesandforDE analysis CPM, RPKM and TPM Sample 1 Sample 2 Sample 3 Gene A (1.5kb) 50 25 85 Gene B (2kb) 75 50 90 Sequencing depth 125 75 175 RPKM (Reads Per Kilobase Million) TPM (TranscriptsPer Kilobase Million) Step 1: Normalizefor sequencingdepth For the example I am scalingby 10 instead of 1000000 Sample 1 Sample 2 Sample 3 Gene A (1.5kb) 4 3.33 4.85 Gene B (2kb) 6 6.66 5.14 Step 2: Normalizefor gene length Sample 1 Sample 2 Sample 3 Gene A (1.5kb) 2.66 2.22 3.23 Gene B (2kb) 3 3.33 2.57 Step 1: Normalizefor gene length Sample 1 Sample 2 Sample 3 Gene A (1.5kb) 33.33 16.66 56.66 Gene B (2kb) 37.5 25 45 Sample 1 Sample 2 Sample 3 Gene A (1.5kb) 4.7 3.99 5.57 Gene B (2kb) 5.29 6 4.426 Step 2: Normalizefor sequencingdepth For the example I am scalingby 10 instead of 1000000 Seq. depth 70.83 41.66 101.66 Seq. depth 9.99 9.99 9.99Seq. depth 5.66 5.55 5.8 50/12.5 = 4 4/1.5 = 2.66 50/1.5 = 33.33 33.33/7.083 For the example I am scalingby 10 instead of 1000000 Sample 1 Sample 2 Sample 3 Gene A (1.5kb) 4 3.33 4.85 Gene B (2kb) 6 6.66 5.14 50/12.5 = 4 CPM (Counts Per Million) RPKM/FPKM (not recommended) • the normalized count values output by the RPKM/FPKM method are not comparable between samples • the total number of RPKM/FPKM normalized counts for each sample will be different. Therefore, you cannot compare the normalized counts for each gene equally between samples. DESeq2-normalized counts: Median of ratios method • tools for differential expression analysis are comparing the counts between sample groups for the same gene, gene length does not need to be accounted for by the tool • sequencing depth and RNA composition do need to be taken into account Median of ratios normalization Step 3: calculate the normalizationfactorfor each sample (size factor) The median value(column-wise for the above table) of all ratios for a given sample is taken as the normalizationfactor(size factor) for that sample Step 4: calculate the normalizedcount values using the normalization factor Step 1: creates a pseudo-referencesample (row-wise geometric mean) Step 2: calculates ratio of each sample to the reference Quantile normalization Smooth quantile normalization Conditional quantile normalization Projection of samples in 2D after PCA Projection of samples in 2D after PCA Comparison of different approaches (sampling MetaTranscriptome Microbiome Overviewof the methodsfor zeroreplacement Normalizationmethod Description Addconstantvalue The simplestmethodisreplacingall zeroswithaconstantvalue smallerthanthe detectionlimit. Martín-Fernándezetal.(2003) foundthat65% of the detectionlimitminimizesthe distortioninthe covariance structure. Usinga constantvalue inthe majorityof cellsleadstounderestimationof the compositional variability. 0.65*detectionlimit=0.65*1 Usinguniformvaluesbetween 0 anddetectionlimit Uniformvaluesbetween0andthe detectionlimit(DL) isoftenused,settingthe firstparameterat 0.1*DL preventsimputedvaluesfrombeingtooclose tozero. runif(0.1*DL,DL) Non-parametricmultiplicative simple imputation didnotwork if more than abouthalf of the entriesinthe compositionaldatamatrix were zero ZComposition package inR Model-basedmultiplicative lognormal imputation The replacementisdone inaniterative manner,andforthatpurpose the EM algorithm,MarkovChain Monte Carlo (MCMC) or multipleimputationare utilized. ZComposition package inR BDLs (Belowdetectionlimit) iterative model-basedprocedure whichperformsregressionstoreplace the zeros(e.g. ordinary multiplelinearregression,robustregression,andpartial least-squares(PLS)regression), procedure is basedonk-nearest-neighbourimputation for a large numberof zerosthere are too few neighbourswithnon-zerosavailable,whichmakesthe algorithmnotapplicable inthiscontext. deepImp Imputationwithdeeplearningmethods,particularlyusingdeepartificialneural networksinanEM- basedapproach DeepImppackage inR Sugnet Lubbe, PeterFilzmoser, Matthias Templ, Comparisonofzero replacementstrategies for compositional data withlargenumbers ofzeros. Chemometrics and IntelligentLaboratory Systems,2021. https://doi.org/10.1016/j.chemolab.2021.104248 Visual representationof distance matrices to compare data structure between an original simulatedcompositionaldatamatrix and different zero imputed matrices when 50% of the values are zero. Overviewof the normalization methods Normalizationmethod Description CLR – centeredlog-ratio divideseachcompositionalpartbythe geometricmeanof all parts CLR removesthe value-range restriction(whichisgoodforsome applications),butdoes notremove the sumconstraint ILR – Isometriclog-ratio Insteadof analyzingrelative abundances, yi, of D differentOTUs,the ILRtransform producesD − 1 coordinates, x∗ i (called“balances”) Each balance correspondstoa single internal node of the tree andrepresentsthe averageddifference inrelativeabundancebetweenthe taxainthe twosisterclades descendingfromthatnode ALR – Additivelog-ratio One componentisusedasa baseline (reference),the proportionwiththe selected reference is logarithmized