Bi5444 Analysis of sequencing data
Statistical analysis
Eva Budinska
budinska@recetox.muni.cz
Aim
• To connect the information on molecular abundancies or processes
from NGS with the conditions of the experiment
• Comparing the molecular patterns between two or more groups (class
comparison)
• Discover new groups based on the molecular patterns (class discovery)
• Predict existing groups (class prediction)
• Analyze survival
• Explore molecular events (pathway analysis…)
The count table…
• The count table is a quantitative representation of the abundance of
the sequence in the sample
• The problem of the count table is that the counts are incomparable
due to technical and biological reasons:
• we are unable to prepare libraries containing exactly the same amounts of
DNA for each sample
• Some features (genes) have longer target sequences than others, hence they
will have more reads assigned!
What to do? …. Normalize!
• Goal: compute a normalization factor for each sample, and adjust
the read counts using this factor
• After the adjustment, the read counts for different samples (and
different genes within sample) should be comparable
• Note that often, the normalization is not explicitly performed but
rather built into the existing analytical framework
Normalization approach I: total count
• Define a reference sample (either one of the observed samples or a
“pseudo-sample”) – this gives a “target” library size.
• The normalization factor for sample j is defined by
• RPKM/FPKM is an extension of this normalization scheme, where we
also normalize for the length of the gene
RPKM/FPKM - outdated
• FPKM = Fragments Per Kilobase per Million mapped reads
• Similar to RPKM= Reads Per Kilobase per Million mapped reads
• Not (anymore) recommended for general use
• For some plots and statistics still OK
• Accounts for the different lengths of the features
• Comparable within the sample
Normalization approach II: TMM
• Trimmed Mean of M-values
• M-values = log fold changes (compared to reference sample)
• A-values = average expression values
• Trim the genes with very small or very large M-or A-values
• Calculate the normalization factors based on a weighted M-value
from the remaining genes
• Assumption: most genes are not differentially expressed
• Incorporated in edgeR
Normalization approach II: TMM
• …
Normalization approach III: RLE/DESeq
• Define the normalization factor for sample j as
• Use a “pseudo-reference” sample with counts defined by the average
of the individual sample counts
• Incorporated in DESeq/DESeq2
Normalization approach IV: other summary
measures
• Other measures can be used instead of the sum of the counts
• Upper quartile
• Median
• Quantile normalization –adapted from microarrays
Comparison of normalization approaches
• Nice evaluation, but only one of many
http://www.ncbi.nlm.nih.gov/pubmed/22988256
Other approaches
• Spike-ins with known expression
• Very precise but more expensive
• Getting more and more common
• Use “housekeeping genes”
• Estimate normalization factor only from these
• How do we know that the housekeeping genesare actually stable? Very often
they are not!
• If we do this, we have to choose at least several (>5) "housekeeping" genes
Read count distribution –Negative Binomial
distribution (?)
• The NB distribution extends the Poisson distribution by allowing for
variability in the probability of a read mapping to a gene (λ).
• This implies the possibility of over-dispersion (i.e., variance exceeding
the mean).
• The NB distribution has two parameters: the mean (μ) and the
dispersion parameter (ϕ).
E[K]=μ
var(K)=μ+ϕμ2
Negative Binomial distribution
From the differential gene expression
• Model the read count for the i’thgene in the j’thsample by
Kij~ NB(pijNj, ϕij)
• Differential expression of a genei is signified by differences in pij
between groups
• It is important to get dispersion estimates correct if we want to say
something about the significance of the differences
Estimating the dispersion
• Due to the small sample size, dispersion (and hence the variance)
estimation is difficult
• But we have a lot of genes!
• They should not behave completely differently
• Solution: combine information across the genes to estimate the
dispersion!
• BUT the estimates from real data suggest that the dispersion may not
be constant across genes
Adjusting the dispersion
• Stabilize the individual estimates by squeezing them towards
a common estimate or a trend (edgeR)
• Model the mean-dispersion (or meanvariance) relationship
(DESeq)
• Bayesian approach using a prior based on empirical values
(DESeq2)
Dispersion estimation example –DESeq2
Compositional nature of the NGS data
• The gene(transcript) abundancies
(read counts) are constrained by
the maximum number of DNA
reads that the sequencer can
provide (the total count constraint)
• Hence the data represents in fact a
proportion (composition) of
genes!
3000
500
14000
4000
200
24000
5000
100
300
SAMPLE 1 SAMPLE 2 SAMPLE 3
Gene1 Gene2 Gene3
The data is compositional – so what?
• The compositional nature of the data induces strong dependencies among the
abundances of the different taxa:
• an increase in the abundance of one gene implies the decrease of the observed number
of counts – hence proportions - for other genes and vice versa
3000
500
14000
4000
200
24000
5000
100
300
SAMPLE 1 SAMPLE 2 SAMPLE 3
Gene1 Gene2 Gene3
3000
500
14000
4000
24000
100
300
SAMPLE 1 SAMPLE 2 SAMPLE 3
Gene1 Gene2 Gene3
Before treatment After treatment
The data is compositional – so what?
In a composition the value of each component is not informative by itself and
the relevant information is contained in the ratios between the components.
The data is compositional – so what? / part 2
• Compositional data do not exist in the Euclidean space, but in a special
constraint space called the simplex
• Hence it is incorrect to apply any multivariable techniques that are dependent
on this space without proper transformation of data (e.g. PCA, clustering….)
BacterioidesFirmicutes
Actinobacteria
0%
100% 100%
0%
100%
0%
PCA on compositional data (without proper transformation)
individuals
Statistical methods for analysis of compositional
data need to fulfill these criteria:
1. Scale Invariance (e.g. the result should be the same regardless of the scale of
the measurement)
• Example: how similar are these two samples?
% Absolute read counts
A B A B
Fusobacteria 10 11 700 11000
Proteobacteria 15 14 1050 14000
Bacterioides 25 20 1750 20000
Firmicutes 50 55 3500 55000
Euclidean
distance
7.2 57088.6
Statistical methods for analysis of compositional
data need to fulfill these criteria:
2. Subcompositional coherence (e.g. the analyses should lead to the same conclusions
regardless of which components of the data are included)
This is especially a problem for correlations between taxa, which tend to be more negative
when we remove some taxa and recalculate the proportions.
Statistical methods for analysis of compositional
data need to fulfill these criteria:
2. Subcompositional coherence (e.g. the analyses should lead to the same
conclusions regardless of which components of the data are included)
Alternative(s) to correlation:
1. phi (Φ) = var(Ai -Aj)/var(Ai)
2. rho (⍴) = var(Ai -Aj)/(var(Ai) + var(Aj))
3. phis (Φs) = var(Ai -Aj)/var(Ai +Aj)
Ai is the log-transformed values for a metagenomic component ‘i’ in the data
Aitchison, 1982, J.R.Statist. Soc.
Lovell et al., 2015, PLoS Comp Biol
Quinn et al, 2017, Scientific Reports 7
Data transformation (normalization)
• Compositional data can be normalized in order to make them suitable for existing statistical
techniques
• Aitchinson, 1982 - build a theory and concepts of analysis of compositional data and suggested
normalizations
• Basic concept – make log-ratios between components
ALR (additive log-ratio transformation)
+ good for most statistical techniques
– needs careful selection of one
component, we are working with k-1
taxa, more difficult to interpret
CLR (centered log-ratio transformation)
+ ratio to geometric mean, preserves all
taxa, no need to select one
– creates singular covariance matrix
ILR (isometric log-ratio
transformation) [Egozcque, 2003]
PhILR (phylogenetic partitioning
based ILR transform) [Silverman
et al, 2017]
Compositional data - PCA before and after
normalization
PCA on absolute counts –
main variance lies in the
sequencing depth
Relative data – problem
with simplex, colour by
individual
CLR transformed data –
colour by individual
The excess zero problem
• Log-ratio transformations require data with positive values, any statistical
analysis of count compositions must be preceded by a proper replacement of the
zeros
• We do not know whether the zeros are real or just below threshold
• What to do?
• E.g. Bayesian multiplicative treatment of count zeros [MartínFernandez,2014,Statistical
Modelling]
• Analysis and correction of compositional bias in sparse sequencing count data
[Kumar et al., BMC Genomics volume 19, Article number: 799 (2018) ]
Large number of genes –a lot of statistical
tests
• p-values are suitable tools for inference when a single hypothesis is
tested
• p-value = probability of obtaining a test statistic at least as extreme as
the one observed, if the null hypothesis is true (that is, without any
true signal in the data)
• But this means that even if the null hypothesis is true, there is a nonzero
probability of obtaining such an extreme test statistic
• If we perform many tests (even with true null hypotheses), we will
get extreme test statistics (and correspondingly low p-values) every
once in a while
What’s the problem?
As the number of hypotheses increases, the probability of obtaining at
least one low p-value(e.g., <0.05) increases, too
What can we do?
• The most common approach is to correct or adjust the observed pvaluest
o account for the multiple testing through correction of
family-wise error rate or false-discovery rate (FDR)
• Typically, multiply each p-value with a number ≥ 1 to obtain adjusted
p-values
• Only if the adjusted p-value is small we call the result significant
Bonferonni correction
• Divide the alpha level by the number of tests
• e.g. 1000 tests and alpha=0.05 => adjusted alpha level is 0.00005!
Benjamini-Hochberg (FDR) correction
Idea: instead of focusing on the per-gene false positive probability, try
to control the fraction of false positives among the genes that are
considered significant.
• We can tolerate a few false positives if we simultaneously have a lot
of true positive findings.
• After FDR (e.g. Benjamini-Hochberg) adjustment:
• An adjusted p-value of (e.g.) 0.05 means that the smallest false discovery rate
that we can get if we want to consider the given gene as significant, is 5%.
• An adjusted p-value close to 1 means that we can not consider the
corresponding gene to be significant without accepting that almost all our
findings will be false.
Benjamini-Hochberg (FDR) correction
Idea: instead of focusing on the per-gene false positive probability, try
to control the fraction of false positives among the genes that are
considered significant.
• We can tolerate a few false positives if we simultaneously have a lot
of true positive findings.
• After FDR (e.g. Benjamini-Hochberg) adjustment:
• An adjusted p-value of (e.g.) 0.05 means that the smallest false discovery rate
that we can get if we want to consider the given gene as significant, is 5%.
• An adjusted p-value close to 1 means that we can not consider the
corresponding gene to be significant without accepting that almost all our
findings will be false.
DE calculation
Differential expression
Inputs to the calculation
Two main inputs
1. Table with raw or normalized gene counts – column per sample and
row per gene
2. Design table –assignment of groups/conditions to the samples
• Additional input -design/model matrix
• "Normal" comparison~condition
• "Paired" comparison ~patient+condition
edgeR
Implemented in R
•Count-based approach
•Assumes a NB distribution
•TMM normalizationby default (other alternatives available)
•Estimates dispersion by shrinking towards a common or trended
estimate
•Allows a large variety of experimental designs through the use of a
generalized linear model (GLM) framework
DESeq2
Implemented in R
• Count-based approach
• Assumes a NB distribution
• RLE normalization
• Estimates dispersion by a Bayesian approach
• Implements outlier detectionand independent filtering
• Allows a large variety of experimental designs through the use of a
generalized linear model (GLM) framework
Other tools
• voom+limma
•baySeq
•Cuffdiff2 (+cummerbund)
•And many other R packages