10. When assumptions are violated - data transformation and non-parametric methods
Log-normally distributed data
Log-normal distribution is very common in any kind of real data, i.e. random variables logarithm
of which follows normal distribution. As a result, log-normal variables may range from zero limit
(excluding zero itself) to plus infinity – that is pretty realistic e.g. for dimensions, mass, time etc.
In contrast to normal distribution, log-normal variables are positively skewed (i.e. are not
distributed symmetrically around the mean) and display a positive correlation between mean
and variance (Fig. 10.1). A straightforward suggestion for such data is to apply logtransformation
of the values to obtain normally distributed variables (Figs 10.1, 10.2, Table
10.1). ANOVA applied on non-transformed and transformed data provides quite different
results (Table 10.1.).
Fig. 10.1. Example of a log-normal variable: length of phone calls in dependence of job of the
person calling. Left panel shows the boxplot on the ordinary linear scale, while the right panel
shows the same values on the log-scaled y-axis.
Table 10.1. Summaries of ANOVA applied on non-tansformed and transformed data displayed
on Fig. 10.1.
Analysis R2 F DF p
non-transformed 0.26 4.13 3,36 0.013
log-transformed 0.42 8.72 3,36 0.0002
Fig. 10.1. Diagnostic plots of ANOVA models applied on non-transformed (upper row of plots)
and log-transformed data (lower row of plots). Note improved normal fit on the QQplot and
homogeneity of variances after transformation (Residuals vs. Fitted and Scale-Location plots).
Note, that log-transformation is not a simple utility procedure, it also affects the interpretation
of the analysis. Log-transformation changes the scale from additive to multiplicative, i.e. we test
the null hypothesis stating that the ratio between population means is 1 (instead of difference
being 0). We also consider different means – analysis on log-scale implies testing geometric
means on the original scale. The same applies for regression coefficients, which become relative
rather than absolute numbers e.g. the slope indicates how many times the response variable
will change with a change in predictor. An example with log-transformation in linear regression
is displayed on Fig. 10.3., 10.4. and Table 10.2.
Log-transformation is sometimes used also for data, which are not log-normally distributed, but
are just positively skewed. Such data may contain zeros and thus are not log-transformable.
Instead log (x + constant) transformation must be used. Alternatively, square-root
transformation may be considered for such data.
Note, that the analysis results do not depend on logarithm used – natural and decadic
logarithms are used most frquently. Just beware to be consistent in using the same logarithm
throughout the analysis.
Fig. 10.3. Example of a regression with log-normal variable: how grain yield of maize depends on
amount of fertilizer applied. Left panel shows the scatterplot on the ordinary linear scale, while
the right panel shows the same values on the log-scaled y-axis.
Table 10.2. ANOVA tables of linear models fitted on non-tansformed and transformed data
displayed on Fig. 10.3.
Analysis R2 F DF p
non-transformed 0.10 11.0 1,98 0.0013
log-transformed 0.14 16.05 1,98 0.0001
Fig. 10.4. Diagnostic plots of linear models fitted on non-transformed (upper row of plots) and
log-transformed data (lower row of plots). Note improved normal fit on the QQplot and
improved homogeneity of variances after transformation (Scale-Location plot).
Non-parametric tests
Some distributions cannot be approximated by normal distribution and simple transformations
are not helpful. This applies e.g. on many data on ordinal scale, such as schoolgrades, subjective
rankings etc. For such cases, non-parametric tests were developed (Table 10.3.). These tests
replace original values by value order and use these data to test differences in central
tendencies (which are not exactly means) between the samples based only on the assumption,
that the samples come from the same distribution.
Table 10.3. List of parametric tests and treir non-parametric counterparts together with
appropriare R functions.
Parametric test Non-parametric test R function
two-sample t-test Mann-Whitney U test wilcox.test
paired t-test Wilcoxon test wilcox.test with
parameter paired=T
One way ANOVA Kruskal-Wallis test* kruskal.test
Pearson correlation Spearman correlation cor.test with parameter
method=”spearman”
* Dunn test may be used for post-hoc comparisons (function dunnTest in package FSA)
Permutation tests
Permutation tests represent useful alternatives to parametric tests. First, a statistic of difference
from null hypothesis (between samples) is defined. That may be raw or relative difference or an
F-ratio if multiple groups are analyzed. This statistic is computed for observed data (observed
statistic). Subsequently, values of response variable are repeatedly permuted (reshuffled) and
the same statistic is computed in each permutation. P-value is then determined by the formula:
𝑝 =
𝑥 + 1
𝑛 𝑝𝑒𝑟𝑚 + 1
where x is the number of permutations in which test statistic was higher than observed test
statistic and nperm is the total number of permutations.
How to do in R
1. Log-scaling of graph axis: parameter log=’axis to be logscaled’,
i.e. mostly log=’y’
2. Log-transformation: function log for natural logarithm,
log10 for decadic
3. Non-parametric tests: see Table 10.3.
4. Permutation tests are available in library coin:
a. permutation-based ANOVA: function oneway_test
b. permutation-based correlation: spearman_test
Both methods require parameter
distribution=approximate(B=number of permutations) to
be set B is usually set to 999 or 9999.