10. When assumptions are violated - data transformation and non-parametric methods Log-normally distributed data Log-normal distribution is very common in many kinds of biological data. These are random variables logarithm of which follows the normal distribution. As a result, log-normal variables may range from the zero limit (excluding zero itself) to plus infinity – that is pretty realistic e.g. for dimensions, mass, time, etc. In contrast to the symmetric normal distribution, log-normal variables are positively skewed and display a positive correlation between mean and variance (Fig. 10.1). A straightforward suggestion for such data is to apply log-transformation to the values to obtain normally distributed variables (Figs 10.1, 10.2, Table 10.1). ANOVA applied on non-transformed and transformed data provides quite different results (Table 10.1.). Fig. 10.1. Example of a log-normal variable: effect of job on the length of phone calls. The left panel shows the boxplot on the ordinary linear scale, while the right panel shows the same values on the log-scaled y-axis. Table 10.1. Summaries of ANOVA applied on non-transformed and transformed data displayed in Fig. 10.1. Analysis R2 F DF p non-transformed 0.26 4.13 3,36 0.013 log-transformed 0.42 8.72 3,36 0.0002 Fig. 10.2. Diagnostic plots of ANOVA models applied on non-transformed (upper row) and log-transformed data (lower row). Note the improved normal fit on the QQplot and homogeneity of variances after transformation (Residuals vs. Fitted and Scale-Location plots). Note that log-transformation is not a simple utility procedure it also affects the interpretation of the analysis. Log-transformation changes the scale from additive to multiplicative, i.e. we test the null hypothesis stating that the ratio between population means is 1 (instead of the difference being 0). We also consider different means – analysis on log-scale implies testing the geometric means on the original scale. The same applies to regression coefficients, which become relative rather than absolute numbers e.g. the slope indicates how many times the response variable will change with a change in the predictor. An example with log-transformation in linear regression is displayed in Fig. 10.3., 10.4. and Table 10.2. Log-transformation is sometimes used for data, which are not log-normally distributed but are just positively skewed. Such data may contain zeros and thus are not log-transformable. Instead log (x + constant) transformation must be used. Alternatively, square-root transformation may be considered for such data. Note that the analysis results do not depend on the logarithm used – natural and decadic logarithms are used most frequently. Just beware of being consistent in using the same logarithm throughout the analysis. Fig. 10.3. Example of a regression with log-normal variable: how grain yield of maize depends on the amount of fertilizer applied. The left panel shows the scatterplot on the ordinary linear scale, while the right panel shows the same values on the log-scaled y-axis. Table 10.2. ANOVA tables of linear models fitted on non-transformed and transformed data displayed in Fig. 10.3. Analysis R2 F DF p non-transformed 0.10 11.0 1,98 0.0013 log-transformed 0.14 16.05 1,98 0.0001 Fig. 10.4. Diagnostic plots of linear models fitted on non-transformed (upper row of plots) and log-transformed data (lower row of plots). Note improved normal fit on the QQplot and improved homogeneity of variances after transformation (Scale-Location plot). Non-parametric tests Some distributions cannot be approximated by the normal distribution, and simple transformations may not helpful. This applies e.g. to many data on the ordinal scale such as school grades, subjective rankings etc. For such cases, non-parametric tests were developed (Table 10.3.). These tests replace the original values by their order and use the resulting order values to test differences in central tendencies (which are not precisely means) between the samples. These tests are still based on the assumption that the samples come from the same distribution (which, however, is quite reasonable). Table 10.3. List of parametric tests and their non-parametric counterparts together with appropriate R functions. Parametric test Non-parametric test R function two-sample t-test Mann-Whitney U test wilcox.test paired t-test Wilcoxon test wilcox.test with parameter paired=T One way ANOVA Kruskal-Wallis test* kruskal.test Pearson correlation Spearman correlation cor.test with parameter method=”spearman” * Dunn test may be used for post-hoc comparisons (function dunnTest in package FSA) Permutation tests Permutation tests represent valuable alternatives to parametric or non-parametric tests. First, a statistic of difference from the null hypothesis (between samples) is defined. That may be the raw or relative difference or an F-ratio if multiple groups are analyzed. This statistic is computed for observed data (observed statistic). Subsequently, values of the response variable are repeatedly permuted (reshuffled), and the same statistic is computed in each permutation. The p-value is then determined by the formula: 𝑝 = 𝑥 + 1 𝑛 𝑝𝑒𝑟𝑚 + 1 where x is the number of permutations in which test statistic was higher than observed test statistic, and nperm is the total number of permutations. How to do in R 1. Log-scaling of graph axis: parameter log=’axis to be logscaled’, i.e. mostly log=’y’ 2. Log-transformation: function log for natural logarithm, log10 for decadic 3. Non-parametric tests: see Table 10.3. 4. Permutation tests are available in library coin: a. permutation-based ANOVA: function oneway_test b. permutation-based correlation: spearman_test Both functions require this parameter distribution=approximate(B=number of permutations) to be set; B is usually set to 999 or 9999.