8. F-test and distribution, analysis of variance (ANOVA) F-test Normally distributed data can be described by two parameters – the mean and the variance. We discussed testing the difference in the mean between two samples in the previous chapter. However, it is also possible to test whether two samples come from a population with the same variance, i.e. the null hypothesis stating: σ2 1 = σ2 2 as usual for population parameters, we do not know the σ, but they can be estimated by s2 (sample variances). A comparison between sample variances is then made by F-test 𝐹 = 𝑠1 2 𝑠2 2 which is a simple ratio between sample variances. The F statistic follows the F distribution, the shape of which is defined by two degrees of freedom – DF numerator and DF denominator. These are found as n1 – 1 and n2 – 1 (i.e. number of observations in the corresponding sample – 1). When reporting test results in a text, both DFs must be reported (usually as subscripts). For instance, “variances significantly differed between green and red apples” (F20,25 = 2.52, p = 0.015). Fig. 8.1 Probability density plot of F-distributions with different DFs. Analysis of variance (ANOVA) F-test is rarely used to test the differences in variances between two samples because hypotheses on variance are not common. However, F-test has its crucial application in the analysis of variance. In chapter 7, we discussed comparisons between the means of two samples using a t-test. A natural question, however, arises – what if we have more than two samples? We may try using pair-wise comparisons between each pair of them. That would, however, lead to multiple non-independent tests and result in inflated type I error probability1. Therefore, we use analysis of variance (ANOVA) to solve such problems. ANOVA tests a null hypothesis on means of multiple samples, which states that the population means are equal, i.e. μ1 = μ2 = μ3 = ...= μk The mechanism of ANOVA is based on decomposing the total variability into two components: 1. systematic component corresponding to differences between groups and 2. error (or residual) component corresponding to differences within groups. These differences are measured as squares. For each observation in the dataset, its total square (measuring difference between its value and the overall mean), effect square (measuring difference between corresponding group mean and the overall mean), and error square (measuring difference between the value and corresponding group mean) can be calculated (Fig 8.2). Fig. 8.2 Mechanism of ANOVA: definition of squares exemplified with the red data point. Subsequently, we can sum the square statistics over the whole dataset and get sums of squares (SS): SStotal, SSeffect, SSerror. We can further calculate mean squares (MS) by dividing SS by corresponding DF, with DFtotal = n – 1, DFeffect = k – 1, and 1 This comes from the fact that if individual tests are performed at α = 0.05, then probability of making type I error in 2 tests (i.e. making error in at least one of the test) is p = 0.05+0.05-0.052 = 0.975. DFerror = DFtotal - DFeffect, where n is total number of observations and k number of categories. Hence we get: MSeffect = SSeffect/DFeffect MSerror = SSerror/DFerror and now, it comes: the mean squares are actually variances. As a result, we can use an Ftest to test the null hypothesis that MSeffect is lower than or equal to MSerror. Such test is equivalent to the test of the null hypothesis stating that all means are equal: FDFeffect,DFerror = MSeffect/ MSerror the corresponding p-value is then found based on a comparison with F distribution as in an ordinary F-test. Note that rejecting the null hypothesis means that at least one of the means is significantly different from at least one other. Besides the p-value, it is also possible to compute the proportion of variability explained by the groups: r2 = SSeffect/SStotal Typical report of ANOVA result in the text then reads: Means were significantly different among the groups (r2 = 0.70, F2,12 = 14.63, p = 0.0006). ANOVA assumptions ANOVA application assumes that i. samples come from normally distributed populations and variances are equal among the groups. These assumptions can be checked by analysis of residuals as they can be restated as requirements for i. normal distribution and ii. the constant variance of residuals. There are formal tests testing for normality, such as the Shapiro-Wilk test, but their use is problematic as they test the null hypothesis that a given sample comes from a normal distribution. The tests are more powerful (likely to reject the null) if there are many observations, but in that case, ANOVA is rather robust to moderate violations of the assumption. By contrast, the formal tests of normality fail to identify the most problematic cases, when the assumptions are not met, and also the number of observations is low because in such cases, their power is low. Instead, I highly recommend a visual check of the residuals. In particular, a scatterplot of standardized residuals and normal quantile-quantile (QQ) plots are informative about possible problems with ANOVA assumptions. Details on how to use these plots to assess ANOVA assumptions are very nicely explained here: https://arc.lib.montana.edu/book/statistics-with-r-textbook/item/57 Post-hoc comparisons When we get a significant result in ANOVA (and only in such case!), we may be further interested to see which mean is different from which. The statistical theory does not provide much help here, however, some pragmatic tools were developed for this purpose. These are based on the principle of pair-wise comparisons (similar to a series of pair-wise two-sample t-tests), which control the inflation of the type I error probability by adjusting the p-values upwards. An example of such a test is Tukey honest significant difference test (Tukey HSD). Results of these tests are frequently summarized in plots by letter indices with different letters indicating significant differences (Fig. 8.3) Fig. 8.3 Dotchart displaying means and 95%-confidence intervals for the means of the three samples. Means significantly different from each other at α = 0.05 are denoted by different letters (based on Tukey HSD test). How to do in R 1. F test, F-distribution function var.test; pf, qf for F-distribution probabilities 2. ANOVA Function aov – accepts formula syntax. Note that the predictor must be a factor; otherwise linear regression is fitted (which is incorrect, but no warning is given). summary (aov.object) displays the ANOVA table with SS, MS, F and p. plot(aov.object) displays the diagnostic plots for checking ANOVA assumptions. See https://arc.lib.montana.edu/book/statistics-with-rtextbook/item/57 for a detailed explanation. Note that this webpage refers to more general regression diagnostics, which we will discuss in upcoming classes. It is, however, basically the same except for the diagnostics plot #4, which you may ignore for now. 3. Post-hoc test tukeyHSD(aov.object)-produces just the differences between groups. Letters as in Fig. 8.3 must be produced manually.