8. F-test and distribution, analysis of variance (ANOVA)
F-test
Normally distributed data can be described by two parameters – the mean and the variance.
We discussed testing the difference in the mean between two samples in the previous
chapter. However, it is also possible to test whether two samples come from a population
with the same variance, i.e. the null hypothesis stating:
σ2
1 = σ2
2
as usual for population parameters, we do not know the σ, but they can be estimated by s2
(sample variances). A comparison between sample variances is then made by F-test
𝐹 =
𝑠1
2
𝑠2
2
which is a simple ratio between sample variances. The F statistic follows the F distribution,
the shape of which is defined by two degrees of freedom – DF numerator and DF
denominator. These are found as n1 – 1 and n2 – 1 (i.e. number of observations in the
corresponding sample – 1). When reporting test results in a text, both DFs must be reported
(usually as subscripts). For instance, “variances significantly differed between green and red
apples” (F20,25 = 2.52, p = 0.015).
Fig. 8.1 Probability density plot of F-distributions with different DFs.
Analysis of variance (ANOVA)
F-test is rarely used to test the differences in variances between two samples because
hypotheses on variance are not common. However, F-test has its crucial application in the
analysis of variance.
In chapter 7, we discussed comparisons between the means of two samples using a t-test. A
natural question, however, arises – what if we have more than two samples? We may try
using pair-wise comparisons between each pair of them. That would, however, lead to
multiple non-independent tests and result in inflated type I error probability1. Therefore, we
use analysis of variance (ANOVA) to solve such problems.
ANOVA tests a null hypothesis on means of multiple samples, which states that the
population means are equal, i.e.
μ1 = μ2 = μ3 = ...= μk
The mechanism of ANOVA is based on decomposing the total variability into two
components: 1. systematic component corresponding to differences between groups and 2.
error (or residual) component corresponding to differences within groups. These differences
are measured as squares. For each observation in the dataset, its total square (measuring
difference between its value and the overall mean), effect square (measuring difference
between corresponding group mean and the overall mean), and error square (measuring
difference between the value and corresponding group mean) can be calculated (Fig 8.2).
Fig. 8.2 Mechanism of ANOVA: definition of squares exemplified with the red data point.
Subsequently, we can sum the square statistics over the whole dataset and get sums of
squares (SS): SStotal, SSeffect, SSerror. We can further calculate mean squares (MS) by dividing SS
by corresponding DF, with DFtotal = n – 1, DFeffect = k – 1, and
1
This comes from the fact that if individual tests are performed at α = 0.05, then probability of making type I
error in 2 tests (i.e. making error in at least one of the test) is p = 0.05+0.05-0.052
= 0.975.
DFerror = DFtotal - DFeffect, where n is total number of observations and k number of categories.
Hence we get:
MSeffect = SSeffect/DFeffect
MSerror = SSerror/DFerror
and now, it comes: the mean squares are actually variances. As a result, we can use an Ftest
to test the null hypothesis that MSeffect is lower than or equal to MSerror. Such test is
equivalent to the test of the null hypothesis stating that all means are equal:
FDFeffect,DFerror = MSeffect/ MSerror
the corresponding p-value is then found based on a comparison with F distribution as in an
ordinary F-test. Note that rejecting the null hypothesis means that at least one of the means
is significantly different from at least one other.
Besides the p-value, it is also possible to compute the proportion of variability explained by
the groups:
r2 = SSeffect/SStotal
Typical report of ANOVA result in the text then reads: Means were significantly different
among the groups (r2 = 0.70, F2,12 = 14.63, p = 0.0006).
ANOVA assumptions
ANOVA application assumes that i. samples come from normally distributed populations and
variances are equal among the groups. These assumptions can be checked by analysis of
residuals as they can be restated as requirements for i. normal distribution and ii. the
constant variance of residuals.
There are formal tests testing for normality, such as the Shapiro-Wilk test, but their use is
problematic as they test the null hypothesis that a given sample comes from a normal
distribution. The tests are more powerful (likely to reject the null) if there are many
observations, but in that case, ANOVA is rather robust to moderate violations of the
assumption. By contrast, the formal tests of normality fail to identify the most problematic
cases, when the assumptions are not met, and also the number of observations is low
because in such cases, their power is low.
Instead, I highly recommend a visual check of the residuals. In particular, a scatterplot of
standardized residuals and normal quantile-quantile (QQ) plots are informative about
possible problems with ANOVA assumptions. Details on how to use these plots to assess
ANOVA assumptions are very nicely explained here:
https://arc.lib.montana.edu/book/statistics-with-r-textbook/item/57
Post-hoc comparisons
When we get a significant result in ANOVA (and only in such case!), we may be further
interested to see which mean is different from which. The statistical theory does not provide
much help here, however, some pragmatic tools were developed for this purpose. These are
based on the principle of pair-wise comparisons (similar to a series of pair-wise two-sample
t-tests), which control the inflation of the type I error probability by adjusting the p-values
upwards. An example of such a test is Tukey honest significant difference test (Tukey HSD).
Results of these tests are frequently summarized in plots by letter indices with different
letters indicating significant differences (Fig. 8.3)
Fig. 8.3 Dotchart displaying means and 95%-confidence intervals for the means of the three
samples. Means significantly different from each other at α = 0.05 are denoted by different
letters (based on Tukey HSD test).
How to do in R
1. F test, F-distribution
function var.test; pf, qf for F-distribution probabilities
2. ANOVA
Function aov – accepts formula syntax. Note that the predictor
must be a factor; otherwise linear regression is fitted (which
is incorrect, but no warning is given).
summary (aov.object) displays the ANOVA table with SS, MS, F
and p.
plot(aov.object) displays the diagnostic plots for checking
ANOVA assumptions. See
https://arc.lib.montana.edu/book/statistics-with-rtextbook/item/57
for a detailed explanation. Note that this
webpage refers to more general regression diagnostics, which
we will discuss in upcoming classes. It is, however, basically
the same except for the diagnostics plot #4, which you may
ignore for now.
3. Post-hoc test
tukeyHSD(aov.object)-produces just the differences between
groups. Letters as in Fig. 8.3 must be produced manually.