One-way ANOVA
Introduction
ANOVA = ANalysis Of VAriance
- Used when comparing group means of three and more groups/conditions.
- Two default variants:
- Between design: independent samples (ANOVA, ANCOVA, factorial ANOVA, etc.)
- Is there any difference between regions when it comes to the income in the Czech Republic?
- Within design: a comparison of a group mean across several points in time (Repeated Measures ANOVA).
- Have households increased fuel consumption during the last five years?
“Grammar”
- Factor - the most frequent attained level of education in the respective city district (primary, secondary, tertiary)
- Dependent variable - the average price for an apartment (60 square metres) in the respective city district in 2016.
- Null hypothesis:
- All of the groups have similar mean. Alternatively, there is no difference in the mean value of an apartment (60 square metres) in Brno based on the most frequent level of education in the respective city district.
- Alternative hypothesis:
- The average price for a flat of 60 square metres is the highest in the districts with university alumni as the most frequent category of the level of education.
F-test and F-ratio
- Null hypothesis: all groups are equal
- ANOVA provides a significance test
- Test statistic is the F-test (or F-ratio)
- How do we get the associated p-value?
- The “family” of F-distributions is based on:
- The number of observations (cases or simply n)
- The number of compared conditions (groups)
Simulation of the F-distribution
# Create the vector x
x <- seq(from = 0, to = 10, length = 2000)
# Evaluate the densities
y_1 <- df(x, 3, 100)
y_2 <- df(x, 1, 1)
y_3 <- df(x, 2, 100)
y_4 <- df(x, 3, 30)
y_5 <- df(x, 3, 500)
y_6 <- df(x, 3, 50)
y_7 <- df(x, 6, 1000)
# Plot the densities
plot(x, y_1, col = 1, type = "l")
lines(x, y_2, col = 2)
lines(x, y_3, col = 3)
lines(x, y_4, col = 4)
lines(x, y_5, col = 5)
lines(x, y_6, col = 6)
lines(x, y_7, col = 7)
# Add the legend
legend("topright", title = "F distributions",
c("df = (3, 100)", "df = (1, 1)", "df = (2, 100)", "df = (3, 30)",
"df = (3, 500)", "df = (3, 50)", "df = (6, 1000)"),
col = c(1, 2, 3, 4, 5, 6, 7), lty = 1)
Summary Table
Code
Bydleni_Brno$Vzdelani = factor(Bydleni_Brno$Vzdelani,
order = TRUE,
levels = c(1, 2, 3),
labels = c("Základní",
"Středoškolské",
"Vysokoškolské"))
Bydleni_Brno_60m2_2016 <- Bydleni_Brno %>%
filter(Prodej_60m2_2016 > 0)
- Summary statistics by groups:
describeBy(Bydleni_Brno_60m2_2016$Prodej_60m2_2016,
group = factor(Bydleni_Brno_60m2_2016$Vzdelani))
Descriptive statistics by group
group: Základní
---------------------------------------------------------------------------------------------
group: Středoškolské
---------------------------------------------------------------------------------------------
group: Vysokoškolské
bp1 = ggplot(Bydleni_Brno_60m2_2016, aes(Vzdelani, Prodej_60m2_2016))
bp1 + geom_boxplot(aes(fill=Vzdelani), alpha=I(0.5)) +
geom_point(position="jitter", alpha=0.5) +
geom_boxplot(outlier.size=0, alpha=0.5) +
theme(
axis.title.x = element_text(face="bold", color="black", size=12),
axis.title.y = element_text(face="bold", color="black", size=12),
plot.title = element_text(face="bold", color = "black", size=12)) +
labs(x="Převažující kategorie vzdělání ",
y = "Cena za byt o rozloze 60 metrů čtverečních (v Kč)",
title= "Cena za byt o rozloze 60 metrů čtverečních (v Kč) dle převažující kategorie vzdělání") +
theme(legend.position='none')
aov(dependent_var ~ independent_var)
summary()
anova_BydleniBrno <- aov(Prodej_60m2_2016 ~ Vzdelani,
data = Bydleni_Brno_60m2_2016)
- Look at the summary table of the result
summary(anova_BydleniBrno)
Df Sum Sq Mean Sq F value Pr(>F)
Vzdelani 2 3.465e+11 1.733e+11 0.963 0.396
Residuals 23 4.137e+12 1.799e+11
Size effect
library(lsr)
etaSquared(anova_BydleniBrno,
type = 2,
anova = FALSE)
eta.sq eta.sq.part
Vzdelani 0.07729335 0.07729335
anova <- lm(Prodej_60m2_2016 ~ Vzdelani,
data = Bydleni_Brno_60m2_2016)
performance::r2(anova)
[34m# R2 for Linear Regression
[39m R2: 0.077
adj. R2: -0.003
Assumptions
Variables
- “Dependent” variable at least on the interval level.
Normal distribution of the dependent variable
- In each of the groups/conditions.
- Violation of this assumption should not substantially bias the test result as long as the groups have a similar number of observations, and in each group, there are at least 30 observations.
- Non-parametric alternative – Kruskal-Wallis test
Homogenity of variance
- Sledujeme Levenův F-test, nulová hypotéza hovoří o homogenitě napříč skupinami
- Pokud Levenův F-test vychází statisticky signifikantní:
- Sledujeme poměr rozptylu u skupin s největším a nejmenším rozptylem, přičemž chceme, aby byl tento poměr menší než 3
- Narušení by nemělo vadit, pokud jsou skupiny stejně velké
- Při narušení lze použít Welchovo F
Independence of observation
Homogenity of variance
library("car")
- If you don’t specify additional arguments, the deviation scores are calculated by comparing each score to its group median.
- This is the default behaviour, even though they are typically calculated by comparing each score to its group mean.
- If you want to use means and not medians, add an argument center = mean. Do this now and compare the results to the first test.
leveneTest(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 0.7446 0.486
23
- Levene’s test with center = mean:
leveneTest(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016, center = mean)
Levene's Test for Homogeneity of Variance (center = mean)
Df F value Pr(>F)
group 2 0.8418 0.4438
23
Normality of the dependent variable:
ggplot(data=Bydleni_Brno_60m2_2016, aes(Prodej_60m2_2016)) +
geom_histogram(binwidth = 250000, col="red",
aes(fill=..count..)) +
scale_fill_gradient("Count", low = "green", high = "red") +
labs(title="Histogram ceny za byt o rozloze 60 metrů čtverečních v Brně v roce 2016") +
labs(x="Cena", y="Četnost") + theme(legend.position='none')
F-test
anova_wm_VNE = oneway.test(Prodej_60m2_2016 ~ Vzdelani,
data = Bydleni_Brno_60m2_2016,
var.equal=TRUE)
anova_wm_VNE
One-way analysis of means
data: Prodej_60m2_2016 and Vzdelani
F = 0.96333, num df = 2, denom df = 23, p-value = 0.3965
anova_wm_VE = oneway.test(Prodej_60m2_2016 ~ Vzdelani,
data = Bydleni_Brno_60m2_2016,
var.equal=FALSE)
anova_wm_VE
One-way analysis of means (not assuming equal variances)
data: Prodej_60m2_2016 and Vzdelani
F = 0.68429, num df = 2.0000, denom df = 5.6794, p-value = 0.5417
Post-Hoc Tests
Allow for multiple pairwise comparisons without an increase in the probability of a Type I error.
Používáme, pokud nemáme dopředu jasné hypotézy.
- Srovnávají vše se vším – každou skupinu s každou (ale neumí slučovat skupiny jako kontrasty).
- Z principu jsou oboustranné.
- Je jich mnoho – liší se v několika parametrech:
- Konzervativní (Ch. II. typu) versus Liberální (Ch. I. typu),
- Most liberal = no adjustment,
- Most conservative = adjust for every possible comparison that could be made,
- Ne/vhodné pro rozdílně velké skupiny,
- Ne/vhodné pro rozdílné skupinové rozptyly.
Recommendation by A. Field:
- Stejně velké skupiny a skupinové rozptyly (ideální situace):
- Pokud si chceme být jistí, že P chyby I. typu nepřekročí zvolenou hladinu:
- Pokud jsou velikosti skupin trochu/hodně rozdílné:
- Pokud pochybujeme o shodnosti skupinových rozptylů:
Tukey
anova_BydleniBrno = aov(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
summary(anova_BydleniBrno)
Df Sum Sq Mean Sq F value Pr(>F)
Vzdelani 2 3.465e+11 1.733e+11 0.963 0.396
Residuals 23 4.137e+12 1.799e+11
tukey <- TukeyHSD(anova_BydleniBrno)
- Plot confidence intervals:
plot(tukey)
Bonferroni
The Bonferroni correction compensates for that increase by testing each individual hypothesis at a significance level of α/m, where α is the desired overall alpha level and m is the number of hypotheses.
- For example, if a trial is testing m = 20 hypotheses with a desired α = 0.05, then the Bonferroni correction would test each individual hypothesis at α = 0.05/20 = 0.0025.
pairwise.t.test(Bydleni_Brno_60m2_2016$Prodej_60m2_2016,
Bydleni_Brno_60m2_2016$Vzdelani, p.adjust = "bonferroni")
Pairwise comparisons using t tests with pooled SD
data: Bydleni_Brno_60m2_2016$Prodej_60m2_2016 and Bydleni_Brno_60m2_2016$Vzdelani
Základní Středoškolské
Středoškolské 1.00 -
Vysokoškolské 0.57 0.89
P value adjustment method: bonferroni
Contrasts
Umožňují porovnat jednotlivé skupiny v jednom kroku bez nutnosti korigovat hladinu významnosti (bez snížení síly testu)
- Jen když máme dopředu hypotézy
Kontrastů lze provést tolik, kolik je počet skupin – 1
- Každý kontrast srovnává 2 průměry
- Průměr skupiny nebo průměr více skupin dohromady
Např. „Základní" vs. „Středoškolské"
- Ortogonální (nezávislé) kontrasty
Skupina použitá v jednom srovnání není použitá v dalším
Neortogonální kontrasty
c1 = c(-1,0,1)
c2 = c(0,-1,1)
mat <- cbind(c1,c2)
contrasts(Bydleni_Brno_60m2_2016$Vzdelani) <- mat
model1 <- lm(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
summary(model1)
Call:
lm(formula = Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
Residuals:
Min 1Q Median 3Q Max
-872820 -171177 87618 278193 601098
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2807714 100840 27.843 <2e-16 ***
Vzdelanic1 -178546 158610 -1.126 0.272
Vzdelanic2 -25648 117027 -0.219 0.828
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 424100 on 23 degrees of freedom
Multiple R-squared: 0.07729, Adjusted R-squared: -0.002942
F-statistic: 0.9633 on 2 and 23 DF, p-value: 0.3965
options(contrasts = c("contr.helmert", "contr.poly"))
contrasts(Bydleni_Brno_60m2_2016$Vzdelani) <- "contr.helmert"
model1 <- lm(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
summary(model1)
Call:
lm(formula = Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
Residuals:
Min 1Q Median 3Q Max
-872820 -171177 87618 278193 601098
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2807714 100840 27.843 <2e-16 ***
Vzdelani1 -76449 117840 -0.649 0.523
Vzdelani2 -102097 74430 -1.372 0.183
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 424100 on 23 degrees of freedom
Multiple R-squared: 0.07729, Adjusted R-squared: -0.002942
F-statistic: 0.9633 on 2 and 23 DF, p-value: 0.3965