Outline


    01. Comparing two groups (t-tests)

    02. Comparing three and more groups (ANOVA)


    Data and packages

    Housing in Brno


    • Packages:
    require("readxl")
    require("dplyr")
    require("summarytools")
    require("ggplot2")
    library("psych")
    library(sjstats)
    library("car")


    • Data:
      Bydleni_Brno <- read_excel("Bydleni_Brno.xlsx")


    Comparing two groups


    Dependent t-test



    Assumptions


    • The sampling distribution is normally distributed.
      • In the dependent t-test this means that the sampling distribution of the differences between scores should be normal, not the scores themselves.
    • Data (“Dependent variable”) are measured at least at the interval level.


    Main Part of the Analysis


    • In the case of our dependent t-test, we need to specify these arguments to t.test():
     ?t.test


    • “paired=…”: Whether we’re doing a dependent (i.e. paired) t-test or independent t-test.
      • In case of the paired t-test it’s TRUE.
    • Note that t.test() carries out a two-sided t-test by default.


    • Variables:
      • x: Column of Bydleni_Brno containing prices for 2015.
      • y: Column of Bydleni_Brno containing prices for 2016


    • Conduct a paired t-test using the t.test function:
    t.test(Bydleni_Brno$Pronajem_m2_2015, Bydleni_Brno$Pronajem_m2_2016, mu = -1,
           paired = TRUE)
    
        Paired t-test
    
    data:  Bydleni_Brno$Pronajem_m2_2015 and Bydleni_Brno$Pronajem_m2_2016
    t = -2.4345, df = 28, p-value = 0.02154
    alternative hypothesis: true difference in means is not equal to -1
    95 percent confidence interval:
     -19.477518  -2.591448
    sample estimates:
    mean of the differences 
                  -11.03448 


    Effect Size


    • Packages:
    library("lsr")
    library("effsize")


    • Cohen’s d, the effsize way:
    cohen.d(x, y, 
            pooled=TRUE, 
            paired=TRUE,
            na.rm=FALSE, 
            hedges.correction=FALSE,
            conf.level=0.95, 
            noncentral=FALSE)
    
    ?cohen.d()


    cohen.d(Bydleni_Brno$Pronajem_m2_2015, Bydleni_Brno$Pronajem_m2_2016, paired=TRUE)
    
    Cohen's d
    
    d estimate: -0.2324419 (small)
    95 percent confidence interval:
          lower       upper 
    -0.40870688 -0.05617685 


    Visualisation of the Cohen’s d


    • This app will show you a graph of simulated data with a random number of observations in each of two groups and random effect size.
      • The effect size will be between -3 and 3, and your job is to guess the size of the effect.


    Independent t-test



    Assumptions


    • Homogeneity of variance.
    • Scores are independent (because they come from non-identical units).
    • Data (“Dependent variable”) are measured at least at the interval level.


    • Create subsets:
    Sidliste <- Bydleni_Brno %>%
     select(c("Sidliste", "Pronajem_m2_2016")) %>%
     filter(Sidliste == 1, Pronajem_m2_2016 > 0)
    
    Rodinne_Domy <- Bydleni_Brno %>%
     select(c("Sidliste", "Pronajem_m2_2016")) %>%
     filter(Sidliste == 0, Pronajem_m2_2016 > 0)


    • Summary statistics:
    Sidliste_view <- view(dfSummary(Sidliste))
    
    Rodinne_Domy_view <- view(dfSummary(Rodinne_Domy))


    • Visualise the data:
    Bydleni_Brno_Najem <- Bydleni_Brno %>%
     filter(Pronajem_m2_2016 > 0) %>%
     ggplot(aes(x = factor(Sidliste), 
                y = Pronajem_m2_2016, 
                fill = factor(Sidliste))) + 
      geom_boxplot()
    
    Bydleni_Brno_Najem


    Levene’s test (Load the car package):

    # Wrangle the data:
    Bydleni_Brno_Najem_Levene <- Bydleni_Brno %>%
     filter(Pronajem_m2_2016 > 0)
    
    # Levene's test
    leveneTest(Bydleni_Brno_Najem_Levene$Pronajem_m2_2016 ~ factor(Bydleni_Brno_Najem_Levene$Sidliste))
    Levene's Test for Homogeneity of Variance (center = median)
          Df F value  Pr(>F)  
    group  1  7.4252 0.01157 *
          25                  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


    • Conduct an independent t-test:
      t.test(Sidliste$Pronajem_m2_2016, Rodinne_Domy$Pronajem_m2_2016,
             var.equal = FALSE)
    
        Welch Two Sample t-test
    
    data:  Sidliste$Pronajem_m2_2016 and Rodinne_Domy$Pronajem_m2_2016
    t = -0.98247, df = 9.591, p-value = 0.35
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
     -37.18558  14.51891
    sample estimates:
    mean of x mean of y 
     149.7778  161.1111 


    Size Effect



    • Cohen’s d, the lsr way:
    # Data
    Najem_2015 <- Bydleni_Brno %>% select(Pronajem_m2_2015)
    Sidliste_2015 <- Bydleni_Brno %>% select(Sidliste)
    
    
    cohensD(Bydleni_Brno$Pronajem_m2_2015 ~ Bydleni_Brno$Sidliste)
    [1] 0.4811341


    • Cohen’s d, the effsize way:
    cohen.d(Bydleni_Brno$Pronajem_m2_2015 ~ Bydleni_Brno$Sidliste)
    Cohercing rhs of formula to factor
    
    Cohen's d
    
    d estimate: -0.4811341 (small)
    95 percent confidence interval:
         lower      upper 
    -1.2770112  0.3147431 

    Comparing three and more groups

    One-way ANOVA


    Introduction


    ANOVA = ANalysis Of VAriance


    • Used when comparing group means of three and more groups/conditions.
    • Two default variants:
      • Between design: independent samples (ANOVA, ANCOVA, factorial ANOVA, etc.)
        • Is there any difference between regions when it comes to the income in the Czech Republic?
    • Within design: a comparison of a group mean across several points in time (Repeated Measures ANOVA).
      • Have households increased fuel consumption during the last five years?


    “Grammar”


    • Factor - the most frequent attained level of education in the respective city district (primary, secondary, tertiary)
    • Dependent variable - the average price for an apartment (60 square metres) in the respective city district in 2016.
    • Null hypothesis:
      • All of the groups have similar mean. Alternatively, there is no difference in the mean value of an apartment (60 square metres) in Brno based on the most frequent level of education in the respective city district.
    • Alternative hypothesis:
      • The average price for a flat of 60 square metres is the highest in the districts with university alumni as the most frequent category of the level of education.


    F-test and F-ratio

    • Null hypothesis: all groups are equal
      • ANOVA provides a significance test
    • Test statistic is the F-test (or F-ratio)


    • How do we get the associated p-value?
      • The “family” of F-distributions is based on:
        • The number of observations (cases or simply n)
        • The number of compared conditions (groups)


    Simulation of the F-distribution

    # Create the vector x
    x <- seq(from = 0, to = 10, length = 2000)
    
    # Evaluate the densities
    y_1 <- df(x, 3, 100)
    y_2 <- df(x, 1, 1)
    y_3 <- df(x, 2, 100)
    y_4 <- df(x, 3, 30)
    y_5 <- df(x, 3, 500)
    y_6 <- df(x, 3, 50)
    y_7 <- df(x, 6, 1000)
    
    # Plot the densities
    plot(x, y_1, col = 1, type = "l")
    lines(x, y_2, col = 2)
    lines(x, y_3, col = 3)
    lines(x, y_4, col = 4)
    lines(x, y_5, col = 5)
    lines(x, y_6, col = 6)
    lines(x, y_7, col = 7)
    
    # Add the legend
    legend("topright", title = "F distributions",
    c("df = (3, 100)", "df = (1, 1)", "df = (2, 100)", "df = (3, 30)",
    "df = (3, 500)", "df = (3, 50)", "df = (6, 1000)"),
    col = c(1, 2, 3, 4, 5, 6, 7), lty = 1)


    Summary Table



    Code


    • Data wrangling
    Bydleni_Brno$Vzdelani = factor(Bydleni_Brno$Vzdelani, 
                                   order = TRUE, 
                                   levels = c(1, 2, 3),
                                   labels = c("Základní", 
                                             "Středoškolské", 
                                             "Vysokoškolské"))
    
    Bydleni_Brno_60m2_2016 <- Bydleni_Brno %>%
                                filter(Prodej_60m2_2016 > 0)


    • Summary statistics by groups:
    describeBy(Bydleni_Brno_60m2_2016$Prodej_60m2_2016, 
               group = factor(Bydleni_Brno_60m2_2016$Vzdelani))
    
     Descriptive statistics by group 
    group: Základní
    --------------------------------------------------------------------------------------------- 
    group: Středoškolské
    --------------------------------------------------------------------------------------------- 
    group: Vysokoškolské


    • Boxplot:
    bp1 = ggplot(Bydleni_Brno_60m2_2016, aes(Vzdelani, Prodej_60m2_2016))
    bp1 + geom_boxplot(aes(fill=Vzdelani), alpha=I(0.5)) +
     geom_point(position="jitter", alpha=0.5) +
     geom_boxplot(outlier.size=0, alpha=0.5) +
     theme(
     axis.title.x = element_text(face="bold", color="black", size=12),
     axis.title.y = element_text(face="bold", color="black", size=12),
     plot.title = element_text(face="bold", color = "black", size=12)) +
     labs(x="Převažující kategorie vzdělání ",
     y = "Cena za byt o rozloze 60 metrů čtverečních (v Kč)",
     title= "Cena za byt o rozloze 60 metrů čtverečních (v Kč) dle převažující kategorie vzdělání") +
    theme(legend.position='none')
    


    • The aov function:
    aov(dependent_var ~ independent_var)
    summary()


    • Apply the aov function:
    anova_BydleniBrno <- aov(Prodej_60m2_2016 ~ Vzdelani, 
                             data = Bydleni_Brno_60m2_2016)


    • Look at the summary table of the result
    summary(anova_BydleniBrno)
                Df    Sum Sq   Mean Sq F value Pr(>F)
    Vzdelani     2 3.465e+11 1.733e+11   0.963  0.396
    Residuals   23 4.137e+12 1.799e+11               


    Size effect


    • The lsr package:
    library(lsr)
    etaSquared(anova_BydleniBrno, 
               type = 2,
               anova = FALSE)
                 eta.sq eta.sq.part
    Vzdelani 0.07729335  0.07729335



    • The sjstats package:
    anova <- lm(Prodej_60m2_2016 ~ Vzdelani, 
                    data = Bydleni_Brno_60m2_2016)
    performance::r2(anova)
    # R2 for Linear Regression
    
           R2: 0.077
      adj. R2: -0.003




    Assumptions


    Variables

    • “Dependent” variable at least on the interval level.


    Normal distribution of the dependent variable

    • In each of the groups/conditions.
    • Violation of this assumption should not substantially bias the test result as long as the groups have a similar number of observations, and in each group, there are at least 30 observations.
    • Non-parametric alternative – Kruskal-Wallis test


    Homogenity of variance

    • Sledujeme Levenův F-test, nulová hypotéza hovoří o homogenitě napříč skupinami
    • Pokud Levenův F-test vychází statisticky signifikantní:
    • Sledujeme poměr rozptylu u skupin s největším a nejmenším rozptylem, přičemž chceme, aby byl tento poměr menší než 3
    • Narušení by nemělo vadit, pokud jsou skupiny stejně velké
    • Při narušení lze použít Welchovo F


    Independence of observation


    Homogenity of variance
    library("car")


    • If you don’t specify additional arguments, the deviation scores are calculated by comparing each score to its group median.
      • This is the default behaviour, even though they are typically calculated by comparing each score to its group mean.
      • If you want to use means and not medians, add an argument center = mean. Do this now and compare the results to the first test.


    • Levene’s test:
    leveneTest(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
    Levene's Test for Homogeneity of Variance (center = median)
          Df F value Pr(>F)
    group  2  0.7446  0.486
          23               


    • Levene’s test with center = mean:
    leveneTest(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016, center = mean)
    Levene's Test for Homogeneity of Variance (center = mean)
          Df F value Pr(>F)
    group  2  0.8418 0.4438
          23               


    Normality of the dependent variable:
    ggplot(data=Bydleni_Brno_60m2_2016, aes(Prodej_60m2_2016)) +
     geom_histogram(binwidth = 250000, col="red",
     aes(fill=..count..)) +
     scale_fill_gradient("Count", low = "green", high = "red") +
     labs(title="Histogram ceny za byt o rozloze 60 metrů čtverečních v Brně v roce 2016") +
     labs(x="Cena", y="Četnost") + theme(legend.position='none')


    F-test

    • Fischer’s test (F-test):
    anova_wm_VNE = oneway.test(Prodej_60m2_2016 ~ Vzdelani, 
                               data = Bydleni_Brno_60m2_2016, 
                               var.equal=TRUE)
    
    anova_wm_VNE
    
        One-way analysis of means
    
    data:  Prodej_60m2_2016 and Vzdelani
    F = 0.96333, num df = 2, denom df = 23, p-value = 0.3965


    • Welch’s F-test:
    anova_wm_VE = oneway.test(Prodej_60m2_2016 ~ Vzdelani, 
                              data = Bydleni_Brno_60m2_2016, 
                              var.equal=FALSE)
    
    anova_wm_VE
    
        One-way analysis of means (not assuming equal variances)
    
    data:  Prodej_60m2_2016 and Vzdelani
    F = 0.68429, num df = 2.0000, denom df = 5.6794, p-value = 0.5417


    Post-Hoc Tests


    Allow for multiple pairwise comparisons without an increase in the probability of a Type I error.

    Používáme, pokud nemáme dopředu jasné hypotézy.

    • Srovnávají vše se vším – každou skupinu s každou (ale neumí slučovat skupiny jako kontrasty).
      • Z principu jsou oboustranné.
    • Je jich mnoho – liší se v několika parametrech:
      • Konzervativní (Ch. II. typu) versus Liberální (Ch. I. typu),
      • Most liberal = no adjustment,
      • Most conservative = adjust for every possible comparison that could be made,
      • Ne/vhodné pro rozdílně velké skupiny,
      • Ne/vhodné pro rozdílné skupinové rozptyly.


    Recommendation by A. Field:


    • Stejně velké skupiny a skupinové rozptyly (ideální situace):
      • REGWQ
      • Tukey


    • Pokud si chceme být jistí, že P chyby I. typu nepřekročí zvolenou hladinu:
      • Bonferroni


    • Pokud jsou velikosti skupin trochu/hodně rozdílné:
      • Gabriel
      • Hochberg GT2


    • Pokud pochybujeme o shodnosti skupinových rozptylů:
      • Games-Howell


    Tukey


    • Conduct ANOVA:
    anova_BydleniBrno = aov(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)


    • View summary:
    summary(anova_BydleniBrno)
                Df    Sum Sq   Mean Sq F value Pr(>F)
    Vzdelani     2 3.465e+11 1.733e+11   0.963  0.396
    Residuals   23 4.137e+12 1.799e+11               


    • Conduct Tukey procedure:
    tukey <- TukeyHSD(anova_BydleniBrno)


    • Plot confidence intervals:
    plot(tukey)


    Bonferroni


    The Bonferroni correction compensates for that increase by testing each individual hypothesis at a significance level of α/m, where α is the desired overall alpha level and m is the number of hypotheses.

    • For example, if a trial is testing m = 20 hypotheses with a desired α = 0.05, then the Bonferroni correction would test each individual hypothesis at α = 0.05/20 = 0.0025.


    • Pairwise t-test:
    pairwise.t.test(Bydleni_Brno_60m2_2016$Prodej_60m2_2016,
    Bydleni_Brno_60m2_2016$Vzdelani, p.adjust = "bonferroni")
    
        Pairwise comparisons using t tests with pooled SD 
    
    data:  Bydleni_Brno_60m2_2016$Prodej_60m2_2016 and Bydleni_Brno_60m2_2016$Vzdelani 
    
                  Základní Středoškolské
    Středoškolské 1.00     -            
    Vysokoškolské 0.57     0.89         
    
    P value adjustment method: bonferroni 



    Contrasts


    Umožňují porovnat jednotlivé skupiny v jednom kroku bez nutnosti korigovat hladinu významnosti (bez snížení síly testu)

    • Jen když máme dopředu hypotézy
    • Kontrastů lze provést tolik, kolik je počet skupin – 1

    • Každý kontrast srovnává 2 průměry
    • Průměr skupiny nebo průměr více skupin dohromady
    • Např. „Základní" vs. „Středoškolské"

    • Ortogonální (nezávislé) kontrasty
    • Skupina použitá v jednom srovnání není použitá v dalším

    Neortogonální kontrasty


    • Contrasts:
    c1 = c(-1,0,1)
    c2 = c(0,-1,1)
    
    mat <- cbind(c1,c2)
    contrasts(Bydleni_Brno_60m2_2016$Vzdelani) <- mat
    
    model1 <- lm(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
    summary(model1)
    
    Call:
    lm(formula = Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -872820 -171177   87618  278193  601098 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  2807714     100840  27.843   <2e-16 ***
    Vzdelanic1   -178546     158610  -1.126    0.272    
    Vzdelanic2    -25648     117027  -0.219    0.828    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 424100 on 23 degrees of freedom
    Multiple R-squared:  0.07729,   Adjusted R-squared:  -0.002942 
    F-statistic: 0.9633 on 2 and 23 DF,  p-value: 0.3965
    options(contrasts = c("contr.helmert", "contr.poly"))
    
    contrasts(Bydleni_Brno_60m2_2016$Vzdelani) <- "contr.helmert"
    
    model1 <- lm(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
    
    summary(model1)
    
    Call:
    lm(formula = Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -872820 -171177   87618  278193  601098 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  2807714     100840  27.843   <2e-16 ***
    Vzdelani1     -76449     117840  -0.649    0.523    
    Vzdelani2    -102097      74430  -1.372    0.183    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 424100 on 23 degrees of freedom
    Multiple R-squared:  0.07729,   Adjusted R-squared:  -0.002942 
    F-statistic: 0.9633 on 2 and 23 DF,  p-value: 0.3965


    Resources

    Conway, A. (n.d.) Intro to Statistics with R: Student’s T-test. Dostupné online na: https://www.datacamp.com/courses/intro-to-statistics-with-rstudents-t-test

    Effect size (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Effect_size

    Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. Sage: UK.

    Navarro, D. J. (2014). Learning statistics with R: A tutorial for psychology students and other beginners. Available online: http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/

    Standard error (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Standard_error

    Student’s t-test (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Student%27s_ttest

     

    A work by Vit Gabrhel

     

    ---
title: "**08. Comparing Two and More Groups**"
subtitle: "R101"
author: "Vít Gabrhel"
output: 
  html_notebook:
    toc: true
    toc_float: true
    theme: yeti
    code_folding: "show"
    
---

## Outline
<br>

<ul>
#### 01. **Comparing two groups (t-tests)**
#### 02. **Comparing three and more groups (ANOVA)**
<ul/>

<br>

## Data and packages
### *Housing in Brno*
<br>

* **Packages**:
```{r, result='hide', message=FALSE}
require("readxl")
require("dplyr")
require("summarytools")
require("ggplot2")
library("psych")
library(sjstats)
library("car")
```
<br>

* **Data**:
```{r}
  Bydleni_Brno <- read_excel("Bydleni_Brno.xlsx")
```


<br>

## Comparing two groups
<br>

### Dependent t-test
<br>

![](01.jpg)

<br>

#### *Assumptions*

<br>

* The sampling distribution is normally distributed. 
  + In the dependent t-test this means that the **sampling distribution** of the **differences between scores** should be **normal**, not the scores themselves.
* Data (*"Dependent variable"*) are measured at least at the **interval level**.

<br>

#### *Main Part of the Analysis*

<br>

* In the case of our dependent t-test, we need to specify these arguments to t.test():
```{r, eval=FALSE}
 ?t.test
```
<br>

* "paired=...": Whether we're doing a **dependent (i.e. paired) t-test** or **independent t-test**. 
    + In case of the **paired t-test** it's **TRUE**.
* Note that *t.test()* carries out a **two-sided t-test** by **default**.

<br>

* **Variables**:
  + x: Column of *Bydleni_Brno* containing prices for 2015.
  + y: Column of *Bydleni_Brno* containing prices for 2016

<br>

* **Conduct a paired t-test** using the t.test function:
```{r}
t.test(Bydleni_Brno$Pronajem_m2_2015, Bydleni_Brno$Pronajem_m2_2016, mu = -1,
       paired = TRUE)
```

![](02.jpg)

<br>

#### *Effect Size*

<br>

* Packages:
```{r, results='hide'}
library("lsr")
library("effsize")
```

<br>

* **Cohen's d**, the **effsize** way:
```{r, eval=FALSE}
cohen.d(x, y, 
        pooled=TRUE, 
        paired=TRUE,
        na.rm=FALSE, 
        hedges.correction=FALSE,
        conf.level=0.95, 
        noncentral=FALSE)

?cohen.d()
```

<br>

```{r}
cohen.d(Bydleni_Brno$Pronajem_m2_2015, Bydleni_Brno$Pronajem_m2_2016, paired=TRUE)
```

<br>

#### *Visualisation of the Cohen's d*

<br>

* This [app](http://shiny.psy.gla.ac.uk/guess/?fbclid=IwAR2EqXgySl7MDoDegYjE80vI1qFMa5qgEz1IAeXRXWv128x9y3jQlxsEHFk) will show you a graph of simulated data with a random number of observations in each of two groups and random effect size. 
  + The effect size will be between -3 and 3, and your job is to **guess the size of the effect**.

<br>
  
### Independent t-test

<br>

![](03.jpg)

<br>

#### *Assumptions*

<br>

* **Homogeneity of variance**. 
* Scores are **independent** (because they come from *non-identical units*).
* Data (*"Dependent variable"*) are measured at least at the **interval level**.

<br>

* Create subsets:
```{r}
Sidliste <- Bydleni_Brno %>%
 select(c("Sidliste", "Pronajem_m2_2016")) %>%
 filter(Sidliste == 1, Pronajem_m2_2016 > 0)

Rodinne_Domy <- Bydleni_Brno %>%
 select(c("Sidliste", "Pronajem_m2_2016")) %>%
 filter(Sidliste == 0, Pronajem_m2_2016 > 0)
```

<br>

* Summary statistics:
```{r, message = FALSE, eval = FALSE}
Sidliste_view <- view(dfSummary(Sidliste))

Rodinne_Domy_view <- view(dfSummary(Rodinne_Domy))
```

<br>

* Visualise the data:
```{r}
Bydleni_Brno_Najem <- Bydleni_Brno %>%
 filter(Pronajem_m2_2016 > 0) %>%
 ggplot(aes(x = factor(Sidliste), 
            y = Pronajem_m2_2016, 
            fill = factor(Sidliste))) + 
  geom_boxplot()

Bydleni_Brno_Najem
```

<br>

#### Levene's test (Load the `car` package):
```{r}
# Wrangle the data:
Bydleni_Brno_Najem_Levene <- Bydleni_Brno %>%
 filter(Pronajem_m2_2016 > 0)

# Levene's test
leveneTest(Bydleni_Brno_Najem_Levene$Pronajem_m2_2016 ~ factor(Bydleni_Brno_Najem_Levene$Sidliste))
```

<br>

* Conduct an **independent t-test**:
```{r}
  t.test(Sidliste$Pronajem_m2_2016, Rodinne_Domy$Pronajem_m2_2016,
         var.equal = FALSE)
```

<br>

#### *Size Effect*

<br>

![](04.jpg)

<br>

* **Cohen's d**, the **lsr** way:

```{r}
# Data
Najem_2015 <- Bydleni_Brno %>% select(Pronajem_m2_2015)
Sidliste_2015 <- Bydleni_Brno %>% select(Sidliste)


cohensD(Bydleni_Brno$Pronajem_m2_2015 ~ Bydleni_Brno$Sidliste)
```

<br>

* **Cohen's d**, the **effsize** way:
```{r, eval=TRUE}
cohen.d(Bydleni_Brno$Pronajem_m2_2015 ~ Bydleni_Brno$Sidliste)
```

## Comparing three and more groups

### One-way ANOVA

<br>

#### *Introduction*

<br>

**ANOVA** = **AN**alysis **O**f **VA**riance

<br>

* Used when **comparing group means** of **three and more groups/conditions**.
* Two default variants:
  + **Between design**: independent samples (ANOVA, ANCOVA, factorial ANOVA, etc.)
    + *Is there any difference between regions when it comes to the income in the Czech Republic?*
* **Within design**: a comparison of a group mean across several points in time (Repeated Measures ANOVA).
  + *Have households increased fuel consumption during the last five years?*

<br>
  
#### *"Grammar"*

<br>

* **Factor** - the most frequent attained level of education in the respective city district (primary, secondary, tertiary)
* **Dependent variable** - the average price for an apartment (60 square metres) in the respective city district in 2016.
* **Null hypothesis**:
  + *All of the groups have similar mean. Alternatively, there is no difference in the mean value of an apartment (60 square metres) in Brno based on the most frequent level of education in the respective city district.*
* **Alternative hypothesis**:
  + *The average price for a flat of 60 square metres is the highest in the districts with university alumni as the most frequent category of the level of education.*

<br>

#### *F-test and F-ratio*

* **Null hypothesis**: *all groups are equal*
  + ANOVA provides a **significance test**
* Test statistic is the **F-test** (or **F-ratio**)

<br>

* *How do we get the associated p-value*?
  + The "family" of **F-distributions** is based on:
    + The **number** of **observations** (cases or simply **n**)
    + The **number** of compared conditions (**groups**)

<br>
    
#### *Simulation of the F-distribution*

```{r}
# Create the vector x
x <- seq(from = 0, to = 10, length = 2000)

# Evaluate the densities
y_1 <- df(x, 3, 100)
y_2 <- df(x, 1, 1)
y_3 <- df(x, 2, 100)
y_4 <- df(x, 3, 30)
y_5 <- df(x, 3, 500)
y_6 <- df(x, 3, 50)
y_7 <- df(x, 6, 1000)

# Plot the densities
plot(x, y_1, col = 1, type = "l")
lines(x, y_2, col = 2)
lines(x, y_3, col = 3)
lines(x, y_4, col = 4)
lines(x, y_5, col = 5)
lines(x, y_6, col = 6)
lines(x, y_7, col = 7)

# Add the legend
legend("topright", title = "F distributions",
c("df = (3, 100)", "df = (1, 1)", "df = (2, 100)", "df = (3, 30)",
"df = (3, 500)", "df = (3, 50)", "df = (6, 1000)"),
col = c(1, 2, 3, 4, 5, 6, 7), lty = 1)
```

<br>

#### *Summary Table*

<br>

![](05.jpg) 

<br>

#### *Code*

<br>

* Data wrangling
```{r}
Bydleni_Brno$Vzdelani = factor(Bydleni_Brno$Vzdelani, 
                               order = TRUE, 
                               levels = c(1, 2, 3),
                               labels = c("Základní", 
                                         "Středoškolské", 
                                         "Vysokoškolské"))

Bydleni_Brno_60m2_2016 <- Bydleni_Brno %>%
                            filter(Prodej_60m2_2016 > 0)
```

<br>

* Summary statistics by groups:
```{r}
describeBy(Bydleni_Brno_60m2_2016$Prodej_60m2_2016, 
           group = factor(Bydleni_Brno_60m2_2016$Vzdelani))
```

<br>

* Boxplot:
```{r}
bp1 = ggplot(Bydleni_Brno_60m2_2016, aes(Vzdelani, Prodej_60m2_2016))
bp1 + geom_boxplot(aes(fill=Vzdelani), alpha=I(0.5)) +
 geom_point(position="jitter", alpha=0.5) +
 geom_boxplot(outlier.size=0, alpha=0.5) +
 theme(
 axis.title.x = element_text(face="bold", color="black", size=12),
 axis.title.y = element_text(face="bold", color="black", size=12),
 plot.title = element_text(face="bold", color = "black", size=12)) +
 labs(x="Převažující kategorie vzdělání ",
 y = "Cena za byt o rozloze 60 metrů čtverečních (v Kč)",
 title= "Cena za byt o rozloze 60 metrů čtverečních (v Kč) dle převažující kategorie vzdělání") +
theme(legend.position='none')

```

<br>

* The **aov** function:
```{r, eval=FALSE, error=FALSE, message=FALSE}
aov(dependent_var ~ independent_var)
summary()
```

<br>

* Apply the aov function:
```{r}
anova_BydleniBrno <- aov(Prodej_60m2_2016 ~ Vzdelani, 
                         data = Bydleni_Brno_60m2_2016)
```

<br>

* Look at the summary table of the result
```{r}
summary(anova_BydleniBrno)
```

<br>

#### *Size effect*

<br>

* The **lsr** package:
```{r}
library(lsr)
etaSquared(anova_BydleniBrno, 
           type = 2,
           anova = FALSE)
```

<br>

![](06.jpg)

<br>

* The **sjstats** package:
```{r}
anova <- lm(Prodej_60m2_2016 ~ Vzdelani, 
                data = Bydleni_Brno_60m2_2016)
performance::r2(anova)
```

<br>

![](07.jpg)

<br>

![](08.jpg)

<br>

#### Assumptions

<br>

**Variables**

* **"Dependent" variable** at least on the **interval level**.

<br>

**Normal distribution of the dependent variable**

* In **each** of the **groups/conditions**.
* **Violation** of this assumption **should not substantially bias** the **test result** as long as the **groups have a similar number of observations**, and in **each group**, there are **at least 30 observations**.
* **Non-parametric** alternative – **Kruskal-Wallis test**

<br>

**Homogenity of variance**

* Sledujeme Levenův F-test, nulová hypotéza hovoří o homogenitě napříč skupinami
* Pokud Levenův F-test vychází statisticky signifikantní:
* Sledujeme poměr rozptylu u skupin s největším a nejmenším rozptylem, přičemž chceme, aby byl tento poměr menší než 3
* Narušení by nemělo vadit, pokud jsou skupiny stejně velké
* Při narušení lze použít Welchovo F

<br>

**Independence of observation**

<br>

##### **Homogenity of variance**
```{r}
library("car")
```

<br>

* If you don't specify additional arguments, the **deviation scores** are calculated by **comparing each score** to **its group median**.
  + This is the default behaviour, even though they are typically calculated by comparing each score to its group mean.
  + If you want to **use means and not medians**, add an argument **center = mean**. Do this now and compare the results to the first test.

<br>

* Levene's test:
```{r}
leveneTest(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
```

<br>

* Levene's test with center = mean:
```{r}
leveneTest(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016, center = mean)
```

<br>

##### **Normality of the dependent variable**:
```{r}
ggplot(data=Bydleni_Brno_60m2_2016, aes(Prodej_60m2_2016)) +
 geom_histogram(binwidth = 250000, col="red",
 aes(fill=..count..)) +
 scale_fill_gradient("Count", low = "green", high = "red") +
 labs(title="Histogram ceny za byt o rozloze 60 metrů čtverečních v Brně v roce 2016") +
 labs(x="Cena", y="Četnost") + theme(legend.position='none')
```

<br>

#### F-test

* Fischer's test (F-test):
```{r}
anova_wm_VNE = oneway.test(Prodej_60m2_2016 ~ Vzdelani, 
                           data = Bydleni_Brno_60m2_2016, 
                           var.equal=TRUE)

anova_wm_VNE
```

<br>

* Welch's F-test:
```{r}
anova_wm_VE = oneway.test(Prodej_60m2_2016 ~ Vzdelani, 
                          data = Bydleni_Brno_60m2_2016, 
                          var.equal=FALSE)

anova_wm_VE
```

<br>

### Post-Hoc Tests

<br>

Allow for multiple pairwise comparisons without an increase in the probability of a Type I error.

Používáme, pokud nemáme dopředu jasné hypotézy.

* Srovnávají vše se vším – každou skupinu s každou (ale neumí slučovat skupiny jako kontrasty).
  + Z principu jsou oboustranné.

* Je jich mnoho – liší se v několika parametrech:
  + Konzervativní (Ch. II. typu) versus Liberální (Ch. I. typu),
  + Most liberal = no adjustment,
  + Most conservative = adjust for every possible comparison that could be made,
  + Ne/vhodné pro rozdílně velké skupiny,
  + Ne/vhodné pro rozdílné skupinové rozptyly.

<br>

#### <u>Recommendation by A. Field:</u>

<br>

* Stejně velké skupiny a skupinové rozptyly (ideální situace):
  + REGWQ
  + Tukey

<br>

* Pokud si chceme být jistí, že P chyby I. typu nepřekročí zvolenou hladinu:
  + Bonferroni

<br>

* Pokud jsou velikosti skupin trochu/hodně rozdílné:
  + Gabriel
  + Hochberg GT2

<br>
 
* Pokud pochybujeme o shodnosti skupinových rozptylů:
  + Games-Howell
  
<br>
  
##### **Tukey**

<br>

* Conduct ANOVA:
```{r}
anova_BydleniBrno = aov(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
```

<br>

* View summary:
```{r}
summary(anova_BydleniBrno)
```

<br>

* Conduct Tukey procedure:
```{r}
tukey <- TukeyHSD(anova_BydleniBrno)
```

<br>

* Plot confidence intervals:
```{r}
plot(tukey)
```

<br>

#### **Bonferroni**

<br>

The Bonferroni correction compensates for that increase by testing each individual hypothesis at a significance level of α/m, where α is the desired overall alpha level and m is the number of hypotheses.

* For example, if a trial is testing m = 20 hypotheses with a desired α = 0.05, then the Bonferroni correction would test each individual hypothesis at α = 0.05/20 = 0.0025.

<br>

* Pairwise t-test:
```{r}
pairwise.t.test(Bydleni_Brno_60m2_2016$Prodej_60m2_2016,
Bydleni_Brno_60m2_2016$Vzdelani, p.adjust = "bonferroni")
```

<br>

![](09.jpg)

<br>

### Contrasts

<br>

Umožňují **porovnat jednotlivé skupiny** v **jednom kroku** bez nutnosti korigovat hladinu významnosti (**bez snížení síly testu**)

* Jen když máme dopředu hypotézy
* Kontrastů lze provést tolik, kolik je počet skupin – 1 

* **Každý kontrast srovnává 2 průměry**
* Průměr skupiny nebo průměr více skupin dohromady
* Např. „Základní" vs. „Středoškolské"

* **Ortogonální (nezávislé) kontrasty**
* Skupina použitá v jednom srovnání není použitá v dalším

**Neortogonální kontrasty**

<br>

* Contrasts:
```{r}
c1 = c(-1,0,1)
c2 = c(0,-1,1)

mat <- cbind(c1,c2)
contrasts(Bydleni_Brno_60m2_2016$Vzdelani) <- mat

model1 <- lm(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)
summary(model1)

options(contrasts = c("contr.helmert", "contr.poly"))

contrasts(Bydleni_Brno_60m2_2016$Vzdelani) <- "contr.helmert"

model1 <- lm(Prodej_60m2_2016 ~ Vzdelani, data = Bydleni_Brno_60m2_2016)

summary(model1)
```

<br>

## Resources

Conway, A. (n.d.) Intro to Statistics with R: Student's T-test. Dostupné online na:
https://www.datacamp.com/courses/intro-to-statistics-with-rstudents-t-test

Effect size (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Effect_size

Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. Sage: UK.

Navarro, D. J. (2014). Learning statistics with R: A tutorial for psychology students and other beginners. Available
online: http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/

Standard error (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Standard_error

Student's t-test (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Student%27s_ttest

&nbsp;
<hr />
<p style="text-align: center;">A work by <a href="https://github.com/VGabrhel">Vit Gabrhel</a></p>
<p style="text-align: center;"><span style="color: #808080;"><em>vit.gabrhel@gmail.com</em></span></p>

<!-- Add icon library -->
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

<!-- Add font awesome icons -->
<p style="text-align: center;">
    <a href="https://github.com/VGabrhel" class="fa fa-github"></a>
    <a href="https://cz.linkedin.com/in/v%C3%ADt-gabrhel-2b0a8b98" class="fa fa-linkedin"></a>
    <a href="https://scholar.google.com/citations?user=Y-NGJekAAAAJ&hl=en&oi=ao" class="fa fa-book"></a>
</p>

&nbsp;


  