5 prosince 2016

Exploratory data analysis

Performing exploratory data analysis (EDA) helps you to get to know your data.

Resources

Indtroduction: What is EDA?

Wickham & Grolemund (2016)

Exploratory data analysis (EDA) is an iterative cycle. You:

  1. Generate questions about your data.
  2. Search for answers by visualizing, transforming, and modelling your data.
  3. Use what you learn to refine your questions and or generate new questions.

Your goal during EDA is to develop an understanding of your data.

Indtroduction: What is EDA?

Wickham & Grolemund (2016)

two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

  • What type of variation occurs within my variables?
  • What type of covariation occurs between my variables?

Tools of EDA

You will need to clean, transform, and visualization:

  • tidyr for data cleaning
  • dplyr for transformations
  • ggplot2 for visualization

Variation within a variable

Example

We use a list of 1738 all-time woman's best performances on 10,000m (track) from http://www.alltime-athletics.com/w_10kok.htm. The following figure presents kernel density estimates and observed values ("ticks" on x-axis). Remember that the sample contains only observations from the left tail.


File: athletics_10000W.txt

Data

athl %>% print(n = 5)
## # A tibble: 1,738 × 9
##    best     time             name country date_of_birth position
##   <int>    <dbl>            <chr>   <chr>         <dbl>    <chr>
## 1     1 29.29083      Almaz Ayana     ETH          1991        1
## 2     2 29.52967      Wang Junxia     CHN          1973        1
## 3     3 29.54217 Vivian Cheruiyot     KEN          1983        2
## 4     4 29.70933  Tirunesh Dibaba     ETH          1985        3
## 5     5 29.89183   Alice Nawowuna     KEN          1994        4
## # ... with 1,733 more rows, and 3 more variables: place <chr>, date <dbl>,
## #   age <dbl>

Density

athl %>% 
  ggplot(
    aes(
        x = time
        )
  ) +
  geom_density(
      color = NA, 
      fill = "lightblue"
      ) +
  geom_rug(
      sides = "b"
      ) +
  scale_x_continuous(
      trans="reverse"
      ) +
  theme_bw() + xlab("Time in minutes") +
  ggtitle("All-time women's best 10 000m") -> p

Density

Summary statistics

athl$time %>% summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.29   31.45   31.77   31.65   31.99   32.17
  • Observed times has median of 31.77 and standard deviation of 0.46 minutes.
  • Current world record (August 2016) is equal to median - 5.4*standard error – seems to be clear outlier.


(Just for comparison my trail PB on 10,000m is slightly above 50 minutes.)

Outliers

If we see data like this we should consider whether we see a mistake (bug) or information.

Hawkins (1980) defined an outlier as "an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism"

Barnett and Lewis (1994) defined an outlier as "an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data"

Typical sources of bugs

  • Measurement/coding error – e.g. a race in Sevilla in 1999 was actually 9,600m long (one lap short)
  • Coding values – e.g. -99 is favourite number for indicating missing value
  • Changes in coding –
  • Errors in table joining – "This dataset uses ISO/Polity IV/… coding." Hardly.

Typical sources of bugs

  • Do your scripts do what they are supposed to do?
  • Does your solution really works in all settings?
  • Do you use proper IDs for observations?
  • Are joined data (e.g. from different survey waves) compatible?

Remember that this kind of bugs is very often silent! Careful reading of code books is the way to avoid them and EDA is the way to discover them.

Information

On the other hand it is fairly possible that data are correct. In this case the outliers often show us some important information about data generating process.

If it is the case we come up with a new question to investigate (Was doping involved? How about wind/altitude/…?)

Covariation between variables

Your data exploration is not by far restricted to investigating distributions individual variables. Information on potential mistakes or nature of data generating process can be obtained by analysis of relationships between variables.

Iris data

In order to demonstrate that we will use famous iris data set with petal and sepal sizes of iris setosa and iris versicolor.

File: iris.Rdata

Iris data

print(iris, n=5)
## # A tibble: 100 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## # ... with 95 more rows

Iris data

Relationship

We will focus on relationship between sepal dimensions:

iris %>% 
  ggplot(
      aes(
    x = Sepal.Length,
    y = Sepal.Width
  )
  ) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  coord_fixed() +
  ggtitle("Sepal dimensions") +
  xlab("Length") + ylab("Width") + theme_bw() -> p

Relationship

OLS spoke out clearly but the result is still a bit suspicious for someone at least touched by biology. To reveal what is really going on we need to dig a bit deeper.

Relationship

Apart measurements errors one can suspect that there is a third variable which drives regression results. One is right in this case:

iris %>% 
  ggplot(aes(
    x = Sepal.Length,
    y = Sepal.Width
  )) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  stat_ellipse(aes(color = Species)) +
  geom_smooth(method = "lm", se = FALSE, aes(color = Species)) +
  coord_fixed() +
  ggtitle("Sepal dimensions") + xlab("Length") + ylab("Width") +
  theme_bw() -> p

Relationship

Careful carrying out of EDA will helps you to better understand data generating process and subsequently prevent modeling mistakes.

Real-life example: Wages in Czech Republic

Introduction

In this exercise we will investigate whether there is a gender gap – a significant difference in wages – in Czech Republic.

Data

Dataset covers sample of observations from Czech Republic gathered by WageIndicator Foundation (http://www.wageindicator.org/).

File: wi_sample_cr.dta

Loading

Data are available in dta (Stata) format. Data from dta can be load using foreign::read.dta() (only older versions) or haven::read_dta() which supports only limited number of data classes (logical, integer, numeric, character, and factor).

library(haven)
read_dta("data/wi_sample_cr.dta") -> wages

Loading

Other statistical packages use very often labelled vectors (class labelled in R). Our dataset is not different:

lapply(wages,class) %>% unlist %>% head
##     casenum      locale     surveyy     survemm     surveqy     country 
##   "numeric" "character"  "labelled"  "labelled"  "labelled"  "labelled"

Loading

Object of class labelled contains coded observed values and codes descriptions. Labels are stored as attributes.

See example:

wages$depmale %>% attributes() %>% names
## [1] "label"        "format.stata" "class"        "labels"

Loading

Variable description:

wages$depmale %>% attributes() %>% extract2("label")
## [1] "co-workers in similar positions are mostly men"

Labels:

wages$depmale %>% attributes() %>% extract2("labels") %>% head
##                                other      missing (mapping not available) 
##                                -9999                                 -999 
##                         user missing            not filled (paper survey) 
##                                  -99                                  -91 
##        missing (checkbox not ticked) not asked (not in country / release) 
##                                  -55                                  -54

Labelled

Unfortunately there is just few methods which can deal with class labelled. For example haven expects class labelled to be coerced into normal factors.

And now we can do the coersion using mutate_at() and as.factor():

# There is no is.labelled() function, so we need to make a little hack:
lapply(wages,class) %>% unlist %>% 
  magrittr::extract(.,equals(.,"labelled")) %>% 
  names -> lvars

# Save labels
lapply(wages, attributes) -> wages_labels

# Coersion follows:
wages %<>% mutate_at(vars(one_of(lvars)),as.factor)

Variables

Our table contains 103 variables which is a bit too much for our exercise. Therefore we will restrict out variables to 6:

wages %<>% 
    select(
        surveyy,    #year of survey
        gender,     #gender
        wagegrhr,   #gross hourly wage in nat. currency
        tenuexpe,   #years of service
        satlife,    #matrix: satisfaction with life as-a-whole
        sathhdiv    #satisfaction with division of housework
    )

Variables

As you can see from the table automatic coersion to factor might not be a good idea. A lot of variables in our selection is clearly continous:

read_dta("data/wi_sample_cr.dta") %>% 
    transmute(
        surveyy = as.integer(surveyy),
        gender = as.factor(gender),
        wagegrhr = as.numeric(wagegrhr),
        tenuexpe = as.numeric(tenuexpe),
        satlife = as.integer(satlife),
        sathhdiv = as.integer(sathhdiv)
    ) -> wages

Exploring data

Distribution of discrete variables

At first we should get an overall picture. Dataset covers following range of years:

wages %$% range(surveyy) 
## [1] 2009 2015

Year of the survey is clearly a dicrete variable. Distribution of observations throughout years can be explored by simple bar plot:

wages %>% 
  ggplot(aes(surveyy)) +
  geom_bar() +
  theme_bw() -> p

Exploring data

Distribution of discrete variables

Exploring data

Distribution of discrete variables

It is worth of consideration not to use such plots as far as the data can be presented in more concise way:

wages %$% table(surveyy) 
## surveyy
##  2009  2010  2011  2012  2013  2014  2015 
## 20502  4288  1039  2146  2908  2474  1155

In this case a bar plot does not bring any added value – it presents the same information using more ink. (Do you remember Tufte?)

Exploring data

Distribution of discrete variables

As far as we are interested in gender gap we should be worried about number of males/females in the sample:

wages %>% 
  ggplot(aes(gender)) +
  geom_bar() +
  theme_bw() -> p

Exploring data

Distribution of discrete variables

Exploring data

Distribution of discrete variables

Genders 0, 1, and NA are not very revealing. Therefore we need to check labels to understand coding values:

wages_labels$gender %>% extract2("labels") -> genders
genders[genders %in% c(0,1)]
##   male female 
##      0      1

Exploring data

Distribution of discrete variables

Now we can see that 0 is a code for males and 1 for females. Notice, that original coding contains a lot of codes for missing values but they collapsed in simple NA while reading Stata file.

To make information understandable we should replace coding values by values one can read without checking coding manual:

wages %>% 
  ggplot(aes(
    gender
    )) +
  geom_bar() +
  scale_x_discrete(
    name = "",
    labels=c("Males",
             "Females",
             "N/A")
  ) +
  theme_bw() -> p

Exploring data

Distribution of discrete variables

Exploring data

Distribution of discrete variables

It might be also useful to see composition of sample for each year – i.e. put previous figures together:

wages %>% 
  filter(!is.na(gender)) %>% # We are not interested in observations we are going to drop anyway. 
  ggplot(aes(
    x = surveyy,
    fill = gender
    )) +
  geom_bar(
    position = "fill"
  ) +
  geom_hline(yintercept = 0.5, linetype=3) + # Just for clarity (be careful with it)
  scale_fill_discrete(
    name = "",
    labels=c("Males","Females")
  ) +
  theme_bw() -> p

Exploring data

Distribution of discrete variables

Questions and examples

Can I use a pie chart?

Questions and examples

Can I use a pie chart?

You shouldn't use pie charts because:

  • it is difficult to read values (join data representation with the scale)
  • ït is difficult to compare sizes of slices (different items are actually shown using different shapes) – Try to order shares from a pie chart above.

Have you ever seen a bar plot or a pie chart made in 3D? If so then you know that it makes everything 3Times worse.

Exploring data

Distribution of continuous variables

Our primary variable of interest is hourly wage which is clearly a continous variables. To get a basic picture it is useful to take a look at decriptive statistics:

wages %$% summary(wagegrhr) # gross hourly wage in nat. currency
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## 0.000e+00 9.800e+01 1.380e+02 1.846e+13 1.940e+02 5.132e+17      5755

Exploring data

Distribution of continuous variables

It seems that highest hourly wage in the sample is 5.132153110^{17} CZK. It is quite a suspicious figure. It seems that there might be a bug in the data.

Let's take a look at histogram:

wages %>% 
  ggplot(
    aes(
      x = wagegrhr 
    )
  ) +
  geom_histogram(bins = 50) +
  theme_bw() -> p

Exploring data

Distribution of continuous variables

Exploring data

Distribution of continuous variables

The suspicion is getting stronger. It seems that there are just few outliers. Let's take a look at raw data:

wages %>% 
  arrange(desc(wagegrhr)) %>% slice(1:10)
## # A tibble: 10 × 6
##    surveyy gender     wagegrhr tenuexpe satlife sathhdiv
##      <int> <fctr>        <dbl>    <dbl>   <int>    <int>
## 1     2013      0 5.132153e+17       29       5       NA
## 2     2014      1 1.776514e+16       17      NA       NA
## 3     2012      0 5.773672e+04       10      NA       NA
## 4     2013      0 3.648961e+04       20      NA       NA
## 5     2014      0 3.449009e+04       10      NA       NA
## 6     2014      0 2.655889e+04        8      NA       NA
## 7     2015      0 2.020785e+04        1      NA       NA
## 8     2014      0 1.385681e+04        2      NA       NA
## 9     2014      0 6.077550e+03        5      NA       NA
## 10    2011      0 5.775520e+03       10       8        4

Exploring data

Distribution of continuous variables

It seems that there are observations with unrealistic values. Keeping those two in tha sample could lead to severe bias in analysis results. Therefore we will exclude all observations with hourly wage bigger then… let's say a 1,000 CZK.

wages %<>% filter(wagegrhr<1000) # This call also drops NAs

Take a look at statistics again:

wages %$% summary(wagegrhr)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0038  98.1500 137.3000 162.7000 192.1000 993.1000

Exploring data

Distribution of continuous variables

Exploring data

Distribution of continuous variables

You can see that observed wage distribution follows a common pattern:

  • it is unimodal
  • it is asymetric (close to log-normal)
  • mean is higer than median (as a consequence)
  • there are very distant observations on the right tail

Exploring data

Distribution of continuous variables

Histogram depicts observed values. There are also geoms which allow us to plot density estimates:

wages %>% 
  ggplot(
    aes(
      x = wagegrhr 
    )
  ) +
  geom_histogram(binwidth = 50) +
  geom_density(
    aes(y = 50 * ..count..),
    color = "blue"
    ) +
  theme_bw() -> p

Exploring data

Distribution of continuous variables

Exploring data

Distribution of continuous variables

As far as our goal is to compare wages of males and females we are eager to see their distributions separately:

wages %>% 
  filter(!is.na(gender)) %>% 
  ggplot(
    aes(
      x = wagegrhr,
      fill = gender
    )
  ) +
  geom_histogram(binwidth = 50) +
  geom_density(
    aes(y = 50 * ..count..)
    ) +
  scale_fill_discrete(
    name = "",
    labels=c("Males","Females")
  ) +
  theme_bw() -> p

Exploring data

Distribution of continuous variables

Exploring data

Distribution of continuous variables

This figure is not very useful because it suffers by serious overploting.

The first thing we can try is to use geom_freqpoly() – geom which plots line connecting tops of histogram bins:

wages %>% 
  filter(!is.na(gender)) %>% 
  ggplot(
    aes(
      x = wagegrhr,
      color = gender
    )
  ) +
  geom_freqpoly(binwidth = 50) +
  geom_density(
    aes(y = 50 * ..count..),
    linetype = 3
    ) +
  scale_color_discrete(
    name = "",
    labels=c("Males","Females")
  ) + theme_bw() -> p

Exploring data

Distribution of continuous variables

Exploring data

Distribution of continuous variables

Another possibility is to use faceting:

wages %>% 
  filter(!is.na(gender)) %>% 
  ggplot(
    aes(
      x = wagegrhr
    )
  ) +
  geom_histogram(binwidth = 50) +
  geom_density(
    aes(y = 50 * ..count..),
    color = "blue"
    ) +
  facet_wrap(~gender) +
  theme_bw() -> p

Exploring data

Distribution of continuous variables

Exploring data

Distribution of continuous variables

Both distributions are clearly visible now, but it is extremely hard to compare them.

Maybe it is time to call for something completely different…

Exploring data

Distribution of continuous variables

For the sake of clarity we can drop some information and plot only basic information on distributions – i.e. use boxplot:

wages %>% 
  filter(!is.na(gender)) %>% 
  ggplot(
    aes(
      gender,
      wagegrhr
    )
  ) +
  geom_boxplot() +
  scale_x_discrete(labels=c("Males","Females"), name = "") +
  theme_bw() -> p

Exploring data

Distribution of continuous variables

Exploring data

Distribution of continuous variables

If you want to see boxplots without outliers you can filter them out using dplyr (other observations may become outliers in this case) or turn off their plotting and zoom on boxplots bodies:

wages %>% 
  filter(!is.na(gender)) %>% 
  ggplot(
    aes(
      gender,
      wagegrhr
    )
  ) +
  geom_boxplot(outlier.shape = NA) + # Turning off ouliers
  coord_cartesian(ylim = c(0,400)) + # Zooming in
  scale_x_discrete(labels=c("Males","Females"), name = "") +
  theme_bw() -> p

Exploring data

Distribution of continuous variables

Box… what?

Relationship of continuous variables

The essential part of EDA is getting an understanding of data generating process. We are basically curious about relationships between variables. Different types of variables requires different techniques.

In the first place we can take a look at the relationship of two continous variables: hourly wage (wagegrhr) and legth of practice (tenuexpe). The first we should do is to plot a scatterplot:

wages %>% 
  ggplot(
    aes(y=wagegrhr,x=tenuexpe)
  ) + geom_point() +
  theme_bw() -> p

Relationship of continuous variables

…and we don't see much. The Figure suffers for serious overplotting which makes impossible to read an information from the plot.

Relationship of continuous variables

There are several ways to deal with the issue:

The first one is to change properties of scatter plot – set an alpha or make "points" as small as possible using shape=".".

Relationship of continuous variables

Neither of them work in this case. The infromation on data generating process remains shrouded in overlaping points.

Relationship of continuous variables

In this case one should consider using different techniques:

Relationship of continuous variables

Theory shows that using rectangle bins might be misleading. It is better to use hexagon bins. (geom_hex has an extra dependency.) Practically horizontal orientation of "highest" bins suggests that there is actually no correlation between hourly wages (wagegrhr) and legth of practice (tenuexpe).

Relationship of continuous variables

wages %>% 
  ggplot(
    aes(y=wagegrhr,x=tenuexpe)
  ) + geom_density2d() +
  theme_bw() +
    ggtitle("geom_density2d()")

Relationship of continuous and discrete variable

Relationship between income and life satisfaction is a subject of many papers. Let's see what can we see in our data. Happiness (satlife) is (as usuall) evaluated using likert scale from 1 to 10 (bigger is better):

wages %>% 
  ggplot(
    aes(x=wagegrhr,y=satlife)
  ) + geom_point() +
  theme_bw()  -> p

Relationship of continuous and discrete variable

Relationship of continuous and discrete variable

We cannot see much in this Figure for two reasons:

  • overplotting,
  • satlife is not exactly continous (but often treated as such)

To see the relationship we have to deal with both. In this case it is quite effective to combine jittering (adding some noise to actual values) and simultaneously make datapoints as small as possible (and zoom majority of observations):

wages %>% 
  ggplot(
    aes(x=wagegrhr,y=satlife)
  ) + geom_jitter(shape=".") +
  coord_cartesian(xlim = c(0,1000)) +
  theme_bw() -> p

Relationship of continuous and discrete variable

Relationship of continuous and discrete variable

Or we can accept, that life satisfaction is discrete and use a boxplot which basically technique to display relationship discrete and continous variable:

wages %>% 
  ggplot(
    aes(y=wagegrhr,x=factor(satlife))
  ) + geom_boxplot() +
  coord_flip(ylim = c(0,1000)) +
  theme_bw() -> p

Relationship of continuous and discrete variable

Relationship of discrete variables

Happy life starts with happy home. And happy home starts with washed dishes. This bold so called wife hypothesis can be tested using our dataset which contains variable sathhdiv – satisfaction with housework division. But there is a problem – both satisfactions are evaluated using likert scale (from 1 to 5 in the case of sathhdiv).

wages %>% 
  ggplot(
    aes(y=satlife,x=sathhdiv)
  ) + geom_point() +
  theme_bw() -> p 

Relationship of discrete variables

A very definition of the word useless.

Relationship of discrete variables

Jittering might be helpful in this case:

wages %>% 
  ggplot(
    aes(y=satlife,x=sathhdiv)
  ) + geom_jitter(alpha=0.1) +
  theme_bw() 

Relationship of discrete variables

A different geom – geom_count – is designed for the case of two discrete variables:

wages %>% 
  ggplot(
    aes(y=satlife,x=sathhdiv)
  ) + geom_count() +
  theme_bw() 

Homework

Detect suspicious observations in migration flows estimates.

Homework

Use S2 database of migartion flows (Abel & Sander, 2014) from the file HW_abel.Rdata. The table abel contains 4 columns:

print(abel, n=5)
## # A tibble: 153,664 × 4
##   origin destination  flow period
##    <chr>       <chr> <int>  <int>
## 1    ABW         ABW     0   1990
## 2    AFG         ABW     0   1990
## 3    AGO         ABW     0   1990
## 4    ALB         ABW     0   1990
## 5    ARE         ABW     1   1990
## # ... with 1.537e+05 more rows

Homework

  • origin and destination contain ISO-3 country codes of countries of origin and destination
  • flow contains the estimated size of the migration flow from origin to destination
  • Estimated flows are available for 4 five-years long periods (1990-1995, 1995-2000, 2000-2005, 2005-2010). Column period contains the first year of each period.

Homework

You are supposed to:

  1. Compute total outflows from each origin in each period.
  2. Compute emigration rates – data on population in the first year of each period are available in HW_population_un.Rdata. Population data are in thousands of persons!
  3. Identify suspcious total outflows – total outflows which are less or equal to 10 % of the value of total outflows from previous period.
  4. Report only suspicious observations in the following structure:

Homework

solution %>% print(n=5)
## Source: local data frame [46 x 3]
## Groups: origin [35]
## 
##   origin period    emig_rate
##    <chr>  <int>        <dbl>
## 1    BIH   1995 0.000000e+00
## 2    CYP   1995 2.221212e-05
## 3    DJI   1995 0.000000e+00
## 4    ERI   1995 4.028640e-03
## 5    ESH   1995 0.000000e+00
## # ... with 41 more rows

Names of columns are mandatory. Do not perform any rounding etc.

Assign resulting table into variable abel_outliers.