Exploratory data analysis

Performing exploratory data analysis (EDA) helps you to get to know your data.

Resources for learning

Dianne Cook and D. F. Swayne, 2008: “Interactive and Dynamic Graphics for Data Analysis: With Examples Using R and GGobi”. http://www.ggobi.org/book/

Hadley Wickham and G. Grolemund, 2016: “R for Data Science”. http://r4ds.had.co.nz/

Indtroduction: What is EDA?

Introduction from Wickham & Grolemund (2016):

This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:

  1. Generate questions about your data.
  2. Search for answers by visualizing, transforming, and modelling your data.
  3. Use what you learn to refine your questions and or generate new questions.

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will hone in on a few particularly productive areas that you’ll eventually write up and communicate to others.

EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualization, transformation, and modelling.

Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought provoking questions—if you follow up each question with a new question based on what you find.

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

  • What type of variation occurs within my variables?
  • What type of covariation occurs between my variables?

Variation within a variable

All-time women’s best 10,000m

We use a list of 1738 all-time woman’s best performances on 10,000m (track) from http://www.alltime-athletics.com/w_10kok.htm. The following figure presents kernel density estimates and observed values (“ticks” on x-axis). Remember that the sample contains only observations from the left tail.

Observed times has median of 31.77 and standard deviation of 0.46 minutes. (Just for comparison my trail PB on 10,000m is slightly above 50 minutes.) You can see that there are some times very distant from median (outliers). For example current world record (August 2016) is equal to median - 5.4*standard error.

If we see data like this we should consider whether we see a mistake (bug) or information.

Hawkins (1980) defined an outlier as “an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism”

Barnett and Lewis (1994) defined an outlier as “an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data”

Typical sources of bugs:

  • Measurement/coding error – e.g. a race in Sevilla in 1999 was actually 9,600m long (one lap short)
  • Coding values – e.g. -99 is favourite number for indicating missing value
  • Changes in coding –
  • Errors in table joining – “This dataset uses ISO/Polity IV/… coding.” Hardly.

Do your scripts do what they are supposed to do? Does your solution really works in all settings? (See https://en.wikipedia.org/wiki/Unit_testing) Do you use proper IDs for observations? Are joined data (e.g. from different survey waves) compatible?

Remember that this kind of bugs is very often silent! Careful reading of code books is the way to avoid them and EDA is the way to discover them.

On the other hand it is fairly possible that data are correct. In this case the outliers often show us some important information about data generating process. If it is the case we come up with a new question to investigate (Was doping involved? How about wind/altitude/…?)

See an outlier in action: https://www.youtube.com/watch?v=jemU_Q7mqMA (Almaz Ayana, ETH, 29:17.45).

Covariation between variables

Your data exploration is not by far restricted to investigating distributions individual variables. Information on potential mistakes or nature of data generating process can be obtained by analysis of relationships between variables.

Iris data

In order to demonstrate that we will use famous iris data set with petal and sepal sizes of iris setosa and iris versicolor.

print(iris, n=5)
## # A tibble: 100 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## # ... with 95 more rows

We will focus on relationship between sepal dimensions:

OLS fitted smoothing line shows that there is only weak relationship between dimensions and negative if any. It means that longer sepals tend to be thinner.

OLS spoke out clearly but the result is still a bit suspicious for someone at least touched by biology. To reveal what is really going on we need to dig a bit deeper.

Apart measurements errors one can suspect that there is a third variable which drives regression results. One is right in this case:

Careful carrying out of EDA will helps you to better understand data generating process and subsequently prevent modeling mistakes.

Real-life example: Wages in Czech Republic

Introduction

In this exercise we will investigate whether there is a gender gap – a significant difference in wages – in Czech Republic.

Data

Dataset covers sample of observations from Czech Republic gathered by WageIndicator Foundation (http://www.wageindicator.org/).

Loading

Data are available in dta (Stata) format. Data from dta can be load using foreign::read.dta() (only older versions) or haven::read_dta() which supports only limited number of data classes (logical, integer, numeric, character, and factor).

library(haven)
read_dta("data/wi_sample_cr.dta") -> wages

Other statistical packages use very often labelled vectors (class labelled in R). Our dataset is not different:

lapply(wages,class) %>% unlist
##     casenum      locale     surveyy     survemm     surveqy     country 
##   "numeric" "character"  "labelled"  "labelled"  "labelled"  "labelled" 
##      contst    regihom1    isco0804    isco0801     occtype    occlevel 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##     depmale        supv     supvwom    supvdich    nace2001    firmsize 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##    firmfema    firmloca    firmowne     firmpri    caofirm1     caofirm 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##     educat1    eduisced    educalmh    educaori    educatyr     yybirth 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##         age  agegr5_mdn    training      contr7     hrswag1    wageper3 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##    wageboth     wagecur     wagegr1     hrswage         ppp     wagegr4 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##       wasum    washif61    waover61    wadirt61    waskil61    wa13mn61 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##    waannu61    wayear61    waxmas61    waholi61    wagrou61    wapers62 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##    waprof61    waothe61     wagene1    yyfstjob    jobpromo    tenuexpe 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##    wagegrhr    wagenehr    memtrad1     memtrad    commtime     cobself 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##    regicity      gender    hhnrmemb      hhstat        chld      satjob 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##    wagesat2    satcombi     satlife    jobsecur    jobsecu2    jobsec61 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##    jobsecu3    jobsec62     jobloss    satbonus    satcontr    sathhinc 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##    sathealt     sathous    satleisu    sathhdiv    waperf61       break 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##    satsecur    satenvir      sathrs     satcoll    satsupvi    satwelfa 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##    satsubor    watips61     breaknr     satnetw     firmmne  reason_sal 
##  "labelled"  "labelled"  "labelled"  "labelled"  "labelled"  "labelled" 
##      region 
##   "numeric"

Object of class labelled contains coded observed values and codes descriptions. Labels are stored as attributes. See example:

wages$depmale %>% attributes()
## $label
## [1] "co-workers in similar positions are mostly men"
## 
## $format.stata
## [1] "%8.0g"
## 
## $class
## [1] "labelled"
## 
## $labels
##                                        other 
##                                        -9999 
##              missing (mapping not available) 
##                                         -999 
##                                 user missing 
##                                          -99 
##                    not filled (paper survey) 
##                                          -91 
##                missing (checkbox not ticked) 
##                                          -55 
##         not asked (not in country / release) 
##                                          -54 
##                      not asked (random item) 
##                                          -53 
##                   not asked (contst routing) 
##                                          -12 
##                          not asked (skipped) 
##                                          -11 
##                           missing (drop out) 
##                                           -9 
##                     missing (not applicable) 
##                                           -8 
##                          missing (dont know) 
##                                           -7 
##                      not asked (random item) 
##                                           -6 
## not asked (release = ... / country-specific) 
##                                           -5 
##                        not asked (no part 2) 
##                                           -4 
##                                 out of range 
##                                           -3 
##                     missing (reason unknown) 
##                                           -2 
##                          not asked (skipped) 
##                                           -1 
##                                           no 
##                                            0 
##                                          yes 
##                                            1

In this case attributes contain question asked (label), possible values and its meaning (labels), and of course class name. Of course it is abslutely crucial to keep those information. Unfortunately there is just few methods which can deal with class labelled. For example haven expects class labelled to be coerced into normal factors.

Before doing any coersion we will extract all labels into dedicated variable:

lapply(wages, attributes) -> wages_labels

And now we can do the coersion using mutate_at() and as.factor():

# There is no is.labelled() function, so we need to make a little hack:
lapply(wages,class) %>% unlist %>% 
  magrittr::extract(.,equals(.,"labelled")) %>% 
  names -> lvars

# Coersion follows:
wages %<>% mutate_at(vars(one_of(lvars)),as.factor)

Now we can take a look at what we have. Table wages contains 34512 observations (rows) and 103 variables (columns).

Let’s take a look at codes and variables.

Codes

As it is apperent from the attributes there is a lot of codes for missing values. Following table from Codebook presents all of them:

Code Label Explanation
-999 Missing (Mapping not available) This missing value is assigned to variables, derived from the mapping tables in the search trees, see section 4.10 for the mappings in use
-99 User missing This missing value is assigned when a question has been shown to the respondents but they chose not to answer it.
-55 Missing (Checkbox not ticked) This missing value is assigned to variables that are shown to the respondents, but not ticked (from 1/1/2011 on).
-54 Not asked (Not in country / Release) This missing value is assigned to variables that at the day of survey completion were not switched on for the country of survey.
-53 Not asked (Random item) This missing value is assigned to all questions that have not been shown to the respondent because it was not chosen from the pool of random items in the grid question (from 1/1/2011 on, see section 4.11).
-12 Not asked (Contst routing) This missing value is assigned when a question was not shown to the respondent because of the CONTST routing (see section 4.3). For example, employees do not respond to the questions for self-employed.
-11 Not asked (Skipped) This missing value is assigned when a question was not shown to the respondent because of routing instructions other than CONTST. For example, if a respondent ticks ‘no’ to the question about children, the questions about the number and age of children are skipped and assigned -11.
-9 Missing (Drop out) This missing value is assigned to all questions following the page where the respondent dropped out of the questionnaire
-8 Missing (Not applicable) This missing value is ticked by respondents when they regard a certain question not applicable.
-7 Missing (Don’t know) This missing value is ticked by respondents when they don’t know the answer to a certain question.
-4 Not asked (No part 2) This missing value is assigned to all questions in part 2, if the respondent has decided to quit the survey at the end of part 1.
-3 Out of range Out-of-range values are coded –3. This particularly applies to data on wages and bonusses, which are set to minimum and maximum boundaries. Note that alerts for out-of-range amounts have been applied only recently. Before, data has been classified out-of-range during data-processing
-2 Missing (Reason unknown) This missing value is assigned to cases with missing values when all reasons mentioned above were not applicable.
-1 Missing – Not asked (skipped) This missing value was assigned in the 2000-2005 dataset. From 2006 on this missing value was replaced by the values -11 and -12.

Variables

Our table contains 103 variables which is a bit too much for our exercise. Therefore we will restrict out variables to comprehendable 39. Following table is generated from labels:

Variable Description
age year of birth ? survey year
break how often career break - if > 1 job > 200900
caofirm covered by collective agreement
casenum case identification number
cobself country of birth self
contr7 has permanent employment contract
contst current employment status
depmale co-workers in similar positions are mostly men
educalmh mapping: education level low-middle-high
firmfema percentage women in workplace
firmmne employed in multinational company
firmowne organisation is domestic or foreign-owned
firmpri public or private sector
firmsize firm size workplace
gender gender
hhnrmemb how many household members
hhstat current marital status
hrswage waged hours per week
chld has children
isco0801 isco 2008 digit-1
isco0804 isco 2008 digit-4
memtrad member of a trade union
nace2001 nace-rev2.0 industries 1 digit
occlevel what educational level is required in your job?
occtype position in occupational hierarchy
regihom1 region home address - aggregate geo info
satcontr satisfaction with contract
sathealt satisfaction with health
sathhdiv satisfaction with division of housework
sathhinc satisfaction with household income
satlife matrix: satisfaction with life as-a-whole
surveyy year of survey
tenuexpe years of service
wagegrhr gross hourly wage in nat. currency
wagegr1 gross wage in nat. currency, not controlled for pay period and hours
wagenehr net hourly wage in nat. currency
wagene1 net wage in nat. currency, not controlled for pay period and hours
wageper3 wage period latest wage
yyfstjob year starting first job

As you can see from the table automatic coersion to factor might not be a good idea. A lot of variables in our selection is clearly continous.

So we can do it by hand:

read_dta("data/wi_sample_cr.dta") -> wages

# Hack:
wages %<>% rename(Break = `break`)

# Select variables
wages %<>% 
  select(
    age,    #year of birth ? survey year
    Break,  #how often career break - if > 1 job > 200900
    caofirm,    #covered by collective agreement
    casenum,    #case identification number
    cobself,    #country of birth self
    contr7, #has permanent employment contract
    contst, #current employment status
    depmale,    #co-workers in similar positions are mostly men
    educalmh,   #mapping: education level low-middle-high
    firmfema,   #percentage women in workplace
    firmmne,    #employed in multinational company
    firmowne,   #organisation is domestic or foreign-owned
    firmpri,    #public or private sector
    firmsize,   #firm size workplace
    gender, #gender
    hhnrmemb,   #how many household members
    hhstat, #current marital status
    hrswage,    #waged hours per week
    chld,   #has children
    isco0801,   #isco 2008 digit-1
    isco0804,   #isco 2008 digit-4
    memtrad,    #member of a trade union
    nace2001,   #nace-rev2.0 industries 1 digit
    occlevel,   #what educational level is required in your job?
    occtype,    #position in occupational hierarchy
    regihom1,   #region home address - aggregate geo info
    satcontr,   #satisfaction with contract
    sathealt,   #satisfaction with health
    sathhdiv,   #satisfaction with division of housework
    sathhinc,   #satisfaction with household income
    satlife,    #matrix: satisfaction with life as-a-whole
    surveyy,    #year of survey
    tenuexpe,   #years of service
    wagegrhr,   #gross hourly wage in nat. currency
    wagegr1,    #gross wage in nat. currency, not controlled for pay period and hours
    wagenehr,   #net hourly wage in nat. currency
    wagene1,    #net wage in nat. currency, not controlled for pay period and hours
    wageper3,   #wage period latest wage
    yyfstjob    #year starting first job
  )

wages$age            %<>%  as.numeric()     #year of birth ? survey year
wages$Break        %<>%  as.numeric()       #how often career break - if > 1 job > 200900
wages$hrswage        %<>%  as.numeric()     #waged hours per week
wages$satcontr   %<>%  as.numeric()     #satisfaction with contract
wages$sathealt   %<>%  as.numeric()     #satisfaction with health
wages$sathhdiv   %<>%  as.numeric()     #satisfaction with division of housework
wages$sathhinc   %<>%  as.numeric()     #satisfaction with household income
wages$satlife        %<>%  as.numeric()     #matrix: satisfaction with life as-a-whole
wages$surveyy        %<>%  as.numeric()     #year of survey
wages$tenuexpe   %<>%  as.numeric()     #years of service
wages$wagegrhr   %<>%  as.numeric()     #gross hourly wage in nat. currency
wages$wagegr1        %<>%  as.numeric()     #gross wage in nat. currency, not controlled for pay period and hours
wages$wagenehr   %<>%  as.numeric()     #net hourly wage in nat. currency
wages$wagene1        %<>%  as.numeric()     #net wage in nat. currency, not controlled for pay period and hours
wages$yyfstjob   %<>%  as.numeric()     #year starting first job

# There is no is.labelled() function, so we need to make a little hack:
lapply(wages,class) %>% unlist %>% 
  magrittr::extract(.,equals(.,"labelled")) %>% 
  names -> lvars

# Coersion follows:
wages %<>% mutate_at(vars(one_of(lvars)),as.factor)

Exploring data

Distribution of discrete variables

At first we should get an overall picture. Dataset covers following range of years:

wages %$% range(surveyy) 
## [1] 2009 2015

Year of the survey is clearly a dicrete variable. Distribution of observations throughout years can be explored by simple bar plot:

wages %>% 
  ggplot(aes(surveyy)) +
  geom_bar() +
  theme_bw()

It is clear the the first year dominates sample composition.

It is worth of consideration not to use such plots as far as the data can be presented in more concise way:

wages %$% table(surveyy) 
## surveyy
##  2009  2010  2011  2012  2013  2014  2015 
## 20502  4288  1039  2146  2908  2474  1155

In this case a bar plot does not bring any added value – it presents the same information using more ink. (Do you remember Tufte?)

As far as we are interested in gender gap we should be worried about number of males/females in the sample:

wages %>% 
  ggplot(aes(gender)) +
  geom_bar() +
  theme_bw()

Genders 0, 1, and NA are not very revealing. Therefore we need to check labels to understand coding values:

wages_labels$gender
## $label
## [1] "gender"
## 
## $format.stata
## [1] "%8.0g"
## 
## $class
## [1] "labelled"
## 
## $labels
##                                        other 
##                                        -9999 
##              missing (mapping not available) 
##                                         -999 
##                                 user missing 
##                                          -99 
##                    not filled (paper survey) 
##                                          -91 
##                missing (checkbox not ticked) 
##                                          -55 
##         not asked (not in country / release) 
##                                          -54 
##                      not asked (random item) 
##                                          -53 
##                   not asked (contst routing) 
##                                          -12 
##                          not asked (skipped) 
##                                          -11 
##                           missing (drop out) 
##                                           -9 
##                     missing (not applicable) 
##                                           -8 
##                          missing (dont know) 
##                                           -7 
##                      not asked (random item) 
##                                           -6 
## not asked (release = ... / country-specific) 
##                                           -5 
##                        not asked (no part 2) 
##                                           -4 
##                                 out of range 
##                                           -3 
##                     missing (reason unknown) 
##                                           -2 
##                          not asked (skipped) 
##                                           -1 
##                                         male 
##                                            0 
##                                       female 
##                                            1

Now we can see that 0 is a code for males and 1 for females. Notice, that original coding contains a lot of codes for missing values but they collapsed in simple NA while reading Stata file.

To make information understandable we should replace coding values by values one can read without checking coding manual:

wages %>% 
  ggplot(aes(
    gender
    )) +
  geom_bar() +
  scale_x_discrete(
    name = "",
    labels=c("Males","Females","N/A")
  ) +
  theme_bw()

# Or you can change levels in factor gender!

It might be also useful to see composition of sample for each year – i.e. put previous figures together:

wages %>% 
  filter(!is.na(gender)) %>% # We are not interested in observations we are going to drop anyway. 
  ggplot(aes(
    x = surveyy,
    fill = gender
    )) +
  geom_bar(
    position = "fill"
  ) +
  geom_hline(yintercept = 0.5, linetype=3) + # Just for clarity (be careful with it)
  scale_fill_discrete(
    name = "",
    labels=c("Males","Females")
  ) +
  theme_bw()

Questions and examples: Can I use a pie chart?

Of course you can, but… (see reason no. 1 for not using pie charts from oracle.com).

See a wikipedia (https://en.wikipedia.org/wiki/Pie_chart) example:

You shouldn’t use pie charts because:

  • it is difficult to read values (join data representation with the scale)
  • ït is difficult to compare sizes of slices (different items are actually shown using different shapes) – Try to order shares from a pie chart above.

Have you ever seen a bar plot or a pie chart made in 3D? If so then you know that it makes everything 3Times worse.

For more options of geom_bar() see ggplot2 handout (chapters on geoms and position adjustments).

Distribution of continuous variables

Our primary variable of interest is hourly wage which is clearly a continous variables. To get a basic picture it is useful to take a look at decriptive statistics:

wages %$% summary(wagegrhr) # gross hourly wage in nat. currency
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## 0.000e+00 9.800e+01 1.380e+02 1.846e+13 1.940e+02 5.132e+17      5755

It seems that highest hourly wage in the sample is 5.132153110^{17} CZK. It is quite a suspicious figure. It seems that there might be a bug in the data.

Let’s take a look at histogram:

wages %>% 
  ggplot(
    aes(
      x = wagegrhr 
    )
  ) +
  geom_histogram(bins = 50) +
  theme_bw()

The suspicion is getting stronger. It seems that there are just few outliers. Let’s take a look at raw data:

wages %>% 
  arrange(desc(wagegrhr)) %>% 
  slice(1:10) %>% 
  select(wagegrhr)
## # A tibble: 10 × 1
##        wagegrhr
##           <dbl>
## 1  5.132153e+17
## 2  1.776514e+16
## 3  5.773672e+04
## 4  3.648961e+04
## 5  3.449009e+04
## 6  2.655889e+04
## 7  2.020785e+04
## 8  1.385681e+04
## 9  6.077550e+03
## 10 5.775520e+03

Yes, it seems that there are observations with unrealistic values. Keeping those two in tha sample could lead to severe bias in analysis results. Therefore we will exclude all observations with hourly wage bigger then… let’s say a 1,000 CZK.

wages %<>% filter(wagegrhr<1000) # This call also drops NAs

Take a look at statistics again:

wages %$% summary(wagegrhr)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0038  98.1500 137.3000 162.7000 192.1000 993.1000

And histogram:

wages %>% 
  ggplot(
    aes(
      x = wagegrhr 
    )
  ) +
  geom_histogram(binwidth = 50) +
  theme_bw()

You can see that observed wage distribution follows a common pattern:

  • it is unimodal
  • it is asymetric (close to log-normal)
  • mean is higer than median (as a consequence)
  • there are very distant observations on the right tail

Histogram depicts observed values. There are also geoms which allow us to plot density estimates:

wages %>% 
  ggplot(
    aes(
      x = wagegrhr 
    )
  ) +
  geom_histogram(binwidth = 50) +
  geom_density(
    aes(y = 50 * ..count..),
    color = "blue"
    ) +
  theme_bw()

As far as our goal is to compare wages of males and females we are eager to see their distributions separately:

wages %>% 
  filter(!is.na(gender)) %>% 
  ggplot(
    aes(
      x = wagegrhr,
      fill = gender
    )
  ) +
  geom_histogram(binwidth = 50) +
  geom_density(
    aes(y = 50 * ..count..)
    ) +
  scale_fill_discrete(
    name = "",
    labels=c("Males","Females")
  ) +
  theme_bw()

This figure is not very useful because it suffers by serious overploting.

The first thing we can try is to use geom_freqpoly() – geom which plots line connecting tops of histogram bins:

wages %>% 
  filter(!is.na(gender)) %>% 
  ggplot(
    aes(
      x = wagegrhr,
      color = gender
    )
  ) +
  geom_freqpoly(binwidth = 50) +
  geom_density(
    aes(y = 50 * ..count..),
    linetype = 3
    ) +
  scale_color_discrete(
    name = "",
    labels=c("Males","Females")
  ) +
  theme_bw()

It is still a mess. All lines are too close to each other and it makes extremely difficult to read anything in the figure. (At least we can see that density estimates are almost identical to observed values.)

Another possibility is to use faceting:

wages %>% 
  filter(!is.na(gender)) %>% 
  ggplot(
    aes(
      x = wagegrhr
    )
  ) +
  geom_histogram(binwidth = 50) +
  geom_density(
    aes(y = 50 * ..count..),
    color = "blue"
    ) +
  facet_wrap(~gender) +
  theme_bw()

Both distributions are clearly visible now, but it is extremely hard to compare them.

Maybe it is time to call for something completely different…

For the sake of clarity we can drop some information and plot only basic information on distributions – i.e. use boxplot:

wages %>% 
  filter(!is.na(gender)) %>% 
  ggplot(
    aes(
      gender,
      wagegrhr
    )
  ) +
  geom_boxplot() +
  scale_x_discrete(labels=c("Males","Females"), name = "") +
  theme_bw()

Boxplots usually distinguish oridanry values (coverd in the body of boxplot) and outliers which are depicted as data points outside the body. (Usually a very simple rules for outliers identification are used – such as value > mean + 1.5*(q3-q1).)

If you want to see boxplots without outliers you can filter them out using dplyr (other observations may become outliers in this case) or turn off their plotting and zoom on boxplots bodies:

wages %>% 
  filter(!is.na(gender)) %>% 
  ggplot(
    aes(
      gender,
      wagegrhr
    )
  ) +
  geom_boxplot(outlier.shape = NA) + # Turning off ouliers
  coord_cartesian(ylim = c(0,400)) + # Zooming in
  scale_x_discrete(labels=c("Males","Females"), name = "") +
  theme_bw()

And now we can clearly see that females are discriminated on the labor market. Gender people knew it all along! Slow down cowboy/girl! We have removed a lot of observation and we do not consider people being different. But for sure boxplots provided us by the best insight into distributions of hourly wages.

Questions and examples: Box… what?

Consider a distribution of 10,000 draws from normal distribution with zero mean and standard deviation equal to 1:

data_frame(
  obs = rnorm(n=10000)
) -> sim.data

The sample has following properties:

sim.data$obs %>% summary
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.859000 -0.669700 -0.000630  0.007619  0.685400  3.980000

Let’s plot it. Upper plot contains estimated density with values of basic statistics. A boxplot in the other figure represents stylized facts on the distribution.

Relationship of continuous variables

The essential part of EDA is getting an understanding of data generating process. We are basically curious about relationships between variables. Different types of variables requires different techniques.

In the first place we can take a look at the relationship of two continous variables: hourly wage (wagegrhr) and legth of practice (tenuexpe). The first we should do is to plot a scatterplot:

wages %>% 
  ggplot(
    aes(y=wagegrhr,x=tenuexpe)
  ) + geom_point() +
  theme_bw()

…and we don’t see much. The Figure suffers for serious overplotting which makes impossible to read an information from the plot.

There are several ways to deal with the issue:

The first one is to change properties of scatter plot – set an alpha or make “points” as small as possible:

wages %>% 
  ggplot(
    aes(y=wagegrhr,x=tenuexpe)
  ) + geom_point(alpha = 0.1) +
  theme_bw() -> p1

wages %>% 
  ggplot(
    aes(y=wagegrhr,x=tenuexpe)
  ) + geom_point(shape = ".") +
  theme_bw() -> p2

gridExtra::grid.arrange(p1,p2, ncol=2)

Neither of them work in this case. The infromation on data generating process remains shrouded in overlaping points.

In this case one should consider using different techniques:

wages %>% 
  ggplot(
    aes(y=wagegrhr,x=tenuexpe)
  ) + geom_bin2d() +
  theme_bw() -> p1

wages %>% 
  ggplot(
    aes(y=wagegrhr,x=tenuexpe)
  ) + geom_hex() +
  theme_bw() -> p2

gridExtra::grid.arrange(p1,p2, ncol=1)

Theory shows that using rectangle bins might be misleading. It is better to use hexagon bins. (geom_hex has an extra dependency.) Practically horizontal orientation of “highest” bins suggests that there is actually no correlation between hourly wages (wagegrhr) and legth of practice (tenuexpe).

Relationship of continuous and discrete variable

Relationship between income and life satisfaction is a subject of many papers. Let’s see what can we see in our data. Happiness (satlife) is (as usuall) evaluated using likert scale from 1 to 10 (bigger is better):

wages %>% 
  ggplot(
    aes(x=wagegrhr,y=satlife)
  ) + geom_point() +
  theme_bw() 
## Warning: Removed 10806 rows containing missing values (geom_point).

We cannot see much in this Figure for two reasons:

  • overplotting,
  • satlife is not exactly continous (but often treated as such)

To see the relationship we have to deal with both. In this case it is quite effective to combine jittering (adding some noise to actual values) and simultaneously make datapoints as small as possible (and zoom majority of observations):

wages %>% 
  ggplot(
    aes(x=wagegrhr,y=satlife)
  ) + geom_jitter(shape=".") +
  coord_cartesian(xlim = c(0,1000)) +
  theme_bw() 
## Warning: Removed 10806 rows containing missing values (geom_point).

Or we can accept, that life satisfaction is discrete and use a boxplot which basically technique to display relationship discrete and continous variable:

wages %>% 
  ggplot(
    aes(y=wagegrhr,x=factor(satlife))
  ) + geom_boxplot() +
  coord_flip(ylim = c(0,1000)) +
  theme_bw() 

geom_boxplot expects one variable to be discrete but satlife is not:

wages$satlife %>% class
## [1] "numeric"

Therefore we need to convert it into factor in ggplot call.

If satlife is converted then NA values, which are skipped in geom_point, creates a level. If we need to omit them we need to filter them out:

wages %>% 
  filter(!is.na(satlife)) %>% 
  ggplot(
    aes(y=wagegrhr,x=factor(satlife))
  ) + geom_boxplot() +
  coord_flip(ylim = c(0,1000)) +
  theme_bw() 

One may wonder whether there the relationship differs for males and females:

wages %>% 
  filter(!is.na(satlife),!is.na(gender)) %>% 
  ggplot(
    aes(y = wagegrhr, x = factor(satlife), color = gender)
  ) + geom_boxplot() +
  coord_flip(ylim = c(0,1000)) +
  scale_color_discrete(labels=c("Males","Females"), name = "") +
  theme_bw() 

Relationship of discrete variables

Happy life starts with happy home. And happy home starts with washed dishes. This bold so called wife hypothesis can be tested using our dataset which contains variable sathhdiv – satisfaction with housework division. But there is a problem – both satisfactions are evaluated using likert scale (from 1 to 5 in the case of sathhdiv).

wages %>% 
  ggplot(
    aes(y=satlife,x=sathhdiv)
  ) + geom_point() +
  theme_bw() 
## Warning: Removed 18543 rows containing missing values (geom_point).

A very definition of the word useless.

Jittering might be helpful in this case:

wages %>% 
  ggplot(
    aes(y=satlife,x=sathhdiv)
  ) + geom_jitter(alpha=0.1) +
  theme_bw() 
## Warning: Removed 18543 rows containing missing values (geom_point).

Or we can try a different geom – geom_count – which is designed for the case of two discrete variables:

wages %>% 
  ggplot(
    aes(y=satlife,x=sathhdiv)
  ) + geom_count() +
  theme_bw() 
## Warning: Removed 18543 rows containing non-finite values (stat_sum).

The size of the points depends on the number of observation at the same coordinates.

And one more thing:

wages %>% 
  filter(!is.na(gender)) %>% 
  ggplot(
    aes(y=sathhdiv,x=gender)
  ) + geom_count() +
  scale_x_discrete(labels=c("Males","Females"), name = "") +
  theme_bw() 
## Warning: Removed 18543 rows containing non-finite values (stat_sum).

Did you know that partners in an average household do together about 170 % of all housework?

To identify the effect of gender on hourly wage we need to take into account all relevant variables. In order to do so we will use some econometrics…

Homework

Detect suspicious observations in migration flows estimates.

Use S2 database of migartion flows (Abel & Sander, 2014) from the file HW_abel.Rdata. The table abel contains 4 columns:

print(abel, n=5)
## # A tibble: 153,664 × 4
##   origin destination  flow period
##    <chr>       <chr> <int>  <int>
## 1    ABW         ABW     0   1990
## 2    AFG         ABW     0   1990
## 3    AGO         ABW     0   1990
## 4    ALB         ABW     0   1990
## 5    ARE         ABW     1   1990
## # ... with 1.537e+05 more rows
  • origin and destination contain ISO-3 country codes of countries of origin and destination
  • flow contains the estimated size of the migration flow from origin to destination
  • Estimated flows are available for 4 five-years long periods (1990-1995, 1995-2000, 2000-2005, 2005-2010). Column period contains the first year of each period.

You are supposed to:

  1. Compute total outflows from each origin in each period.
  2. Compute emigration rates – data on population in the first year of each period are available in HW_population_un.Rdata. Population data are in thousands of persons!
  3. Identify suspcious total outflows – total outflows which are less or equal to 10 % of the value of total outflows from previous period.
  4. Report only suspicious observations in the following structure:
solution %>% print(n=5)
## Source: local data frame [46 x 3]
## Groups: origin [35]
## 
##   origin period    emig_rate
##    <chr>  <int>        <dbl>
## 1    BIH   1995 0.000000e+00
## 2    CYP   1995 2.221212e-05
## 3    DJI   1995 0.000000e+00
## 4    ERI   1995 4.028640e-03
## 5    ESH   1995 0.000000e+00
## # ... with 41 more rows

Names of columns are mandatory. Do not perform any rounding etc.

Assign resulting table into variable abel_outliers.