Performing exploratory data analysis (EDA) helps you to get to know your data.
Dianne Cook and D. F. Swayne, 2008: “Interactive and Dynamic Graphics for Data Analysis: With Examples Using R and GGobi”. http://www.ggobi.org/book/
Hadley Wickham and G. Grolemund, 2016: “R for Data Science”. http://r4ds.had.co.nz/
Introduction from Wickham & Grolemund (2016):
This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will hone in on a few particularly productive areas that you’ll eventually write up and communicate to others.
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualization, transformation, and modelling.
Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought provoking questions—if you follow up each question with a new question based on what you find.
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
We use a list of 1738 all-time woman’s best performances on 10,000m (track) from http://www.alltime-athletics.com/w_10kok.htm. The following figure presents kernel density estimates and observed values (“ticks” on x-axis). Remember that the sample contains only observations from the left tail.
Observed times has median of 31.77 and standard deviation of 0.46 minutes. (Just for comparison my trail PB on 10,000m is slightly above 50 minutes.) You can see that there are some times very distant from median (outliers). For example current world record (August 2016) is equal to median - 5.4*standard error.
If we see data like this we should consider whether we see a mistake (bug) or information.
Hawkins (1980) defined an outlier as “an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism”
Barnett and Lewis (1994) defined an outlier as “an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data”
Typical sources of bugs:
-99
is favourite number for indicating missing valueDo your scripts do what they are supposed to do? Does your solution really works in all settings? (See https://en.wikipedia.org/wiki/Unit_testing) Do you use proper IDs for observations? Are joined data (e.g. from different survey waves) compatible?
Remember that this kind of bugs is very often silent! Careful reading of code books is the way to avoid them and EDA is the way to discover them.
On the other hand it is fairly possible that data are correct. In this case the outliers often show us some important information about data generating process. If it is the case we come up with a new question to investigate (Was doping involved? How about wind/altitude/…?)
See an outlier in action: https://www.youtube.com/watch?v=jemU_Q7mqMA (Almaz Ayana, ETH, 29:17.45).
Your data exploration is not by far restricted to investigating distributions individual variables. Information on potential mistakes or nature of data generating process can be obtained by analysis of relationships between variables.
In order to demonstrate that we will use famous iris
data set with petal and sepal sizes of iris setosa and iris versicolor.
print(iris, n=5)
## # A tibble: 100 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fctr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## # ... with 95 more rows
We will focus on relationship between sepal dimensions:
OLS fitted smoothing line shows that there is only weak relationship between dimensions and negative if any. It means that longer sepals tend to be thinner.
OLS spoke out clearly but the result is still a bit suspicious for someone at least touched by biology. To reveal what is really going on we need to dig a bit deeper.
Apart measurements errors one can suspect that there is a third variable which drives regression results. One is right in this case:
Careful carrying out of EDA will helps you to better understand data generating process and subsequently prevent modeling mistakes.
In this exercise we will investigate whether there is a gender gap – a significant difference in wages – in Czech Republic.
Dataset covers sample of observations from Czech Republic gathered by WageIndicator Foundation (http://www.wageindicator.org/).
Data are available in dta
(Stata) format. Data from dta
can be load using foreign::read.dta()
(only older versions) or haven::read_dta()
which supports only limited number of data classes (logical, integer, numeric, character, and factor).
library(haven)
read_dta("data/wi_sample_cr.dta") -> wages
Other statistical packages use very often labelled vectors (class labelled
in R). Our dataset is not different:
lapply(wages,class) %>% unlist
## casenum locale surveyy survemm surveqy country
## "numeric" "character" "labelled" "labelled" "labelled" "labelled"
## contst regihom1 isco0804 isco0801 occtype occlevel
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## depmale supv supvwom supvdich nace2001 firmsize
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## firmfema firmloca firmowne firmpri caofirm1 caofirm
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## educat1 eduisced educalmh educaori educatyr yybirth
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## age agegr5_mdn training contr7 hrswag1 wageper3
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## wageboth wagecur wagegr1 hrswage ppp wagegr4
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## wasum washif61 waover61 wadirt61 waskil61 wa13mn61
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## waannu61 wayear61 waxmas61 waholi61 wagrou61 wapers62
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## waprof61 waothe61 wagene1 yyfstjob jobpromo tenuexpe
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## wagegrhr wagenehr memtrad1 memtrad commtime cobself
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## regicity gender hhnrmemb hhstat chld satjob
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## wagesat2 satcombi satlife jobsecur jobsecu2 jobsec61
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## jobsecu3 jobsec62 jobloss satbonus satcontr sathhinc
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## sathealt sathous satleisu sathhdiv waperf61 break
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## satsecur satenvir sathrs satcoll satsupvi satwelfa
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## satsubor watips61 breaknr satnetw firmmne reason_sal
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
## region
## "numeric"
Object of class labelled
contains coded observed values and codes descriptions. Labels are stored as attributes. See example:
wages$depmale %>% attributes()
## $label
## [1] "co-workers in similar positions are mostly men"
##
## $format.stata
## [1] "%8.0g"
##
## $class
## [1] "labelled"
##
## $labels
## other
## -9999
## missing (mapping not available)
## -999
## user missing
## -99
## not filled (paper survey)
## -91
## missing (checkbox not ticked)
## -55
## not asked (not in country / release)
## -54
## not asked (random item)
## -53
## not asked (contst routing)
## -12
## not asked (skipped)
## -11
## missing (drop out)
## -9
## missing (not applicable)
## -8
## missing (dont know)
## -7
## not asked (random item)
## -6
## not asked (release = ... / country-specific)
## -5
## not asked (no part 2)
## -4
## out of range
## -3
## missing (reason unknown)
## -2
## not asked (skipped)
## -1
## no
## 0
## yes
## 1
In this case attributes contain question asked (label
), possible values and its meaning (labels
), and of course class
name. Of course it is abslutely crucial to keep those information. Unfortunately there is just few methods which can deal with class labelled
. For example haven
expects class labelled
to be coerced into normal factors.
Before doing any coersion we will extract all labels into dedicated variable:
lapply(wages, attributes) -> wages_labels
And now we can do the coersion using mutate_at()
and as.factor()
:
# There is no is.labelled() function, so we need to make a little hack:
lapply(wages,class) %>% unlist %>%
magrittr::extract(.,equals(.,"labelled")) %>%
names -> lvars
# Coersion follows:
wages %<>% mutate_at(vars(one_of(lvars)),as.factor)
Now we can take a look at what we have. Table wages
contains 34512 observations (rows) and 103 variables (columns).
Let’s take a look at codes and variables.
As it is apperent from the attributes there is a lot of codes for missing values. Following table from Codebook presents all of them:
Code | Label | Explanation |
---|---|---|
-999 | Missing (Mapping not available) | This missing value is assigned to variables, derived from the mapping tables in the search trees, see section 4.10 for the mappings in use |
-99 | User missing | This missing value is assigned when a question has been shown to the respondents but they chose not to answer it. |
-55 | Missing (Checkbox not ticked) | This missing value is assigned to variables that are shown to the respondents, but not ticked (from 1/1/2011 on). |
-54 | Not asked (Not in country / Release) | This missing value is assigned to variables that at the day of survey completion were not switched on for the country of survey. |
-53 | Not asked (Random item) | This missing value is assigned to all questions that have not been shown to the respondent because it was not chosen from the pool of random items in the grid question (from 1/1/2011 on, see section 4.11). |
-12 | Not asked (Contst routing) | This missing value is assigned when a question was not shown to the respondent because of the CONTST routing (see section 4.3). For example, employees do not respond to the questions for self-employed. |
-11 | Not asked (Skipped) | This missing value is assigned when a question was not shown to the respondent because of routing instructions other than CONTST. For example, if a respondent ticks ‘no’ to the question about children, the questions about the number and age of children are skipped and assigned -11. |
-9 | Missing (Drop out) | This missing value is assigned to all questions following the page where the respondent dropped out of the questionnaire |
-8 | Missing (Not applicable) | This missing value is ticked by respondents when they regard a certain question not applicable. |
-7 | Missing (Don’t know) | This missing value is ticked by respondents when they don’t know the answer to a certain question. |
-4 | Not asked (No part 2) | This missing value is assigned to all questions in part 2, if the respondent has decided to quit the survey at the end of part 1. |
-3 | Out of range | Out-of-range values are coded –3. This particularly applies to data on wages and bonusses, which are set to minimum and maximum boundaries. Note that alerts for out-of-range amounts have been applied only recently. Before, data has been classified out-of-range during data-processing |
-2 | Missing (Reason unknown) | This missing value is assigned to cases with missing values when all reasons mentioned above were not applicable. |
-1 | Missing – Not asked (skipped) | This missing value was assigned in the 2000-2005 dataset. From 2006 on this missing value was replaced by the values -11 and -12. |
Our table contains 103 variables which is a bit too much for our exercise. Therefore we will restrict out variables to comprehendable 39. Following table is generated from labels:
Variable | Description |
---|---|
age | year of birth ? survey year |
break | how often career break - if > 1 job > 200900 |
caofirm | covered by collective agreement |
casenum | case identification number |
cobself | country of birth self |
contr7 | has permanent employment contract |
contst | current employment status |
depmale | co-workers in similar positions are mostly men |
educalmh | mapping: education level low-middle-high |
firmfema | percentage women in workplace |
firmmne | employed in multinational company |
firmowne | organisation is domestic or foreign-owned |
firmpri | public or private sector |
firmsize | firm size workplace |
gender | gender |
hhnrmemb | how many household members |
hhstat | current marital status |
hrswage | waged hours per week |
chld | has children |
isco0801 | isco 2008 digit-1 |
isco0804 | isco 2008 digit-4 |
memtrad | member of a trade union |
nace2001 | nace-rev2.0 industries 1 digit |
occlevel | what educational level is required in your job? |
occtype | position in occupational hierarchy |
regihom1 | region home address - aggregate geo info |
satcontr | satisfaction with contract |
sathealt | satisfaction with health |
sathhdiv | satisfaction with division of housework |
sathhinc | satisfaction with household income |
satlife | matrix: satisfaction with life as-a-whole |
surveyy | year of survey |
tenuexpe | years of service |
wagegrhr | gross hourly wage in nat. currency |
wagegr1 | gross wage in nat. currency, not controlled for pay period and hours |
wagenehr | net hourly wage in nat. currency |
wagene1 | net wage in nat. currency, not controlled for pay period and hours |
wageper3 | wage period latest wage |
yyfstjob | year starting first job |
As you can see from the table automatic coersion to factor might not be a good idea. A lot of variables in our selection is clearly continous.
So we can do it by hand:
read_dta("data/wi_sample_cr.dta") -> wages
# Hack:
wages %<>% rename(Break = `break`)
# Select variables
wages %<>%
select(
age, #year of birth ? survey year
Break, #how often career break - if > 1 job > 200900
caofirm, #covered by collective agreement
casenum, #case identification number
cobself, #country of birth self
contr7, #has permanent employment contract
contst, #current employment status
depmale, #co-workers in similar positions are mostly men
educalmh, #mapping: education level low-middle-high
firmfema, #percentage women in workplace
firmmne, #employed in multinational company
firmowne, #organisation is domestic or foreign-owned
firmpri, #public or private sector
firmsize, #firm size workplace
gender, #gender
hhnrmemb, #how many household members
hhstat, #current marital status
hrswage, #waged hours per week
chld, #has children
isco0801, #isco 2008 digit-1
isco0804, #isco 2008 digit-4
memtrad, #member of a trade union
nace2001, #nace-rev2.0 industries 1 digit
occlevel, #what educational level is required in your job?
occtype, #position in occupational hierarchy
regihom1, #region home address - aggregate geo info
satcontr, #satisfaction with contract
sathealt, #satisfaction with health
sathhdiv, #satisfaction with division of housework
sathhinc, #satisfaction with household income
satlife, #matrix: satisfaction with life as-a-whole
surveyy, #year of survey
tenuexpe, #years of service
wagegrhr, #gross hourly wage in nat. currency
wagegr1, #gross wage in nat. currency, not controlled for pay period and hours
wagenehr, #net hourly wage in nat. currency
wagene1, #net wage in nat. currency, not controlled for pay period and hours
wageper3, #wage period latest wage
yyfstjob #year starting first job
)
wages$age %<>% as.numeric() #year of birth ? survey year
wages$Break %<>% as.numeric() #how often career break - if > 1 job > 200900
wages$hrswage %<>% as.numeric() #waged hours per week
wages$satcontr %<>% as.numeric() #satisfaction with contract
wages$sathealt %<>% as.numeric() #satisfaction with health
wages$sathhdiv %<>% as.numeric() #satisfaction with division of housework
wages$sathhinc %<>% as.numeric() #satisfaction with household income
wages$satlife %<>% as.numeric() #matrix: satisfaction with life as-a-whole
wages$surveyy %<>% as.numeric() #year of survey
wages$tenuexpe %<>% as.numeric() #years of service
wages$wagegrhr %<>% as.numeric() #gross hourly wage in nat. currency
wages$wagegr1 %<>% as.numeric() #gross wage in nat. currency, not controlled for pay period and hours
wages$wagenehr %<>% as.numeric() #net hourly wage in nat. currency
wages$wagene1 %<>% as.numeric() #net wage in nat. currency, not controlled for pay period and hours
wages$yyfstjob %<>% as.numeric() #year starting first job
# There is no is.labelled() function, so we need to make a little hack:
lapply(wages,class) %>% unlist %>%
magrittr::extract(.,equals(.,"labelled")) %>%
names -> lvars
# Coersion follows:
wages %<>% mutate_at(vars(one_of(lvars)),as.factor)
At first we should get an overall picture. Dataset covers following range of years:
wages %$% range(surveyy)
## [1] 2009 2015
Year of the survey is clearly a dicrete variable. Distribution of observations throughout years can be explored by simple bar plot:
wages %>%
ggplot(aes(surveyy)) +
geom_bar() +
theme_bw()
It is clear the the first year dominates sample composition.
It is worth of consideration not to use such plots as far as the data can be presented in more concise way:
wages %$% table(surveyy)
## surveyy
## 2009 2010 2011 2012 2013 2014 2015
## 20502 4288 1039 2146 2908 2474 1155
In this case a bar plot does not bring any added value – it presents the same information using more ink. (Do you remember Tufte?)
As far as we are interested in gender gap we should be worried about number of males/females in the sample:
wages %>%
ggplot(aes(gender)) +
geom_bar() +
theme_bw()
Genders 0
, 1
, and NA
are not very revealing. Therefore we need to check labels to understand coding values:
wages_labels$gender
## $label
## [1] "gender"
##
## $format.stata
## [1] "%8.0g"
##
## $class
## [1] "labelled"
##
## $labels
## other
## -9999
## missing (mapping not available)
## -999
## user missing
## -99
## not filled (paper survey)
## -91
## missing (checkbox not ticked)
## -55
## not asked (not in country / release)
## -54
## not asked (random item)
## -53
## not asked (contst routing)
## -12
## not asked (skipped)
## -11
## missing (drop out)
## -9
## missing (not applicable)
## -8
## missing (dont know)
## -7
## not asked (random item)
## -6
## not asked (release = ... / country-specific)
## -5
## not asked (no part 2)
## -4
## out of range
## -3
## missing (reason unknown)
## -2
## not asked (skipped)
## -1
## male
## 0
## female
## 1
Now we can see that 0
is a code for males and 1
for females. Notice, that original coding contains a lot of codes for missing values but they collapsed in simple NA
while reading Stata file.
To make information understandable we should replace coding values by values one can read without checking coding manual:
wages %>%
ggplot(aes(
gender
)) +
geom_bar() +
scale_x_discrete(
name = "",
labels=c("Males","Females","N/A")
) +
theme_bw()
# Or you can change levels in factor gender!
It might be also useful to see composition of sample for each year – i.e. put previous figures together:
wages %>%
filter(!is.na(gender)) %>% # We are not interested in observations we are going to drop anyway.
ggplot(aes(
x = surveyy,
fill = gender
)) +
geom_bar(
position = "fill"
) +
geom_hline(yintercept = 0.5, linetype=3) + # Just for clarity (be careful with it)
scale_fill_discrete(
name = "",
labels=c("Males","Females")
) +
theme_bw()
Of course you can, but… (see reason no. 1 for not using pie charts from oracle.com).
See a wikipedia (https://en.wikipedia.org/wiki/Pie_chart) example:
You shouldn’t use pie charts because:
Have you ever seen a bar plot or a pie chart made in 3D? If so then you know that it makes everything 3Times worse.
For more options of geom_bar()
see ggplot2 handout (chapters on geoms and position adjustments).
Our primary variable of interest is hourly wage which is clearly a continous variables. To get a basic picture it is useful to take a look at decriptive statistics:
wages %$% summary(wagegrhr) # gross hourly wage in nat. currency
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000e+00 9.800e+01 1.380e+02 1.846e+13 1.940e+02 5.132e+17 5755
It seems that highest hourly wage in the sample is 5.132153110^{17} CZK. It is quite a suspicious figure. It seems that there might be a bug in the data.
Let’s take a look at histogram:
wages %>%
ggplot(
aes(
x = wagegrhr
)
) +
geom_histogram(bins = 50) +
theme_bw()
The suspicion is getting stronger. It seems that there are just few outliers. Let’s take a look at raw data:
wages %>%
arrange(desc(wagegrhr)) %>%
slice(1:10) %>%
select(wagegrhr)
## # A tibble: 10 × 1
## wagegrhr
## <dbl>
## 1 5.132153e+17
## 2 1.776514e+16
## 3 5.773672e+04
## 4 3.648961e+04
## 5 3.449009e+04
## 6 2.655889e+04
## 7 2.020785e+04
## 8 1.385681e+04
## 9 6.077550e+03
## 10 5.775520e+03
Yes, it seems that there are observations with unrealistic values. Keeping those two in tha sample could lead to severe bias in analysis results. Therefore we will exclude all observations with hourly wage bigger then… let’s say a 1,000 CZK.
wages %<>% filter(wagegrhr<1000) # This call also drops NAs
Take a look at statistics again:
wages %$% summary(wagegrhr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0038 98.1500 137.3000 162.7000 192.1000 993.1000
And histogram:
wages %>%
ggplot(
aes(
x = wagegrhr
)
) +
geom_histogram(binwidth = 50) +
theme_bw()
You can see that observed wage distribution follows a common pattern:
Histogram depicts observed values. There are also geoms which allow us to plot density estimates:
wages %>%
ggplot(
aes(
x = wagegrhr
)
) +
geom_histogram(binwidth = 50) +
geom_density(
aes(y = 50 * ..count..),
color = "blue"
) +
theme_bw()
As far as our goal is to compare wages of males and females we are eager to see their distributions separately:
wages %>%
filter(!is.na(gender)) %>%
ggplot(
aes(
x = wagegrhr,
fill = gender
)
) +
geom_histogram(binwidth = 50) +
geom_density(
aes(y = 50 * ..count..)
) +
scale_fill_discrete(
name = "",
labels=c("Males","Females")
) +
theme_bw()
This figure is not very useful because it suffers by serious overploting.
The first thing we can try is to use geom_freqpoly()
– geom which plots line connecting tops of histogram bins:
wages %>%
filter(!is.na(gender)) %>%
ggplot(
aes(
x = wagegrhr,
color = gender
)
) +
geom_freqpoly(binwidth = 50) +
geom_density(
aes(y = 50 * ..count..),
linetype = 3
) +
scale_color_discrete(
name = "",
labels=c("Males","Females")
) +
theme_bw()
It is still a mess. All lines are too close to each other and it makes extremely difficult to read anything in the figure. (At least we can see that density estimates are almost identical to observed values.)
Another possibility is to use faceting:
wages %>%
filter(!is.na(gender)) %>%
ggplot(
aes(
x = wagegrhr
)
) +
geom_histogram(binwidth = 50) +
geom_density(
aes(y = 50 * ..count..),
color = "blue"
) +
facet_wrap(~gender) +
theme_bw()
Both distributions are clearly visible now, but it is extremely hard to compare them.
Maybe it is time to call for something completely different…
For the sake of clarity we can drop some information and plot only basic information on distributions – i.e. use boxplot:
wages %>%
filter(!is.na(gender)) %>%
ggplot(
aes(
gender,
wagegrhr
)
) +
geom_boxplot() +
scale_x_discrete(labels=c("Males","Females"), name = "") +
theme_bw()
Boxplots usually distinguish oridanry values (coverd in the body of boxplot) and outliers which are depicted as data points outside the body. (Usually a very simple rules for outliers identification are used – such as value > mean + 1.5*(q3-q1)
.)
If you want to see boxplots without outliers you can filter them out using dplyr
(other observations may become outliers in this case) or turn off their plotting and zoom on boxplots bodies:
wages %>%
filter(!is.na(gender)) %>%
ggplot(
aes(
gender,
wagegrhr
)
) +
geom_boxplot(outlier.shape = NA) + # Turning off ouliers
coord_cartesian(ylim = c(0,400)) + # Zooming in
scale_x_discrete(labels=c("Males","Females"), name = "") +
theme_bw()
And now we can clearly see that females are discriminated on the labor market. Gender people knew it all along! Slow down cowboy/girl! We have removed a lot of observation and we do not consider people being different. But for sure boxplots provided us by the best insight into distributions of hourly wages.
Consider a distribution of 10,000 draws from normal distribution with zero mean and standard deviation equal to 1:
data_frame(
obs = rnorm(n=10000)
) -> sim.data
The sample has following properties:
sim.data$obs %>% summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.859000 -0.669700 -0.000630 0.007619 0.685400 3.980000
Let’s plot it. Upper plot contains estimated density with values of basic statistics. A boxplot in the other figure represents stylized facts on the distribution.
The essential part of EDA is getting an understanding of data generating process. We are basically curious about relationships between variables. Different types of variables requires different techniques.
In the first place we can take a look at the relationship of two continous variables: hourly wage (wagegrhr
) and legth of practice (tenuexpe
). The first we should do is to plot a scatterplot:
wages %>%
ggplot(
aes(y=wagegrhr,x=tenuexpe)
) + geom_point() +
theme_bw()
…and we don’t see much. The Figure suffers for serious overplotting which makes impossible to read an information from the plot.
There are several ways to deal with the issue:
The first one is to change properties of scatter plot – set an alpha or make “points” as small as possible:
wages %>%
ggplot(
aes(y=wagegrhr,x=tenuexpe)
) + geom_point(alpha = 0.1) +
theme_bw() -> p1
wages %>%
ggplot(
aes(y=wagegrhr,x=tenuexpe)
) + geom_point(shape = ".") +
theme_bw() -> p2
gridExtra::grid.arrange(p1,p2, ncol=2)
Neither of them work in this case. The infromation on data generating process remains shrouded in overlaping points.
In this case one should consider using different techniques:
wages %>%
ggplot(
aes(y=wagegrhr,x=tenuexpe)
) + geom_bin2d() +
theme_bw() -> p1
wages %>%
ggplot(
aes(y=wagegrhr,x=tenuexpe)
) + geom_hex() +
theme_bw() -> p2
gridExtra::grid.arrange(p1,p2, ncol=1)
Theory shows that using rectangle bins might be misleading. It is better to use hexagon bins. (geom_hex
has an extra dependency.) Practically horizontal orientation of “highest” bins suggests that there is actually no correlation between hourly wages (wagegrhr
) and legth of practice (tenuexpe
).
Relationship between income and life satisfaction is a subject of many papers. Let’s see what can we see in our data. Happiness (satlife
) is (as usuall) evaluated using likert scale from 1 to 10 (bigger is better):
wages %>%
ggplot(
aes(x=wagegrhr,y=satlife)
) + geom_point() +
theme_bw()
## Warning: Removed 10806 rows containing missing values (geom_point).
We cannot see much in this Figure for two reasons:
satlife
is not exactly continous (but often treated as such)To see the relationship we have to deal with both. In this case it is quite effective to combine jittering (adding some noise to actual values) and simultaneously make datapoints as small as possible (and zoom majority of observations):
wages %>%
ggplot(
aes(x=wagegrhr,y=satlife)
) + geom_jitter(shape=".") +
coord_cartesian(xlim = c(0,1000)) +
theme_bw()
## Warning: Removed 10806 rows containing missing values (geom_point).
Or we can accept, that life satisfaction is discrete and use a boxplot which basically technique to display relationship discrete and continous variable:
wages %>%
ggplot(
aes(y=wagegrhr,x=factor(satlife))
) + geom_boxplot() +
coord_flip(ylim = c(0,1000)) +
theme_bw()
geom_boxplot
expects one variable to be discrete but satlife
is not:
wages$satlife %>% class
## [1] "numeric"
Therefore we need to convert it into factor in ggplot call.
If satlife
is converted then NA
values, which are skipped in geom_point
, creates a level. If we need to omit them we need to filter them out:
wages %>%
filter(!is.na(satlife)) %>%
ggplot(
aes(y=wagegrhr,x=factor(satlife))
) + geom_boxplot() +
coord_flip(ylim = c(0,1000)) +
theme_bw()
One may wonder whether there the relationship differs for males and females:
wages %>%
filter(!is.na(satlife),!is.na(gender)) %>%
ggplot(
aes(y = wagegrhr, x = factor(satlife), color = gender)
) + geom_boxplot() +
coord_flip(ylim = c(0,1000)) +
scale_color_discrete(labels=c("Males","Females"), name = "") +
theme_bw()
Happy life starts with happy home. And happy home starts with washed dishes. This bold so called wife hypothesis can be tested using our dataset which contains variable sathhdiv
– satisfaction with housework division. But there is a problem – both satisfactions are evaluated using likert scale (from 1 to 5 in the case of sathhdiv
).
wages %>%
ggplot(
aes(y=satlife,x=sathhdiv)
) + geom_point() +
theme_bw()
## Warning: Removed 18543 rows containing missing values (geom_point).
A very definition of the word useless.
Jittering might be helpful in this case:
wages %>%
ggplot(
aes(y=satlife,x=sathhdiv)
) + geom_jitter(alpha=0.1) +
theme_bw()
## Warning: Removed 18543 rows containing missing values (geom_point).
Or we can try a different geom – geom_count
– which is designed for the case of two discrete variables:
wages %>%
ggplot(
aes(y=satlife,x=sathhdiv)
) + geom_count() +
theme_bw()
## Warning: Removed 18543 rows containing non-finite values (stat_sum).
The size of the points depends on the number of observation at the same coordinates.
And one more thing:
wages %>%
filter(!is.na(gender)) %>%
ggplot(
aes(y=sathhdiv,x=gender)
) + geom_count() +
scale_x_discrete(labels=c("Males","Females"), name = "") +
theme_bw()
## Warning: Removed 18543 rows containing non-finite values (stat_sum).
Did you know that partners in an average household do together about 170 % of all housework?
To identify the effect of gender on hourly wage we need to take into account all relevant variables. In order to do so we will use some econometrics…
Detect suspicious observations in migration flows estimates.
Use S2 database of migartion flows (Abel & Sander, 2014) from the file HW_abel.Rdata
. The table abel
contains 4 columns:
print(abel, n=5)
## # A tibble: 153,664 × 4
## origin destination flow period
## <chr> <chr> <int> <int>
## 1 ABW ABW 0 1990
## 2 AFG ABW 0 1990
## 3 AGO ABW 0 1990
## 4 ALB ABW 0 1990
## 5 ARE ABW 1 1990
## # ... with 1.537e+05 more rows
origin
and destination
contain ISO-3 country codes of countries of origin and destinationflow
contains the estimated size of the migration flow from origin to destinationperiod
contains the first year of each period.You are supposed to:
HW_population_un.Rdata
. Population data are in thousands of persons!solution %>% print(n=5)
## Source: local data frame [46 x 3]
## Groups: origin [35]
##
## origin period emig_rate
## <chr> <int> <dbl>
## 1 BIH 1995 0.000000e+00
## 2 CYP 1995 2.221212e-05
## 3 DJI 1995 0.000000e+00
## 4 ERI 1995 4.028640e-03
## 5 ESH 1995 0.000000e+00
## # ... with 41 more rows
Names of columns are mandatory. Do not perform any rounding etc.
Assign resulting table into variable abel_outliers
.