"It is often said that 80 % of data analysis is spent on the cleaning and preparing data."
"Tidy datasets are all alike but every messy dataset is messy in its own way."
– Hadley Wickham
Today we will deal with the bad, the ugly, and the messy.
2016
"It is often said that 80 % of data analysis is spent on the cleaning and preparing data."
"Tidy datasets are all alike but every messy dataset is messy in its own way."
– Hadley Wickham
Today we will deal with the bad, the ugly, and the messy.
Data comes in many different formats.
Typically we work with variables observed for cross-sectional unit (person, geographical unit,…) in time (1,…,infinity) or group (treatment/control).
Multiple rows contain one observation (state-year combination):
## # A tibble: 12 × 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## # ... with 8 more rows
Real life example:
One variable is in many columns + header contain values:
## # A tibble: 3 × 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766
Real life example:
wpp*
packages…or other monstrosities (e.g. values aggregated into one "cell"):
## # A tibble: 6 × 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## # ... with 3 more rows
Real life example:
Unfortunately creativity is not forbidden nor punishable.
How to organize your data to get a format which is easy to work with?
Brief "tidy" format definition:
For full definition see Wickham, H. (2014): Tidy Data, https://www.jstatsoft.org/article/view/v059i10
Example of tidy data:
## # A tibble: 6 × 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## # ... with 2 more rows
Real life example:
pwt8
package.CRAN repository contains a lot of packages which contain tool useful for data tidying. For example:
tidyr
contains two basic functions spread()
and gather()
which can handle majority of tyding cases.
From now on we will work mostly with packages of tidyverse
(e.g. tidyr
, dplyr
, readr
, haven
, readxl
, ggplot2
, stringr
, tibble
, and many more). These packges are very often under heavy development.
More about tidyverse:
spread()
is a function for spreading a key-value pair across multiple columns
Messy example:
## # A tibble: 12 × 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583
Make it tidy:
spread(data = table2, key = type, value = count)
## # A tibble: 6 × 4 ## country year cases population ## * <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
A call:
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
Arguments:
data
– an input data.framekey
– bare name of the column whose values will be used as column headingsvalue
– bare name of the column whose values will populate the cellstidyr is among packages which use so-called non-standard evaluation (NSE, see vignette("nse")
). The use of NSE has practical implications. With NSE the names of variables (columns) are called by bare (unquoted) names.
There are standard evaluation (SE) versions of some functions ending with "_" (e.g. NSE spread()
and SE spread_()
)
Basic arguments are sufficient for uncorrupted data, but not all data sets are flawless…
table2[-2,] %>% spread(type, count)
## # A tibble: 6 × 4 ## country year cases population ## * <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 NA ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
spread()
automatically add NA
to empty cells. You can change this behavior using
fill
– If set, missing values will be replaced with this value.Assume that we want (for some reason) add zeros to empty cells:
table2[-2,] %>% spread(type, count, fill = 0)
## # A tibble: 6 × 4 ## country year cases population ## * <chr> <int> <dbl> <dbl> ## 1 Afghanistan 1999 745 0 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
What if we have no observation for Afghanistan in 1999?
table2[-c(1:2),] %>% spread(type, count)
## # A tibble: 5 × 4 ## country year cases population ## * <chr> <int> <int> <int> ## 1 Afghanistan 2000 2666 20595360 ## 2 Brazil 1999 37737 172006362 ## 3 Brazil 2000 80488 174504898 ## 4 China 1999 212258 1272915272 ## 5 China 2000 213766 1280428583
spread()
will not create row for Afghanistan in 1999. That might be a problem. (Perhaps we want to add the values from different source later.)
Fortunately we can modify this behavior of separate()
by:
drop
– If FALSE, will keep factor levels that don't appear in the data, filling in missing combinations with fill
.table2[-c(1:2),] %>% spread(type, count, drop = FALSE)
## # A tibble: 6 × 4 ## country year cases population ## * <chr> <int> <int> <int> ## 1 Afghanistan 1999 NA NA ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
table2[-c(1:2),] %>% spread(type, count, drop = FALSE, fill = 0)
## # A tibble: 6 × 4 ## country year cases population ## * <chr> <int> <dbl> <dbl> ## 1 Afghanistan 1999 0 0 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
gather()
takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use gather()
when you notice that you have columns that are not variables.
Messy example:
## # A tibble: 3 × 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766
Make it tidy:
gather(table4a, key = year, value = value, -country)
## # A tibble: 6 × 3 ## country year value ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Brazil 1999 37737 ## 3 China 1999 212258 ## 4 Afghanistan 2000 2666 ## 5 Brazil 2000 80488 ## 6 China 2000 213766
A call:
gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
Arguments:
key
,value
– Names of key and value columns to create in output. key
is the bare name of an existing column. value
is the name of the column being created....
– Specification of columns to gather. Use bare variable names. Select all variables between x
and z
with x:z
, exclude y
with -y
.So, it is possible to get the same result with the list of columns to gather and list of column not to gather:
gather(table4a, key = year, value = value, `1999`:`2000`)
## # A tibble: 6 × 3 ## country year value ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Brazil 1999 37737 ## 3 China 1999 212258 ## 4 Afghanistan 2000 2666 ## 5 Brazil 2000 80488 ## 6 China 2000 213766
na.rm
– If TRUE
, will remove rows from output where the value column in NA
.convert
– If TRUE
will automatically run type.convert
on the key column. This is useful if the column names are actually numeric
, integer
, or logical
.factor_key
– If FALSE
, the default, the key values will be stored as a character vector. If TRUE
, will be stored as a factor, which preserves the original ordering of the columns.If you need to use ugly variable (column) name use "`" sign. Ugly is most often a name which is actually numeric or starts with numeric. Run "1_ahoj" and the same expression closed in the sign "`" and compare the results.
Ugly names can be quite often found in Stata data set converted into R – e.g. data sets from World Values Survey.
tidyr functions have recently adopted extended options for column specification which are very useful for programming. These options are identical with options available in select()
function from dplyr
package.
Observations can be missing explicitly or implicitly in your data set.
In the case of explicit missing values there is a combination of cross-sectional and time unit with NA
value.
See the example (population of Norway and Sweden from 1995 to 2000):
## # A tibble: 12 × 3 ## country year population ## <chr> <int> <int> ## 1 Norway 1995 4359788 ## 2 Norway 1996 NA ## 3 Norway 1997 NA ## 4 Norway 1998 4440109 ## 5 Norway 1999 NA ## 6 Norway 2000 4491572 ## 7 Sweden 1995 NA ## 8 Sweden 1996 8849420 ## 9 Sweden 1997 8859106 ## 10 Sweden 1998 8861204 ## 11 Sweden 1999 NA ## 12 Sweden 2000 8872284
Missing observations (whole rows) can be omitted from the data set. In that case we deal with implicit missing values:
## # A tibble: 7 × 3 ## country year population ## <chr> <int> <int> ## 1 Norway 1995 4359788 ## 2 Norway 1998 4440109 ## 3 Norway 2000 4491572 ## 4 Sweden 1996 8849420 ## 5 Sweden 1997 8859106 ## 6 Sweden 1998 8861204 ## 7 Sweden 2000 8872284
tidyr
contains functions which can turn implicit to turn implicit missing values into explicit ones. complete()
returns data set with implicit missing observations added.
complete(data, ..., fill = list())
Arguments:
data
– An input data frame...
– Specification of columns to expand. These columns form an unique ID of an observation (a country-year pair in this case).fill
– A value used to fill empty observations (NA
by default)POP_implicit
## # A tibble: 7 × 3 ## country year population ## <chr> <int> <int> ## 1 Norway 1995 4359788 ## 2 Norway 1998 4440109 ## 3 Norway 2000 4491572 ## 4 Sweden 1996 8849420 ## 5 Sweden 1997 8859106 ## 6 Sweden 1998 8861204 ## 7 Sweden 2000 8872284
complete(data = POP_implicit, country,year)
## # A tibble: 10 × 3 ## country year population ## <chr> <int> <int> ## 1 Norway 1995 4359788 ## 2 Norway 1996 NA ## 3 Norway 1997 NA ## 4 Norway 1998 4440109 ## 5 Norway 2000 4491572 ## 6 Sweden 1995 NA ## 7 Sweden 1996 8849420 ## 8 Sweden 1997 8859106 ## 9 Sweden 1998 8861204 ## 10 Sweden 2000 8872284
Note, that if there is a value missing for both countries in a specific year than this implicit missing observation is missing even in the output of complete()
.
complete()
works only with values present in input dataset!
To overcome this restriction it is possible to supply a vector with all possible values to complete()
:
complete(data = POP_implicit,country,year=1995:2000)
Fills missing values in previously used population data:
fill(data = POP_explicit, population, .direction = "down")
Arguments:
data
– An input data frame...
– Specification of columns to fill. Use bare variable names. Select all variables between x
and z
with x:z
, exclude y
with -y
..direction
– Direction in which to fill missing values. Currently either "down" (the default) or "up".Missing observations have to be explicit!
POP_explicit %>% mutate(population_filled = population) %>% fill(population_filled, .direction = "down")
## # A tibble: 12 × 4 ## country year population population_filled ## <chr> <int> <int> <int> ## 1 Norway 1995 4359788 4359788 ## 2 Norway 1996 NA 4359788 ## 3 Norway 1997 NA 4359788 ## 4 Norway 1998 4440109 4440109 ## 5 Norway 1999 NA 4440109 ## 6 Norway 2000 4491572 4491572 ## 7 Sweden 1995 NA 4491572 ## 8 Sweden 1996 8849420 8849420 ## 9 Sweden 1997 8859106 8859106 ## 10 Sweden 1998 8861204 8861204 ## 11 Sweden 1999 NA 8861204 ## 12 Sweden 2000 8872284 8872284
You need to be extra careful while using fill()
because it is sensitive to ordering. See example:
## # A tibble: 12 × 4 ## country year population population_filled ## <chr> <int> <int> <int> ## 1 Norway 1995 4359788 4359788 ## 2 Sweden 1995 NA 4359788 ## 3 Norway 1996 NA 4359788 ## 4 Sweden 1996 8849420 8849420 ## 5 Norway 1997 NA 8849420 ## 6 Sweden 1997 8859106 8859106 ## 7 Norway 1998 4440109 4440109 ## 8 Sweden 1998 8861204 8861204 ## # ... with 4 more rows
There might be a problem even if ordering is correct. In the first example of using fill()
the value for the first year in Sweden was filled by the last value for Norway!
Solution:
fill()
for each chunkDifficult? Wait for dplyr
. This problem can be easily solved using group_by()
.
Data from UN World Population Prospects 2015 contains net migration (in thousands) for five years periods. Data are available in wpp2015_migration.Rdata
. Problem is that data are messy as hell. Load it and make it tidy!
load("data/wpp2015_migration.Rdata") # Convert data into tibble migration %<>% as_data_frame() # Print names names(migration)
## [1] "country_code" "name" "1950-1955" "1955-1960" ## [5] "1960-1965" "1965-1970" "1970-1975" "1975-1980" ## [9] "1980-1985" "1985-1990" "1990-1995" "1995-2000" ## [13] "2000-2005" "2005-2010" "2010-2015" "2015-2020" ## [17] "2020-2025" "2025-2030" "2030-2035" "2035-2040" ## [21] "2040-2045" "2045-2050" "2050-2055" "2055-2060" ## [25] "2060-2065" "2065-2070" "2070-2075" "2075-2080" ## [29] "2080-2085" "2085-2090" "2090-2095" "2095-2100"
The task is to make it tidy and represent each interval by its center (e.g. "2095-2100" by 2097.5).
In the first step we should make it tidy (easy):
migration %<>% gather(period,value,-country_code,-name) migration %>% print(n=3)
## # A tibble: 7,230 × 4 ## country_code name period value ## <int> <fctr> <chr> <dbl> ## 1 900 WORLD 1950-1955 -0.002 ## 2 901 More developed regions 1950-1955 738.227 ## 3 902 Less developed regions 1950-1955 -738.229 ## # ... with 7,227 more rows
In the second step we need to compute an average from periods description:
"2095-2010"
into c(2095,2010)
Solution of the first problem: we will use separate()
for separating column period
into two new columns!
separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)
data
– An input data frame.col
– Bare name of the column to be separated.into
– Names of new columns to create as character vector.sep
– Separator between columns.remove
– If TRUE
(default) then original column is removed.convert
– If TRUE
(default is FALSE
), will run type.convert
with as.is = TRUE
on new columns.We create two columns out of period containg first (year1) and last (year5) year. Notice, that newly created columns are converted into integer.
migration %<>% separate(period, c("year1","year5"), sep = "-", convert = TRUE) migration %>% print(n=4)
## # A tibble: 7,230 × 5 ## country_code name year1 year5 value ## * <int> <fctr> <int> <int> <dbl> ## 1 900 WORLD 1950 1955 -0.002 ## 2 901 More developed regions 1950 1955 738.227 ## 3 902 Less developed regions 1950 1955 -738.229 ## 4 941 Least developed countries 1950 1955 -522.004 ## # ... with 7,226 more rows
Do it hard way…
migration$period <- mean(c(migration$year1,migration$year5))
…or easy way (dplyr)
migration %<>% mutate( period = mean(c(year1,year5)) )
Google Forms is a free platform for creating simple questionnaires. Responses can be downloaded as csv (comma-separated values) file. Sample data was created using following questionnaire.
Responses are available in a file GoogleForms_AVED.csv
. Load it and make it tidy!
gdata <- read_csv("data/GoogleForms_AVED.csv") names(gdata)
## [1] "Timestamp" ## [2] "Choose one of following (radiobutton):" ## [3] "Choose multiple answers (checkbox):" ## [4] "Radiobutton grid: [Row 1]" ## [5] "Radiobutton grid: [Row 2]" ## [6] "Radiobutton grid: [Row 3]"
Google forms returns really ugly names. The first task is to assign reasonable names to columns:
print(gdata[,-1]) # Don't print time
## # A tibble: 5 × 5 ## Q1 Q2 Q3a Q3b Q3c ## <chr> <chr> <chr> <chr> <chr> ## 1 Option A Option 1;Option 3 Column 2 Column 1 Column 4 ## 2 Option C Option 3 Column 4 Column 3 Column 2 ## 3 Option B Option 2 <NA> Column 3 Column 2 ## 4 Option A Option 1;Option 2 Column 2 Column 3 Column 1 ## 5 Option C Option 1;Option 2;Option 3 Column 2 Column 3 <NA>
Google stores answers as (complete) strings – in the case of multiple-choice questions (Q2) are answers chosen separated by ";"
To get the data in useable tidy form we need to separate answers in Q2 into multiple columns. Each possible answer would get its own column with logical value according to anwer of the respondent.
The most straightforward solution is a combination of separate_rows()
, mutate()
, and spread()
.
A call:
Arguments:
data
– An input data frame....
– Specification of columns to fill. Use bare variable names. Select all variables between x
and z
with x:z
, exclude y
with -y
.sep
– Separator between columns.convert
– If TRUE
(default is FALSE
), will run type.convert
with as.is = TRUE
on new columns.We need to:
gdata %>% separate_rows(Q2, sep=";") %>% mutate(aux = TRUE) %>% spread(Q2, aux, fill = FALSE)
gdata %>% separate_rows(Q2, sep=";")
## # A tibble: 9 × 5 ## Q1 Q3a Q3b Q3c Q2 ## <chr> <chr> <chr> <chr> <chr> ## 1 Option A Column 2 Column 1 Column 4 Option 1 ## 2 Option A Column 2 Column 1 Column 4 Option 3 ## 3 Option C Column 4 Column 3 Column 2 Option 3 ## 4 Option B <NA> Column 3 Column 2 Option 2 ## 5 Option A Column 2 Column 3 Column 1 Option 1 ## 6 Option A Column 2 Column 3 Column 1 Option 2 ## 7 Option C Column 2 Column 3 <NA> Option 1 ## 8 Option C Column 2 Column 3 <NA> Option 2 ## 9 Option C Column 2 Column 3 <NA> Option 3
gdata %>% separate_rows(Q2, sep=";") %>% mutate(aux = TRUE)
## # A tibble: 9 × 6 ## Q1 Q3a Q3b Q3c Q2 aux ## <chr> <chr> <chr> <chr> <chr> <lgl> ## 1 Option A Column 2 Column 1 Column 4 Option 1 TRUE ## 2 Option A Column 2 Column 1 Column 4 Option 3 TRUE ## 3 Option C Column 4 Column 3 Column 2 Option 3 TRUE ## 4 Option B <NA> Column 3 Column 2 Option 2 TRUE ## 5 Option A Column 2 Column 3 Column 1 Option 1 TRUE ## 6 Option A Column 2 Column 3 Column 1 Option 2 TRUE ## 7 Option C Column 2 Column 3 <NA> Option 1 TRUE ## 8 Option C Column 2 Column 3 <NA> Option 2 TRUE ## 9 Option C Column 2 Column 3 <NA> Option 3 TRUE
gdata %>% separate_rows(Q2, sep=";") %>% mutate(aux = TRUE) %>% spread(Q2, aux, fill = FALSE)
## # A tibble: 5 × 7 ## Q1 Q3a Q3b Q3c `Option 1` `Option 2` `Option 3` ## * <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl> ## 1 Option A Column 2 Column 1 Column 4 TRUE FALSE TRUE ## 2 Option A Column 2 Column 3 Column 1 TRUE TRUE FALSE ## 3 Option B <NA> Column 3 Column 2 FALSE TRUE FALSE ## 4 Option C Column 2 Column 3 <NA> TRUE TRUE TRUE ## 5 Option C Column 4 Column 3 Column 2 FALSE FALSE TRUE
This is solution is far from ideal:
For better solution (with factors) and more exercises see handout.
The task is to load a table from Eurostat and make it (almost) tidy. Table lfsa_urgacob
contains unemployment rates (%) of immigrants by sex, age and country of birth.
The table is in the file lfsa_urgacob.tsv
(tab-separated table):
## # A tibble: 33,751 × 22 ## `unit,sex,age,c_birth,geo\\time` `2015` `2014` `2013` `2012` `2011` ## <chr> <chr> <chr> <chr> <chr> <chr> ## 1 PC,F,Y15-19,EU15_FOR,AT : u : u : u : u : u ## 2 PC,F,Y15-19,EU15_FOR,BE : : u : u : u : bu ## 3 PC,F,Y15-19,EU15_FOR,CH : u : u : u : u : u ## 4 PC,F,Y15-19,EU15_FOR,CY : u : u : u : : u ## # ... with 3.375e+04 more rows, and 16 more variables: `2010` <chr>, ## # `2009` <chr>, `2008` <chr>, `2007` <chr>, `2006` <chr>, `2005` <chr>, ## # `2004` <chr>, `2003` <chr>, `2002` <chr>, `2001` <chr>, `2000` <chr>, ## # `1999` <chr>, `1998` <chr>, `1997` <chr>, `1996` <chr>, `1995` <chr>
:
stays for missing observations41.6 u
or : bu
letters denote notesYou are supposed to drop all notes and transform the table into following format:
## # A tibble: 24,003 × 35 ## unit sex c_birth geo year Y15_19 Y15_24 Y15_39 Y15_59 Y15_64 ## * <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 PC F EU15_FOR AT 1995 NA NA NA NA NA ## 2 PC F EU15_FOR AT 1996 NA NA NA NA NA ## # ... with 2.4e+04 more rows, and 25 more variables: Y15_74 <dbl>, ## # Y20_24 <dbl>, Y20_64 <dbl>, Y25_29 <dbl>, Y25_49 <dbl>, Y25_54 <dbl>, ## # Y25_59 <dbl>, Y25_64 <dbl>, Y25_74 <dbl>, Y30_34 <dbl>, Y35_39 <dbl>, ## # Y40_44 <dbl>, Y40_59 <dbl>, Y40_64 <dbl>, Y45_49 <dbl>, Y50_54 <dbl>, ## # Y50_59 <dbl>, Y50_64 <dbl>, Y50_74 <dbl>, Y55_59 <dbl>, Y55_64 <dbl>, ## # Y60_64 <dbl>, Y65_69 <dbl>, Y65_74 <dbl>, Y70_74 <dbl>
Assign transformed table into variable eudata
.
Hints: