AVED: Tidy data (tidyr)

2016

Data formats

"It is often said that 80 % of data analysis is spent on the cleaning and preparing data."

"Tidy datasets are all alike but every messy dataset is messy in its own way."

– Hadley Wickham

Today we will deal with the bad, the ugly, and the messy.

Data

Data comes in many different formats.

Typically we work with variables observed for cross-sectional unit (person, geographical unit,…) in time (1,…,infinity) or group (treatment/control).

Messy data: Examples (1)

Multiple rows contain one observation (state-year combination):

## # A tibble: 12 × 4
##       country  year       type    count
##         <chr> <int>      <chr>    <int>
## 1 Afghanistan  1999      cases      745
## 2 Afghanistan  1999 population 19987071
## 3 Afghanistan  2000      cases     2666
## 4 Afghanistan  2000 population 20595360
## # ... with 8 more rows

Real life example:

CSV files exported from OECD (https://stats.oecd.org/)
FAO or Barro-Lee data set

Messy data: Examples (2)

One variable is in many columns + header contain values:

## # A tibble: 3 × 3
##       country `1999` `2000`
## *       <chr>  <int>  <int>
## 1 Afghanistan    745   2666
## 2      Brazil  37737  80488
## 3       China 212258 213766

Real life example:

Files exported from http://data.worldbank.org/
UN data on population from wpp* packages

Messy data: Examples (3)

…or other monstrosities (e.g. values aggregated into one "cell"):

## # A tibble: 6 × 3
##       country  year            rate
## *       <chr> <int>           <chr>
## 1 Afghanistan  1999    745/19987071
## 2 Afghanistan  2000   2666/20595360
## 3      Brazil  1999 37737/172006362
## # ... with 3 more rows

Real life example:

dates in CSV files from Google Trends
tables from Eurostat

Unfortunately creativity is not forbidden nor punishable.

Tidy data (1)

How to organize your data to get a format which is easy to work with?

Brief "tidy" format definition:

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

For full definition see Wickham, H. (2014): Tidy Data, https://www.jstatsoft.org/article/view/v059i10

Tidy data (2)

Example of tidy data:

## # A tibble: 6 × 4
##       country  year cases population
##         <chr> <int> <int>      <int>
## 1 Afghanistan  1999   745   19987071
## 2 Afghanistan  2000  2666   20595360
## 3      Brazil  1999 37737  172006362
## 4      Brazil  2000 80488  174504898
## # ... with 2 more rows

Real life example:

Penn World Tables (excel file from http://www.rug.nl/research/ggdc/data/pwt/pwt-9.0) or Penn World Tables from pwt8 package.

Tidying up

CRAN repository contains a lot of packages which contain tool useful for data tidying. For example:

reshape (very obsolete, published on CRAN in 2005),
reshape2 (obsolete, published on CRAN in 2010),
tidyr (daisy fresh, published on CRAN in 2014)

tidyr contains two basic functions spread() and gather() which can handle majority of tyding cases.

Disclaimer

From now on we will work mostly with packages of tidyverse (e.g. tidyr, dplyr, readr, haven, readxl, ggplot2, stringr, tibble, and many more). These packges are very often under heavy development.

This slides/handouts may become obsolete in next 5 minutes…
It is useful from time to time to go through list of functions included in a package

More about tidyverse:

spread()

spread() is a function for spreading a key-value pair across multiple columns

spread()

Messy example:

## # A tibble: 12 × 4
##        country  year       type      count
##          <chr> <int>      <chr>      <int>
## 1  Afghanistan  1999      cases        745
## 2  Afghanistan  1999 population   19987071
## 3  Afghanistan  2000      cases       2666
## 4  Afghanistan  2000 population   20595360
## 5       Brazil  1999      cases      37737
## 6       Brazil  1999 population  172006362
## 7       Brazil  2000      cases      80488
## 8       Brazil  2000 population  174504898
## 9        China  1999      cases     212258
## 10       China  1999 population 1272915272
## 11       China  2000      cases     213766
## 12       China  2000 population 1280428583

spread()

Make it tidy:

spread(data = table2, key = type, value = count)

## # A tibble: 6 × 4
##       country  year  cases population
## *       <chr> <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

spread()

A call:

spread(data, key, value, 
       fill = NA, convert = FALSE, drop = TRUE,  sep = NULL)

Arguments:

data – an input data.frame
key – bare name of the column whose values will be used as column headings
value – bare name of the column whose values will populate the cells

Non-standard evaluation (NSE)

tidyr is among packages which use so-called non-standard evaluation (NSE, see vignette("nse")). The use of NSE has practical implications. With NSE the names of variables (columns) are called by bare (unquoted) names.

There are standard evaluation (SE) versions of some functions ending with "_" (e.g. NSE spread() and SE spread_())

spread()

Basic arguments are sufficient for uncorrupted data, but not all data sets are flawless…

spread(), fill (1)

table2[-2,] %>% spread(type, count)

## # A tibble: 6 × 4
##       country  year  cases population
## *       <chr> <int>  <int>      <int>
## 1 Afghanistan  1999    745         NA
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

spread() automatically add NA to empty cells. You can change this behavior using

fill – If set, missing values will be replaced with this value.

spread(), fill (2)

Assume that we want (for some reason) add zeros to empty cells:

table2[-2,] %>% spread(type, count, fill = 0)

## # A tibble: 6 × 4
##       country  year  cases population
## *       <chr> <int>  <dbl>      <dbl>
## 1 Afghanistan  1999    745          0
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

spread(), drop (1)

What if we have no observation for Afghanistan in 1999?

table2[-c(1:2),] %>% spread(type, count)

## # A tibble: 5 × 4
##       country  year  cases population
## *       <chr> <int>  <int>      <int>
## 1 Afghanistan  2000   2666   20595360
## 2      Brazil  1999  37737  172006362
## 3      Brazil  2000  80488  174504898
## 4       China  1999 212258 1272915272
## 5       China  2000 213766 1280428583

spread() will not create row for Afghanistan in 1999. That might be a problem. (Perhaps we want to add the values from different source later.)

spread(), drop (2)

Fortunately we can modify this behavior of separate() by:

drop – If FALSE, will keep factor levels that don't appear in the data, filling in missing combinations with fill.

table2[-c(1:2),] %>% spread(type, count, drop = FALSE)

## # A tibble: 6 × 4
##       country  year  cases population
## *       <chr> <int>  <int>      <int>
## 1 Afghanistan  1999     NA         NA
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

spread(), drop (3)

table2[-c(1:2),] %>% spread(type, count, drop = FALSE, fill = 0)

## # A tibble: 6 × 4
##       country  year  cases population
## *       <chr> <int>  <dbl>      <dbl>
## 1 Afghanistan  1999      0          0
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

gather()

gather() takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use gather() when you notice that you have columns that are not variables.

gather()

Messy example:

## # A tibble: 3 × 3
##       country `1999` `2000`
## *       <chr>  <int>  <int>
## 1 Afghanistan    745   2666
## 2      Brazil  37737  80488
## 3       China 212258 213766

gather()

Make it tidy:

gather(table4a, key = year, value = value, -country)

## # A tibble: 6 × 3
##       country  year  value
##         <chr> <chr>  <int>
## 1 Afghanistan  1999    745
## 2      Brazil  1999  37737
## 3       China  1999 212258
## 4 Afghanistan  2000   2666
## 5      Brazil  2000  80488
## 6       China  2000 213766

gather()

A call:

gather(data, key, value, ..., 
       na.rm = FALSE, convert = FALSE, factor_key = FALSE)

Arguments:

key,value – Names of key and value columns to create in output. key is the bare name of an existing column. value is the name of the column being created.
... – Specification of columns to gather. Use bare variable names. Select all variables between x and z with x:z, exclude y with -y.

gather()

So, it is possible to get the same result with the list of columns to gather and list of column not to gather:

gather(table4a, key = year, value = value, `1999`:`2000`)

## # A tibble: 6 × 3
##       country  year  value
##         <chr> <chr>  <int>
## 1 Afghanistan  1999    745
## 2      Brazil  1999  37737
## 3       China  1999 212258
## 4 Afghanistan  2000   2666
## 5      Brazil  2000  80488
## 6       China  2000 213766

gather(), more arguments

na.rm – If TRUE, will remove rows from output where the value column in NA.
convert – If TRUE will automatically run type.convert on the key column. This is useful if the column names are actually numeric, integer, or logical.
factor_key – If FALSE, the default, the key values will be stored as a character vector. If TRUE, will be stored as a factor, which preserves the original ordering of the columns.

Something ugly

If you need to use ugly variable (column) name use "`" sign. Ugly is most often a name which is actually numeric or starts with numeric. Run "1_ahoj" and the same expression closed in the sign "`" and compare the results.

Ugly names can be quite often found in Stata data set converted into R – e.g. data sets from World Values Survey.

Something good

tidyr functions have recently adopted extended options for column specification which are very useful for programming. These options are identical with options available in select() function from dplyr package.

Exercise 1: Population of Norway and Sweden

Observations can be missing explicitly or implicitly in your data set.

In the case of explicit missing values there is a combination of cross-sectional and time unit with NA value.

Exercise 1: Explicit missing values

See the example (population of Norway and Sweden from 1995 to 2000):

## # A tibble: 12 × 3
##    country  year population
##      <chr> <int>      <int>
## 1   Norway  1995    4359788
## 2   Norway  1996         NA
## 3   Norway  1997         NA
## 4   Norway  1998    4440109
## 5   Norway  1999         NA
## 6   Norway  2000    4491572
## 7   Sweden  1995         NA
## 8   Sweden  1996    8849420
## 9   Sweden  1997    8859106
## 10  Sweden  1998    8861204
## 11  Sweden  1999         NA
## 12  Sweden  2000    8872284

Exercise 1: Implicit missing values

Missing observations (whole rows) can be omitted from the data set. In that case we deal with implicit missing values:

## # A tibble: 7 × 3
##   country  year population
##     <chr> <int>      <int>
## 1  Norway  1995    4359788
## 2  Norway  1998    4440109
## 3  Norway  2000    4491572
## 4  Sweden  1996    8849420
## 5  Sweden  1997    8859106
## 6  Sweden  1998    8861204
## 7  Sweden  2000    8872284

Exercise 1: Turning Implicit into explicit

tidyr contains functions which can turn implicit to turn implicit missing values into explicit ones. complete() returns data set with implicit missing observations added.

complete(data, ..., fill = list())

Arguments:

data – An input data frame
... – Specification of columns to expand. These columns form an unique ID of an observation (a country-year pair in this case).
fill – A value used to fill empty observations (NA by default)

Exercise 1: Turning Implicit into explicit

POP_implicit

## # A tibble: 7 × 3
##   country  year population
##     <chr> <int>      <int>
## 1  Norway  1995    4359788
## 2  Norway  1998    4440109
## 3  Norway  2000    4491572
## 4  Sweden  1996    8849420
## 5  Sweden  1997    8859106
## 6  Sweden  1998    8861204
## 7  Sweden  2000    8872284

complete(data = POP_implicit,
         country,year)

## # A tibble: 10 × 3
##    country  year population
##      <chr> <int>      <int>
## 1   Norway  1995    4359788
## 2   Norway  1996         NA
## 3   Norway  1997         NA
## 4   Norway  1998    4440109
## 5   Norway  2000    4491572
## 6   Sweden  1995         NA
## 7   Sweden  1996    8849420
## 8   Sweden  1997    8859106
## 9   Sweden  1998    8861204
## 10  Sweden  2000    8872284

Exercise 1: Mystery of missing 1999

Note, that if there is a value missing for both countries in a specific year than this implicit missing observation is missing even in the output of complete().

complete() works only with values present in input dataset!

To overcome this restriction it is possible to supply a vector with all possible values to complete():

complete(data = POP_implicit,country,year=1995:2000)

Exercise 2: Replacing missing values

Fills missing values in previously used population data:

fill(data = POP_explicit, population, .direction = "down")

Arguments:

data – An input data frame
... – Specification of columns to fill. Use bare variable names. Select all variables between x and z with x:z, exclude y with -y.
.direction – Direction in which to fill missing values. Currently either "down" (the default) or "up".

Missing observations have to be explicit!

Exercise 2: Replacing missing values

POP_explicit %>% mutate(population_filled = population) %>% 
fill(population_filled, .direction = "down")

## # A tibble: 12 × 4
##    country  year population population_filled
##      <chr> <int>      <int>             <int>
## 1   Norway  1995    4359788           4359788
## 2   Norway  1996         NA           4359788
## 3   Norway  1997         NA           4359788
## 4   Norway  1998    4440109           4440109
## 5   Norway  1999         NA           4440109
## 6   Norway  2000    4491572           4491572
## 7   Sweden  1995         NA           4491572
## 8   Sweden  1996    8849420           8849420
## 9   Sweden  1997    8859106           8859106
## 10  Sweden  1998    8861204           8861204
## 11  Sweden  1999         NA           8861204
## 12  Sweden  2000    8872284           8872284

Exercise 2: Replacing missing values

You need to be extra careful while using fill() because it is sensitive to ordering. See example:

## # A tibble: 12 × 4
##   country  year population population_filled
##     <chr> <int>      <int>             <int>
## 1  Norway  1995    4359788           4359788
## 2  Sweden  1995         NA           4359788
## 3  Norway  1996         NA           4359788
## 4  Sweden  1996    8849420           8849420
## 5  Norway  1997         NA           8849420
## 6  Sweden  1997    8859106           8859106
## 7  Norway  1998    4440109           4440109
## 8  Sweden  1998    8861204           8861204
## # ... with 4 more rows

Exercise 2: Replacing missing values

There might be a problem even if ordering is correct. In the first example of using fill() the value for the first year in Sweden was filled by the last value for Norway!

Solution:

Split dataset into chunks – one for each country
Call fill() for each chunk
Join chunks together again

Difficult? Wait for dplyr. This problem can be easily solved using group_by().

Exercise 3: World Population Prospects 2015

Data from UN World Population Prospects 2015 contains net migration (in thousands) for five years periods. Data are available in wpp2015_migration.Rdata. Problem is that data are messy as hell. Load it and make it tidy!

Exercise 3: Load it and look at it

load("data/wpp2015_migration.Rdata")

# Convert data into tibble
migration %<>% as_data_frame()

# Print names
names(migration)

##  [1] "country_code" "name"         "1950-1955"    "1955-1960"   
##  [5] "1960-1965"    "1965-1970"    "1970-1975"    "1975-1980"   
##  [9] "1980-1985"    "1985-1990"    "1990-1995"    "1995-2000"   
## [13] "2000-2005"    "2005-2010"    "2010-2015"    "2015-2020"   
## [17] "2020-2025"    "2025-2030"    "2030-2035"    "2035-2040"   
## [21] "2040-2045"    "2045-2050"    "2050-2055"    "2055-2060"   
## [25] "2060-2065"    "2065-2070"    "2070-2075"    "2075-2080"   
## [29] "2080-2085"    "2085-2090"    "2090-2095"    "2095-2100"

Exercise 3: Make it tidy!

The task is to make it tidy and represent each interval by its center (e.g. "2095-2100" by 2097.5).

In the first step we should make it tidy (easy):

migration %<>% gather(period,value,-country_code,-name)

migration %>% print(n=3)

## # A tibble: 7,230 × 4
##   country_code                   name    period    value
##          <int>                 <fctr>     <chr>    <dbl>
## 1          900                  WORLD 1950-1955   -0.002
## 2          901 More developed regions 1950-1955  738.227
## 3          902 Less developed regions 1950-1955 -738.229
## # ... with 7,227 more rows

Exercise 3: Intervals

In the second step we need to compute an average from periods description:

Extract two numbers from a character: i.e. transform "2095-2010" into c(2095,2010)
Compute a mean

Solution of the first problem: we will use separate() for separating column period into two new columns!

Exercise 3: separate()

separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE,
  convert = FALSE, extra = "warn", fill = "warn", ...)

data – An input data frame.
col – Bare name of the column to be separated.
into – Names of new columns to create as character vector.
sep – Separator between columns.
remove – If TRUE (default) then original column is removed.
convert – If TRUE (default is FALSE), will run type.convert with as.is = TRUE on new columns.

Exercise 3: Application of separate()

We create two columns out of period containg first (year1) and last (year5) year. Notice, that newly created columns are converted into integer.

migration %<>% separate(period, c("year1","year5"), 
                        sep = "-", convert = TRUE)
migration %>% print(n=4)

## # A tibble: 7,230 × 5
##   country_code                      name year1 year5    value
## *        <int>                    <fctr> <int> <int>    <dbl>
## 1          900                     WORLD  1950  1955   -0.002
## 2          901    More developed regions  1950  1955  738.227
## 3          902    Less developed regions  1950  1955 -738.229
## 4          941 Least developed countries  1950  1955 -522.004
## # ... with 7,226 more rows

Exercise 3: Create new period ID

Do it hard way…

migration$period <- mean(c(migration$year1,migration$year5))

…or easy way (dplyr)

migration %<>% 
    mutate(
        period = mean(c(year1,year5))
    )

Exercise 4: Google forms

Google Forms is a free platform for creating simple questionnaires. Responses can be downloaded as csv (comma-separated values) file. Sample data was created using following questionnaire.

Responses are available in a file GoogleForms_AVED.csv. Load it and make it tidy!

Exercise 4: Take a look at the data

gdata <- read_csv("data/GoogleForms_AVED.csv")

names(gdata)

## [1] "Timestamp"                             
## [2] "Choose one of following (radiobutton):"
## [3] "Choose multiple answers (checkbox):"   
## [4] "Radiobutton grid: [Row 1]"             
## [5] "Radiobutton grid: [Row 2]"             
## [6] "Radiobutton grid: [Row 3]"

Google forms returns really ugly names. The first task is to assign reasonable names to columns:

Exercise 4: Take a look at the data

print(gdata[,-1]) # Don't print time

## # A tibble: 5 × 5
##         Q1                         Q2      Q3a      Q3b      Q3c
##      <chr>                      <chr>    <chr>    <chr>    <chr>
## 1 Option A          Option 1;Option 3 Column 2 Column 1 Column 4
## 2 Option C                   Option 3 Column 4 Column 3 Column 2
## 3 Option B                   Option 2     <NA> Column 3 Column 2
## 4 Option A          Option 1;Option 2 Column 2 Column 3 Column 1
## 5 Option C Option 1;Option 2;Option 3 Column 2 Column 3     <NA>

Exercise 4: The Mess

Google stores answers as (complete) strings – in the case of multiple-choice questions (Q2) are answers chosen separated by ";"

To get the data in useable tidy form we need to separate answers in Q2 into multiple columns. Each possible answer would get its own column with logical value according to anwer of the respondent.

The most straightforward solution is a combination of separate_rows(), mutate(), and spread().

Exercise 4: separate_rows()

A call:

Arguments:

data – An input data frame.
... – Specification of columns to fill. Use bare variable names. Select all variables between x and z with x:z, exclude y with -y.
sep – Separator between columns.
convert – If TRUE (default is FALSE), will run type.convert with as.is = TRUE on new columns.

Exercise 4: Tidying up

We need to:

Separate answers stored in Q2 into multiple rows
Add new auxiliary column "aux" which is equal to TRUE for every row
Spread values from auxiliary variable using Q2 as a key

gdata %>%
  separate_rows(Q2, sep=";") %>% 
  mutate(aux = TRUE) %>% 
  spread(Q2, aux, fill = FALSE)

Exercise 4: Tidying up step by step (1)

gdata %>%
  separate_rows(Q2, sep=";")

## # A tibble: 9 × 5
##         Q1      Q3a      Q3b      Q3c       Q2
##      <chr>    <chr>    <chr>    <chr>    <chr>
## 1 Option A Column 2 Column 1 Column 4 Option 1
## 2 Option A Column 2 Column 1 Column 4 Option 3
## 3 Option C Column 4 Column 3 Column 2 Option 3
## 4 Option B     <NA> Column 3 Column 2 Option 2
## 5 Option A Column 2 Column 3 Column 1 Option 1
## 6 Option A Column 2 Column 3 Column 1 Option 2
## 7 Option C Column 2 Column 3     <NA> Option 1
## 8 Option C Column 2 Column 3     <NA> Option 2
## 9 Option C Column 2 Column 3     <NA> Option 3

Exercise 4: Tidying up step by step (2)

gdata %>%
  separate_rows(Q2, sep=";") %>% 
  mutate(aux = TRUE)

## # A tibble: 9 × 6
##         Q1      Q3a      Q3b      Q3c       Q2   aux
##      <chr>    <chr>    <chr>    <chr>    <chr> <lgl>
## 1 Option A Column 2 Column 1 Column 4 Option 1  TRUE
## 2 Option A Column 2 Column 1 Column 4 Option 3  TRUE
## 3 Option C Column 4 Column 3 Column 2 Option 3  TRUE
## 4 Option B     <NA> Column 3 Column 2 Option 2  TRUE
## 5 Option A Column 2 Column 3 Column 1 Option 1  TRUE
## 6 Option A Column 2 Column 3 Column 1 Option 2  TRUE
## 7 Option C Column 2 Column 3     <NA> Option 1  TRUE
## 8 Option C Column 2 Column 3     <NA> Option 2  TRUE
## 9 Option C Column 2 Column 3     <NA> Option 3  TRUE

Exercise 4: Tidying up step by step (3)

gdata %>%
  separate_rows(Q2, sep=";") %>% 
  mutate(aux = TRUE) %>% 
  spread(Q2, aux, fill = FALSE)

## # A tibble: 5 × 7
##         Q1      Q3a      Q3b      Q3c `Option 1` `Option 2` `Option 3`
## *    <chr>    <chr>    <chr>    <chr>      <lgl>      <lgl>      <lgl>
## 1 Option A Column 2 Column 1 Column 4       TRUE      FALSE       TRUE
## 2 Option A Column 2 Column 3 Column 1       TRUE       TRUE      FALSE
## 3 Option B     <NA> Column 3 Column 2      FALSE       TRUE      FALSE
## 4 Option C Column 2 Column 3     <NA>       TRUE       TRUE       TRUE
## 5 Option C Column 4 Column 3 Column 2      FALSE      FALSE       TRUE

Exercise 4

This is solution is far from ideal:

Names of newly created columns does not make much sense.
Column for "Option 4" is missing (because nobody chose it – but we don't want to lose this information)

For better solution (with factors) and more exercises see handout.

Homework (1)

Homework (2)

The task is to load a table from Eurostat and make it (almost) tidy. Table lfsa_urgacob contains unemployment rates (%) of immigrants by sex, age and country of birth.

The table is in the file lfsa_urgacob.tsv (tab-separated table):

## # A tibble: 33,751 × 22
##   `unit,sex,age,c_birth,geo\\time` `2015` `2014` `2013` `2012` `2011`
##                              <chr>  <chr>  <chr>  <chr>  <chr>  <chr>
## 1          PC,F,Y15-19,EU15_FOR,AT    : u    : u    : u    : u    : u
## 2          PC,F,Y15-19,EU15_FOR,BE      :    : u    : u    : u   : bu
## 3          PC,F,Y15-19,EU15_FOR,CH    : u    : u    : u    : u    : u
## 4          PC,F,Y15-19,EU15_FOR,CY    : u    : u    : u      :    : u
## # ... with 3.375e+04 more rows, and 16 more variables: `2010` <chr>,
## #   `2009` <chr>, `2008` <chr>, `2007` <chr>, `2006` <chr>, `2005` <chr>,
## #   `2004` <chr>, `2003` <chr>, `2002` <chr>, `2001` <chr>, `2000` <chr>,
## #   `1999` <chr>, `1998` <chr>, `1997` <chr>, `1996` <chr>, `1995` <chr>

Homework (3)

: stays for missing observations
in records in the form 41.6 u or : bu letters denote notes

You are supposed to drop all notes and transform the table into following format:

## # A tibble: 24,003 × 35
##    unit   sex  c_birth   geo  year Y15_19 Y15_24 Y15_39 Y15_59 Y15_64
## * <chr> <chr>    <chr> <chr> <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1    PC     F EU15_FOR    AT  1995     NA     NA     NA     NA     NA
## 2    PC     F EU15_FOR    AT  1996     NA     NA     NA     NA     NA
## # ... with 2.4e+04 more rows, and 25 more variables: Y15_74 <dbl>,
## #   Y20_24 <dbl>, Y20_64 <dbl>, Y25_29 <dbl>, Y25_49 <dbl>, Y25_54 <dbl>,
## #   Y25_59 <dbl>, Y25_64 <dbl>, Y25_74 <dbl>, Y30_34 <dbl>, Y35_39 <dbl>,
## #   Y40_44 <dbl>, Y40_59 <dbl>, Y40_64 <dbl>, Y45_49 <dbl>, Y50_54 <dbl>,
## #   Y50_59 <dbl>, Y50_64 <dbl>, Y50_74 <dbl>, Y55_59 <dbl>, Y55_64 <dbl>,
## #   Y60_64 <dbl>, Y65_69 <dbl>, Y65_74 <dbl>, Y70_74 <dbl>

Homework (4)

Assign transformed table into variable eudata.

Hints:

Pay attention to column names and classes!
You may find helpful to use some regular expressions
Age groups contains "-" in original table and "_" in the transformed one

This is the end