AVED: Data manipulation (dplyr)

This handout heavily borrows from dplyr (dplyr-associated) vignettes available at https://github.com/hadley/dplyr/tree/master/vignettes

Empirical work requires knowledge of data manipulation – it includes data filtering, modification, joining etc. Basic R provides sufficient functionality but those tasks can be handled more elegantly and effectively using tools from dplyr package. Dplyr is the best package for manipulation of small and middle size data set. (I work with data set of millions of observations with instant response on daily basis.) Big data are bit beyond powers of dplyr and require specialized tools (database systems).

The lecture is organized to following sections which covers most tasks needed in empirical work:

Basic verbs (functions) of dplyr provide basic functionality: sorting, subsetting, columns modifying, and summarizing
Basic verbs are empowered by so called grouped operations which allows user to perform operations separately for user-defined groups.
It is common that a researcher needs to work with data from multiple sources which have to be joined into single data set. Joining operations are performed in dplyr using so called two-table verbs.
Dplyr also provides window functions which might be needed when working with data with time dimension.

Topics covered in the lecture are not a complete list of dplyr capabilities – e.g. (SQL) database functions are not covered at all. dplyr is a package under quite heavy development with new functions popping out every few months.

dplyr is new completely re-implemented evolution of plyr package. If you need to load both packages (e.g. because of dependencies) make sure that you load plyr first!

Let’s do it…

library(dplyr)

Data used in examples

For dplyr demonstrations we will use data set from nycflights13 package which includes data on flights which departed New York in 2013.

library(nycflights13)

Package contains following tables:

airlines with airlines codes and names (carrier – two letter abbreviation, name – full name)

## # A tibble: 16 × 2
##   carrier                   name
##     <chr>                  <chr>
## 1      9E      Endeavor Air Inc.
## 2      AA American Airlines Inc.
## # ... with 14 more rows

airports with airports metadata (faa – FAA airport code, name – name of airport, lon – longitude, lat – latitude, alt – altitude in feet, tz – time zone from GMT, dst – daylight savings time zone)

## # A tibble: 1,396 × 7
##     faa                          name      lat       lon   alt    tz   dst
##   <chr>                         <chr>    <dbl>     <dbl> <int> <dbl> <chr>
## 1   04G             Lansdowne Airport 41.13047 -80.61958  1044    -5     A
## 2   06A Moton Field Municipal Airport 32.46057 -85.68003   264    -5     A
## # ... with 1,394 more rows

flights with data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 (year, month, day – date of departure, dep_time, arr_time – actual time of arrival and departure (local tz), sched_dep_time, sched_arr_time – scheduled arrival and departure times (local tz), dep_delay, arr_delay – departure and arrival delays in minutes, hour, minute – time of scheduled departure broken into hour and minutes, carrier – two letter abbreviation, tailnum – plane tail number, flight – flight number, origin, dest – FAA codes, air_time – amount of time spent in the air, distance – distance flown, time_hour – scheduled date and hour of the flight as a POSIXct date)

## # A tibble: 336,776 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## # ... with 3.368e+05 more rows, and 12 more variables:
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

planes with plane metadata for all plane tail numbers found in the FAA aircraft registry. (tailnum – tail number, year – year manufactured, type – type of plane, manufacturer, model, enignes – number of engines, seats – number of seats, speed – average cruising speed in mph, engine – type of engine)

## # A tibble: 3,322 × 9
##   tailnum  year                    type     manufacturer     model engines
##     <chr> <int>                   <chr>            <chr>     <chr>   <int>
## 1  N10156  2004 Fixed wing multi engine          EMBRAER EMB-145XR       2
## 2  N102UW  1998 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## # ... with 3,320 more rows, and 3 more variables: seats <int>,
## #   speed <int>, engine <chr>

weather with hourly meteorological data for LGA, JFK and EWR (origin – weather station, time_hour – scheduled date and hour of the flight as a POSIXct date, the rest is self-explanatory)

## # A tibble: 26,130 × 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
## 1    EWR  2013     1     1     0 37.04 21.92 53.97      230   10.35702
## 2    EWR  2013     1     1     1 37.04 21.92 53.97      230   13.80936
## # ... with 2.613e+04 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

Basic verbs

Filter rows with filter() and slice()

A basic operation in data handling is a selection of observations (rows) which meet specific condition from a data set. This functionality is delivered by function filter(). For example we can filter planes from planes table which meets following conditions:

have 2 engines
were manufactured between 2000 and 2005 (and manufacturing year is known)
were not manufactured by Embraer or Boeing

filter(.data = planes, 
       engines == 2, 
       !is.na(year),
       year>=2000 & year<=2005, 
       !(manufacturer %in% c("EMBRAER","BOEING"))
       )

## # A tibble: 556 × 9
##   tailnum  year                    type     manufacturer    model engines
##     <chr> <int>                   <chr>            <chr>    <chr>   <int>
## 1  N117UW  2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214       2
## 2  N118US  2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214       2
## 3  N119US  2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214       2
## 4  N121UW  2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214       2
## # ... with 552 more rows, and 3 more variables: seats <int>, speed <int>,
## #   engine <chr>

The syntax simple: filter(.data,...) where ... represents a sequence of logical predicates.

Note that filter(), as well as other dplyr functions, uses non-standard evaluation. The version filter_() for standard evaluation is also available. (See handout for tidyr.)

Identical result can be obtained using more verbose code of standard R:

planes[planes$engines == 2 & 
         !is.na(planes$year) &
         planes$year >= 2000 & 
         planes$year <= 2005 & 
         planes$manufacturer != "EMBRAER" &
         planes$manufacturer != "BOEING",]

## # A tibble: 556 × 9
##   tailnum  year                    type     manufacturer    model engines
##     <chr> <int>                   <chr>            <chr>    <chr>   <int>
## 1  N117UW  2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214       2
## 2  N118US  2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214       2
## 3  N119US  2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214       2
## 4  N121UW  2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214       2
## # ... with 552 more rows, and 3 more variables: seats <int>, speed <int>,
## #   engine <chr>

Argument of filter() a sequence of logical predicates. Sometimes (rarely) is useful filter rows using their row numbers. filter() is unable to process row numbers as arguments. In this case we can use slice(.data,...) with similar syntax, where ... represents row numbers (integers). Assume that (for some reason) we want to filter only even rows:

slice(.data = planes, seq(from = 2, to = nrow(planes), by = 2))

## # A tibble: 1,661 × 9
##   tailnum  year                    type     manufacturer    model engines
##     <chr> <int>                   <chr>            <chr>    <chr>   <int>
## 1  N102UW  1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214       2
## 2  N104UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214       2
## 3  N105UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214       2
## 4  N108UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214       2
## # ... with 1,657 more rows, and 3 more variables: seats <int>,
## #   speed <int>, engine <chr>

Hack of the day: filter() is a quite usual name for a function. For example apart of dplyr you can find a function of the same name in package stats. If you need to call specific function you can do it using package::function notation (e.g. dplyr::filter()). With this notation you don’t need the package to be loaded!

Extract unique rows with distinct()

In some cases a data set contains identical rows. If we want to get rid of duplication we can use for example unique() or considerably faster distinct(.data, ..., .keep_all = FALSE) from dplyr:

distinct(.data = flights, tailnum, .keep_all = TRUE)

## # A tibble: 4,044 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## 3  2013     1     1      542            540         2      923
## 4  2013     1     1      544            545        -1     1004
## # ... with 4,040 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Arguments:

.data – a data.frame (tbl)
... – optional variables to use when determining uniqueness. If omitted, will use all variables.
.keep_all – if TRUE then all variables are kept in the output. (Default value is FALSE.)

The example shows that table flights with 336776 observations contains only 4044 unique tail numbers (i.e. planes). If .keep_all=TRUE is used then distinct() returns values for the first observation of the tail number.

We can also use distinct() with .keep_all=FALSE to get all observed combinations of some variables:

distinct(.data = flights, origin, dest, .keep_all = FALSE)

## # A tibble: 224 × 2
##   origin  dest
##    <chr> <chr>
## 1    EWR   IAH
## 2    LGA   IAH
## 3    JFK   MIA
## 4    JFK   BQN
## # ... with 220 more rows

Extract random rows with sample_n() and sample_frac()

For example when bootstrapping you need to randomly sample the observations (rows). You can use functions sample_n() for absolute sample size and sample_frac() for relative sample size. Both functions are wrappers around sample.int() from base and work with its arguments (i.e. weight and replace).

Sort rows with arrange()

With arrange(.data,...) dplyr provides function for sorting rows according to one or multiple columns:

arrange(.data = planes, manufacturer)

## # A tibble: 3,322 × 9
##   tailnum  year                    type manufacturer    model engines
##     <chr> <int>                   <chr>        <chr>    <chr>   <int>
## 1  N365AA  2001              Rotorcraft   AGUSTA SPA    A109E       2
## 2  N125UW  2009 Fixed wing multi engine       AIRBUS A320-214       2
## 3  N126UW  2009 Fixed wing multi engine       AIRBUS A320-214       2
## 4  N127UW  2010 Fixed wing multi engine       AIRBUS A320-214       2
## # ... with 3,318 more rows, and 3 more variables: seats <int>,
## #   speed <int>, engine <chr>

arrange(.data = planes, manufacturer, year)

## # A tibble: 3,322 × 9
##   tailnum  year                    type manufacturer    model engines
##     <chr> <int>                   <chr>        <chr>    <chr>   <int>
## 1  N365AA  2001              Rotorcraft   AGUSTA SPA    A109E       2
## 2  N186US  2002 Fixed wing multi engine       AIRBUS A321-211       2
## 3  N187US  2002 Fixed wing multi engine       AIRBUS A321-211       2
## 4  N188US  2002 Fixed wing multi engine       AIRBUS A321-211       2
## # ... with 3,318 more rows, and 3 more variables: seats <int>,
## #   speed <int>, engine <chr>

Arguments:

... comma separated list of unquoted variable names. If you want to sort a variable in descending order you need to use function desc() – see example:

arrange(.data = planes, desc(manufacturer))

## # A tibble: 3,322 × 9
##   tailnum  year                     type           manufacturer     model
##     <chr> <int>                    <chr>                  <chr>     <chr>
## 1  N397AA  1985 Fixed wing single engine           STEWART MACO FALCON XP
## 2  N521AA    NA Fixed wing single engine           STEWART MACO FALCON-XP
## 3  N347AA  1985               Rotorcraft               SIKORSKY     S-76A
## 4  N537JB  2012               Rotorcraft ROBINSON HELICOPTER CO       R66
## # ... with 3,318 more rows, and 4 more variables: engines <int>,
## #   seats <int>, speed <int>, engine <chr>

Select columns with select()

You can select columns from a table using select(.data,...), where ... is comma separated list of unquoted expressions. You can treat variable names like they are positions. See example:

select(.data = planes, year, manufacturer, model, seats, speed)

## # A tibble: 3,322 × 5
##    year     manufacturer     model seats speed
##   <int>            <chr>     <chr> <int> <int>
## 1  2004          EMBRAER EMB-145XR    55    NA
## 2  1998 AIRBUS INDUSTRIE  A320-214   182    NA
## 3  1999 AIRBUS INDUSTRIE  A320-214   182    NA
## 4  1999 AIRBUS INDUSTRIE  A320-214   182    NA
## # ... with 3,318 more rows

select(.data = planes, 2, 4, 5, 7, 8)

## # A tibble: 3,322 × 5
##    year     manufacturer     model seats speed
##   <int>            <chr>     <chr> <int> <int>
## 1  2004          EMBRAER EMB-145XR    55    NA
## 2  1998 AIRBUS INDUSTRIE  A320-214   182    NA
## 3  1999 AIRBUS INDUSTRIE  A320-214   182    NA
## 4  1999 AIRBUS INDUSTRIE  A320-214   182    NA
## # ... with 3,318 more rows

Special functions inside select()

You can use special function when addressing columns:

: can be used to address range of variables
- drops variables
= renames columns

select(.data = planes, year, manufacturer:seats)

## # A tibble: 3,322 × 5
##    year     manufacturer     model engines seats
##   <int>            <chr>     <chr>   <int> <int>
## 1  2004          EMBRAER EMB-145XR       2    55
## 2  1998 AIRBUS INDUSTRIE  A320-214       2   182
## 3  1999 AIRBUS INDUSTRIE  A320-214       2   182
## 4  1999 AIRBUS INDUSTRIE  A320-214       2   182
## # ... with 3,318 more rows

select(.data = planes, -tailnum, -year, -type)

## # A tibble: 3,322 × 6
##       manufacturer     model engines seats speed    engine
##              <chr>     <chr>   <int> <int> <int>     <chr>
## 1          EMBRAER EMB-145XR       2    55    NA Turbo-fan
## 2 AIRBUS INDUSTRIE  A320-214       2   182    NA Turbo-fan
## 3 AIRBUS INDUSTRIE  A320-214       2   182    NA Turbo-fan
## 4 AIRBUS INDUSTRIE  A320-214       2   182    NA Turbo-fan
## # ... with 3,318 more rows

select(.data = planes, time=year, manufacturer, seats, speed)

## # A tibble: 3,322 × 4
##    time     manufacturer seats speed
##   <int>            <chr> <int> <int>
## 1  2004          EMBRAER    55    NA
## 2  1998 AIRBUS INDUSTRIE   182    NA
## 3  1999 AIRBUS INDUSTRIE   182    NA
## 4  1999 AIRBUS INDUSTRIE   182    NA
## # ... with 3,318 more rows

It is possible to process even very complicated tasks using select() like selecting columns according to specific key or reordering columns. This and lot more can be done with special functions called select helpers:

starts_with(), ends_with(), contains(), matches() allows user to select columns according to some key. For example one can select all columns from weather related to wind.

select(.data = weather, contains("wind"))

## # A tibble: 26,130 × 3
##   wind_dir wind_speed wind_gust
##      <dbl>      <dbl>     <dbl>
## 1      230   10.35702  11.91865
## 2      230   13.80936  15.89154
## 3      230   12.65858  14.56724
## 4      230   13.80936  15.89154
## # ... with 26,126 more rows

when programming the names of columns needed are often stored in character vector. In this case you can use one_of() helper.

vars <- c("temp","humid","wind_speed")
select(.data = weather, one_of(vars))

## # A tibble: 26,130 × 3
##    temp humid wind_speed
##   <dbl> <dbl>      <dbl>
## 1 37.04 53.97   10.35702
## 2 37.04 53.97   13.80936
## 3 37.94 52.09   12.65858
## 4 37.94 54.51   13.80936
## # ... with 26,126 more rows

in the case of missing header or numeric column names in source data file columns are usually named automatically by combination of letter and number – e.g. V1 or X2008. To select those columns one can use num_range().

# Create dummy dataset with columns named V*
df <- as.data.frame(matrix(runif(100), nrow = 10)) %>% as_data_frame()
select(.data = df, num_range("V", c(1,4:6)))

## # A tibble: 10 × 4
##           V1        V4        V5         V6
##        <dbl>     <dbl>     <dbl>      <dbl>
## 1 0.65959353 0.3566706 0.5649707 0.45155619
## 2 0.08093357 0.0151294 0.3492659 0.41551914
## 3 0.42322792 0.5145382 0.4412013 0.87910739
## 4 0.89932347 0.9216651 0.9020654 0.06002039
## # ... with 6 more rows

For the clarity is sometimes useful to change ordering of columns – e.g. to put ID column on the first position. everything() can help with it.

select(.data = weather, time_hour, everything())

## # A tibble: 26,130 × 15
##             time_hour origin  year month   day  hour  temp  dewp humid
##                <dttm>  <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>
## 1 2013-01-01 01:00:00    EWR  2013     1     1     0 37.04 21.92 53.97
## 2 2013-01-01 02:00:00    EWR  2013     1     1     1 37.04 21.92 53.97
## 3 2013-01-01 03:00:00    EWR  2013     1     1     2 37.94 21.92 52.09
## 4 2013-01-01 04:00:00    EWR  2013     1     1     3 37.94 23.00 54.51
## # ... with 26,126 more rows, and 6 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>

rename()

dplyr provides specialized version of select(): rename(). rename() changes names of selected columns and keeps others in the data set.

select(.data = planes, time=year)

## # A tibble: 3,322 × 1
##    time
##   <int>
## 1  2004
## 2  1998
## 3  1999
## 4  1999
## # ... with 3,318 more rows

rename(.data = planes, time=year)

## # A tibble: 3,322 × 9
##   tailnum  time                    type     manufacturer     model engines
##     <chr> <int>                   <chr>            <chr>     <chr>   <int>
## 1  N10156  2004 Fixed wing multi engine          EMBRAER EMB-145XR       2
## 2  N102UW  1998 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## 3  N103US  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## 4  N104UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## # ... with 3,318 more rows, and 3 more variables: seats <int>,
## #   speed <int>, engine <chr>

Create new columns with mutate()

Function mutate(.data,...) adds new variables according to specifications given in ... and preserves existing. See example – conversion of temp from weather table from Celsius to Fahrenheit:

mutate(weather,
       temp_f = temp*9/5 + 32
       ) %>% 
  # Just to make outcome pretty -- by default mutate() adds new columns at the end.
  select(
    temp, temp_f, everything()
  )

## # A tibble: 26,130 × 16
##    temp  temp_f origin  year month   day  hour  dewp humid wind_dir
##   <dbl>   <dbl>  <chr> <dbl> <dbl> <int> <int> <dbl> <dbl>    <dbl>
## 1 37.04  98.672    EWR  2013     1     1     0 21.92 53.97      230
## 2 37.04  98.672    EWR  2013     1     1     1 21.92 53.97      230
## 3 37.94 100.292    EWR  2013     1     1     2 21.92 52.09      230
## 4 37.94 100.292    EWR  2013     1     1     3 23.00 54.51      230
## # ... with 26,126 more rows, and 6 more variables: wind_speed <dbl>,
## #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>

transmute()

dplyr contains even function transmute(.data,...) which drops existing variables:

transmute(weather,
       temp_f = temp*9/5 + 32
       )

## # A tibble: 26,130 × 1
##    temp_f
##     <dbl>
## 1  98.672
## 2  98.672
## 3 100.292
## 4 100.292
## # ... with 26,126 more rows

Summarise values in columns with summarise()

Function summarise(.data,...) summaries multiple values to a single value. It means that it allows a user to apply a function on selected columns and get one row outcome. For example one may wonder about average temperature, humidity, and maximum wind_speed in 2013:

summarise(.data = weather,
          av_temp = mean(temp, na.rm = TRUE),
          av_humid = mean(humid, na.rm = TRUE),
          max_wspeed = max(wind_speed, na.rm = TRUE)
          )

## # A tibble: 1 × 3
##    av_temp av_humid max_wspeed
##      <dbl>    <dbl>      <dbl>
## 1 55.20351 62.34732   1048.361

Summarise and mutate multiple columns

Above described uses of summarise() and mutate() require manual specification of all rules – e.g. if you want to know averages of all weather characteristics then you would have to write a line for each of them. This is not really comfortable. However dplyr contains functions summarise_all(), summarise_if(), summarise_at() and their mutate counterparts mutate_all(), mutate_if(), and mutate_at() which make the task simple.

*_all()

Functions summarise_all(.tbl,.funs,...) and mutate_all(.tbl, .funs, ...) allows to apply a function to all columns. For example one might want to have square roots of all columns and mean values:

weather %>% select(-origin,-time_hour) %>% mutate_all(sqrt)

## Warning in `__dplyr_colwise_fun`(c(21.92, 21.92, 21.92, 23, 24.08, 26.06, :
## NaNs produced

## # A tibble: 26,130 × 13
##       year month   day     hour     temp     dewp    humid wind_dir
##      <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 44.86647     1     1 0.000000 6.086050 4.681880 7.346428 15.16575
## 2 44.86647     1     1 1.000000 6.086050 4.681880 7.346428 15.16575
## 3 44.86647     1     1 1.414214 6.159545 4.681880 7.217340 15.16575
## 4 44.86647     1     1 1.732051 6.159545 4.795832 7.383089 15.16575
## # ... with 26,126 more rows, and 5 more variables: wind_speed <dbl>,
## #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>

weather %>% summarise_all(mean)

## Warning in mean.default(c("EWR", "EWR", "EWR", "EWR", "EWR", "EWR",
## "EWR", : argument is not numeric or logical: returning NA

## # A tibble: 1 × 15
##   origin  year    month      day     hour  temp  dewp humid wind_dir
##    <dbl> <dbl>    <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1     NA  2013 6.505741 15.67972 11.51841    NA    NA    NA       NA
## # ... with 6 more variables: wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

We can see two problems there. The first is, that function return NA when at least one observation is missing. To change this behavior using mean() arguments written in ...:

weather %>% summarise_all(mean, na.rm = TRUE)

## Warning in mean.default(c("EWR", "EWR", "EWR", "EWR", "EWR", "EWR",
## "EWR", : argument is not numeric or logical: returning NA

## # A tibble: 1 × 15
##   origin  year    month      day     hour     temp    dewp    humid
##    <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
## 1     NA  2013 6.505741 15.67972 11.51841 55.20351 41.3854 62.34732
## # ... with 7 more variables: wind_dir <dbl>, wind_speed <dbl>,
## #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>

The second problem is related to datatypes of columns which are incompatible with functions applied. To fix it we can use summarise_if(.tbl, .predicate, .funs,...) and mutate_if(.tbl, .predicate, .funs,...) which apply a function only if condition given in .predicate is true. In this case we want to process only numeric values:

weather %>% mutate_if(is.numeric,sqrt)

## Warning in `__dplyr_colwise_fun`(c(21.92, 21.92, 21.92, 23, 24.08, 26.06, :
## NaNs produced

## # A tibble: 26,130 × 15
##   origin     year month   day     hour     temp     dewp    humid wind_dir
##    <chr>    <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1    EWR 44.86647     1     1 0.000000 6.086050 4.681880 7.346428 15.16575
## 2    EWR 44.86647     1     1 1.000000 6.086050 4.681880 7.346428 15.16575
## 3    EWR 44.86647     1     1 1.414214 6.159545 4.681880 7.217340 15.16575
## 4    EWR 44.86647     1     1 1.732051 6.159545 4.795832 7.383089 15.16575
## # ... with 26,126 more rows, and 6 more variables: wind_speed <dbl>,
## #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>

weather %>% summarise_if(is.numeric,mean,na.rm = TRUE)

## # A tibble: 1 × 13
##    year    month      day     hour     temp    dewp    humid wind_dir
##   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
## 1  2013 6.505741 15.67972 11.51841 55.20351 41.3854 62.34732 198.0667
## # ... with 5 more variables: wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>

The last two functions summarise_at(.tbl,.cols,.funs,...) and mutate_at(.tbl,.cols,.funs,...) apply a function to the selection of columns specified in .cols. You can use the same specifications as in select(). For example we can summarize only columns with wind characteristics:

weather %>% summarise_at(vars(starts_with("wind")),mean,na.rm = TRUE)

## # A tibble: 1 × 3
##   wind_dir wind_speed wind_gust
##      <dbl>      <dbl>     <dbl>
## 1 198.0667   10.39587  11.96336

Grouped operations

Basic verbs are quite useful by themselves but they do not really add a novel functionality to basic R (see example above). However they become a very powerful tool when used in so called grouped operations. dplyr allows user to group observations to mutually exclusive groups and perform operation separately for each group. A example above presented using summarise() for getting average temperature, humidity, and maximum wind speed. The statistics obtained however described average and maximum values for all airports together – resulting statistics therefore had no useful meaning. We would be probably more interested in summarizing data for each airport individually. We can do so using group_by():

weather %>% 
  group_by(origin) %>% 
  summarise(
          av_temp = mean(temp, na.rm = TRUE),
          av_humid = mean(humid, na.rm = TRUE),
          max_wspeed = max(wind_speed, na.rm = TRUE)
          )

## # A tibble: 3 × 4
##   origin  av_temp av_humid max_wspeed
##    <chr>    <dbl>    <dbl>      <dbl>
## 1    EWR 55.48703 62.86192 1048.36058
## 2    JFK 54.42183 65.04412   42.57886
## 3    LGA 55.70181 59.13616   40.27730

Now, the output contains values summarized for each group defined by variable origin. A group can be defined even by combination of variables:

weather %>% 
  group_by(origin,month) %>% 
  summarise(
          av_temp = mean(temp, na.rm = TRUE),
          av_humid = mean(humid, na.rm = TRUE),
          max_wspeed = max(wind_speed, na.rm = TRUE)
          )

## Source: local data frame [36 x 5]
## Groups: origin [?]
## 
##   origin month  av_temp av_humid max_wspeed
##    <chr> <dbl>    <dbl>    <dbl>      <dbl>
## 1    EWR     1 35.56318 61.94838   40.27730
## 2    EWR     2 34.16027 62.98858 1048.36058
## 3    EWR     3 40.03386 57.60311   29.92028
## 4    EWR     4 52.90100 54.40324   25.31716
## # ... with 32 more rows

Note that grouping variables (de facto group IDs) are always presented in the output. The output also remains grouped. See what it means:

weather  %<>% group_by(origin)
print(weather)

## Source: local data frame [26,130 x 15]
## Groups: origin [3]
## 
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
## 1    EWR  2013     1     1     0 37.04 21.92 53.97      230   10.35702
## 2    EWR  2013     1     1     1 37.04 21.92 53.97      230   13.80936
## 3    EWR  2013     1     1     2 37.94 21.92 52.09      230   12.65858
## 4    EWR  2013     1     1     3 37.94 23.00 54.51      230   13.80936
## # ... with 26,126 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

weather %>% 
  summarise(
          av_temp = mean(temp, na.rm = TRUE),
          av_humid = mean(humid, na.rm = TRUE),
          max_wspeed = max(wind_speed, na.rm = TRUE)
          )

## # A tibble: 3 × 4
##   origin  av_temp av_humid max_wspeed
##    <chr>    <dbl>    <dbl>      <dbl>
## 1    EWR 55.48703 62.86192 1048.36058
## 2    JFK 54.42183 65.04412   42.57886
## 3    LGA 55.70181 59.13616   40.27730

It might by convenient in some cases to remove grouping from a data.frame:

weather %<>% ungroup()
print(weather)

## # A tibble: 26,130 × 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
## 1    EWR  2013     1     1     0 37.04 21.92 53.97      230   10.35702
## 2    EWR  2013     1     1     1 37.04 21.92 53.97      230   13.80936
## 3    EWR  2013     1     1     2 37.94 21.92 52.09      230   12.65858
## 4    EWR  2013     1     1     3 37.94 23.00 54.51      230   13.80936
## # ... with 26,126 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

dplyr contains a special grouping function rowwise() which creates a group for each row.

Two-tables verbs

Data for your research are only rarely already compiled into one data set. You need to do this part by your own very often. dplyr provides functions for joining two data sets. It currently supports following join types (see help):

Mutating joins

Mutating joins are joins that returns columns from both data sets x and y:

left_join(): return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
right_join(): return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
inner_join(): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
full_join(): return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

Filtering joins

Filtering joins return columns only from x:

semi_join(): return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
anti_join(): return all rows from x where there are not matching values in y, keeping just columns from x.

See some examples: One may wonder who has the biggest planes on the sky of NYC. Data needed are spread across three tables: flights (tail numbers and carriers codes), planes (tail numbers and number of seats), and airlines (carrier codes and names).

It is essencial that all three tables contain a key which can be used to match observations.

First we can join flights with planes – i.e. we add technical parameters of airplanes to every single flight.

left_join(flights, planes, by = "tailnum") %>% 
  left_join(.,airlines, by = "carrier") -> joined_tables

The syntax of joining functions is simple. The first two slots contains tables to join (order matters in this case!). The key used for matching rows from x and y is defined by argument by:

if NULL (default) than columns with identical names are used for matching
if character vector than columns with listed names are used for matching (see example above)
to join tables using variables with different names it is possible to use a named vector: by = c("a" = "b", "c" = "d"), where a and c are column names from table x and b and d from y.

To get results on the biggest planes question we need to filter data a bit. Table flights contains all individual flights from NYC airports therefore one plane is recorded many times and so does the resulting table joined_tables. It would bias our results. Therefore in the first step we need to get a unique set of airplanes identifiers (tail numbers):

joined_tables %>% 
  distinct(tailnum, .keep_all = TRUE) %>% 
  # Now we need to get an average speed for each carrier. First we need to group the data by carrier name.
  group_by(name) %>% 
  # Now we can take average speeds
  summarise(
    av_seats = mean(seats, na.rm = TRUE)
  ) %>% 
  # And sort it according to av_speed
  arrange(
    desc(av_seats)
    ) %>% 
  # And get top three
  slice(1:3)

## # A tibble: 3 × 2
##                     name av_seats
##                    <chr>    <dbl>
## 1 Hawaiian Airlines Inc. 377.0000
## 2        US Airways Inc. 205.7687
## 3  United Air Lines Inc. 194.4013

…and biggest planes award goes to Hawaiian Airlines Inc.

Set operations (setops) and binding

…from R Studio Cheat Sheet

Joining multiple tables with Reduce()

Join functions in dplyr are truly two-tables verbs because they do not support joining more then two tables in one call. If this feature is needed one can use Reduce() from base package. Reduce() uses a binary function to successively combine the elements of a given vector. We can demonstrate it using left_join() problem from the first example above:

Reduce(left_join, list(flights,planes,airlines))

## Joining, by = c("year", "tailnum")

## Joining, by = "carrier"

## # A tibble: 336,776 × 27
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## 3  2013     1     1      542            540         2      923
## 4  2013     1     1      544            545        -1     1004
## # ... with 336,772 more rows, and 20 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, type <chr>, manufacturer <chr>,
## #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>,
## #   name <chr>

Reduce() progress from left to right (it can be changed) and successively applies given function. It is equivalent to chaining multiple successive calls of left_join():

Reduce(left_join, list(a,b,c,d,...))
left_join(a,b) %>% left_join(.,c) %>% left_join(.,d) %>% ...

Window functions

A window function is a variation on an aggregation function. Where an aggregation function, like sum() and mean(), takes n inputs and return a single value, a window function returns n values. dplyr contains three families of window functions:

ranking functions
offset functions
cumulative aggregates

See some examples.

Ranking functions

planes %>% 
  mutate(
    seats_rnum = row_number(seats), 
    seats_decile = ntile(seats, n = 10),
    seats_mrank = min_rank(seats),
    seats_drank = dense_rank(seats),
    seats_prank = percent_rank(seats),
    seats_cdist = cume_dist(seats)
  ) %>% 
  select(starts_with("seats"))

## # A tibble: 3,322 × 7
##   seats seats_rnum seats_decile seats_mrank seats_drank seats_prank
##   <int>      <int>        <int>       <int>       <int>       <dbl>
## 1    55        123            1         123          15  0.03673592
## 2   182       2410            8        2410          30  0.72538392
## 3   182       2411            8        2410          30  0.72538392
## 4   182       2412            8        2410          30  0.72538392
## # ... with 3,318 more rows, and 1 more variables: seats_cdist <dbl>

Offset functions

flights %>%
  group_by(flight) %>% 
  mutate(
    lead_arr_delay = lead(arr_delay, n = 1L),
    lag_arr_delay = lag(arr_delay, n = 1L)
  ) %>% 
  select(ends_with("arr_delay")) %>% 
  arrange(flight)

## Adding missing grouping variables: `flight`

## Source: local data frame [336,776 x 4]
## Groups: flight [3,844]
## 
##   flight arr_delay lead_arr_delay lag_arr_delay
##    <int>     <dbl>          <dbl>         <dbl>
## 1      1         6             29            NA
## 2      1        29              5             6
## 3      1         5             24            29
## 4      1        24            -36             5
## # ... with 336,772 more rows

Note that flight is in in the output even if it is not selected by select()! Grouping variables remain in selection automatically.

Cumulative aggregates

flights %>%
  group_by(flight) %>% 
  mutate(
    cummean_arr_delay = cummean(arr_delay)
  ) %>% 
  select(ends_with("arr_delay")) %>% 
  arrange(flight)

## Adding missing grouping variables: `flight`

## Source: local data frame [336,776 x 3]
## Groups: flight [3,844]
## 
##   flight arr_delay cummean_arr_delay
##    <int>     <dbl>             <dbl>
## 1      1         6           6.00000
## 2      1        29          17.50000
## 3      1         5          13.33333
## 4      1        24          16.00000
## # ... with 336,772 more rows

Exercises

OECD data: Aggregation of monthly data

In the first exercise we will process data on unemployment rate from OECD database. File OECD_STLABOUR_monthly.csv contains seasonally adjusted monthly unemployment rates for numerous population groups in OECD countries. We would like to get yearly unemployment rates for total population 15+ (series LRUNTTTT), males (LRUNTTMA), and females (LRUNTTFE).

read_csv("data/OECD_STLABOUR_monthly.csv") -> oecd
print(oecd)

## # A tibble: 18,210 × 19
##   `"LOCATION"`   Country  SUBJECT                                Subject
##          <chr>     <chr>    <chr>                                  <chr>
## 1          AUS Australia LRUN24FE Unemployment rate, Aged 15-24, Females
## 2          AUS Australia LRUN24FE Unemployment rate, Aged 15-24, Females
## 3          AUS Australia LRUN24FE Unemployment rate, Aged 15-24, Females
## 4          AUS Australia LRUN24FE Unemployment rate, Aged 15-24, Females
## # ... with 18,206 more rows, and 15 more variables: MEASURE <chr>,
## #   Measure <chr>, FREQUENCY <chr>, Frequency <chr>, TIME <chr>,
## #   Time <chr>, `Unit Code` <chr>, Unit <chr>, `PowerCode Code` <int>,
## #   PowerCode <chr>, `Reference Period Code` <chr>, `Reference
## #   Period` <chr>, Value <dbl>, `Flag Codes` <chr>, Flags <chr>

OECD data comes with a lot of information. In this case we need just country code, variable code, value, and time.

oecd %<>% 
  select(country = 1, 
         variable = SUBJECT, 
         value = Value, 
         time = Time) 
print(oecd)

## # A tibble: 18,210 × 4
##   country variable    value     time
##     <chr>    <chr>    <dbl>    <chr>
## 1     AUS LRUN24FE 12.16585 Jan-2000
## 2     AUS LRUN24FE 12.43789 Feb-2000
## 3     AUS LRUN24FE 11.52670 Mar-2000
## 4     AUS LRUN24FE 11.51421 Apr-2000
## # ... with 18,206 more rows

In the call used select() utilizes both standard ways to select a column – number (in the case of ugly column name) and names.

Data set contains 1 different unemployment rates. We are interested only in three of them:

oecd %<>% 
  filter(variable %in% c("LRUNTTTT","LRUNTTMA","LRUNTTFE"))
print(oecd)

## # A tibble: 3,954 × 4
##   country variable    value     time
##     <chr>    <chr>    <dbl>    <chr>
## 1     AUS LRUNTTFE 6.670316 Jan-2000
## 2     AUS LRUNTTFE 6.740396 Feb-2000
## 3     AUS LRUNTTFE 6.576782 Mar-2000
## 4     AUS LRUNTTFE 6.341064 Apr-2000
## # ... with 3,950 more rows

Now it is time to make it tidy:

oecd %<>% 
  spread(variable,value)
print(oecd)

## # A tibble: 1,320 × 5
##   country     time LRUNTTFE LRUNTTMA LRUNTTTT
## *   <chr>    <chr>    <dbl>    <dbl>    <dbl>
## 1     AUS Apr-2000 6.341064 6.400338 6.374269
## 2     AUS Apr-2001 6.509435 7.016913 6.791216
## 3     AUS Apr-2002 5.889289 6.709898 6.347627
## 4     AUS Apr-2003 5.850319 6.171773 6.027655
## # ... with 1,316 more rows

We want to aggregated monthly observations. For the purpose of grouping it useful to store year and month in different columns:

oecd %<>% 
  separate(time, c("month","year"), sep="-", convert = TRUE)
print(oecd)

## # A tibble: 1,320 × 6
##   country month  year LRUNTTFE LRUNTTMA LRUNTTTT
## *   <chr> <chr> <int>    <dbl>    <dbl>    <dbl>
## 1     AUS   Apr  2000 6.341064 6.400338 6.374269
## 2     AUS   Apr  2001 6.509435 7.016913 6.791216
## 3     AUS   Apr  2002 5.889289 6.709898 6.347627
## 4     AUS   Apr  2003 5.850319 6.171773 6.027655
## # ... with 1,316 more rows

Now we can take averages for years:

oecd %<>% 
  # We want to get an average for a country-year pairs. Therefore we need to make groups along these variables.
  group_by(country,year) %>% 
  # And we want to take averages only for numeric columns.
  summarise_if(is.numeric, mean, na.rm=T)
print(oecd)

## Source: local data frame [110 x 5]
## Groups: country [?]
## 
##   country  year LRUNTTFE LRUNTTMA LRUNTTTT
##     <chr> <int>    <dbl>    <dbl>    <dbl>
## 1     AUS  2000 6.030742 6.465302 6.273984
## 2     AUS  2001 6.455428 7.007580 6.763034
## 3     AUS  2002 6.148189 6.534425 6.363390
## 4     AUS  2003 5.960969 5.894895 5.924274
## # ... with 106 more rows

WIID data: Choose specific observation for each group

Inequality is hot topic. Because… you know… it is huge and growing beyond that. If you want to let data speak you will most likely need WIID data set.

The WIID database in 3.3 version is available in WIID_ver3_sept15_0.tsv (tab separated values).

read_tsv("data/WIID_ver3_sept15_0.tsv") -> wiid
print(wiid)

## # A tibble: 7,153 × 53
##       Country Countrycode3 Countrycode2  Year      POP         Incomegroup
##         <chr>        <chr>        <chr> <int>    <dbl>               <chr>
## 1 Afghanistan          AFG           AF  2007       NA          Low income
## 2 Afghanistan          AFG           AF  2008       NA          Low income
## 3     Albania          ALB           AL  1996 3.112597 Upper middle income
## 4     Albania          ALB           AL  1997 3.091902 Upper middle income
## # ... with 7,149 more rows, and 47 more variables: Mean <dbl>,
## #   Median <dbl>, Gini <dbl>, Source <chr>, D1 <dbl>, D2 <dbl>, D3 <dbl>,
## #   D4 <dbl>, D5 <dbl>, D6 <dbl>, D7 <dbl>, D8 <dbl>, D9 <dbl>, D10 <dbl>,
## #   Q1 <dbl>, Q2 <dbl>, Q3 <dbl>, Q4 <dbl>, Q5 <dbl>, P5 <dbl>, P95 <dbl>,
## #   Revision <chr>, Quality <chr>, Source_Comments <chr>, Region <chr>,
## #   Currency <chr>, Reference_period <chr>, Welfaredefn <chr>,
## #   IncSharU <chr>, UofAnala <chr>, Equivsc <chr>, AreaCovr <chr>,
## #   PopCovr <chr>, AgeCovr <chr>, EU <chr>, OECD <chr>, Year_cat <chr>,
## #   Welfaredefn_new <chr>, Equivsc_new <chr>, AgeCovr_new <chr>,
## #   AreaCovr_new <chr>, PopCovr_new <chr>, IncSharU_new <chr>,
## #   UofAnala_new <chr>, Exchangerate <dbl>, Mean_usd <dbl>,
## #   Median_usd <dbl>

WIID aggregates many data sets on income inequality with many indicators. We will focus on Gini coefficient (Gini). Each observation is rated according its quality (Quality) from High (best) to Not known (worst). To make our lives easier we will restrain the data set on country and year identifiers and Gini and Quality ratings.

wiid %<>% select(
  Country = Countrycode3, # ISO alpha3 country codes
  Year = Year,
  gini = Gini,
  quality = Quality
)
# Lets take a look at Australia in 1978
wiid %>% 
  filter(
    Country == "AUS",
    Year == 1978
  )

## # A tibble: 3 × 4
##   Country  Year  gini quality
##     <chr> <int> <dbl>   <chr>
## 1     AUS  1978  38.1     Low
## 2     AUS  1978  26.1 Average
## 3     AUS  1978  18.0 Average

…and it is apparent that we have a problem. There are multiple observations for country-year pairs and even for country-year-quality. Researcher might want to use the best quality available for each country-year pair and if there are more of the same quality then take an average.

In the first step we will create averages for unique country-year-quality combinations:

wiid %<>% 
  group_by(Country,Year,quality) %>% 
  summarise(
    gini = mean(gini, na.rm = TRUE)
  ) %>% ungroup
# See...
wiid %>% 
  filter(
    Country == "AUS",
    Year == 1978
  )

## # A tibble: 2 × 4
##   Country  Year quality  gini
##     <chr> <int>   <chr> <dbl>
## 1     AUS  1978 Average 22.05
## 2     AUS  1978     Low 38.10

In the second step we would like to choose the best observation available. The quality has following values:

wiid %>% distinct(quality)

## # A tibble: 4 × 1
##     quality
##       <chr>
## 1      High
## 2   Average
## 3       Low
## 4 Not known

The strategy is following. Order rows for each country-year pair by quality (best goes first,…) and get only the first row! Easy to say, harder to do. The best option is to convert character vector quality into ordered factor.

# Before:
wiid %>% distinct(quality) %>% arrange(quality)

## # A tibble: 4 × 1
##     quality
##       <chr>
## 1   Average
## 2      High
## 3       Low
## 4 Not known

# Converting into factor:
wiid$quality %<>% factor(levels = c("High","Average","Low","Not known"))
# After:
wiid %>% distinct(quality) %>% arrange(quality)

## # A tibble: 4 × 1
##     quality
##      <fctr>
## 1      High
## 2   Average
## 3       Low
## 4 Not known

Now we can do the rest:

wiid %>% 
  # at first it is wise to remove rows with NAs
  filter(!is.na(gini)) %>% 
  # we are looking for the best observation for country-year pair
  group_by(Country,Year) %>% 
  # order by quality
  arrange(quality) %>% 
  # and now we take the first row
  slice(1L) -> wiid_0

There is an alternative way of doing the last step without grouping:

wiid %>% 
  filter(!is.na(gini)) %>%
  arrange(quality) %>% 
  distinct(Country, Year, .keep_all = TRUE) -> wiid_1

Results are the same:

setdiff(wiid_0,wiid_1)

## Source: local data frame [0 x 4]
## Groups:  [?]
## 
## # ... with 4 variables: Country <chr>, Year <int>, quality <fctr>,
## #   gini <dbl>

Joining datasets

One may wonder if there is a correlation between unemployment and income inequality. In order to investigate it we would need to join WIID data on inequality and OECD data on unemployment.

wiid_0 %>% 
  # We don't need information on quality therefore we can drop it
  select(-quality) %>% 
  # Notice that joins are sensitive to inputs order -- you can place "." on the position where table going through the pipe should be.
  # The names of key columns are not identical. Therefore we need to use named vector describing matching rules.
  left_join(oecd,., 
            by = c("country"="Country",
                   "year"="Year")) -> oecd_wiid
print(oecd_wiid)

## Source: local data frame [110 x 6]
## Groups: country [?]
## 
##   country  year LRUNTTFE LRUNTTMA LRUNTTTT   gini
##     <chr> <int>    <dbl>    <dbl>    <dbl>  <dbl>
## 1     AUS  2000 6.030742 6.465302 6.273984 38.775
## 2     AUS  2001 6.455428 7.007580 6.763034 31.100
## 3     AUS  2002 6.148189 6.534425 6.363390 30.900
## 4     AUS  2003 5.960969 5.894895 5.924274 31.200
## # ... with 106 more rows

We have successfully joined the data and now we can compute correlation coefficient. It is 0.32. In R we can also make pictures…

…but we will talk about that later. (Think about regressions with and without FE.)

Homework

Czech Statistical Office publishes detailed statistics on population changes at municipal level. The database is divided into multiple tables in the same structure (one for a county/“okres”). Each table has following structure:

## # A tibble: 135 × 19
##    year city_id      name   est pop_b   nat  mort immig  emig pch_nat
##   <chr>   <chr>     <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1  1971  210001 Běchovice     -  1859    17    35    75    97     -18
## 2  1972  210001 Běchovice     -  1819    26    26   103    63      NA
## 3  1973  210001 Běchovice     -  1859    26    28    59    75      -2
## 4  1971  210002    Benice     -   336     9     4     6    15       5
## # ... with 131 more rows, and 9 more variables: pch_mig <dbl>,
## #   pch_tot <dbl>, area_ch1 <dbl>, pop_e <dbl>, area_ch2 <dbl>,
## #   from <chr>, from_name <chr>, where <chr>, where_name <chr>

For the purpose of the homework are important following columns:

year – year
city_id – ID of the city
pop_b – population at the beginning of the year
pch_mig – net absolute migration (number of persons)

Data are available in the file HW_CZSO_migration.Rdata in the variable CZSO which is a list of individual tables (data frames). You are supposed to:

join list of data frames into one big table (data frame)
compute net migration rate (to column mrate) as 100*pch_mig/pop_b
compute average migration rate (to column amrate) and average population size (from pop_b) for each city_id for the period 1991–2010 (incl. margin years), ignore missing values

Assign your solution to the variable czdata. Your solution should look like following table:

## # A tibble: 6,283 × 3
##   city_id   amrate      apop
##     <chr>    <dbl>     <dbl>
## 1  564982 14.71920 290.50000
## 2  546518 13.91694  62.94737
## 3  564907 13.78374 170.35000
## 4  517321 13.47592 207.25000
## # ... with 6,279 more rows

Hints:

You may find helpful to use some regular expressions
Carefuly check all names and classes.

Good luck!