This handout heavily borrows from dplyr (dplyr-associated) vignettes available at https://github.com/hadley/dplyr/tree/master/vignettes
Empirical work requires knowledge of data manipulation – it includes data filtering, modification, joining etc. Basic R provides sufficient functionality but those tasks can be handled more elegantly and effectively using tools from dplyr
package. Dplyr is the best package for manipulation of small and middle size data set. (I work with data set of millions of observations with instant response on daily basis.) Big data are bit beyond powers of dplyr and require specialized tools (database systems).
The lecture is organized to following sections which covers most tasks needed in empirical work:
dplyr
using so called two-table verbs.Topics covered in the lecture are not a complete list of dplyr
capabilities – e.g. (SQL) database functions are not covered at all. dplyr
is a package under quite heavy development with new functions popping out every few months.
dplyr
is new completely re-implemented evolution ofplyr
package. If you need to load both packages (e.g. because of dependencies) make sure that you loadplyr
first!
Let’s do it…
library(dplyr)
For dplyr
demonstrations we will use data set from nycflights13
package which includes data on flights which departed New York in 2013.
library(nycflights13)
Package contains following tables:
airlines
with airlines codes and names (carrier
– two letter abbreviation, name
– full name)## # A tibble: 16 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## # ... with 14 more rows
airports
with airports metadata (faa
– FAA airport code, name
– name of airport, lon
– longitude, lat
– latitude, alt
– altitude in feet, tz
– time zone from GMT, dst
– daylight savings time zone)## # A tibble: 1,396 × 7
## faa name lat lon alt tz dst
## <chr> <chr> <dbl> <dbl> <int> <dbl> <chr>
## 1 04G Lansdowne Airport 41.13047 -80.61958 1044 -5 A
## 2 06A Moton Field Municipal Airport 32.46057 -85.68003 264 -5 A
## # ... with 1,394 more rows
flights
with data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 (year
, month
, day
– date of departure, dep_time
, arr_time
– actual time of arrival and departure (local tz
), sched_dep_time
, sched_arr_time
– scheduled arrival and departure times (local tz
), dep_delay
, arr_delay
– departure and arrival delays in minutes, hour
, minute
– time of scheduled departure broken into hour and minutes, carrier
– two letter abbreviation, tailnum
– plane tail number, flight
– flight number, origin
, dest
– FAA codes, air_time
– amount of time spent in the air, distance
– distance flown, time_hour
– scheduled date and hour of the flight as a POSIXct date)## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## # ... with 3.368e+05 more rows, and 12 more variables:
## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
planes
with plane metadata for all plane tail numbers found in the FAA aircraft registry. (tailnum
– tail number, year
– year manufactured, type
– type of plane, manufacturer
, model
, enignes
– number of engines, seats
– number of seats, speed
– average cruising speed in mph, engine
– type of engine)## # A tibble: 3,322 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 2 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## # ... with 3,320 more rows, and 3 more variables: seats <int>,
## # speed <int>, engine <chr>
weather
with hourly meteorological data for LGA, JFK and EWR (origin
– weather station, time_hour
– scheduled date and hour of the flight as a POSIXct date, the rest is self-explanatory)## # A tibble: 26,130 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 0 37.04 21.92 53.97 230 10.35702
## 2 EWR 2013 1 1 1 37.04 21.92 53.97 230 13.80936
## # ... with 2.613e+04 more rows, and 5 more variables: wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
A basic operation in data handling is a selection of observations (rows) which meet specific condition from a data set. This functionality is delivered by function filter()
. For example we can filter planes from planes
table which meets following conditions:
filter(.data = planes,
engines == 2,
!is.na(year),
year>=2000 & year<=2005,
!(manufacturer %in% c("EMBRAER","BOEING"))
)
## # A tibble: 556 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N117UW 2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 2 N118US 2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 3 N119US 2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 4 N121UW 2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## # ... with 552 more rows, and 3 more variables: seats <int>, speed <int>,
## # engine <chr>
The syntax simple: filter(.data,...)
where ...
represents a sequence of logical predicates.
Note that
filter()
, as well as otherdplyr
functions, uses non-standard evaluation. The versionfilter_()
for standard evaluation is also available. (See handout fortidyr
.)
Identical result can be obtained using more verbose code of standard R:
planes[planes$engines == 2 &
!is.na(planes$year) &
planes$year >= 2000 &
planes$year <= 2005 &
planes$manufacturer != "EMBRAER" &
planes$manufacturer != "BOEING",]
## # A tibble: 556 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N117UW 2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 2 N118US 2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 3 N119US 2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 4 N121UW 2000 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## # ... with 552 more rows, and 3 more variables: seats <int>, speed <int>,
## # engine <chr>
Argument of filter()
a sequence of logical predicates. Sometimes (rarely) is useful filter rows using their row numbers. filter()
is unable to process row numbers as arguments. In this case we can use slice(.data,...)
with similar syntax, where ...
represents row numbers (integers). Assume that (for some reason) we want to filter only even rows:
slice(.data = planes, seq(from = 2, to = nrow(planes), by = 2))
## # A tibble: 1,661 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 2 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 3 N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 4 N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## # ... with 1,657 more rows, and 3 more variables: seats <int>,
## # speed <int>, engine <chr>
Hack of the day:
filter()
is a quite usual name for a function. For example apart ofdplyr
you can find a function of the same name in packagestats
. If you need to call specific function you can do it usingpackage::function
notation (e.g.dplyr::filter()
). With this notation you don’t need the package to be loaded!
In some cases a data set contains identical rows. If we want to get rid of duplication we can use for example unique()
or considerably faster distinct(.data, ..., .keep_all = FALSE)
from dplyr:
distinct(.data = flights, tailnum, .keep_all = TRUE)
## # A tibble: 4,044 × 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## # ... with 4,040 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Arguments:
.data
– a data.frame (tbl)...
– optional variables to use when determining uniqueness. If omitted, will use all variables..keep_all
– if TRUE
then all variables are kept in the output. (Default value is FALSE
.)The example shows that table flights
with 336776 observations contains only 4044 unique tail numbers (i.e. planes). If .keep_all=TRUE
is used then distinct()
returns values for the first observation of the tail number.
We can also use distinct()
with .keep_all=FALSE
to get all observed combinations of some variables:
distinct(.data = flights, origin, dest, .keep_all = FALSE)
## # A tibble: 224 × 2
## origin dest
## <chr> <chr>
## 1 EWR IAH
## 2 LGA IAH
## 3 JFK MIA
## 4 JFK BQN
## # ... with 220 more rows
For example when bootstrapping you need to randomly sample the observations (rows). You can use functions sample_n()
for absolute sample size and sample_frac()
for relative sample size. Both functions are wrappers around sample.int()
from base
and work with its arguments (i.e. weight
and replace
).
With arrange(.data,...)
dplyr
provides function for sorting rows according to one or multiple columns:
arrange(.data = planes, manufacturer)
## # A tibble: 3,322 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N365AA 2001 Rotorcraft AGUSTA SPA A109E 2
## 2 N125UW 2009 Fixed wing multi engine AIRBUS A320-214 2
## 3 N126UW 2009 Fixed wing multi engine AIRBUS A320-214 2
## 4 N127UW 2010 Fixed wing multi engine AIRBUS A320-214 2
## # ... with 3,318 more rows, and 3 more variables: seats <int>,
## # speed <int>, engine <chr>
arrange(.data = planes, manufacturer, year)
## # A tibble: 3,322 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N365AA 2001 Rotorcraft AGUSTA SPA A109E 2
## 2 N186US 2002 Fixed wing multi engine AIRBUS A321-211 2
## 3 N187US 2002 Fixed wing multi engine AIRBUS A321-211 2
## 4 N188US 2002 Fixed wing multi engine AIRBUS A321-211 2
## # ... with 3,318 more rows, and 3 more variables: seats <int>,
## # speed <int>, engine <chr>
Arguments:
...
comma separated list of unquoted variable names. If you want to sort a variable in descending order you need to use function desc()
– see example:arrange(.data = planes, desc(manufacturer))
## # A tibble: 3,322 × 9
## tailnum year type manufacturer model
## <chr> <int> <chr> <chr> <chr>
## 1 N397AA 1985 Fixed wing single engine STEWART MACO FALCON XP
## 2 N521AA NA Fixed wing single engine STEWART MACO FALCON-XP
## 3 N347AA 1985 Rotorcraft SIKORSKY S-76A
## 4 N537JB 2012 Rotorcraft ROBINSON HELICOPTER CO R66
## # ... with 3,318 more rows, and 4 more variables: engines <int>,
## # seats <int>, speed <int>, engine <chr>
You can select columns from a table using select(.data,...)
, where ...
is comma separated list of unquoted expressions. You can treat variable names like they are positions. See example:
select(.data = planes, year, manufacturer, model, seats, speed)
## # A tibble: 3,322 × 5
## year manufacturer model seats speed
## <int> <chr> <chr> <int> <int>
## 1 2004 EMBRAER EMB-145XR 55 NA
## 2 1998 AIRBUS INDUSTRIE A320-214 182 NA
## 3 1999 AIRBUS INDUSTRIE A320-214 182 NA
## 4 1999 AIRBUS INDUSTRIE A320-214 182 NA
## # ... with 3,318 more rows
select(.data = planes, 2, 4, 5, 7, 8)
## # A tibble: 3,322 × 5
## year manufacturer model seats speed
## <int> <chr> <chr> <int> <int>
## 1 2004 EMBRAER EMB-145XR 55 NA
## 2 1998 AIRBUS INDUSTRIE A320-214 182 NA
## 3 1999 AIRBUS INDUSTRIE A320-214 182 NA
## 4 1999 AIRBUS INDUSTRIE A320-214 182 NA
## # ... with 3,318 more rows
You can use special function when addressing columns:
:
can be used to address range of variables-
drops variables=
renames columnsselect(.data = planes, year, manufacturer:seats)
## # A tibble: 3,322 × 5
## year manufacturer model engines seats
## <int> <chr> <chr> <int> <int>
## 1 2004 EMBRAER EMB-145XR 2 55
## 2 1998 AIRBUS INDUSTRIE A320-214 2 182
## 3 1999 AIRBUS INDUSTRIE A320-214 2 182
## 4 1999 AIRBUS INDUSTRIE A320-214 2 182
## # ... with 3,318 more rows
select(.data = planes, -tailnum, -year, -type)
## # A tibble: 3,322 × 6
## manufacturer model engines seats speed engine
## <chr> <chr> <int> <int> <int> <chr>
## 1 EMBRAER EMB-145XR 2 55 NA Turbo-fan
## 2 AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
## 3 AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
## 4 AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
## # ... with 3,318 more rows
select(.data = planes, time=year, manufacturer, seats, speed)
## # A tibble: 3,322 × 4
## time manufacturer seats speed
## <int> <chr> <int> <int>
## 1 2004 EMBRAER 55 NA
## 2 1998 AIRBUS INDUSTRIE 182 NA
## 3 1999 AIRBUS INDUSTRIE 182 NA
## 4 1999 AIRBUS INDUSTRIE 182 NA
## # ... with 3,318 more rows
It is possible to process even very complicated tasks using select()
like selecting columns according to specific key or reordering columns. This and lot more can be done with special functions called select helpers:
starts_with()
, ends_with()
, contains()
, matches()
allows user to select columns according to some key. For example one can select all columns from weather
related to wind.select(.data = weather, contains("wind"))
## # A tibble: 26,130 × 3
## wind_dir wind_speed wind_gust
## <dbl> <dbl> <dbl>
## 1 230 10.35702 11.91865
## 2 230 13.80936 15.89154
## 3 230 12.65858 14.56724
## 4 230 13.80936 15.89154
## # ... with 26,126 more rows
one_of()
helper.vars <- c("temp","humid","wind_speed")
select(.data = weather, one_of(vars))
## # A tibble: 26,130 × 3
## temp humid wind_speed
## <dbl> <dbl> <dbl>
## 1 37.04 53.97 10.35702
## 2 37.04 53.97 13.80936
## 3 37.94 52.09 12.65858
## 4 37.94 54.51 13.80936
## # ... with 26,126 more rows
V1
or X2008
. To select those columns one can use num_range()
.# Create dummy dataset with columns named V*
df <- as.data.frame(matrix(runif(100), nrow = 10)) %>% as_data_frame()
select(.data = df, num_range("V", c(1,4:6)))
## # A tibble: 10 × 4
## V1 V4 V5 V6
## <dbl> <dbl> <dbl> <dbl>
## 1 0.65959353 0.3566706 0.5649707 0.45155619
## 2 0.08093357 0.0151294 0.3492659 0.41551914
## 3 0.42322792 0.5145382 0.4412013 0.87910739
## 4 0.89932347 0.9216651 0.9020654 0.06002039
## # ... with 6 more rows
everything()
can help with it.select(.data = weather, time_hour, everything())
## # A tibble: 26,130 × 15
## time_hour origin year month day hour temp dewp humid
## <dttm> <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>
## 1 2013-01-01 01:00:00 EWR 2013 1 1 0 37.04 21.92 53.97
## 2 2013-01-01 02:00:00 EWR 2013 1 1 1 37.04 21.92 53.97
## 3 2013-01-01 03:00:00 EWR 2013 1 1 2 37.94 21.92 52.09
## 4 2013-01-01 04:00:00 EWR 2013 1 1 3 37.94 23.00 54.51
## # ... with 26,126 more rows, and 6 more variables: wind_dir <dbl>,
## # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>
dplyr
provides specialized version of select()
: rename()
. rename()
changes names of selected columns and keeps others in the data set.
select(.data = planes, time=year)
## # A tibble: 3,322 × 1
## time
## <int>
## 1 2004
## 2 1998
## 3 1999
## 4 1999
## # ... with 3,318 more rows
rename(.data = planes, time=year)
## # A tibble: 3,322 × 9
## tailnum time type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 2 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 3 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 4 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## # ... with 3,318 more rows, and 3 more variables: seats <int>,
## # speed <int>, engine <chr>
Function mutate(.data,...)
adds new variables according to specifications given in ...
and preserves existing. See example – conversion of temp
from weather
table from Celsius to Fahrenheit:
mutate(weather,
temp_f = temp*9/5 + 32
) %>%
# Just to make outcome pretty -- by default mutate() adds new columns at the end.
select(
temp, temp_f, everything()
)
## # A tibble: 26,130 × 16
## temp temp_f origin year month day hour dewp humid wind_dir
## <dbl> <dbl> <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>
## 1 37.04 98.672 EWR 2013 1 1 0 21.92 53.97 230
## 2 37.04 98.672 EWR 2013 1 1 1 21.92 53.97 230
## 3 37.94 100.292 EWR 2013 1 1 2 21.92 52.09 230
## 4 37.94 100.292 EWR 2013 1 1 3 23.00 54.51 230
## # ... with 26,126 more rows, and 6 more variables: wind_speed <dbl>,
## # wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
dplyr
contains even function transmute(.data,...)
which drops existing variables:
transmute(weather,
temp_f = temp*9/5 + 32
)
## # A tibble: 26,130 × 1
## temp_f
## <dbl>
## 1 98.672
## 2 98.672
## 3 100.292
## 4 100.292
## # ... with 26,126 more rows
Function summarise(.data,...)
summaries multiple values to a single value. It means that it allows a user to apply a function on selected columns and get one row outcome. For example one may wonder about average temperature, humidity, and maximum wind_speed in 2013:
summarise(.data = weather,
av_temp = mean(temp, na.rm = TRUE),
av_humid = mean(humid, na.rm = TRUE),
max_wspeed = max(wind_speed, na.rm = TRUE)
)
## # A tibble: 1 × 3
## av_temp av_humid max_wspeed
## <dbl> <dbl> <dbl>
## 1 55.20351 62.34732 1048.361
Above described uses of summarise()
and mutate()
require manual specification of all rules – e.g. if you want to know averages of all weather characteristics then you would have to write a line for each of them. This is not really comfortable. However dplyr
contains functions summarise_all()
, summarise_if()
, summarise_at()
and their mutate
counterparts mutate_all()
, mutate_if()
, and mutate_at()
which make the task simple.
Functions summarise_all(.tbl,.funs,...)
and mutate_all(.tbl, .funs, ...)
allows to apply a function to all columns. For example one might want to have square roots of all columns and mean values:
weather %>% select(-origin,-time_hour) %>% mutate_all(sqrt)
## Warning in `__dplyr_colwise_fun`(c(21.92, 21.92, 21.92, 23, 24.08, 26.06, :
## NaNs produced
## # A tibble: 26,130 × 13
## year month day hour temp dewp humid wind_dir
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 44.86647 1 1 0.000000 6.086050 4.681880 7.346428 15.16575
## 2 44.86647 1 1 1.000000 6.086050 4.681880 7.346428 15.16575
## 3 44.86647 1 1 1.414214 6.159545 4.681880 7.217340 15.16575
## 4 44.86647 1 1 1.732051 6.159545 4.795832 7.383089 15.16575
## # ... with 26,126 more rows, and 5 more variables: wind_speed <dbl>,
## # wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>
weather %>% summarise_all(mean)
## Warning in mean.default(c("EWR", "EWR", "EWR", "EWR", "EWR", "EWR",
## "EWR", : argument is not numeric or logical: returning NA
## # A tibble: 1 × 15
## origin year month day hour temp dewp humid wind_dir
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA 2013 6.505741 15.67972 11.51841 NA NA NA NA
## # ... with 6 more variables: wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
We can see two problems there. The first is, that function return NA
when at least one observation is missing. To change this behavior using mean()
arguments written in ...
:
weather %>% summarise_all(mean, na.rm = TRUE)
## Warning in mean.default(c("EWR", "EWR", "EWR", "EWR", "EWR", "EWR",
## "EWR", : argument is not numeric or logical: returning NA
## # A tibble: 1 × 15
## origin year month day hour temp dewp humid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA 2013 6.505741 15.67972 11.51841 55.20351 41.3854 62.34732
## # ... with 7 more variables: wind_dir <dbl>, wind_speed <dbl>,
## # wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
The second problem is related to datatypes of columns which are incompatible with functions applied. To fix it we can use summarise_if(.tbl, .predicate, .funs,...)
and mutate_if(.tbl, .predicate, .funs,...)
which apply a function only if condition given in .predicate
is true. In this case we want to process only numeric values:
weather %>% mutate_if(is.numeric,sqrt)
## Warning in `__dplyr_colwise_fun`(c(21.92, 21.92, 21.92, 23, 24.08, 26.06, :
## NaNs produced
## # A tibble: 26,130 × 15
## origin year month day hour temp dewp humid wind_dir
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 44.86647 1 1 0.000000 6.086050 4.681880 7.346428 15.16575
## 2 EWR 44.86647 1 1 1.000000 6.086050 4.681880 7.346428 15.16575
## 3 EWR 44.86647 1 1 1.414214 6.159545 4.681880 7.217340 15.16575
## 4 EWR 44.86647 1 1 1.732051 6.159545 4.795832 7.383089 15.16575
## # ... with 26,126 more rows, and 6 more variables: wind_speed <dbl>,
## # wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
weather %>% summarise_if(is.numeric,mean,na.rm = TRUE)
## # A tibble: 1 × 13
## year month day hour temp dewp humid wind_dir
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 6.505741 15.67972 11.51841 55.20351 41.3854 62.34732 198.0667
## # ... with 5 more variables: wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>
The last two functions summarise_at(.tbl,.cols,.funs,...)
and mutate_at(.tbl,.cols,.funs,...)
apply a function to the selection of columns specified in .cols
. You can use the same specifications as in select()
. For example we can summarize only columns with wind characteristics:
weather %>% summarise_at(vars(starts_with("wind")),mean,na.rm = TRUE)
## # A tibble: 1 × 3
## wind_dir wind_speed wind_gust
## <dbl> <dbl> <dbl>
## 1 198.0667 10.39587 11.96336
Basic verbs are quite useful by themselves but they do not really add a novel functionality to basic R (see example above). However they become a very powerful tool when used in so called grouped operations. dplyr
allows user to group observations to mutually exclusive groups and perform operation separately for each group. A example above presented using summarise()
for getting average temperature, humidity, and maximum wind speed. The statistics obtained however described average and maximum values for all airports together – resulting statistics therefore had no useful meaning. We would be probably more interested in summarizing data for each airport individually. We can do so using group_by()
:
weather %>%
group_by(origin) %>%
summarise(
av_temp = mean(temp, na.rm = TRUE),
av_humid = mean(humid, na.rm = TRUE),
max_wspeed = max(wind_speed, na.rm = TRUE)
)
## # A tibble: 3 × 4
## origin av_temp av_humid max_wspeed
## <chr> <dbl> <dbl> <dbl>
## 1 EWR 55.48703 62.86192 1048.36058
## 2 JFK 54.42183 65.04412 42.57886
## 3 LGA 55.70181 59.13616 40.27730
Now, the output contains values summarized for each group defined by variable origin
. A group can be defined even by combination of variables:
weather %>%
group_by(origin,month) %>%
summarise(
av_temp = mean(temp, na.rm = TRUE),
av_humid = mean(humid, na.rm = TRUE),
max_wspeed = max(wind_speed, na.rm = TRUE)
)
## Source: local data frame [36 x 5]
## Groups: origin [?]
##
## origin month av_temp av_humid max_wspeed
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 1 35.56318 61.94838 40.27730
## 2 EWR 2 34.16027 62.98858 1048.36058
## 3 EWR 3 40.03386 57.60311 29.92028
## 4 EWR 4 52.90100 54.40324 25.31716
## # ... with 32 more rows
Note that grouping variables (de facto group IDs) are always presented in the output. The output also remains grouped. See what it means:
weather %<>% group_by(origin)
print(weather)
## Source: local data frame [26,130 x 15]
## Groups: origin [3]
##
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 0 37.04 21.92 53.97 230 10.35702
## 2 EWR 2013 1 1 1 37.04 21.92 53.97 230 13.80936
## 3 EWR 2013 1 1 2 37.94 21.92 52.09 230 12.65858
## 4 EWR 2013 1 1 3 37.94 23.00 54.51 230 13.80936
## # ... with 26,126 more rows, and 5 more variables: wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
weather %>%
summarise(
av_temp = mean(temp, na.rm = TRUE),
av_humid = mean(humid, na.rm = TRUE),
max_wspeed = max(wind_speed, na.rm = TRUE)
)
## # A tibble: 3 × 4
## origin av_temp av_humid max_wspeed
## <chr> <dbl> <dbl> <dbl>
## 1 EWR 55.48703 62.86192 1048.36058
## 2 JFK 54.42183 65.04412 42.57886
## 3 LGA 55.70181 59.13616 40.27730
It might by convenient in some cases to remove grouping from a data.frame:
weather %<>% ungroup()
print(weather)
## # A tibble: 26,130 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 0 37.04 21.92 53.97 230 10.35702
## 2 EWR 2013 1 1 1 37.04 21.92 53.97 230 13.80936
## 3 EWR 2013 1 1 2 37.94 21.92 52.09 230 12.65858
## 4 EWR 2013 1 1 3 37.94 23.00 54.51 230 13.80936
## # ... with 26,126 more rows, and 5 more variables: wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
dplyr
contains a special grouping function rowwise()
which creates a group for each row.
Data for your research are only rarely already compiled into one data set. You need to do this part by your own very often. dplyr
provides functions for joining two data sets. It currently supports following join types (see help):
Mutating joins are joins that returns columns from both data sets x
and y
:
left_join()
: return all rows from x
, and all columns from x
and y.
Rows in x
with no match in y
will have NA
values in the new columns. If there are multiple matches between x
and y
, all combinations of the matches are returned.right_join()
: return all rows from y
, and all columns from x
and y
. Rows in y
with no match in x
will have NA
values in the new columns. If there are multiple matches between x
and y
, all combinations of the matches are returned.inner_join()
: return all rows from x
where there are matching values in y
, and all columns from x
and y
. If there are multiple matches between x
and y
, all combination of the matches are returned.full_join()
: return all rows and all columns from both x
and y
. Where there are not matching values, returns NA
for the one missing.Filtering joins return columns only from x
:
semi_join()
: return all rows from x
where there are matching values in y
, keeping just columns from x
. A semi join differs from an inner join because an inner join will return one row of x
for each matching row of y
, where a semi join will never duplicate rows of x
.anti_join()
: return all rows from x
where there are not matching values in y
, keeping just columns from x
.See some examples: One may wonder who has the biggest planes on the sky of NYC. Data needed are spread across three tables: flights
(tail numbers and carriers codes), planes
(tail numbers and number of seats), and airlines
(carrier codes and names).
It is essencial that all three tables contain a key which can be used to match observations.
First we can join flights
with planes
– i.e. we add technical parameters of airplanes to every single flight.
left_join(flights, planes, by = "tailnum") %>%
left_join(.,airlines, by = "carrier") -> joined_tables
The syntax of joining functions is simple. The first two slots contains tables to join (order matters in this case!). The key used for matching rows from x
and y
is defined by argument by
:
NULL
(default) than columns with identical names are used for matchingby = c("a" = "b", "c" = "d")
, where a
and c
are column names from table x
and b
and d
from y
.To get results on the biggest planes question we need to filter data a bit. Table flights
contains all individual flights from NYC airports therefore one plane is recorded many times and so does the resulting table joined_tables
. It would bias our results. Therefore in the first step we need to get a unique set of airplanes identifiers (tail numbers):
joined_tables %>%
distinct(tailnum, .keep_all = TRUE) %>%
# Now we need to get an average speed for each carrier. First we need to group the data by carrier name.
group_by(name) %>%
# Now we can take average speeds
summarise(
av_seats = mean(seats, na.rm = TRUE)
) %>%
# And sort it according to av_speed
arrange(
desc(av_seats)
) %>%
# And get top three
slice(1:3)
## # A tibble: 3 × 2
## name av_seats
## <chr> <dbl>
## 1 Hawaiian Airlines Inc. 377.0000
## 2 US Airways Inc. 205.7687
## 3 United Air Lines Inc. 194.4013
…and biggest planes award goes to Hawaiian Airlines Inc.
…from R Studio Cheat Sheet
Join functions in dplyr
are truly two-tables verbs because they do not support joining more then two tables in one call. If this feature is needed one can use Reduce()
from base
package. Reduce()
uses a binary function to successively combine the elements of a given vector. We can demonstrate it using left_join()
problem from the first example above:
Reduce(left_join, list(flights,planes,airlines))
## Joining, by = c("year", "tailnum")
## Joining, by = "carrier"
## # A tibble: 336,776 × 27
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## # ... with 336,772 more rows, and 20 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, type <chr>, manufacturer <chr>,
## # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>,
## # name <chr>
Reduce()
progress from left to right (it can be changed) and successively applies given function. It is equivalent to chaining multiple successive calls of left_join()
:
Reduce(left_join, list(a,b,c,d,...))
left_join(a,b) %>% left_join(.,c) %>% left_join(.,d) %>% ...
A window function is a variation on an aggregation function. Where an aggregation function, like sum()
and mean()
, takes n inputs and return a single value, a window function returns n values. dplyr
contains three families of window functions:
See some examples.
planes %>%
mutate(
seats_rnum = row_number(seats),
seats_decile = ntile(seats, n = 10),
seats_mrank = min_rank(seats),
seats_drank = dense_rank(seats),
seats_prank = percent_rank(seats),
seats_cdist = cume_dist(seats)
) %>%
select(starts_with("seats"))
## # A tibble: 3,322 × 7
## seats seats_rnum seats_decile seats_mrank seats_drank seats_prank
## <int> <int> <int> <int> <int> <dbl>
## 1 55 123 1 123 15 0.03673592
## 2 182 2410 8 2410 30 0.72538392
## 3 182 2411 8 2410 30 0.72538392
## 4 182 2412 8 2410 30 0.72538392
## # ... with 3,318 more rows, and 1 more variables: seats_cdist <dbl>
flights %>%
group_by(flight) %>%
mutate(
lead_arr_delay = lead(arr_delay, n = 1L),
lag_arr_delay = lag(arr_delay, n = 1L)
) %>%
select(ends_with("arr_delay")) %>%
arrange(flight)
## Adding missing grouping variables: `flight`
## Source: local data frame [336,776 x 4]
## Groups: flight [3,844]
##
## flight arr_delay lead_arr_delay lag_arr_delay
## <int> <dbl> <dbl> <dbl>
## 1 1 6 29 NA
## 2 1 29 5 6
## 3 1 5 24 29
## 4 1 24 -36 5
## # ... with 336,772 more rows
Note that
flight
is in in the output even if it is not selected byselect()
! Grouping variables remain in selection automatically.
flights %>%
group_by(flight) %>%
mutate(
cummean_arr_delay = cummean(arr_delay)
) %>%
select(ends_with("arr_delay")) %>%
arrange(flight)
## Adding missing grouping variables: `flight`
## Source: local data frame [336,776 x 3]
## Groups: flight [3,844]
##
## flight arr_delay cummean_arr_delay
## <int> <dbl> <dbl>
## 1 1 6 6.00000
## 2 1 29 17.50000
## 3 1 5 13.33333
## 4 1 24 16.00000
## # ... with 336,772 more rows
In the first exercise we will process data on unemployment rate from OECD database. File OECD_STLABOUR_monthly.csv
contains seasonally adjusted monthly unemployment rates for numerous population groups in OECD countries. We would like to get yearly unemployment rates for total population 15+ (series LRUNTTTT
), males (LRUNTTMA
), and females (LRUNTTFE
).
read_csv("data/OECD_STLABOUR_monthly.csv") -> oecd
print(oecd)
## # A tibble: 18,210 × 19
## `"LOCATION"` Country SUBJECT Subject
## <chr> <chr> <chr> <chr>
## 1 AUS Australia LRUN24FE Unemployment rate, Aged 15-24, Females
## 2 AUS Australia LRUN24FE Unemployment rate, Aged 15-24, Females
## 3 AUS Australia LRUN24FE Unemployment rate, Aged 15-24, Females
## 4 AUS Australia LRUN24FE Unemployment rate, Aged 15-24, Females
## # ... with 18,206 more rows, and 15 more variables: MEASURE <chr>,
## # Measure <chr>, FREQUENCY <chr>, Frequency <chr>, TIME <chr>,
## # Time <chr>, `Unit Code` <chr>, Unit <chr>, `PowerCode Code` <int>,
## # PowerCode <chr>, `Reference Period Code` <chr>, `Reference
## # Period` <chr>, Value <dbl>, `Flag Codes` <chr>, Flags <chr>
OECD data comes with a lot of information. In this case we need just country code, variable code, value, and time.
oecd %<>%
select(country = 1,
variable = SUBJECT,
value = Value,
time = Time)
print(oecd)
## # A tibble: 18,210 × 4
## country variable value time
## <chr> <chr> <dbl> <chr>
## 1 AUS LRUN24FE 12.16585 Jan-2000
## 2 AUS LRUN24FE 12.43789 Feb-2000
## 3 AUS LRUN24FE 11.52670 Mar-2000
## 4 AUS LRUN24FE 11.51421 Apr-2000
## # ... with 18,206 more rows
In the call used select()
utilizes both standard ways to select a column – number (in the case of ugly column name) and names.
Data set contains 1 different unemployment rates. We are interested only in three of them:
oecd %<>%
filter(variable %in% c("LRUNTTTT","LRUNTTMA","LRUNTTFE"))
print(oecd)
## # A tibble: 3,954 × 4
## country variable value time
## <chr> <chr> <dbl> <chr>
## 1 AUS LRUNTTFE 6.670316 Jan-2000
## 2 AUS LRUNTTFE 6.740396 Feb-2000
## 3 AUS LRUNTTFE 6.576782 Mar-2000
## 4 AUS LRUNTTFE 6.341064 Apr-2000
## # ... with 3,950 more rows
Now it is time to make it tidy:
oecd %<>%
spread(variable,value)
print(oecd)
## # A tibble: 1,320 × 5
## country time LRUNTTFE LRUNTTMA LRUNTTTT
## * <chr> <chr> <dbl> <dbl> <dbl>
## 1 AUS Apr-2000 6.341064 6.400338 6.374269
## 2 AUS Apr-2001 6.509435 7.016913 6.791216
## 3 AUS Apr-2002 5.889289 6.709898 6.347627
## 4 AUS Apr-2003 5.850319 6.171773 6.027655
## # ... with 1,316 more rows
We want to aggregated monthly observations. For the purpose of grouping it useful to store year and month in different columns:
oecd %<>%
separate(time, c("month","year"), sep="-", convert = TRUE)
print(oecd)
## # A tibble: 1,320 × 6
## country month year LRUNTTFE LRUNTTMA LRUNTTTT
## * <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 AUS Apr 2000 6.341064 6.400338 6.374269
## 2 AUS Apr 2001 6.509435 7.016913 6.791216
## 3 AUS Apr 2002 5.889289 6.709898 6.347627
## 4 AUS Apr 2003 5.850319 6.171773 6.027655
## # ... with 1,316 more rows
Now we can take averages for years:
oecd %<>%
# We want to get an average for a country-year pairs. Therefore we need to make groups along these variables.
group_by(country,year) %>%
# And we want to take averages only for numeric columns.
summarise_if(is.numeric, mean, na.rm=T)
print(oecd)
## Source: local data frame [110 x 5]
## Groups: country [?]
##
## country year LRUNTTFE LRUNTTMA LRUNTTTT
## <chr> <int> <dbl> <dbl> <dbl>
## 1 AUS 2000 6.030742 6.465302 6.273984
## 2 AUS 2001 6.455428 7.007580 6.763034
## 3 AUS 2002 6.148189 6.534425 6.363390
## 4 AUS 2003 5.960969 5.894895 5.924274
## # ... with 106 more rows
Inequality is hot topic. Because… you know… it is huge and growing beyond that. If you want to let data speak you will most likely need WIID data set.
The WIID database in 3.3 version is available in WIID_ver3_sept15_0.tsv
(tab separated values).
read_tsv("data/WIID_ver3_sept15_0.tsv") -> wiid
print(wiid)
## # A tibble: 7,153 × 53
## Country Countrycode3 Countrycode2 Year POP Incomegroup
## <chr> <chr> <chr> <int> <dbl> <chr>
## 1 Afghanistan AFG AF 2007 NA Low income
## 2 Afghanistan AFG AF 2008 NA Low income
## 3 Albania ALB AL 1996 3.112597 Upper middle income
## 4 Albania ALB AL 1997 3.091902 Upper middle income
## # ... with 7,149 more rows, and 47 more variables: Mean <dbl>,
## # Median <dbl>, Gini <dbl>, Source <chr>, D1 <dbl>, D2 <dbl>, D3 <dbl>,
## # D4 <dbl>, D5 <dbl>, D6 <dbl>, D7 <dbl>, D8 <dbl>, D9 <dbl>, D10 <dbl>,
## # Q1 <dbl>, Q2 <dbl>, Q3 <dbl>, Q4 <dbl>, Q5 <dbl>, P5 <dbl>, P95 <dbl>,
## # Revision <chr>, Quality <chr>, Source_Comments <chr>, Region <chr>,
## # Currency <chr>, Reference_period <chr>, Welfaredefn <chr>,
## # IncSharU <chr>, UofAnala <chr>, Equivsc <chr>, AreaCovr <chr>,
## # PopCovr <chr>, AgeCovr <chr>, EU <chr>, OECD <chr>, Year_cat <chr>,
## # Welfaredefn_new <chr>, Equivsc_new <chr>, AgeCovr_new <chr>,
## # AreaCovr_new <chr>, PopCovr_new <chr>, IncSharU_new <chr>,
## # UofAnala_new <chr>, Exchangerate <dbl>, Mean_usd <dbl>,
## # Median_usd <dbl>
WIID aggregates many data sets on income inequality with many indicators. We will focus on Gini coefficient (Gini
). Each observation is rated according its quality (Quality
) from High
(best) to Not known
(worst). To make our lives easier we will restrain the data set on country and year identifiers and Gini and Quality ratings.
wiid %<>% select(
Country = Countrycode3, # ISO alpha3 country codes
Year = Year,
gini = Gini,
quality = Quality
)
# Lets take a look at Australia in 1978
wiid %>%
filter(
Country == "AUS",
Year == 1978
)
## # A tibble: 3 × 4
## Country Year gini quality
## <chr> <int> <dbl> <chr>
## 1 AUS 1978 38.1 Low
## 2 AUS 1978 26.1 Average
## 3 AUS 1978 18.0 Average
…and it is apparent that we have a problem. There are multiple observations for country-year pairs and even for country-year-quality. Researcher might want to use the best quality available for each country-year pair and if there are more of the same quality then take an average.
In the first step we will create averages for unique country-year-quality combinations:
wiid %<>%
group_by(Country,Year,quality) %>%
summarise(
gini = mean(gini, na.rm = TRUE)
) %>% ungroup
# See...
wiid %>%
filter(
Country == "AUS",
Year == 1978
)
## # A tibble: 2 × 4
## Country Year quality gini
## <chr> <int> <chr> <dbl>
## 1 AUS 1978 Average 22.05
## 2 AUS 1978 Low 38.10
In the second step we would like to choose the best observation available. The quality
has following values:
wiid %>% distinct(quality)
## # A tibble: 4 × 1
## quality
## <chr>
## 1 High
## 2 Average
## 3 Low
## 4 Not known
The strategy is following. Order rows for each country-year pair by quality (best goes first,…) and get only the first row! Easy to say, harder to do. The best option is to convert character vector quality
into ordered factor.
# Before:
wiid %>% distinct(quality) %>% arrange(quality)
## # A tibble: 4 × 1
## quality
## <chr>
## 1 Average
## 2 High
## 3 Low
## 4 Not known
# Converting into factor:
wiid$quality %<>% factor(levels = c("High","Average","Low","Not known"))
# After:
wiid %>% distinct(quality) %>% arrange(quality)
## # A tibble: 4 × 1
## quality
## <fctr>
## 1 High
## 2 Average
## 3 Low
## 4 Not known
Now we can do the rest:
wiid %>%
# at first it is wise to remove rows with NAs
filter(!is.na(gini)) %>%
# we are looking for the best observation for country-year pair
group_by(Country,Year) %>%
# order by quality
arrange(quality) %>%
# and now we take the first row
slice(1L) -> wiid_0
There is an alternative way of doing the last step without grouping:
wiid %>%
filter(!is.na(gini)) %>%
arrange(quality) %>%
distinct(Country, Year, .keep_all = TRUE) -> wiid_1
Results are the same:
setdiff(wiid_0,wiid_1)
## Source: local data frame [0 x 4]
## Groups: [?]
##
## # ... with 4 variables: Country <chr>, Year <int>, quality <fctr>,
## # gini <dbl>
One may wonder if there is a correlation between unemployment and income inequality. In order to investigate it we would need to join WIID data on inequality and OECD data on unemployment.
wiid_0 %>%
# We don't need information on quality therefore we can drop it
select(-quality) %>%
# Notice that joins are sensitive to inputs order -- you can place "." on the position where table going through the pipe should be.
# The names of key columns are not identical. Therefore we need to use named vector describing matching rules.
left_join(oecd,.,
by = c("country"="Country",
"year"="Year")) -> oecd_wiid
print(oecd_wiid)
## Source: local data frame [110 x 6]
## Groups: country [?]
##
## country year LRUNTTFE LRUNTTMA LRUNTTTT gini
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUS 2000 6.030742 6.465302 6.273984 38.775
## 2 AUS 2001 6.455428 7.007580 6.763034 31.100
## 3 AUS 2002 6.148189 6.534425 6.363390 30.900
## 4 AUS 2003 5.960969 5.894895 5.924274 31.200
## # ... with 106 more rows
We have successfully joined the data and now we can compute correlation coefficient. It is 0.32. In R we can also make pictures…
…but we will talk about that later. (Think about regressions with and without FE.)
Czech Statistical Office publishes detailed statistics on population changes at municipal level. The database is divided into multiple tables in the same structure (one for a county/“okres”). Each table has following structure:
## # A tibble: 135 × 19
## year city_id name est pop_b nat mort immig emig pch_nat
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1971 210001 Běchovice - 1859 17 35 75 97 -18
## 2 1972 210001 Běchovice - 1819 26 26 103 63 NA
## 3 1973 210001 Běchovice - 1859 26 28 59 75 -2
## 4 1971 210002 Benice - 336 9 4 6 15 5
## # ... with 131 more rows, and 9 more variables: pch_mig <dbl>,
## # pch_tot <dbl>, area_ch1 <dbl>, pop_e <dbl>, area_ch2 <dbl>,
## # from <chr>, from_name <chr>, where <chr>, where_name <chr>
For the purpose of the homework are important following columns:
year
– yearcity_id
– ID of the citypop_b
– population at the beginning of the yearpch_mig
– net absolute migration (number of persons)Data are available in the file HW_CZSO_migration.Rdata
in the variable CZSO
which is a list of individual tables (data frames). You are supposed to:
mrate
) as 100*pch_mig/pop_b
amrate
) and average population size (from pop_b
) for each city_id
for the period 1991–2010 (incl. margin years), ignore missing valuesAssign your solution to the variable czdata
. Your solution should look like following table:
## # A tibble: 6,283 × 3
## city_id amrate apop
## <chr> <dbl> <dbl>
## 1 564982 14.71920 290.50000
## 2 546518 13.91694 62.94737
## 3 564907 13.78374 170.35000
## 4 517321 13.47592 207.25000
## # ... with 6,279 more rows
Hints:
Good luck!