2016

Data manipulation

Majority of data-manipulating operation can be desribed using following verbs:

select, filter, add, transform, summarise, join

…these and even more operations can be done
using tools from dplyr.

library(dplyr)

Hadley Wickham and Romain Francois (2016). dplyr: A Grammar of Data Manipulation. R package version 0.5.0.
https://CRAN.R-project.org/package=dplyr

Table of content

  • Data used in examples
  • Basic verbs
  • Grouped operations
  • Two-tables verbs
  • Exercises
  • Homework

Data used in examples

Diamonds datasets

We will use two tables derived from ggplot2::diamonds table for demostration of dplyr functions.

Diamonds datasets: Diamonds_quality

print(Diamonds_quality, n=5)
## # A tibble: 53,940 × 4
##      ID     cut color price
##   <int>   <ord> <ord> <int>
## 1     1   Ideal     E   326
## 2     2 Premium     E   326
## 3     3    Good     E   327
## 4     4 Premium     I   334
## 5     5    Good     J   335
## # ... with 5.394e+04 more rows
  • ID – ID of the stone
  • cut – quality of the cut (Fair, Good, Very Good, Premium, Ideal)
  • color – diamond colour, from J (worst) to D (best)
  • price – price in USD

Diamonds datasets: Diamonds_size

print(Diamonds_size, n=4)
## # A tibble: 53,940 × 5
##      ID carat     x     y     z
##   <int> <dbl> <dbl> <dbl> <dbl>
## 1     1  0.23  3.95  3.98  2.43
## 2     2  0.21  3.89  3.84  2.31
## 3     3  0.23  4.05  4.07  2.31
## 4     4  0.29  4.20  4.23  2.63
## # ... with 5.394e+04 more rows
  • ID – ID of the stone
  • carat – weight of the diamond
  • x – length in mm
  • y – width in mm
  • z – depth in mm

Basic verbs

Operations with rows: filter(), slice(), distinct(), arrange()
Operations with columns: select(), mutate()
Others: summarise(), mutate_if(),…

Filter rows with filter() and slice()

A basic operation in data handling is a selection of observations (rows) which meet given predicates from a data set.
This functionality is delivered by function filter().

For example we can filter stones from table Diamonds_quality which meets following conditions:

  • price is between 1,000 and 15,000 USD
  • cut is Very Good or Premium
  • color is D (best)

filter()

filter(.data, ...)
  • .data – a table
  • ... – logical predicates

filter(), example

Diamonds_quality %>% 
    filter(
        price >= 1000,
        price <= 15000,
        cut %in% c("Premium","Very Good"),
        color == "D"
    ) %>% print(n=3)
## # A tibble: 2,159 × 4
##      ID       cut color price
##   <int>     <ord> <ord> <int>
## 1   101 Very Good     D  2760
## 2   128 Very Good     D  2763
## 3   132 Very Good     D  2764
## # ... with 2,156 more rows

base vs. dplyr

base call:

Diamonds_quality[ 
    price >= 1000 &
    price <= 15000 &
    cut %in% c("Premium",
            "Very Good") &
    color == "D",
    ]

dplyr call:

Diamonds_quality %>% 
  filter(
    price >= 1000,
    price <= 15000,
    cut %in% c("Premium",
            "Very Good"),
    color == "D"
  )

slice()

Argument of filter() is a sequence of logical predicates. Sometimes (rarely) is useful to filter rows using their row numbers. filter() is unable to process row numbers as arguments. In this case we can use slice(.data,...) with similar syntax, where ... represents row numbers (integers).

slice(), example

Assume that (for some reason) we want to filter only even rows:

slice(.data = Diamonds_quality, 
      seq(from = 2, to = nrow(Diamonds_quality), by = 2)
      ) %>% print(n=5)
## # A tibble: 26,970 × 4
##      ID       cut color price
##   <int>     <ord> <ord> <int>
## 1     2   Premium     E   326
## 2     4   Premium     I   334
## 3     6 Very Good     J   336
## 4     8 Very Good     H   337
## 5    10 Very Good     H   338
## # ... with 2.696e+04 more rows

Extract unique rows with distinct()

In some cases a tale may contain identical rows. If we want to get rid of duplicities we can use for example unique() or considerably faster distinct() from dplyr.

distinct(.data, ..., .keep_all = FALSE)
  • .data – a table
  • ... – Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables.
  • .keep_all – If TRUE, keep all variables in .data. If a combination of ... is not distinct, this keeps the first row of values.

distinct(), example

Diamonds_quality %>% filter(cut=="Ideal", color=="E") %>% print(n=2)
## # A tibble: 3,903 × 4
##      ID   cut color price
##   <int> <ord> <ord> <int>
## 1     1 Ideal     E   326
## 2    83 Ideal     E   554
## # ... with 3,901 more rows
Diamonds_quality %>% distinct(cut, color, .keep_all = TRUE) %>% print(n=2)
## # A tibble: 35 × 4
##      ID     cut color price
##   <int>   <ord> <ord> <int>
## 1     1   Ideal     E   326
## 2     2 Premium     E   326
## # ... with 33 more rows

Sort rows with arrange()

With arrange(.data,...) dplyr provides function for sorting rows according to one or multiple columns.

Diamonds_quality %>% 
    arrange(cut,desc(price)) %>% print(n=5)
## # A tibble: 53,940 × 4
##      ID   cut color price
##   <int> <ord> <ord> <int>
## 1 27647  Fair     G 18574
## 2 27644  Fair     H 18565
## 3 27631  Fair     J 18531
## 4 27624  Fair     G 18515
## 5 27544  Fair     H 18308
## # ... with 5.394e+04 more rows

Select columns with select()

You can select columns from a table using select(.data,...), where ... is comma separated list of unquoted expressions. You can treat variable names like they are positions. See example:

Diamonds_quality %>% names()
## [1] "ID"    "cut"   "color" "price"
Diamonds_quality %>% select(ID,cut,color)
Diamonds_quality %>% select(1,2,3)

Special functions inside select()

You can use special function when addressing columns:

  • : can be used to address range of variables
  • - drops variables
  • = renames columns
Diamonds_quality %>% select(ID,cut,color)
Diamonds_quality %>% select(ID:color)
Diamonds_quality %>% select(-price)
Diamonds_quality %>% select(1,2,3)
Diamonds_quality %>% select(1:3)
Diamonds_quality %>% select(-4)

All above listed calls return identical outcome.

Special functions inside select()

Diamonds_quality %>% 
    select(ID,price) %>% 
    print(n=6)
## # A tibble: 53,940 × 2
##      ID price
##   <int> <int>
## 1     1   326
## 2     2   326
## 3     3   327
## 4     4   334
## 5     5   335
## 6     6   336
## # ... with 5.393e+04 more rows
Diamonds_quality %>% 
    select(id = ID,price) %>% 
    print(n=6)
## # A tibble: 53,940 × 2
##      id price
##   <int> <int>
## 1     1   326
## 2     2   326
## 3     3   327
## 4     4   334
## 5     5   335
## 6     6   336
## # ... with 5.393e+04 more rows

Select helpers (1)

What if…

  • I want to select all columns starting with letter "c".
  • I want to select all columns whose name contains letter "r".
  • I want to select columns whose names are stored in vector cnames <- c("ID","price").
  • I want to reorder columns.

In those cases you want to use select helpers.

Select helpers (1)

These functions allow you to select variables based on their names.

  • starts_with(): starts with a prefix
  • ends_with(): ends with a prefix
  • contains(): contains a literal string
  • matches(): matches a regular expression
  • num_range(): a numerical range like x01, x02, x03.
  • one_of(): variables in character vector.
  • everything(): all variables.

Select helpers (2)

starts_with(), ends_with(), contains(), matches() allows user to select columns according to some key. For example one can select all columns from Diamonds_quality which start with letter "c".

Diamonds_quality %>% 
    select(starts_with("c")) %>% print(n=4)
## # A tibble: 53,940 × 2
##       cut color
##     <ord> <ord>
## 1   Ideal     E
## 2 Premium     E
## 3    Good     E
## 4 Premium     I
## # ... with 5.394e+04 more rows

Select helpers (3)

When programming the names of columns needed are often stored in character vector. In this case you can use one_of() helper.

cnames <- c("ID","price")
Diamonds_quality %>% 
    select(one_of(cnames)) %>% print(n=4)
## # A tibble: 53,940 × 2
##      ID price
##   <int> <int>
## 1     1   326
## 2     2   326
## 3     3   327
## 4     4   334
## # ... with 5.394e+04 more rows

Select helpers (4)

In the case of missing header or numeric column names in source file columns are usually named automatically by combination of letter and number – e.g. V1 or X2008. To select those columns one can use num_range().

# Create dummy dataset with columns named V*
df <- as.data.frame(matrix(runif(100), nrow = 10)) %>% as_data_frame()
select(.data = df, num_range("V", c(1,5:6))) %>% print(n=4)
## # A tibble: 10 × 3
##          V1        V5        V6
##       <dbl>     <dbl>     <dbl>
## 1 0.9118840 0.7743316 0.3742713
## 2 0.9596149 0.3087521 0.3699380
## 3 0.5387952 0.8935246 0.4191943
## 4 0.6891536 0.8948257 0.3036369
## # ... with 6 more rows

Select helpers (5)

For the clarity is sometimes useful to change ordering of columns – e.g. to put price column on the first position. everything() can help with it.

Diamonds_quality %>% 
    select(price,everything()) %>% print(n=4)
## # A tibble: 53,940 × 4
##   price    ID     cut color
##   <int> <int>   <ord> <ord>
## 1   326     1   Ideal     E
## 2   326     2 Premium     E
## 3   327     3    Good     E
## 4   334     4 Premium     I
## # ... with 5.394e+04 more rows

Rename columns with rename()

dplyr provides specialized version of select(): rename(). rename() changes names of selected columns and keeps others in the data set.

Diamonds_quality %>% 
    select(id = ID) %>% 
    print(n=4)
## # A tibble: 53,940 × 1
##      id
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## # ... with 5.394e+04 more rows
Diamonds_quality %>% 
    rename(id = ID) %>% 
    print(n=4)
## # A tibble: 53,940 × 4
##      id     cut color price
##   <int>   <ord> <ord> <int>
## 1     1   Ideal     E   326
## 2     2 Premium     E   326
## 3     3    Good     E   327
## 4     4 Premium     I   334
## # ... with 5.394e+04 more rows

Create new columns with mutate()

Function mutate(.data,...) adds new variables according to specifications given in ... and preserves existing. See example – lousy estimation of stones volume and conversion of carats to grams:

Diamonds_size %>% 
mutate(
    volume = x*y*z,
    weight = carat*0.2
) %>% print(n=3)
## # A tibble: 53,940 × 7
##      ID carat     x     y     z   volume weight
##   <int> <dbl> <dbl> <dbl> <dbl>    <dbl>  <dbl>
## 1     1  0.23  3.95  3.98  2.43 38.20203  0.046
## 2     2  0.21  3.89  3.84  2.31 34.50586  0.042
## 3     3  0.23  4.05  4.07  2.31 38.07688  0.046
## # ... with 5.394e+04 more rows

mutate()

You can also "recycle" the column name:

Diamonds_size %>% 
mutate(
    carat = sqrt(carat)
) %>% print(n=3)
## # A tibble: 53,940 × 5
##      ID     carat     x     y     z
##   <int>     <dbl> <dbl> <dbl> <dbl>
## 1     1 0.4795832  3.95  3.98  2.43
## 2     2 0.4582576  3.89  3.84  2.31
## 3     3 0.4795832  4.05  4.07  2.31
## # ... with 5.394e+04 more rows

mutate() vs. transmute()

mutate() leaves old columns intact. transmute() keeps only newly created column in the output.

Diamonds_size %>% 
mutate(
    carat = sqrt(carat)
) %>% print(n=3)
## # A tibble: 53,940 × 5
##      ID     carat     x     y     z
##   <int>     <dbl> <dbl> <dbl> <dbl>
## 1     1 0.4795832  3.95  3.98  2.43
## 2     2 0.4582576  3.89  3.84  2.31
## 3     3 0.4795832  4.05  4.07  2.31
## # ... with 5.394e+04 more rows
Diamonds_size %>% 
transmute(
    carat = sqrt(carat)
) %>% print(n=3)
## # A tibble: 53,940 × 1
##       carat
##       <dbl>
## 1 0.4795832
## 2 0.4582576
## 3 0.4795832
## # ... with 5.394e+04 more rows

Summarise values in columns with summarise()

Function summarise(.data,...) summaries multiple values to a single value. It means that it allows a user to apply a function on selected columns and get one row outcome.

summarise()

For example one may wonder about average weight, maximum x, and minimum y in the table:

Diamonds_size %>% 
    summarise(
        carat = mean(carat, na.rm = T),
        max_x = max(x),
        min_y = min(y)
    )
## # A tibble: 1 × 3
##       carat max_x min_y
##       <dbl> <dbl> <dbl>
## 1 0.7979397 10.74     0

Summarise and mutate multiple columns

Above described uses of summarise() and mutate() require manual specification of all rules – e.g. if you want to know averages of all size characteristics then you would have to write a line for each of them. This is not really comfortable. However dplyr contains functions summarise_all(), summarise_if(), summarise_at() and their mutate counterparts mutate_all(), mutate_if(), and mutate_at() which make the task simple.

*_all()

Functions summarise_all(.tbl,.funs,...) and mutate_all(.tbl, .funs, ...) allows user to apply a function to all columns

*_all()

Diamonds_size %>% select(-ID) %>% mutate_all(sqrt) %>% print(n=2)
## # A tibble: 53,940 × 4
##       carat        x        y        z
##       <dbl>    <dbl>    <dbl>    <dbl>
## 1 0.4795832 1.987461 1.994994 1.558846
## 2 0.4582576 1.972308 1.959592 1.519868
## # ... with 5.394e+04 more rows
Diamonds_size %>% select(-ID) %>% summarise_all(mean)
## # A tibble: 1 × 4
##       carat        x        y        z
##       <dbl>    <dbl>    <dbl>    <dbl>
## 1 0.7979397 5.731157 5.734526 3.538734

*_if()

The second problem is related to datatypes of columns which are incompatible with functions applied. To fix it we can use summarise_if(.tbl, .predicate, .funs,...) and mutate_if(.tbl, .predicate, .funs,...) which apply a function only if condition given in .predicate is true.

*_if()

In this case we want to process only numeric values:

Diamonds_quality %>% mutate_if(is.numeric,sqrt) %>%  print(n=2)
## # A tibble: 53,940 × 4
##         ID     cut color    price
##      <dbl>   <ord> <ord>    <dbl>
## 1 1.000000   Ideal     E 18.05547
## 2 1.414214 Premium     E 18.05547
## # ... with 5.394e+04 more rows
Diamonds_quality %>% summarise_if(is.numeric,mean,na.rm = TRUE)
## # A tibble: 1 × 2
##        ID  price
##     <dbl>  <dbl>
## 1 26970.5 3932.8

*_at()

The last two functions summarise_at(.tbl,.cols,.funs,...) and mutate_at(.tbl,.cols,.funs,...) apply a function to the selection of columns specified in .cols. You can use the same specifications as in select().

Why dplyr?

What we need dplyr for?

  • designed for tables
  • elegant
  • fast
  • able to process large tables
  • has grouped operations

Grouped operations

Grouped operations (1)

Basic verbs are quite useful by themselves but they do not really add a novel functionality to basic R (see example above). However they become a very powerful tool when used in so called grouped operations.

dplyr allows user to group observations to mutually exclusive groups and perform operations separately for each group.

Grouped operations (2)

Groups can be defined using function group_by():

group_by(.data, ..., add = FALSE)
  • .data – a table
  • ... – one or more variables to group by. All tbls accept variable names, some will also accept functions of variables. Duplicated groups will be silently dropped.
  • add – By default, when add = FALSE, group_by will override existing groups. To instead add to the existing groups, use add = TRUE

Grouped operations (3)

One grouping variable:

Diamonds_quality %>% 
    group_by(cut) %>% print(n=4)
## Source: local data frame [53,940 x 4]
## Groups: cut [5]
## 
##      ID     cut color price
##   <int>   <ord> <ord> <int>
## 1     1   Ideal     E   326
## 2     2 Premium     E   326
## 3     3    Good     E   327
## 4     4 Premium     I   334
## # ... with 5.394e+04 more rows

Grouped operations (4)

More grouping variables:

Diamonds_quality %>% 
    group_by(cut,color) %>% print(n=4)
## Source: local data frame [53,940 x 4]
## Groups: cut, color [35]
## 
##      ID     cut color price
##   <int>   <ord> <ord> <int>
## 1     1   Ideal     E   326
## 2     2 Premium     E   326
## 3     3    Good     E   327
## 4     4 Premium     I   334
## # ... with 5.394e+04 more rows

Grouped operations (5): deep down

Diamonds_quality is a simple tibble:

Diamonds_quality %>% class
## [1] "tbl_df"     "tbl"        "data.frame"

group_by() adds a class grouped_df:

Diamonds_quality %>% group_by(cut) %>% class
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Deep down it creates a set of additional attributes which allows compatibile methods to perform "grouped operations".

Using grouped operations (1)

Mirror, mirror,… tell me which the average price of stones for each cut and how many stones for each cut we have in sample.

Diamonds_quality %>% group_by(cut) %>% 
    summarise(mean_price = mean(price, na.rm=T),
              nobs = n())
## # A tibble: 5 × 3
##         cut mean_price  nobs
##       <ord>      <dbl> <int>
## 1      Fair   4358.758  1610
## 2      Good   3928.864  4906
## 3 Very Good   3981.760 12082
## 4   Premium   4584.258 13791
## 5     Ideal   3457.542 21551

Grouping variables are preserved in resulting table!

Using grouped operations (2)

What about deviation of the price from mean…

Diamonds_quality %>% group_by(cut) %>% 
    mutate(
        mean_price_dev = price - mean(price, na.rm=T)
    ) %>% print(n=5)
## Source: local data frame [53,940 x 5]
## Groups: cut [5]
## 
##      ID     cut color price mean_price_dev
##   <int>   <ord> <ord> <int>          <dbl>
## 1     1   Ideal     E   326      -3131.542
## 2     2 Premium     E   326      -4258.258
## 3     3    Good     E   327      -3601.864
## 4     4 Premium     I   334      -4250.258
## 5     5    Good     J   335      -3593.864
## # ... with 5.394e+04 more rows

Using grouped operations (3)

Recall an example from tidyr slides:

POP_explicit %>% mutate(population_filled = population) %>% 
fill(population_filled, .direction = "down")
## # A tibble: 12 × 4
##    country  year population population_filled
##      <chr> <int>      <int>             <int>
## 1   Norway  1995    4359788           4359788
## 2   Norway  1996         NA           4359788
## 3   Norway  1997         NA           4359788
## 4   Norway  1998    4440109           4440109
## 5   Norway  1999         NA           4440109
## 6   Norway  2000    4491572           4491572
## 7   Sweden  1995         NA           4491572
## 8   Sweden  1996    8849420           8849420
## 9   Sweden  1997    8859106           8859106
## 10  Sweden  1998    8861204           8861204
## 11  Sweden  1999         NA           8861204
## 12  Sweden  2000    8872284           8872284

Using grouped operations (4)

Sweden has silently inherited last value from Norway.

Suggest a fix!

Using grouped operations (5)

POP_explicit %>% mutate(population_filled = population) %>% 
group_by(country) %>% fill(population_filled, .direction = "down")
## Source: local data frame [12 x 4]
## Groups: country [2]
## 
##    country  year population population_filled
##      <chr> <int>      <int>             <int>
## 1   Norway  1995    4359788           4359788
## 2   Norway  1996         NA           4359788
## 3   Norway  1997         NA           4359788
## 4   Norway  1998    4440109           4440109
## 5   Norway  1999         NA           4440109
## 6   Norway  2000    4491572           4491572
## 7   Sweden  1995         NA                NA
## 8   Sweden  1996    8849420           8849420
## 9   Sweden  1997    8859106           8859106
## 10  Sweden  1998    8861204           8861204
## 11  Sweden  1999         NA           8861204
## 12  Sweden  2000    8872284           8872284

rowwise()

There is a special case of grouping when each group consist of single observation (row). Such grouping can be arrange by a rowwise() function.

ungroup()

group_by() adds a new attribute to a tibble which is a permanent part of the object. It can be changed by:

  • another group_by(add = FALSE) call
  • ungroup() function which removes grouping attributes (result is a tibble)
  • as.data.frame() which results in ordinary data.frame
  • …and pretty any way which would modify grouped_df attribute

Two-tables verbs

  • Mutating joins
  • Filtering joins
  • Bindings

Two-tables verbs

Two-tables verbs are functions with two (or more in the case of bindigs) input tables. Their purpose is most often to join two tables together.

You can join multiple tables using two-tables verbs and Reduce(). For details see handout. (I think I have done that three times at most.)

Data

We will demonstrate joins using two small tables:

print(tabA)
## # A tibble: 3 × 2
##      x1    x2
##   <chr> <dbl>
## 1     A     1
## 2     B     2
## 3     C     3
print(tabB)
## # A tibble: 3 × 2
##      x1    x3
##   <chr> <lgl>
## 1     A  TRUE
## 2     B FALSE
## 3     D  TRUE

Mutating joins (1)

Mutating joins are joins that returns columns from both data sets x and y.

Mutating joins (2)

A call of a join function() X:

X(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
  • x and y – tables to join
  • by – a character vector of variables to join by. If NULL, the default, join will do a natural join, using all variables with common names across the two tables. A message lists the variables so that you can check they're right (to suppress the message, simply explicitly list the variables that you want to join). To join by different variables on x and y use a named vector. For example, by = c("a" = "b") will match x.a to y.b.
  • suffix – If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to diambiguate them.

Mutating joins: left_join()

left_join(): return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

left_join(tabA,tabB)
## # A tibble: 3 × 3
##      x1    x2    x3
##   <chr> <dbl> <lgl>
## 1     A     1  TRUE
## 2     B     2 FALSE
## 3     C     3    NA

Mutating joins: right_join()

right_join(): return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

right_join(tabA,tabB)
## # A tibble: 3 × 3
##      x1    x2    x3
##   <chr> <dbl> <lgl>
## 1     A     1  TRUE
## 2     B     2 FALSE
## 3     D    NA  TRUE

Mutating joins: inner_join()

inner_join(): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

inner_join(tabA,tabB)
## # A tibble: 2 × 3
##      x1    x2    x3
##   <chr> <dbl> <lgl>
## 1     A     1  TRUE
## 2     B     2 FALSE

Mutating joins: full_join()

full_join(): return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

full_join(tabA,tabB)
## # A tibble: 4 × 3
##      x1    x2    x3
##   <chr> <dbl> <lgl>
## 1     A     1  TRUE
## 2     B     2 FALSE
## 3     C     3    NA
## 4     D    NA  TRUE

Beware of bugs…

print(tabA)
## # A tibble: 3 × 2
##      x1    x2
##   <chr> <dbl>
## 1     A     1
## 2     B     2
## 3     C     3
print(tabBb)
## # A tibble: 3 × 2
##      x1    x3
##   <chr> <lgl>
## 1     A  TRUE
## 2     B FALSE
## 3     B  TRUE

Beware of bugs

left_join(tabA,tabBb)
## # A tibble: 4 × 3
##      x1    x2    x3
##   <chr> <dbl> <lgl>
## 1     A     1  TRUE
## 2     B     2 FALSE
## 3     B     2  TRUE
## 4     C     3    NA

ID in x1 is not an unique identification of an obervation anymore! No warnings. No errors. No messages.

Filtering joins: semi_join()

Filtering joins return columns only from x:

semi_join(): return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

semi_join(tabA,tabB)
## # A tibble: 2 × 2
##      x1    x2
##   <chr> <dbl>
## 1     A     1
## 2     B     2

Filtering joins: anti_join()

anti_join(): return all rows from x where there are not matching values in y, keeping just columns from x.

anti_join(tabA,tabB)
## # A tibble: 1 × 2
##      x1    x2
##   <chr> <dbl>
## 1     C     3

Binding

bind_rows(..., .id = NULL)
bind_cols(...)
  • ... Data frames to combine.

Each argument can either be a data frame, a list that could be a data frame, or a list of data frames.

When column-binding, rows are matched by position, not value so all data frames must have the same number of rows. To match by value, not position, see left_join etc. When row-binding, columns are matched by name, and any values that don't match will be filled with NA.

Exercises

Exercise 1: OECD data

File OECD_STLABOUR_monthly.csv contains seasonally adjusted monthly unemployment rates for numerous population groups in OECD countries. We would like to get yearly unemployment rates for total population 15+ (series LRUNTTTT), males (LRUNTTMA), and females (LRUNTTFE).

Exercise 1: Load it and see it

read_csv("data/OECD_STLABOUR_monthly.csv") -> oecd
print(oecd, n=2)
## # A tibble: 18,210 × 19
##   `"LOCATION"`   Country  SUBJECT                                Subject
##          <chr>     <chr>    <chr>                                  <chr>
## 1          AUS Australia LRUN24FE Unemployment rate, Aged 15-24, Females
## 2          AUS Australia LRUN24FE Unemployment rate, Aged 15-24, Females
## # ... with 1.821e+04 more rows, and 15 more variables: MEASURE <chr>,
## #   Measure <chr>, FREQUENCY <chr>, Frequency <chr>, TIME <chr>,
## #   Time <chr>, `Unit Code` <chr>, Unit <chr>, `PowerCode Code` <int>,
## #   PowerCode <chr>, `Reference Period Code` <chr>, `Reference
## #   Period` <chr>, Value <dbl>, `Flag Codes` <chr>, Flags <chr>

Exercise 1: Select relevant columns

OECD data comes with a lot of information. In this case we need just country code, variable code, value, and time.

oecd %<>% 
  select(country = 1, 
         variable = SUBJECT, 
         value = Value, 
         time = Time) 
print(oecd, n=3)
## # A tibble: 18,210 × 4
##   country variable    value     time
##     <chr>    <chr>    <dbl>    <chr>
## 1     AUS LRUN24FE 12.16585 Jan-2000
## 2     AUS LRUN24FE 12.43789 Feb-2000
## 3     AUS LRUN24FE 11.52670 Mar-2000
## # ... with 1.821e+04 more rows

Exercise 1: Filter relevant rows

Data set contains 1 different unemployment rates. We are interested only in three of them:

oecd %<>% 
  filter(variable %in% c("LRUNTTTT","LRUNTTMA","LRUNTTFE"))
print(oecd, n=5)
## # A tibble: 3,954 × 4
##   country variable    value     time
##     <chr>    <chr>    <dbl>    <chr>
## 1     AUS LRUNTTFE 6.670316 Jan-2000
## 2     AUS LRUNTTFE 6.740396 Feb-2000
## 3     AUS LRUNTTFE 6.576782 Mar-2000
## 4     AUS LRUNTTFE 6.341064 Apr-2000
## 5     AUS LRUNTTFE 6.259398 May-2000
## # ... with 3,949 more rows

Exercise 1: Some use of tidyr (1)

Now it is time to make it tidy:

oecd %<>% 
  spread(variable,value)
print(oecd, n=5)
## # A tibble: 1,320 × 5
##   country     time LRUNTTFE LRUNTTMA LRUNTTTT
## *   <chr>    <chr>    <dbl>    <dbl>    <dbl>
## 1     AUS Apr-2000 6.341064 6.400338 6.374269
## 2     AUS Apr-2001 6.509435 7.016913 6.791216
## 3     AUS Apr-2002 5.889289 6.709898 6.347627
## 4     AUS Apr-2003 5.850319 6.171773 6.027655
## 5     AUS Apr-2004 5.580715 5.417770 5.490434
## # ... with 1,315 more rows

Exercise 1: Some use of tidyr (2)

We want to aggregated monthly observations. For the purpose of grouping it useful to store year and month in different columns:

oecd %<>% 
  separate(time, c("month","year"), sep="-", convert = TRUE)
print(oecd, n=5)
## # A tibble: 1,320 × 6
##   country month  year LRUNTTFE LRUNTTMA LRUNTTTT
## *   <chr> <chr> <int>    <dbl>    <dbl>    <dbl>
## 1     AUS   Apr  2000 6.341064 6.400338 6.374269
## 2     AUS   Apr  2001 6.509435 7.016913 6.791216
## 3     AUS   Apr  2002 5.889289 6.709898 6.347627
## 4     AUS   Apr  2003 5.850319 6.171773 6.027655
## 5     AUS   Apr  2004 5.580715 5.417770 5.490434
## # ... with 1,315 more rows

Exercise 1: Do the aggregation (1)

Now we can take averages for years:

  1. We want to get an average for a country-year pairs. Therefore we need to make groups along these variables.
  2. And we want to take averages only for numeric columns.
oecd %<>% 
  group_by(country,year) %>% 
  summarise_if(is.numeric, mean, na.rm=T)

Exercise 1: Do the aggregation (2)

print(oecd, n=9)
## Source: local data frame [110 x 5]
## Groups: country [?]
## 
##   country  year LRUNTTFE LRUNTTMA LRUNTTTT
##     <chr> <int>    <dbl>    <dbl>    <dbl>
## 1     AUS  2000 6.030742 6.465302 6.273984
## 2     AUS  2001 6.455428 7.007580 6.763034
## 3     AUS  2002 6.148189 6.534425 6.363390
## 4     AUS  2003 5.960969 5.894895 5.924274
## 5     AUS  2004 5.515620 5.290276 5.390753
## 6     AUS  2005 5.198657 4.887887 5.027679
## 7     AUS  2006 4.899717 4.665082 4.771069
## 8     AUS  2007 4.756301 4.041257 4.364758
## 9     AUS  2008 4.565651 3.965455 4.237747
## # ... with 101 more rows

Exercise 2: WIID data

Inequality is hot topic. Because… you know… it is huge and growing beyond that. If you want to let the data speak you will most likely need WIID data set.

The WIID database in 3.3 version is available in WIID_ver3_sept15_0.tsv (tab separated values).

read_tsv("data/WIID_ver3_sept15_0.tsv") -> wiid

Exercise 2: Select relevant columns

WIID aggregates many data sets on income inequality with many indicators. We will focus on Gini coefficient (Gini). Each observation is rated according its quality (Quality) from High (best) to Not known (worst). To make our lives easier we will restrain the data set on country and year identifiers and Gini and Quality ratings.

wiid %<>% select(
  Country = Countrycode3, # ISO alpha3 country codes
  Year = Year,
  gini = Gini,
  quality = Quality
)

Exercise 2: See some data

Let's take a look at Australia in 1978

wiid %>% 
  filter(
    Country == "AUS",
    Year == 1978
  )
## # A tibble: 3 × 4
##   Country  Year  gini quality
##     <chr> <int> <dbl>   <chr>
## 1     AUS  1978  38.1     Low
## 2     AUS  1978  26.1 Average
## 3     AUS  1978  18.0 Average

Exercise 2: See the problem

…and it is apparent that we have a problem. There are multiple observations for country-year pairs and even for country-year-quality.

Researcher might want to use the best quality available for each country-year pair and if there are more of the same quality then take an average.

Exercise 2: Make averages

In the first step we will create averages for unique country-year-quality combinations:

wiid %<>% 
  group_by(Country,Year,quality) %>% 
  summarise(
    gini = mean(gini, na.rm = TRUE)
  ) %>% ungroup

See…

wiid %>%  filter(Country == "AUS", Year == 1978)
## # A tibble: 2 × 4
##   Country  Year quality  gini
##     <chr> <int>   <chr> <dbl>
## 1     AUS  1978 Average 22.05
## 2     AUS  1978     Low 38.10

Exercise 2: Pick the best obs. (1)

In the second step we would like to choose the best observation available. The quality has following values:

wiid %>% distinct(quality)
## # A tibble: 4 × 1
##     quality
##       <chr>
## 1      High
## 2   Average
## 3       Low
## 4 Not known

The strategy is following. Order rows for each country-year pair by quality (best goes first,…) and get only the first row!

Exercise 2: Pick the best obs. (2a)

Easy to say, harder to do. The best option is to convert character vector quality into ordered factor.

# Before:
wiid %>% distinct(quality) %>% arrange(quality)
## # A tibble: 4 × 1
##     quality
##       <chr>
## 1   Average
## 2      High
## 3       Low
## 4 Not known

Exercise 2: Pick the best obs. (2b)

# Converting into factor:
wiid$quality %<>% factor(levels = c("High","Average","Low","Not known"))
# After:
wiid %>% distinct(quality) %>% arrange(quality)
## # A tibble: 4 × 1
##     quality
##      <fctr>
## 1      High
## 2   Average
## 3       Low
## 4 Not known

Exercise 2: Pick the best obs. (3)

Now we can do the rest:

wiid %>% 
  # at first it is wise to remove rows with NAs
  filter(!is.na(gini)) %>% 
  # we are looking for the best observation for country-year pair
  group_by(Country,Year) %>% 
  # order by quality
  arrange(quality) %>% 
  # and now we take the first row
  slice(1L) -> wiid_0

Exercise 2: See the results

print(wiid_0, n = 6)
## Source: local data frame [2,921 x 4]
## Groups: Country, Year [2,921]
## 
##   Country  Year quality     gini
##     <chr> <int>  <fctr>    <dbl>
## 1     AFG  2007    High 29.00000
## 2     AFG  2008 Average 27.80000
## 3     AGO  2000 Average 58.60000
## 4     AGO  2009    High 46.66667
## 5     ALB  1996 Average 28.00000
## 6     ALB  1997 Average 29.10000
## # ... with 2,915 more rows

There is an alternative way of doing the last step without grouping. See handout.

Exercise 3: Join OECD and WIID (1)

One may wonder if there is a correlation between unemployment and income inequality. In order to investigate it we would need to join WIID data on inequality and OECD data on unemployment.

Exercise 3: Join OECD and WIID (2)

wiid_0 %>% 
  select(-quality) %>% 
  left_join(oecd,., 
            by = c("country"="Country",
                   "year"="Year")) -> oecd_wiid
  1. We don't need information on quality therefore we can drop it
  2. Notice that joins are sensitive to inputs order – you can place "." on the position where table going through the pipe should be.
  3. The names of key columns are not identical. Therefore we need to use named vector describing matching rules.

Exercise 3: See the results

print(oecd_wiid, n=6)
## Source: local data frame [110 x 6]
## Groups: country [?]
## 
##   country  year LRUNTTFE LRUNTTMA LRUNTTTT   gini
##     <chr> <int>    <dbl>    <dbl>    <dbl>  <dbl>
## 1     AUS  2000 6.030742 6.465302 6.273984 38.775
## 2     AUS  2001 6.455428 7.007580 6.763034 31.100
## 3     AUS  2002 6.148189 6.534425 6.363390 30.900
## 4     AUS  2003 5.960969 5.894895 5.924274 31.200
## 5     AUS  2004 5.515620 5.290276 5.390753 35.800
## 6     AUS  2005 5.198657 4.887887 5.027679     NA
## # ... with 104 more rows

We have successfully joined the data and now we can compute correlation between Gini and LRUNTTTT. It is 0.32.

Homework

Homework (1)

Czech Statistical Office publishes detailed statistics on population changes at municipal level.

The database is divided into multiple tables in the same structure (one for a county/"okres"). Each table has following structure:

## # A tibble: 135 × 19
##    year city_id      name   est pop_b   nat  mort immig  emig pch_nat
##   <chr>   <chr>     <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1  1971  210001 Běchovice     -  1859    17    35    75    97     -18
## 2  1972  210001 Běchovice     -  1819    26    26   103    63      NA
## 3  1973  210001 Běchovice     -  1859    26    28    59    75      -2
## 4  1971  210002    Benice     -   336     9     4     6    15       5
## # ... with 131 more rows, and 9 more variables: pch_mig <dbl>,
## #   pch_tot <dbl>, area_ch1 <dbl>, pop_e <dbl>, area_ch2 <dbl>,
## #   from <chr>, from_name <chr>, where <chr>, where_name <chr>

Homework (2)

For the purpose of the homework are important following columns:

  • year – year
  • city_id – ID of the city
  • pop_b – population at the beginning of the year
  • pch_mig – net absolute migration (number of persons)

Data are available in the file HW_CZSO_migration.Rdata in the variable CZSO which is a list of individual tables (data frames).

Homework (3)

You are supposed to:

  • join list of data frames into one big table (data frame)
  • compute net migration rate (to column mrate) as 100*pch_mig/pop_b
  • compute average migration rate (to column amrate) and average population size (from pop_b) for each city_id for the period 1991–2010 (incl. margin years), ignore missing values

Assign your solution to the variable czdata.

Homework (4)

Your solution should look like the following table:

## # A tibble: 6,283 × 3
##    city_id    amrate      apop
##      <chr>     <dbl>     <dbl>
## 1   564982 14.719198 290.50000
## 2   546518 13.916943  62.94737
## 3   564907 13.783738 170.35000
## 4   517321 13.475916 207.25000
## 5   572870 11.828373 158.80000
## 6   583464 11.291534 105.60000
## 7   566098 10.949050 201.40000
## 8   538264 10.663840 536.90000
## 9   564915 10.131669  78.35000
## 10  537543  9.197432  62.50000
## # ... with 6,273 more rows

Extra seminar

Extra seminar


  • Tuesday 15/11/16 at 14:35 in VT204
  • building a table from scratch
  • voluntary (obligatory for some of you)