Data transformation with dplyr:: Cheatsheet
dplyr functions work with pipes and expect tidy data. In tidy data:
nnn
III
nnn
Each variable is in Each observation, or its own column       case, is in its own row
pipes
x%>%f(y) becomes f(x, y)
Summarise Cases
Apply summary functions to columns to create a new table of summary statistics. Summary functions take vectors as input and return one value (see back).
S summary function
summarise(.data, •••) Compute table of summaries.
summarise(mtcars, avg = mean(mpg))
count).data,wt = NULL, sort = FALSE, name-NULL) Count number of rows in each group defined by the variables in ... Also tally)).
count(mtcars, cyl)
Group Cases
Use group_by(.data,..., .add = FALSE, .drop = TRUE) to create a "grouped" copy of a table grouped by columns in ... dplyr functions will manipulate each "group" separately and combine the results.
mtcars %>% group_by(cyl) %>% summarise(avg= mean(mpg))
Use rowwise).data,...) to group data into individual rows, dplyr functions will compute results for each row. Also apply functions to list-columns. See tidyr cheat sheet for list-column workflow.
■■■■ starwars%>%
->     - -> BBBB     rowwiseO %>%
mutate(film_count = length(films))
ungroup(x,...) Returns ungrouped copy of table.
ungroup(g_mtcars)
Qstud
10
Manipulate Cases
EXTRACT CASES
Row functions return a subset of rows as a new table.
filter).data,.preserve = FALSE) Extract rows that meet logical criteria.
filter(mtcars, mpg>20)
distinct).data,.keep_all= FALSE) Remove rows with duplicate values.
distinct(mtcars, gear)
preserve = FALSE) Select rows
slice).data, by position
slicefmtcars, 10:15'
slice sample).data,..., n, prop, weight_by = NULL, replace = FALSE) Randomly select rows. Use n to select a number of rows and prop to select a fraction of rows.
slice_sample(mtcars, n = 5, replace = TRUE)
slice_min(.data, order_by,n, prop, with_ties = TRUE) and slice_max() Select rows with the lowest and highest values.
slice_min(mtcars, mpg, prop = 0.25)
slice head).data,n, prop) and slice tail))
Select the first or last rows.
slice headfmtcars, n = 5)
Logical and boolean operators to use with filter))
==        < <=        is.naf)   %in% |
!= > >=        !is.na()   ! &
See ?base::Logic and ?Comparison for help.
xor()
ARRANGE CASES
arrange).data,.by_group= FALSE) Order rows by values of a column or columns (low to high), use with desc() to order from high to low.
arrange(mtcars, mpg) arrange(mtcars, desc(mpg))
ADD CASES
add_row(.data,.before = NULL, .after = Add one or more rows to a table.
add_row(cars, speed = 1, dist = 1)
NULL)
Manipulate Variables
EXTRACT VARIABLES
Column functions return a set of columns as a new vector or table.
pull).data, var = -1, name = NULL,...) Extract lB -►B       column values as a vector, by name or index.
pull(mtcars, wt)
" select).data,...) Extract columns as a table.
select(mtcars, mpg, wt)
■■■ ■
BBB   BBB relocate(.data,.before = NULL, .after = NULL) I Move columns to new position.
j relocate(mtcars, mpg, cyl, .after = last_col())
Use these helpers with select)) and across))
e.g. select(mtcars, mpg:cyl)
contains(match)      num_range(prefix, range) :, e.g. mpgxyl
ends_with(match)   all_of(x)/any_of(x,vars) -,e.g,-gear
starts_with(match) matches(match) everything))
MANIPULATE MULTIPLE VARIABLES AT ONCE
■■■   ■■■  across).cols, .funs,.names = NULL) Summarise I   ■■■  or mutate multiple columns in the same way.
summarise(mtcars, across(everything(), mean))
c_across(.cols) Compute across columns in row-wise data.
transmute(rowwise(UKgas), total = sum(c_across(l:2))
MAKE NEW VARIABLES
Apply vectorized functions to columns. Vectorized functions take vectors as input and return vectors of the same length as output (see back).
S vectorized function
lBB  mutate).data,.keep= "all", .before = NULL, IB-fr-BI    I  .after = NULL) Compute new column(s). Also
| add column)), add count)), and add tally)).
mutate(mtcars, gpm = 1 / mpg)
transmute).data,...) Compute new column(s), drop others.
transmute(mtcars, gpm = 1 / mpg)
rename).data,...) Rename columns. Use rename_with() to rename with a function.
renamefcars, distance = dist)
RStudio8isa trademark of RStudio, PBC • CCBYSA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn moreatdplyr.tidyverse.org • dplyr 1.0.7 • Updated: 2021-07
Vectorized Functions      Summary Functions      Combine Tables
TO USE WITH MUTATE ()
mutate)) and transmute)) apply vectorized functions to columns to create new columns. Vectorized functions take vectors as input and return vectors of the same length as output.
S vectorized function
OFFSET
dplyr::lag() - offset elements by 1 dplyr::lead() - offset elements by -1
CUMULATIVE AGGREGATE
dplyr::cumall() - cumulative all() dplyr::cumany() - cumulative any()
cummaxO - cumulative max() dplyr::cummean() - cumulative mean()
cummin)) - cumulative min()
cumprod() - cumulative prod()
cumsum() - cumulative sum()
RANKING
dplyr::cume_dist() - proportion of all values <= dplyr::dense_rank() - rank w ties = min, no gaps dplyr::min_rank() - rank with ties= min dplyr::ntile() - bins into n bins dplyr::percent_rank() - min_rank scaled to [0,1] dplyr::row_number() - rank with ties = "first"
MATH
+.-.*. /> A> %/%>%%" arithmetic ops
log(), Iog2(),logl0()- logs
<, <=, >, >=, !=, == - logical comparisons dplyr::between()-x>= left & x <= right dplyr::near() - safe == for floating point numbers
MISCELLANEOUS
dplyr::case_when() - multi-case if_else()
starwars%>% m utate(type = case_wh en ( height > 200 | mass > 200 ~ "large", species == "Droid" -"robot", TRUE -"other") )
dplyr::coalesce() - first non-NA values by
element across a set of vectors dplyr::if_else() - element-wise if() + else() dplyr::na_if() - replace specific values with NA pmax() - element-wise max() pmin() - element-wise min()
Qstud
TO USE WITH SUMMARISE ()
summarise)) applies summary functions to columns to create a new table. Summary functions take vectors as input and return single values as output.
S summary function
COUNT
plyr::n() - number of values/rows plyr::n_distinct() - # of uniques sum(!is.na()) - # of non-NA's
POSITION
mean() - mean, also mean(!is.na()) median)) - median
LOGICAL
mean)) - proportion of TRUE's sum()-#ofTRUE's
ORDER
plyr::first() - first value plyr::last() - last value plyr::nth() - value in nth location of vector
RANK
quantile() - nth quantile min() - minimum value max() - maximum value
SPREAD
IQR() - Inter-Quartile Range mad() - median absolute deviation sd() - standard deviation var() - variance
Row Names
Tidy data does not use rownames, which store a variable outside of the columns. To work with the rownames, first move them into a column.
)ble::rownames_to_column()
1 a t    i a i   Move row names into col.
2 b u    2 b u  a <- rownames_to_column(mtcars, var= "C")
3 c
le::column_to_rownames()
2 b u "*u 2 b  Move col into row names.
3 c v   v 3 c  column_to_rownames(a,var= "C")
Also tibble::has_rownames() and tibble::remove_rownames().
COMBINE VARIABLES
x y
□□B
a  t 1
ana
a  t 3
□BBBBB
a  t  1 a  t 3
bind_cols(..., .name_repair) Returns tables placed side by side as a single table. Column lengths must be equal. Columns will NOT be matched by id (to do that look at Relational Data below), so be sure to check that both tables are ordered the way you want before binding.
RELATIONAL DATA
Use a "Mutating Join" to join one table to columns from another, matching values with the rows that they correspond to. Each join retains a different combination of values from the tables.
□BBB leftjoin(x,y, by = NULL, copy = FALSE, Ull  suffix = c(".x", ".y"),..., keep = FALSE, c v 3 na na_matched = "na") Join matching values from y to x.
IBBB rightJoin(x, y, by = NULL, copy = FALSE, I lul I  suffix = c(".x", ".y"),..., keep = FALSE, d w na i   na_matches= "na") Join matching values from x to y.
□□BE] inner_join(x,y, by = NULL, copy = FALSE, III I  suffix = c(".x", ".y"),..., keep = FALSE,
na_matches= "na") Join data. Retain
only rows with matches.
□□BE] fullJoin(x,y, by = NULL, copy = FALSE,
III I suffix = c(".x", ".y"),..., keep = FALSE,
c „ 3 na na_matches= "na") Join data. Retain all
d w na i values, all rows.
COLUMN MATCHING FOR JOINS
Use by = c("coll", "col2",...) to
^2^2    specify one or more common c v 3 nana   columns to match on.
left_join(x, y, by= "A")
J  Use a named vector, by = c("coll" = b u 2 b u   "col2"), to match on columns that c v 3 a t   have different names in each table.
left_join(x, y, by = c("C" = "D"))
I  Use suffix to specify the suffix to a t i d w   give to unmatched columns that
have the same name in both tables.
left_join(x, y, by = c("C" = "D"), suffix = c("l", "2"))
2 b u
3 a t
COMBINE CASES
DOB
BOB
c v 3
EBEBB
bind_rows(..., .id = NULL) Returns tables one on top of the other as a single table. Set .id to a column name to add a column of the original table names (as pictured).
Use a "Filtering Join" to filter one table against the rows of another.
x y
□□B □□□
all . a t 3 _ b  u  2    T     b   u 2 —
c v  3 d w 1
□BE semi join(x, y, by = NULL, copy = FALSE, ° 1 2       na_matches= "na") Return rowsofx
that have a match in y. Use to see what
will be included in a join.
□BE antijoin(x, y, by= NULL, copy= FALSE, c v 3       na_matches= "na") Return rowsofx that do not have a match in y. Use to see what will not be included in a join.
Use a "Nest Join" to inner join one table to another into a nested data frame.
! nestjoin(x,y, by = NULL, copy = FALSE, keep = FALSE, name = NULL,...) Join data, nesting matches from y in a single new data frame column.
SET OPERATIONS
□BE   intersect(x,y,...)
cv3   Rows that appear in both x and y. ^
□BE   setdiff(x,y,...)
a t i   Rows that appear in x but not y.
b  u 2
□BE union(x,y,...)
a t i Rows that appear in x or y.
b " 2 (Duplicates removed), union all()
d «, 4 retains duplicates.
Use setequal() to test whether two data sets contain the exact same rows (in any order).
10
RStudio8isa trademark of RStudio, PBC • CCBYSA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn moreatdplyr.tidyverse.org • dplyr 1.0.7 • Updated: 2021-07