Data Transformation with dplyr:: Cheatsheet
dplyr functions work with pipes and expect tidy data. In tidy data:
nnn
III
&
pipes
Each variable is in   Each observation, or x%>%f(y)
its own column       case, is in its own row      becomes f(x, y)
Summarise Cases
These apply summary functions to columns to create a new table of summary statistics. Summary functions take vectors as input and return one value (see back).
S summary function
_^HHB   summarise(.data,...)
Compute table of summaries.
summarise(mtcars, avg = mean(mpg))
count(x,wt = NULL, sort = FALSE) ■■■    ■       Count number of rows in each group defined by the variables in ...Also tally().
count(iris, Species)
VARIATIONS
summarise_all() - Apply funs to every column. summarise_at() - Apply funs to specific columns. summarise_if() - Apply funs to all cols of one type.
Group Cases
Use group_by() to create a "grouped" copy of a table, dplyr functions will manipulate each "group" separately and then combine the results.
mtcars%>% group_by(cyl) %>% summarise(avg = mean(mpg))
group_by(.data,add =
FALSE)
Returns copy of table grouped by...
gjris <- group_by(iris, Species)
ungroup(x,...) Returns ungrouped copy of table.
ungroup(gjris)
Manipulate Cases
EXTRACT CASES
Row functions return a subset of rows as a new table.
filter(.data,...) Extract rows that meet logical criteria. filter(ihs, Sepal.Length > 7)
distinct(.data,.keep_all = FALSE) Remove rows with duplicate values. distinct(iris, Species)
sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env= parent.framef)) Randomly select fraction of rows. sample_frac(iris, 0.5, replace = TRUE)
sample_n(tbl, size, replace = FALSE, weight = NULL, .env = parent.frame()) Randomly select size rows. sample_n(iris, 10, replace = TRUE)
slice(.data,...) Select rows by position.
slice(iris, 10:15)
top_n(x, n, wt) Select and order top n entries (by group if grouped data). top_n(iris, 5, Sepal.Width)
Logical and boolean operators to use with filter))
< <= is.na()     %in%      | xor()
> >= !is.na()     ! &
See ?base::Logic and ?Comparison for help.
ARRANGE CASES
ADD CASES
arrange).data,...) Order rows by values of a column or columns (low to high), use with desc() to order from high to low. arrange(mtcars, mpg) arrange(mtcars, desc(mpg))
add_row(.data,.before = NULL, .after = NULL) Add one or more rows to a table. add_row(faithful, eruptions = 1, waiting =1)
©Stud
10
Manipulate Variables
EXTRACT VARIABLES
Column functions return a set of columns as a new vector or table.
pull(.data, var = -1) Extract column values as |5-*5       a vector. Choose by name or index. pull(iris, Sepal.Length)
select(.data,...) !■ Extract columns as a table. Also select_if().
select(iris, Sepal.Length, Species)
Use these helpers with select (),
e.g. select(iris, starts_with("Sepal"))
contains(match)
ends_with(match)
matches(match)
num_range(prefix, range)
one_of(...)
starts_with(match)
e.g. mpgxyl -, e.g, -Species
MAKE NEW VARIABLES
These apply vectorized functions to columns. Vectorized funs take vectors as input and return vectors of the same length as output (see back).
vectorized function
mutate(.data,...)
j Compute new column(s). g mutate(mtcars, gpm = 1/mpg)
transmute).data,...) Compute new column(s), drop others.
transmute(mtcars, gpm = 1/mpg)
mutate_all(.tbl, .funs,...) Apply funs to every column. Use with funs(). Also mutate_if().
J   mutate_all(faithful, funs(log(.), log2(.))) mutate_if(iris, is.numeric, funs(log(.)))
IH^II      mutate_at(.tbl, .cols, .funs,...) Apply funs to specific columns. Use with funs(), vars() and the helper functions for select(). mutate_at(iris, vars( -Species), funs(log(.)))
add_column(.data,.before = NULL, .after = ■ NULL) Add new column(s). Also add_count(),
J add_tally(). add_column(mtcars, new = 1:32)
■■■_^BH   rename(.data,...) Rename columns.
rename(iris, Length = Sepal.Length)
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettesfpackage - cf'dplyr", "tibble")) • dplyr 0.7.0« tibble 1.2.0 • Updated: 2019-(
Vector Functions
TO USE WITH MUTATE ()
mutate)) and transmute)) apply vectorized functions to columns to create new columns. Vectorized functions take vectors as input and return vectors of the same length as output.
i vectorized function
OFFSETS
dplyr::lag() - Offset elements by 1 dplyr::lead() - Offset elements by -1
CUMULATIVE AGGREGATES
dplyr::cumall() - Cumulative aII()
dplyr::cumany() - Cumulative any() cummaxj) - Cumulative max()
dplyr::cummean() - Cumulative mean() cummin() - Cumulative min() cumprod() - Cumulative prod() cumsum() - Cumulative sum()
RANKINGS
dplyr::cume_dist() - Proportion of all values <= dplyr::dense_rank() - rank w ties = min, no gaps dplyr::min_rank() - rank with ties = min dplyr::ntile() - bins into n bins dplyr::percent_rank() - min_rankscaled to [0,1] dplyr::row_number() - rank with ties = "first"
MATH
+)-)*) /> A> %/%, %% - arithmetic ops
log(), log2(), loglO() - logs
<, <=, >, >=, !=, == - logical comparisons dplyr::between() - x >= left & x <= right dplyr::near() - safe == for floating point numbers
MISC
dplyr::case_when() - multi-case if_else() iris %>% mutate(Species = case_when(
Species == "versicolor" ~ "versi", Species == "virginica" ~ "virgi", TRUE-Species)) dplyr::coalesce() - first non-NA values by element across a set of vectors dplyr::if_else() - element-wise if() + else() dplyr::na_if() - replace specific values with NA pmax() - element-wise max() pmin() - element-wise min() dplyr::recode() - Vectorized switch() dplyr::recode_factor() - Vectorized switch() for factors
Summary Functions       Combine Tables
TO USE WITH SUMMARISE ()
summarise)) applies summary functions to columns to create a new table. Summary functions take vectors as input and return single values as output.
i summary function
COUNTS
dplyr::n() - number of values/rows dplyr::n_distinct() - # of uniques sum(!is.na()) - # of non-NA's
LOCATION
mean() - mean, also mean(!is.na()) median)) - median
LOGICALS
mean() - Proportion of TRUE's sum()-#ofTRUE's
POSITION/ORDER
dplyr::first() - first value dplyr::last() - last value dplyr::nth() - value in nth location of vector
RANK
quantile() - nth quantile min() - minimum value max() - maximum value
SPREAD
IQR() - Inter-Quartile Range mad() - median absolute deviation sd() - standard deviation var() - variance
Row Names
Tidy data does not use rownames, which store a variable outside of the columns. To work with the rownames, first move them into a column.
□B   ■■■ rovvnames_to_column()
1 a t _^ i a t   Move row names into col.
2 b u    2 b u   a<- rownames_to_column(iris, var
3 c v       3   c v     = „c,}
bb column_to_rownames()
1 a 1 -*-\ab *   Move col in row names.
3 c I   3 c I column_to_rownames(a,var="C")
COMBINE VARIABLES
x y
bbb beb
a  t 1      I      a t 3
b u 2 b u 2
c v 3 d w 1
a t   1 a  t 3
b u 2 b u 2 c v 3 d w 1
Use bind_cols() to paste tables beside each other as they are.
bind_cols(...) Returns tables placed side by
side as a single table.
BE SURE THAT ROWS ALIGN.
Use a "Mutating Join" to join one table to columns from another, matching values with the rows that they correspond to. Each join retains a different combination of values from the tables.
Finnn left join(x, y, by = NULL,
a 1 1 3  copy=FALSE, suffix=c(".x",".y"),...)
c v 3 fjA Join matching values from y to x.
Finnn rightJoin(x, y, by = NULL, copy = a t i 3   FALSE, suffix=c(".x",".y"),...)
Join matching values from x to y.
b u 2 2 d w NA 1
FlFinr inner join(x, y, by = NULL, copy =
a t i s FALSE, suffix=c(".x",".y"),...)
b u 2 2 Join data. Retain only rows with matches.
HBHB fulljoin(x, y, by = NULL,
llH copy=FALSE, suffix=c(".x",".y"),...)
c v 3 na Join data. Retain all values, all rows.
I Use by = c("coll", "col2",...) to
a t i t 3  specify one or more common I 1 i nana columns to match on.
left_join(x, y by= "A")
B Use a named vector, by = c("coll" =
I I I * ™ "col2"), to match on columns that c v 3 a t   have different names in each table.
left_join(x, y, by = c("C" = "D"))
PtlRinram Use suffix to specify the suffix to a t i d w give to unmatched columns that c I 3 a tu   have the same name in both tables.
left_join(x, y by = c("C" = "D"), suffix =
c("l","2"))
COMBINE CASES
nnn
a t 1
C v 3
y d -
Use bind_rows() to paste tables below each other as they are.
mnnn bind_rows(..., .id = NULL)
SSSS Returns tables one on top of the other x c v 3 as a single table. Set .id to a column z c » 3 name to add a column of the original z d w 4 table names (as pictured)
HHH intersect(x, y,...)
c v 3 Rows that appear in both x and y. ^
Finn setdiff(x,y,...)
at i Rows that appear in x but not y.
Finn union(x,y,...)
at i Rows that appear in x or y.
bc 1 2 (Duplicates removed). union_al
d»4 retains duplicates.
QD
Use setequal() to test whether two data sets contain the exact same rows (in any order).
EXTRACT ROWS
x y bbb
a t   1 I      a t 3
b u 2 T    b  u 2
Use a "Filtering Join" to filter one table against the rows of another.
nnn semijoin(x,y, by = NULL,...)
at i Return rows of x that have a match in y.
b"2  USEFULTO SEE WHATWILL BE JOINED.
Finn anti join(x, y, by = NULL,...) cv3 Return rows of x that do not have a
match iny. USEFULTO SEE WHATWILL
NOT BE JOINED.
e
»Studio
Also has_rownames(), remove_rownames()
RStudio® is a trademark of RStudio, Inc. • CCBYSA RStudio« info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettesfpackage - cfdplyr", "tibble")) • dplyr 0.7.0« tibble 1.2.0 • Updated: 2019-(