Data tidying with tidyr:: Cheatsheet
Tidy data is a way to organize tabular data in a consistent data structure across packages. A table is tidy if:
111
Each variable is in its own column
Each observation, or case, is in its own row
Access variables as vectors
Preserve cases in vectorized operations
Tibbies
AN ENHANCED DATA FRAME
Tibbies are a table format provided by the tibble package. They inherit the data frame class, but have improved behaviors:
• Subset a new tibble with ], a vector with [[ and $.
• No partial matching when subsetting columns.
• Display concise views of the data on one screen.
options(tibble.print_max = n, tibble. print_min = m, tibble.width = Inf) Control default display settings.
View() or glimpse)) View the entire data set.
CONSTRUCT A TIBBLE
tibble)...) Construct by columns.
tibble(x = 1:3, y = c("a", "b", "c"))
tribble)...) Construct by rows.
tribble(~x, ~y,
1, "a",
2, "b",
3, "c")
A tibble: 3x2
<int> <chr> 11a
2 2b
3 3c
as_tibble(x,...) Convert a data frame to a tibble.
enframe(x, name = "name", value = "value") Convert a named vector to a tibble. Also deframe().
is_tibble(x) Test whether x is a tibble.
©Stud
10
Reshape Data
- Pivot data to reorganize values into a new layout.
table4a
KK1WTW HUHU
[1 0.7k
b 37k 80k C      212k 213k
b		37k
C		212k
a	IFTiTTl	2k
b		80k
c	Bililil	213k
table2
				1333
a	1999        3  0.7k iw	a 1999	0.7k	19m
a	1999 IBS 19m	a 2000	2k	20m
a	2000         j 2k	b 1999	37k	172m
a	2000 WEHM 20m	b 2000	80k	174m
b	1999 37k	C 1999	212k	1t
b	1999 BEH 172m	C 2000	213k	1t
b	2000 80k			
b	2000 BEH 174m			
C	1999            212k			
C	1999 BEH 1t			
C	2000            213k			
C	2000 WHOM 1t			
pivot_longer(data, cols, names_to = "name", values_to = "value", values_drop_na = FALSE)
"Lengthen" data by collapsing several columns into two. Column names move to a new names_to column and values to a newvalues_to column.
pivot_longer(table4a, cols = 2:3, namesjo - year", values_to = "cases")
pivot_wider(data, names_from = "name", values_from = "value")
The inverse of pivot_longer(). "Widen" data by expanding two columns into several. One column provides the new column names, the other the values.
pivot_wider(table2, names_from = type, values from = count)
Split Cells
table5
a 19
a 20
b 19
b 20
■ Use these functions to split or combine cells into individual, isolated values.
_table 3_
a 1999 0.7k/
a 2000 2k/
b 1999 37k/
b 2000 80k/
a 19
a 20
b 19
b 20
a 1999 0.7k
a 2000 2k
b 1999 37k
b 2000 80k
_table3
a 1999 0.7k/
a 2000 ?kHiIM—^. I
b 1999 37k/
b 2000 80k/
a 1999 0.7k
1999
a 2000 2k
a 2000
b 1999 37k
b 1999
b 2000 80k
b 2000
unite(data, col,..., sep =     remove = TRUE, na.rm = FALSE) Collapse cells across several columns into a single column.
unite(table5, century, year, col = "year", sep = "")
separate(data, col, into, sep = "[A[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn",...) Separate each cell in a column into several columns. Also extract))
separate(table3, rate, sep = "/", into = c("cases", "pop"))
separate rowsfdata,sep = "[A[:alnum:].]+", convert = FALSE) Separate each cell in a column into several rows.
separate_rows(table3, rate, sep = "/")
Expand Tables
Create new combinations of variables or identify implicit missing values (combinations of variables not present in the data).
a    1     3 .
B    1 4
EÜEE1     expand(data,...) Create a * a 2      new tibble with all possible g 2      combinations of the values of the variables listed in ... Drop other variables. expand(mtcars, cyl, gear, ca rb)
a 1 3 B    1 4
EOtaiS] complete(data,
,fill =
list()) Add missing possible combinations of values of variables listed in ... Fill remaining variables with NA.
complete(mtcars, cyl, gear, ca rb)
Handle Missing Values
Drop or replace explicit missing values (NA).
a 1 B na C na
a 1 B na C na
Eld   drop na(data,...) Drop □  3     rows containing NA's in .. columns.
drop_na(x, x2)
nira
a 1 B 1 C 1
111 TZW
fill(data,..., .direction = "down") Fill in NA's in ... columns using the next or previous value.
fill(x,x2)
replace_na(data, replace) Specify a value to replace NA in selected columns.
replace_na(x, list(x2 = 2))
RStudio8isatrademarkofRStudio,PBC • CCBYSA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn moreattidyr.tidyverse.org • tibble 3.1.2 • tidyr 1.1.3 • Updated: 2021-08
Nested Data
A nested data frame stores individual tables as a list-column of data frames within a larger organizing data frame. List-columns can also be lists of vectors or lists of varying data types. Use a nested data frame to:
• Preserve relationships between observations and subsets of data. Preserve the type of the variables being nested (factors and datetimes aren't coerced to character).
• Manipulate many sub-tables at once with purrr funcitons like map(), map2(), or pmap() or with dplyr rowwisef) grouping.
CREATE NESTED DATA
nest(data,...) Moves groups of cells into a list-column of a data frame. Use alone or with dplyr::group_by():
1. Group the data frame with group_by() and use nest() to move the groups into a list-column.
n_storms <- storms %>% group_by(name) %>% nest()
2. Use nest(new_col = c(x, y)) to specify the columns to group using dplyr::select() syntax.
n_storms <- storms %>% nestfdata = c(yeanlong))
"cell" contents
RESHAPE NESTED DATA
unnest(data, cols,keep_empty = FALSE) Flatten nested columns back to regular columns. The inverse of nest().
n_storms%>% unnest(data)
unnest_longer(data, col, values_to = NULL, indices_to = NULL) Turn each element of a list-column into a row.
starwars%>% select(name, films) %>% unnestjonger(films)
	KB	misa		KB	Ig] ESU	
Amy	1975	27.5 -79.0	Amy	1975	27.5 -79.0	
Amy	1975	28.5 -79.0	Amy	1975	28.5 -79.0	nested data frame
Amy	1975	29.5 -79.0	Amy	1975	29.5 -79.0	
Bob	1979	22.0 -96.0	Bob	1979	22.0 -96.0	_^   Amy   <tibble [50x3]>
Bob	1979	22.5 -95.3	Bob	1979	22.5 -95.3	Bob   <tibble [50x3]>
Bob	1979	23.0 -94.6	Bob	1979	23.0 -94.6	Zeta   <tibble [50x3]>
Zeta	2005	23.9 -35.6	Zeta	2005	23.9 -35.6	
Zeta	2005	24.2 -36.1	Zeta	2005	24.2 -36.1	
Zeta	2005	24.7 -36.6	Zeta	2005	24.7 -36.6	
1975 1975 1975
27.5 -79.0 28.5 -79.0 29.5 -79.0
Index list-columns with [
. n_storms$data[[l]]
1979 1979 1979
2005 2005 2005
22.0 -96.0 22.5 -95.3 23.0 -94.6
		^^^^H name	films 1
		Luke	TheEmpire Strik...
		Luke	Revenge of the S...
Luke			Return of the Jed...
Luke	<chr[5]>	I C-3PO	TheEmpire Strik...
C-3PO	<chr[6]>	1 C-3PO	Attack of the CI...
R2-D2	<chr[7]>	C-3PO	The Phantom M...
		R2-D2	TheEmpire Strik...
		R2-D2	Attack of the CI...
		R2-D2	The Phantom M...
23.9 -35.6 24.2 -36.1 24.7 -36.6
CREATE TIBBLES WITH LIST-COLUMNS
tibble::tribble(...) Makes list-columns when needed
tribble( -max, ~seq,
unnest_wider(data, col) Turn each element of a list-column into a regular column.
starwars%>% select(name, films) %>% unnest wider(films)
3, 1
4, 1
5, 1
			
3	<int [3]>	Luke	<chr[5]>
4	<int[4]>	C-3PO	<chr[6]>
5	<int [5]>	R2-D2	<chr[7]>
TfT^a   EH    EH EH			
Luke	1 The Empire...	Revenge of...	Return of...
C-3PO	1 The Empire...	Attack of...	The Phantom...
R2-D2	The Empire...	Attack of...	The Phantom...
tibble::tibble(...) Saves list input as list-columns.
tibble(max = c(3,4,5), seq = list(l:3,1:4,1:5))
tibble::enframe(x, name="name", value="value") Converts multi-level list to a tibble with list-cols.
enframe(list('3'=l:3, '4'=1:4, '5'=1:5), 'max', 'seq')
OUTPUT LIST-COLUMNS FROM OTHER FUNCTIONS
dplyr::mutate(), transmute)), and summarise)) will output list-columns if they return a list.
mtcars%>% group_by(cyl) %>% summarisefq = list(quantile(mpg)))
hoist).data, .col,.remove = TRUE) Selectively pull list components out into their own top-level columns. Uses purrr::pluck() syntax for selecting from lists.
starwarsc
e
	select(name, films) %>% hoist(films, first_film = 1, second_film	= 2)		
	■QE^HI BEnS^BB3SS3U			
Luke	<chr[5]>                     HHHHJ The Empire... <chr[6]>                     B C-3PO TheEmpire...	Revenge of...	<ch	[3]>
C-3PO		Attack of...	<ch	[4]>
R2-D2	<chr[7]>                         R2-D2 TheEmpire...	Attack of...	<ch	[5]>
TRANSFORM NESTED DATA
A vectorized function takes a vector, transforms each element in parallel, and returns a vector of the same length. By themselves vectorized functions cannot work with lists, such as list-columns.
dplyr::rowwise(.data,...) Group data so that each row is one group, and within the groups, elements of list-columns appear directly (accessed with [[), not as lists of length one. When you use rowwise)), dplyr functions will seem to apply functions to list-columns in a vectorized fashion.
<tibble [50x4]> <tibble [50x4]> <tibble [50x4]>
. í ! <tibb[e [50x4]> fUnj <tibble [50x4]> fun(  <tibb[e [50x4]>
result 1 result 2 result 3
Apply a function to a list-column and create a new list-column.
n_storms%>%
dimi) returns two values per row
rowwiseO %>% f
mutate(n = list(dim(data)))-^E
Apply a function to a list-column and create a regular column.
n_storms%>% rowwiseO %>% mutatefn = nrow(data)'
nrow() returns one integer per row
Collapse multiple list-columns into a single list-column.
appendO returnsa list for each row, so col type must be list
starwars %>% rowwiseO %>% mutate(transport = list(append(vehicles, starships))
Apply a function to multiple list-columns.
lengthQ returns one integer per row
starwars %>% rowwiseO %>% mutate(n_transports = length(c(vehicles, starships))
Studio
See purrr package for more list functions.
RStudio8 is a trademark of RStudio, PBC • CCBYSA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn moreattidyr.tidyverse.org • tibble 3.1.2 • tidyr 1.1.3 • Updated: 2021-08