Data Import:: Cheatsheet
R's tidyverse is built around tidy data stored in tibbles, which are enhanced data frames.
The front side of this sheet shows how to read text files into R with readr.
The reverse side shows how to create tibbles with tibble and to layout tidy data with tidyr.
OTHERTYPESOFDATA
Try one of the following packages to import other types of files
• haven - SPSS, Stata, and SAS files readxl- excel files (.xls and .xlsx) DBI - databases
• jsonlite-json
• xml2-XML
• httr-Web APIs
• rvest-HTML (Web Scraping)
Save Data
Save x, an R object, to path, a file path, as:
Comma delimited file
write_csv(x, path, na = "NA", append = FALSE, col_names = lappend)
File with arbitrary delimiter write_delim(x, path, delim = "", na = "NA", append = FALSE, col_names = lappend)
CSV for excel write_excel_csv(x, path, na = "NA", append = FALSE, col_names = lappend)
String to file write_file(x, path, append = FALSE)
String vector to file, one element per line
write_lines(x,path, na = "NA", append = FALSE)
Object to RDS file
write_rds(x, path, compress = c("none", "gz", "bz2", "xz"),...)
Tab delimited files
write_tsv(x, path, na = "NA", append = FALSE, col_names = lappend)
©Stud
10
Read Tabular Data
These functions share the common arguments:
read_*(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), progress = interactiveQ)
a,b,c 1,2,3 4,5,NA
a;b;c 1;2;3 4;5;NA
7^ 1\2\3 4\5\NA
K
a b c 1 2 3 4 5 NA
□	D	B
1	2	3
4	5	NA
D 1	□ 2	B 3
4	5	NA
□ 1	El 2	B 3
4	5	NA
□ 1	El 2	B 3
4	5	NA
Comma Delimited Files
read_csv("file.csv")
To make file.csv run:
write_file(x = "a,b,c\nl,2,3\n4,5,NA", path = "file.csv")
Semi-colon Delimited Files read_cs v2 (" f i I e2. cs v")
write_file(x = "a;b;c\nl;2;3\n4;5;NA", path = "file2.csv")
Files with Any Delimiter
read_delim("file.txt", delim = "|")
write_file(x = "a|b|c\nl|2|3\n4|5|NA", path = "file.txt")
Fixed Width Files
read_fwf("file.fwf", coLpositions = c(l, 3,5))
write_file(x = "a b c\nl 2 3\n4 5 NA", path = "file.fwf")
Tab Delimited Files
read_tsv("file.tsv") Also read_table().
write_file(x = Ma\tb\tc\nl\t2\t3\n4\t5\tNA", path = "file.tsv";
USEFULARGUMENTS
l\    Example file
a,b,c 1,2,3 4,5,NA
A	B	C
1	2	3
4	5	NA
□ A	n B	B c
1	2	3
4	5	NA
write_file("a,b,c\nl,2,3\n4,5,NA","file.csv") f <- "file.csv"
No header
read_csv(f, col_names= FALSE) Provide header
read_csv(f, col_names = c("x", "y", "z"))
n 4	□ 5	□ NA
□ 1	□ 2	□ 3
□ NA	a 2	B 3
4	5	NA
Skip lines
Read in a subset
Missing Values
Read Non-Tabular Data
Read a file into a single string
read_file(file, locale = defaultJocale()) Read each line into its own string
read_Iines(file, skip = 0, n_max = -1L, na = character), locale = default_locale(), progress = interactive())
Read Apache style log files
read_Iog(file, col_names = FALSE, coljypes = NULL, skip = 0, n_max = -1, progress = interactiveQ)
Read a file into a raw vector
read_file_raw(file)
Read each line into a raw vector
read_!ines_raw(file, skip = 0, n_max = -1L, progress = interactiveQ)
Data types
readrfunctions guess the types of each column and convert types when appropriate (but will NOT convert strings to factors automatically).
A message shows the type of each column in the result.
## Parsed with column specification: ## cols(
##    age = col_integer(), ##    sex = col_character( ##    earn = col_double() ## )
1. Use problems)) to diagnose problems. x <- read_csv("file.csv"); problems(x)
2. Use a col_ function to guide parsing.
• col_guess() - the default
• col_character()
• col_double(), col_euro_double()
• col_datetime(format = '") Also col_date(format=""), col_time(format= "")
• col_factor(levels, ordered = FALSE)
• col_integer()
• col_logical()
• col_number(), col_numeric()
• col_skip()
x<- read_csv("file.csv", col_types = cols( A = col_double(), B = col_logical(), C = colJactorQ))
3. Else, read in as character vectors then parse with a parse_ function.
• parse_guess()
• parse_character()
• parse_datetime() Also parse_date() and
parse_time()
• parse_double()
• parse_factor()
• parse_integer()
• parse_logical()
• parse_number()
x$A <- parse_number(x$A)
RStudio8 isa trademark of RStudio, Inc. • CCBYSA RStudio- info@rstudio.com • 844-448-1212-rstudio.com- Learn more at tidwerse.org • readr 1.1.0- tibble 1.2.12- tidyr 0.6.0- Updated: 2019-'
Tibbies
an enhanced data frame
The tibble package provides a new S3 class for storing tabular data, the tibble. Tibbies inherit the data frame class, but improve three behaviors:
• Subsetting- [ always returns a new tibble, [[ and $ always return a vector.
• No partial matching - You must use full column names when subsetting
• Display - When you print a tibble, R provides a concise view of the ^— data that fits on one screen
# A tibble: 234 * 6
ranufacturer model displ
<chr> <chr> <dbl>
"?ie<l
audi a4 quattri audi a4 quattri audi a4 quattri
It 1:8
a4 3.1
tibble display
A large table to display
156 1999     6 auto(l4]
157 1999     6 auto(l4
158 2668     6 auto(l4)
mm s..falsi!
161 1999     4 auto(l4]
162 2008    4 manualjm5j
ill !"Stf]
166 1999     4 auto(l4]
[ reached getQption ("ma; — omitted 68 rows ]
data frame display
• Control the default appearance with options:
options(tibble.print_max = n,
tibble.print_min = m, tibble.width = Inf)
• View full data set with View)) or glimpse))
• Revert to data frame with as.data.frame))
CONSTRUCT A TIBBLE IN TWO WAYS
tibble)...) Construct by columns. tibble(x = l:3,y = c("a", "b", "c")j
tribble)...)
Construct by rows. tribble( ~x, ~y
1, "a",
2, "b",
3, "c")
as_tibble(x,...) Convert data frame to tibble.
enframefx, name = "name", value = "value") Convert named vector to a tibble
is_tibble(x) Test whether x is a tibble.
©Stud
10
Tidy Data with tidyr
Tidy data is a way to organize tabular data. It provides a consistent data structure across packages.
A table is tidy if:
Tidy data:
&
Each variable is in its own column
Each observation, or case, is in its own row
A * B -> C □ * q q
Makes variables easy to access as vectors
Preserves cases during vectorized operations
Reshape Data Ch3 nge the layout of values in a table
Use gather)) and spread)) to reorganize the values of a table into a new layout.
gather(data, key, value,na.rm = FALSE, convert = FALSE, factor_key = FALSE)
gather() moves column names into a key column, gathering the column values into a single value column.
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
spread() moves the unique values of a key column into the column names, spreading the values of a value column across the new columns.
table4a	
Bwimirmi-Miiii	
A 0.7K	2K
B 37K	80K
C      212K	213K
	A	IEEE	0.7K
	B		37K
	c		212K
	A		2K
	B		80K
	C		213K
key value			
gather(table4a, '1999', '2000' key= "year", value = "cases")
	table2					
						
A	1999 E   3 0-7K		A	1999	0.7K	19M
hbb	1999 WEHM 19M		bbb	2000	2K	20M
bbb	2000 E    1 2K		B	1999	37K	172M
bbb]	2000 WEHM 20M		B	2000	80K	174M
B	1999 bb11 37K		C	1999	212K	1T
B	1999 WEHM 172M		C	2000	213K	1T
B	2000 ISISHSl 80K					
B	2000 WEM 174M					
C	1999 is  9 212K					
C	1999 WEHM 1T					
C	2000 ISHJg 213K					
C	2000 WSHM 1T					
	key valu<					
spread(table2, type, count)
Handle Missing Values
drop_na(data,...)
Drop rows containing NA'sin ... columns.
X
E NA
drop_na(x, x2)
fill(data,.direction = cf'down", "up")) Fill in NA's in ... columns with most recent non-NA values.
replace_na(data,
replace = list(),...) Replace NA's by column.
niia	X niia		X	
A 1	A 1	w     A 1 B 1	A 1	
D 3	B NA		B NA	
	_ NA	C 1	_ NA	C 2
	D 3	D 3	D 3	D 3
	E NA	E 3	E NA	E 2
fill(x, x2)
replace_na(x, list(x2=2))
Expand Tables quickly create tables with combinations of values
complete(data,fill = list()) expand(data,...)
Create new tibble with all possible combinations of the values of the variables listed in ...
expand(mtcars, cyl, gear, carb)
Adds to the data missing combinations of the values of the variables listed in ...
complete(mtcars, cyl, gear, carb)
Split Cells
Use these functions to split or combine cells into individual, isolated values.
separate(data, col, into, sep = "[A[:alnum:]] +", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn",...)
Separate each cell in a column to make several columns.
	table3					
		BfS^h				
A	1999	0.7KAI9M		A	1999	0.7K Piffil
BBB	2000	2K/20M		BBB	2000	2K 1
B	1999	37K/172M		B	1999	37K WBS
B	2000	80K/174M		B	2000	80K 1
C	1999	212K/1T		C	1999	212K ■
C	2000	213K/1T		C	2000	213K BEB
separate(table3, rate, sep = "/", into = c(''cases", "pop"))
separate_rows(data, ...,sep= "[A[:alnum:] +", convert = FALSE)
Separate each cell in a column to make several rows.
table3						
		BfS^B]				
BBBi	1999	1.7K/19M		bbh	1999	0.7K
BBBi	2000	2K/20M		bbbi	1999	nun
B	1999	37K/172M		BBB	2000	2K
B	2000	80K/174M		BBB	2000	WHIM
C	1999	212K/1T		B	1999	37K
C	2000	213K/1T		B	1999	WfSM
				B	2000	80K
				B	2000	
				C	1999	212K
				C	1999	MSM
				C	2000	213K
				C	2000	BIB
separate_rows(table3, rate, sep = "/")
unite(data, col,sep =     remove = TRUE)
Collapse cells across several columns to make a single column.
table5 Afghan 19		Afghan   1999	
Afghan	20          ■ _w 19	Afghan	2000
Brazil		Brazil	1999
Brazil	20	Brazil	2000
China	19	China	1999
China	20	China	2000
unite(table5, century, year, col= "year", sep = "")
RStudio8 isa trademark of RStudio, Inc. • CCBYSA RStudio- info@rstudio.com • 844-448-1212-rstudio.com- Learn more at tidwerse.org • readr 1.1.0- tibble 1.2.12- tidyr 0.6.0- Updated: 2019-'