Data Import:: Cheatsheet R's tidyverse is built around tidy data stored in tibbles, which are enhanced data frames. The front side of this sheet shows how to read text files into R with readr. The reverse side shows how to create tibbles with tibble and to layout tidy data with tidyr. OTHERTYPESOFDATA Try one of the following packages to import other types of files • haven - SPSS, Stata, and SAS files readxl- excel files (.xls and .xlsx) DBI - databases • jsonlite-json • xml2-XML • httr-Web APIs • rvest-HTML (Web Scraping) Save Data Save x, an R object, to path, a file path, as: Comma delimited file write_csv(x, path, na = "NA", append = FALSE, col_names = lappend) File with arbitrary delimiter write_delim(x, path, delim = "", na = "NA", append = FALSE, col_names = lappend) CSV for excel write_excel_csv(x, path, na = "NA", append = FALSE, col_names = lappend) String to file write_file(x, path, append = FALSE) String vector to file, one element per line write_lines(x,path, na = "NA", append = FALSE) Object to RDS file write_rds(x, path, compress = c("none", "gz", "bz2", "xz"),...) Tab delimited files write_tsv(x, path, na = "NA", append = FALSE, col_names = lappend) @Stud 10 Read Tabular Data These functions share the common arguments: read_*(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), progress = interactiveQ) a,b,c 1,2,3 4,5,NA a;b;c 1;2;3 4;5;NA X 4^ 1\2\3 4\5\NA X a b c 1 2 3 4 5 NA □ D B 1 2 3 4 5 NA D 1 □ 2 B 3 4 5 NA □ 1 El 2 B 3 4 5 NA □ 1 El 2 B 3 4 5 NA Comma Delimited Files read_csv("file.csv") To make file.csv run: write_file(x = "a,b,c\nl,2,3\n4,5,NA", path = "file.csv") Semi-colon Delimited Files read_cs v2 (" f i I e2. cs v") write Jile(x = "a;b;c\nl;2;3\n4;5;NA", path = "file2.csv") Files with Any Delimiter read_delim("file.txt", delim = "|") write_file(x = "a|b|c\nl|2|3\n4|5|NA", path = "file.txt") Fixed Width Files read_fwf("file.fwf", coLpositions = c(l, 3,5)) write_file(x = "a b c\nl 2 3\n4 5 NA", path = "file.fwf") Tab Delimited Files read_tsv("file.tsv") Also read_table(). write_file(x = Ma\tb\tc\nl\t2\t3\n4\t5\tNA", path = "file.tsv"; Data types readrfunctions guess the types of each column and convert types when appropriate (but will NOT convert strings to factors automatically). A message shows the type of each column in the result. ## Parsed with column specification: ## cols( ## age = col_integer(), ## sex = col_character( ## earn = col_double() ## ) USEFULARGUMENTS Example file a,b,c 1,2,3 4,5,NA X 1. Use problems)) to diagnose problems x <- read_csv("file.csv"); problems(x) 2. Use a col_ function to guide parsing • col_guess() - the default • col_character() • col_double(), col_euro_double() • col_datetime(format = '") Also col_date(format=""), col_time(format= "") • col_factor(levels, ordered = FALSE) • col_integer() • col_logical() • col_number(), col_numeric() • col_skip() x<- read_csv("file.csv", col_types = cols( A = col_double(), B = col_logical(), C = colJactorQ)) 3. Else, read in as character vectors then parse with a parse_ function. • parse_guess() • parse_character() • parse_datetime() Also parse_date() and parse_time() • parse_double() • parse_factor() • parse_integer() • parse_logical() • parse_number() x$A <- parse_number(x$A) RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with tidwerse.org • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2017-01 A B C 1 2 3 4 5 NA □ A n B B c 1 2 3 4 5 NA write_file("a,b,c\nl,2,3\n4,5,NA","file.csv") f <- "file.csv" No header read_csv(f, col_names= FALSE) Provide header read_csv(f, col_names = c("x", "y", "z")) n 4 □ 5 □ NA □ 1 □ 2 □ 3 □ NA a 2 B 3 4 5 NA Skip lines Read in a subset Missing Values Read Non-Tabular Data Read a file into a single string read_file(file, locale = defaultjocalefj) Read each line into its own string read_!ines(file, skip = 0, n_max = -1L, na = character), locale = defaultjocalefj, progress = interactivefj) Read Apache style log files read_Iog(file, col_names = FALSE, cojtypes = NULL, skip = 0, n_max = -1, progress = interactiveQ) Read a file into a raw vector read_file_raw(file) Read each line into a raw vector read Jines_raw(file, skip = 0, n_max = -1L, progress = interactivefj) Tibbies an enhanced data frame The tibble package provides a new S3 class for storing tabular data, the tibble. Tibbies inherit the data frame class, but improve three behaviors: • Subsetting- [ always returns a new tibble, [[ and $ always return a vector. • No partial matching - You must use full column names when subsetting • Display - When you print a tibble, R provides a concise view of the ^— data that fits on one screen # A tibble: 234 * 6 ranufacturer model displ "?ie auto(l4) 158 2668 6 mim %* 161 1999 4 auto!14} auto(l4} ill V 166 1999 4 anualiriiii mam) auto(l4} [ reached gel — omitted 68 Option! "max. print") data frame display • Control the default appearance with options: options(tibble.print_max = n, tibble.print_min = m, tibble.width = Inf) • View full data set with View)) or glimpse)) • Revert to data frame with as.data.frame)) CONSTRUCT A TIBBLE IN TWO WAYS tibble)...) Construct by columns. tibble(x = l:3,y = c("a", "b", "c")j tribble)...) Construct by rows. tribble( ~x, ~y 1, "a", 2, "b", 3, "c") A tibble 11a 2 2b 3 3c as_tibble(x,...) Convert data frame to tibble. enframefx, name = "name", value = "value") Convert named vector to a tibble is_tibble(x) Test whether x is a tibble. @Stud 10 Tidy Data with Tidyr Tidy data is a way to organize tabular data. It provides a consistent data structure across packages. A table is tidy if: Tidy data: & Each variable is in its own column Each observation, or case, is in its own row A * B -> C □ * q q Makes variables easy to access as vectors Preserves cases during vectorized operations Reshape Data Ch3 nge the layout of values in a table Use gather)) and spread)) to reorganize the values of a table into a new layout. gather(data, key, value,na.rm = FALSE, convert = FALSE, factor_key = FALSE) Gather moves column names into a key column, gathering the column values into a spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL) Spread moves the unique values of a key column into the column names, spreading the values of a single value column. value column across th table4a table2 Hwrmirmi-Miiii A 0.7K 2K A 1991 0.7K A 1999 cases 0.7K B 37K 80K B 1999 37K 1999 WSSM 19M C 212K 213K C 1999 212K 2000 cases 2K A Ullllll 2K 2000 WSHM 20M B 3 80K B 1999 cases 37K C Ullllll 213K B 1999 WEEM 172M key value B 2000 cases 80K B 2000 WSSM 174M C 1999 cases 212K C 1999 WSHM 1T C 2000 cases 213K C 2000 WEEM 1T gather(table4a, *1999\ *2000\ key value A 1999 0.7K 19M 2000 2K 20M B 1999 37K 172M B 2000 80K 174M C 1999 212K 1T C 2000 213K 1T key= "year", value = "cases") spread(table2, type, count) Handle Missing Values drop_na(data,...) Drop rows containing NA'sin ... columns. X drop_na(x, x2) fill(data,.direction = cf'down", "up")) Fill in NA's in ... columns with most recent non-NA values. replace_na(data, replace = list)),...) Replace NA's by column. niia X niia X A 1 A 1 w A 1 B 1 A 1 D 3 B NA B NA _ NA C 1 _ NA C 2 D 3 D 3 D 3 D 3 E NA E 3 E NA E 2 fill(x, x2) replace_na(x, list(x2=2)) Expand Tables qUic complete(data,fill = listQ) kly create tables with combinations of values Adds to the data missing combinations of the values of the variables listed in ... complete(mtcars, cyl, gear, carb) expand(data,...) Create new tibble with all possible combinations of the values of the variables listed in ... expand(mtcars, cyl, gear, carb) Split Cells Use these functions to split or combine cells into individual, isolated values. separate(data, col, into, sep = "[A[:alnum:]] +", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn",...) Separate each cell in a column to make several columns. table3 A 1999 0.7KAI9M A 1999 0.7K 19M 2000 2K/20M 2000 2K 20M B 1999 37K/172M B 1999 37K B 2000 80K/174M B 2000 80K 174 C 1999 212K/1T C 1999 212K C 2000 213K/1T C 2000 213K separate(table3, rate, into = c(''cases", "pop")) separate_rows(data, ...,sep= "[A[:alnum:] +", convert = FALSE) Separate each cell in a column to make several rows. Also separate_rows_(). table3 1999 1.7K/19M 1999 0.7K 2000 2K/20M -> 1999 19M B 1999 37KAI72M 2000 2K B 2000 80K/174M 2000 20M c 1999 212K/1T B 1999 37K c 2000 213K/1T B 1999 172M B 2000 80K B 2000 174M c 1999 212K c 1999 1T c 2000 213K c 2000 1T separate_rows(table3, rate) unite(data, col,sep = remove = TRUE) Collapse cells across several columns to make a single column. table5 Afghan 19 99 Afghan 1999 Afghan 20 -> Afghan 2000 Brazil 19 99 Brazil 1999 Brazil 20 Brazil 2000 China 19 99 China 1999 China 20 0 China 2000 unite(table5, century, year, col= "year", sep = "") RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with tidwerse.org • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2017-01