Data import and export Lukáš Lehotský and Petr Ocelík Summary of the last lesson • Logical conditions or index positions allow for object filtering, when inserted into square brackets • Square brackets must include all object dimensions • When filtering data frames, filtering condition usually applies to rows condition <- df[ ,3] > 1000000 df[ condition , ] cars type price consumption 1 BMW 3 1200000 6.2 2 Audi A4 1164000 5.9 df.sub <- df[ condition , ] Working directory • Folder, where all imports and exports are taking place • Makes data import and export easier • Functions setwd () and getwd () getwd() [1] "C:/Users/Lukas/. . . setwd("C:\\Users\\Lukas\\Documents\\R intro") setwd("C:/Users/Lukas/Documents/R intro") Working directory • There are few issues/limitations • R does not accept single backslash \ in Win path • Typically, this is path copied from the Win explorer • It's necessary to replace backslash \ with forwardslash / or double backslash \\ • Sometimes, there's an issue with non-standard letters in the path • Typically, in user-name paths in Win • Solution is to move R and working directory out File Edit Code View Plots 0] Untrtledl I £3 I H □ Source Session Build Debug Erofile lools Help New Session Interrupt R lerminate R... Restart R Ctrl+Shift+F10 Set Working Directory Load Workspace... Save Workspace As.., Clear Workspace,,, Quit Session... _+Run _ Source - , I Environment History | £jf Q J* Import Dataset - I Global Environment ^ Project: (None) ■ To Source File Location To Files Pane Location Choose Directory.,, Ctrl+Shift+H Ctrl+Q 1:1 (Top Level) t Console -/ > I R Script : can Files Plots Packages Help Viewer = n Ol Install © Update 1 (<* MC9I Name Description Version I I User Library A □ abind Combine Multidimensional Arrays 1.4-5 □ acepack ACE and AVAS for Selecting Multiple Regression Transformations 1,4,1 © □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © □ backports Reimplementations of Functions Introduced Since R-3.0.0 1.1.0 © □ base64enc Tools for base64 encoding 0.1-3 © □ beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C + + Header Files 1.62.0-1 □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 □ bltcps Bitwise Operations 1.0-6 © □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap (PNG, JPEG, TIFF], vector (PDF, SVG, PostScript) and display (X11 and Win32) output 1.5-9 © Data import • R Studio visual import • Functions • Structured data (tables) • Unstructured data • R native files o RStudio - °D File Edit Code View Plots Session Build Debug Profile Tools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » ^ Project: (None) » Untitledl x =n j En vi ron m en^^riMWf^^^^ = n 1 <£L 1 H □ Source onSave 1 1 ^Source - 1 ^ H Cj* Import Dataset^ y = List - 1 @ 1 1 Global EnviraWrtHI!^^^^^ Environment is empty 1:1 (Top Level) Console -7 ">"f R Script : Files Plots Packages Help Viewer = n 11 @ Home ■» CR R Resources U I RStudio A Learning R Online CRAN Task Views R on StackOverflow Getting Help with R Manuals An Introduction to R Writing R Extensions R Data Import/Export Reference RStudio IDE Support RStudio Cheat Sheets RStudio Tip of the Day RStudio Packages RStudio Products The R Language Definition R Installation and Administration R Internals Packages Search Engine & Keywords Data import: structured data • Most common format of tabular data • CSV-comma-separated value • TSV-tab-separated value • TXT - plain text file • Other formats (XLSX,...) • CSV/TSV are most desired • Essentially, CSV and TSV files are text files PSPad - [C:\Users\Lukas\Downloads\BALIBOMBING2002_2002.csv] [7? File Projects Edit Search View Format Tools HTML Settings Window Help(x) 1- II B X mi ■j|r-iE,[Lä]^ -• P flpz * {+} 11 a# IT cö ü 1 b m - =h ^1 M. @ E£ □ - i a I " Filter XI51 XI53 X155 X160 XI82 XI63 X175 X177 X180 X1S3 X1S9 X650 151 0 2 2 2 2 2 2 2 _2_ 2 2 A 153 2 0 2 2 2 2 2 2 2 2 2 155 2 2 0 2 3 3 2 2 2 2 2 160 2 2 2 0 2 2 2 2 2 2 2 162 2 2 3 2 0 3 2 2 2 2 0 163 2 2 3 2 3 0 2 2 2 2 0 1 175 2 2 2 2 2 2 0 2 2 2 0 177 2 2 2 2 2 2 2 0 2 2 2 180 2 2 2 2 2 2 2 2 0 2 2 133 2 2 2 2 2 2 2 2 2 0 0 189 2 2 2 2 0 0 0 2 2 0 0 850 2 2 0 0 0 0 0 0 0 0 0 1579 2 2 2 2 2 2 2 2 2 2 2 V < | > Showing 1 to 15 of 27 entries Environment History Connections \3* Id FH* Import Dataset -Global Environment - Data df List - 27 obs. of 27 variables Console Terminal C:/Users/Lukas/Downloads/ > df <- read.table( file = "BALIBOMBING2002_2ee2.csv", + header = TRUE, + _ _ _ ii ir sep = , , + row-names = ls + na.strings = "NA", + stringsAsFactors = FALSE, + fileEncoding = "UTF-8") > Files Plots Packages Help Viewer = r 6| install (f) Update i <§> Mame Description Version User Library A □ abind Combine Multidimensional Arrays 1.4-5 □ acepack ACE and AVAS for Selecting Multiple Regression Transformations 1.4.1 0 □ antiword Extract Text from Microsoft Word Documents 1.1 0 assertthat Easy Pre and Post Assertions 0.2.0 □ audio Audio Interface for R 0.1-5 0 □ backports Reimplementation5 of Functions Introduced Since R-3.0.0 1.1.2 & base64enc Tools for baseS4 encoding 0.1-3 0 □ beepr Easily Play Notification Sounds on any Platform 1.2 w □ BH Boost C++ Header Files 1.66.0-1 0 □ bindr Parametrized Active Bindings 0.1.1 0 □ bind repp An 'Repp' Interface to Active Bindings 0.2.2 _0 □ bit A Class for Vectors of 1-Bit Booleans 1.1-13 A □ bit64 A S3 Class for Vectors of 64bit Integers 0.9-7 0 □ bitops Bitwise Operations 1.0-6 0 □ blob A Simple S3 Class for Representing Vectors of Binary Data (BLOBS') 1.1.1 0 □ brew Templating Framework for Report Generation 1.0-6 □ ca Simple, Multiple and Joint Correspondence Analysis 0.70 Data import: unstructured data • Most common to use function readLines () • Can import any unstructured data as a character vector • Data conversions/mining must follow text <- readLines(con = "BALIBOMBING2002_2002.csv", warn = TRUE, encoding = "UTF-8") Data import: RData files • R has native data format • Allows to save whole environment (all objects) or particular object for later use • Easiest way to access/share previous R work • Function load () load(file = "balibombing.rdata") RStudio File Edit Code View Plots Session Build Debug Profile Tools Help Q ' t3? " i_i Go to file/function I BS " Addins £■1 Project: [None] - df Filter Showing 1 to 15 of 27 entries Console Terminal C:/Users/Lukas/Downloads/ > > str(df) 'data.frame' $ X151 $ X153 $ X155 $ xiee $ X162 $ X163 $ X175 $ X177 $ xise $ X1S3 $ X1S9 int int int int int int int int int int int 27 e 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 obs. of 27 variables: e e X151 XI53 XI55 XI60 XI62 XI63 XI75 XI77 X1S0 X13J X1S9 X650 151 0 2 2 2 2 2 2 2 2 2 2 A 153 2 0 2 2 2 2 2 2 2 2 2 155 2 2 0 2 3 3 2 2 2 2 2 160 2 2 2 0 2 2 2 2 2 2 2 162 2 2 3 2 0 3 2 2 2 2 0 1 SB 2 2 3 2 3 0 2 2 2 2 0 175 2 2 2 2 2 2 0 2 2 2 0 177 2 2 2 2 2 2 2 0 2 2 2 ISO 2 2 2 2 2 2 2 2 0 2 2 183 2 2 2 2 2 2 2 2 2 0 0 129 2 2 2 2 0 0 0 2 2 0 0 650 2 2 0 0 0 0 0 0 0 0 0 1579 2 2 2 2 2 2 2 2 2 2 2 V Environment History Connections = r -* y I fj#» Import Dataset - I List - 1 Global Environment - I* Data Odf 27 obs. of 27 variables Files Plots Packages Help Viewer = E b| Install ® Update I I © Name Description Version 1 User Library □ abind Combine Multidimensional Arrays 1.4-5 w □ acepack ACE and AVASfor Selecting Multiple Regression Transformations 1.4.1 © TT antiword Extract Text from Microsoft Word Documents 1.1 © □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © □ backports Reimplementations of Functions Introduced Since R-3.0.0 1.1.2 © TT base64enc Tools for baseS4 encoding 0.1-3 © □ beepr Easily Play Notification Sounds on any Platform 1.2 © □ 3H Boost C + + Header Files 1.66.0-1 © □ bindr Parametrized Active Bindings 0.1.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2.2 © □ bit A Class for Vectors of 1 -Bit Booleans 1.1-13 © TT bit64 A S3 Class for Vectors of 64bit Integers 0.9-7 © □ bitops Bitwise Operations 1.0-6 © □ blob A Simple S3 Class for Representing Vectors of Binary Data [BLOBS'] 1.1.1 © TT brew Templating Framework for Report Generation 1.0-6 © □ ca Simple, Multiple and Joint Correspondence Analysis 0.70 © V RStudio Eile Edit Code View Elots Session Build Debug Profile lools EJelp '—■ Go to file/function I BS - Addins df X151 X133 X133 X160 XI62 XI63 XI75 XI77 X160 XI33 XI S3 XS50 Environment History Connections kJ P* Import Dataset - £ 1 Global Environment - ? I Project: (None] ■ List - Data 153 2 0 2 2 2 2 2 2 2 2 2 df 27 obs. of 27 variables 155 2 2 0 2 3 3 2 2 2 2 2 160 2 2 2 0 2 2 2 2 2 2 2 162 2 2 3 2 0 3 2 2 2 2 0 1 SB 2 2 3 2 3 0 2 2 2 2 0 175 2 2 2 2 2 2 0 2 2 2 0 177 2 2 2 2 2 2 2 0 2 2 2 180 2 2 2 2 2 2 2 2 0 2 2 183 2 2 2 2 2 2 2 2 2 O 0 Files Plots Packages Help Viewer = T 129 2 2 2 2 0 0 0 2 2 0 0 Ol Install ® Update 650 2 2 0 0 0 0 0 0 0 0 0 Name Description Version 1579 2 2 2 2 2 2 2 2 2 2 2 User Library Showing 1 to 15 of 27 entries Console Terminal > str(df) I 'data.frame' 27 obs. of 27 variables: •p XML a L / z / / / $ X153 int 2 e 2 2 2 2 2 2 2 2 $ X155 int 2 2 e 2 3 3 2 2 2 2 $ X160 int 2 2 2 e 2 2 2 2 2 2 $ X162 int 2 2 3 2 e 3 2 2 2 2 $ X163 int 2 2 3 2 3 e 2 2 2 2 $ X175 int 2 2 2 2 2 2 e 2 2 2 $ X177 int 2 2 2 2 2 2 2 e 2 2 $ xise int 2 2 2 2 2 2 2 2 e 2 $ X1S3 int 2 2 2 2 2 2 2 2 2 e $ X1S9 int 2 2 2 2 e e e 2 2 e j. ■ _ _ _ _ u □ □ □ □ □ □ □ □ abind acepack antiword assertthat audio backports base64enc beepr BH Combine Multidimensional Arrays ACE and AVAS for Selecting Multiple Regression Transfo—lations Extract Text from Microsoft Word Documents Easy Pre and Post Assertions Audio Interface for R Reimplementations of Functions Introduced Since R-3.0.0 Tools for baseS4 encoding Easily Play Notification Sounds on any Platform Boost C + + Header Files □ □ □ □ bindr bindrcpp bit bit64 O bltops blob □ □ b'6'.V Parametrized Active Bindings An 'Repp' Interface to Active Bindings A Class for Vectors of 1 -Bit Booleans A S3 Class for Vectors of 64bit Integers Bitwise Operations A Simple S3 Class for Representing Vectors of Binary Data (BLOBS'}_ Templating Framework for Report Generation Simple, Multiple and Joint Correspondence Analysis 1.4-5 1.4.1 1.1 0.2.0 0.1-5 1.1.2 0.1-3 1.2 1.66.0-1 0.1.1 0.2.2 1.1-13 0.9-7 1.0-6 1.1.1 1.0-6 0.70 © © © © © © © © © © © © © v Data conversions • Data frame may be converted to other object types • Function as . matrix () turns DF to matrix • Function as . list () will turn DF into list of vectors • Other object types may be converted to a data frame as well • Function as . data. frame () Data conversions: DF to matrix • Data are loaded as data frames by default - for networks, conversion is necessary • Class of data needs to be considered beforehand • If text included, the whole matrix will be coerced to character • Conversion of data classes after the DF conversion is painful - needs to be done column by column mat <- as.matrix(df) RStudio File Edit Code View Plots Session Build Debug Profile lools Help Q ' Qb,\ t3? " i_i Go to file/function ut - Addins - £■1 Project: [None] - df Filter Showing 1 to 15 of 27 entries Console Terminal C :/U s ers/ L u ka s/ D o wn I c a d s/ > str(df) 'data.fr $ X151 $ X153 $ X155 $ X160 $ X162 $ X163 $ X175 $ X177 $ xise $ X1S3 $ X1S9 int int int int int int int int int int int 27 2 e 2 2 2 2 2 2 2 2 2 obs. 2 2 of 27 variables: 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 9 3 2 2 2 2 3 0 2 2 2 2 2 2 0 2 2 2 2 2 2 0 2 2 2 2 2 2 0 2 2 2 2 2 2 0 0 0 0 2 2 0 X151 XI53 XI55 XI60 XI62 XI63 XI75 XI77 X1S0 X13J X1S9 X650 151 0 2 2 2 2 2 2 2 2 2 2 A 153 2 0 2 2 2 2 2 2 2 2 2 155 2 2 0 2 3 3 2 2 2 2 2 160 2 2 2 0 2 2 2 2 2 2 2 162 2 2 3 2 0 3 2 2 2 2 0 1 SB 2 2 3 2 3 0 2 2 2 2 0 175 2 2 2 2 2 2 0 2 2 2 0 177 2 2 2 2 2 2 2 0 2 2 2 ISO 2 2 2 2 2 2 2 2 0 2 2 183 2 2 2 2 2 2 2 2 2 0 0 129 2 2 2 2 0 0 0 2 2 0 0 650 2 2 0 0 0 0 0 0 0 0 0 1579 2 2 2 2 2 2 2 2 2 2 2 V Environment History Connections = r -* y I fj#» Import Dataset - I List - 1 Global Environment - I* Data Odf 27 obs. of 27 variables Files Plots Packages Help Viewer = E oj Install ® Update I I © Name Description Version 1 User Library □ abind Combine Multidimensional Arrays 1.4-5 w □ acepack ACE and AVASfor Selecting Multiple Regression Transformations 1.4.1 © TT antiword Extract Text from Microsoft Word Documents 1.1 © □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © □ backports Reimplementations of Functions Introduced Since R-3.0.0 1.1.2 © TT base64enc Tools for baseS4 encoding 0.1-3 © □ beepr Easily Play Notification Sounds on any Platform 1.2 © □ 3H Boost C + + Header Files 1.66.0-1 © □ bindr Parametrized Active Bindings 0.1.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2.2 © □ bit A Class for Vectors of 1 -Bit Booleans 1.1-13 © □ bit64 A S3 Class for Vectors of 64bit Integers 0.9-7 © □ bitops Bitwise Operations 1.0-6 © □ blob A Simple S3 Class for Representing Vectors of Binary Data [BLOBS'] 1.1.1 © TT brew Templating Framework for Report Generation 1.0-6 © □ ca Simple, Multiple and Joint Correspondence Analysis 0.70 © V Data conversions: matrix to DF • Inconvenient process • Argument stringsAsFactors is important -otherwise any text becomes factor variable • Data class of matrix will be preserved • No automatic recognition built in on conversion • If matrix is unnamed, column names are assigned automatically mat2df <- as.data.frame(mat) mat2df <- as.data.frame(mat, stringsAsFactors = FALSE) Names of rows and columns • Column names are an issue in data frames • Data frame columns must have proper column (variable) names • Names starting with number are not allowed (default transformation behavior to "X#") • Missing column names lead to automatically generated names (VI, V2,Vn) • Conversion (e.g. from list) might result into crazy column names Names of rows and columns • If issue with names is present, renaming needs to follow • Functions rownames () and colnames () allow to access existing names • Renaming is counterintuitive - we need to assign vector with names to the function result • It must be of a same length as number of columns colnames(df) <- c("varl" , "var2" , ... , "varN") Names of rows and columns • Names may be generated automatically using paste () orpasteOO functions - these functions collapse text bits together • Function paste () adds spaces between texts • Function pasteO () collapses texts without spaces colnames(df) <- pasteO("actor" , 1:27) * In case of a matrix, each dimension needs to be named separately colnames(mat) <- pasteO("actor" , 1:27) colnames(mat) <- rownames(mat) Data export • Export of entire workspace • Export of particular objects • R native files • Structured data (tables) • Unstructured data Data export: saving entire workspace • Function save . image () • Rdata format • No additional arguments required save.image(file = "workspace 2018 07 04.rdata") File Edit Code View £lots Session Build Debug Profile Tools Help LJ (4 Go to file/function I BS * Addins - Console Terminal C:/Users/Lukas/Downloads/ > View(df) > BS«** _ i Showing 1 to 15 of 27 entries Environment History Connections t2* bJ f1^* Import Dataset -'Global Environment - Data df ^ Project: (None) -List - 27 obs. of 27 variables Files Plots Packages Help Viewer = T 6| Install (£) Update Q. Name Description Version User Library □ abind Combine Multidimensional Arrays 1.4-5 V □ acepack ACE and AVASfor Selecting Multiple Regression Transformations 1.4,1 G □ antiword Extract Text from Microsoft Word Documents 1.1 Q □ assertthat Easy Pre and Post Assertions 0.2.0 Q □ a u die Audio Interface for R 0.1-5 e □ backports Reimplementations of Functions Introduced Since R-3.0.0 1.1.2 0 □ base64enc Tools for base64 encoding 0,1-3 Q □ beepr Easily Play Notification Sounds on any Platform 1.2 O □ 3H Boost C + + Header Files 1,66.0-1 □ bindr Parametrized Active Bindings 0,1,1 © □ bindrepp An 'Repp' Interface to Active Bindings 0.2.2 G □ bit A Class for Vectors of 1 -Bit Booleans 1,1-13 G bitS4 A S3 Class for Vectors of 64bit Integers 0,9-7 G □ bitops Bitwise Operations 1,0-6 G □ blob A Simple S3 Class for Representing Vectors of Binary Data ('BLOBS') 1.1.1 G □ bre'.v Templating Framework for Report Generation 1.0-6 V □ ca Simple, Multiple and Joint Correspondence Analysis 0.70 Data export: saving particular object as Rdata • Function save () • Rdata is R native format • Preserves the object as is, with its object name • Allows to avoid export and import issues save(df, file = "BALIBOMBING export.rdata") Data export: saving tabular data • Function write . table () • Object which is exported has to be specified • Otherwise similar to read. table () write.table(df, "BALIBOMBING_export.csv", sep = 11, 11 , row.names = TRUE, col.names = TRUE, fileEncoding = "UTF-8") Data export: saving unstructured data • Function writeLines () • Similar to readLines () • Basically no arguments writeLines(df, "BALIBOMBING export.txt") Other input and output options • Built-in CSV-specific functions • read, csv () and write . csv () functions • read, csv () requires less arguments than read. table () but has more limited functionality • Libraries "xlsx" or "openxlsx" for XLSX input • Both packages contain functions read, xlsx () and write . xlsx () - similar to read, table () • Package "readtext" designed to read unstructured text data (PDFs, Word documents, etc.) • Other packages for specific formats (XML, HTML, Json, SQL,...) Practice 1 • Download the package "MEB433_434_03A_practice.zip" into your computer • Unpack it into any folder • Set the folder as a working directory • Import the data "nrg_122a.csv" into R • Explore and adjust the data set in a text editor, if necessary • Achieve the following • Dataset has proper column names • Missing values are treated correctly (find function arguments) • All columns are numeric (there's a catch here) Practice 2 • Get rid of the summary rows in the data frame • Get rid of the missing values for the last year • Extract only the last 3 years into a separate object • Export the subset in rdata format • Export the subset in CSV format