Text analysis Lukáš Lehotský "text analysis is just a fancy and convoluted way how to obtain independent or dependent variable" Inaki Sagarzazu Concepts Bag of wo Bag of words • The quick brown fox jumps over the lazy dog Word Occurrence brown 1 dog 1 fox 1 jumps 1 lazy 1 over 1 quick 1 the 2 Co-occurrence Co-occurrence • The quick brown fox jumps over the lazy dog. Brown dog sleeps well. Word Sentence 1 Sentence 2 brown 1 1 dog 1 1 fox 1 jumps 1 lazy 1 over 1 quick 1 sleeps 1 the 2 well 1 Co-locations and N-grams Co-locations/n-grams • Established phrases - usually occur together and form a meaning Ministry of the Environment European Union prime minister toilet paper Zipf law 14 Zipf's law 12 10 CD =5 U 2 o — Esperanto German ■—• Latin Malay ■—■ Ukrainian English •—• Czech Slovak Italian Romanian ■—■ Spanish Polish • • Slovene Uzbek Finnish French Hebrew ■—• Basque ■—■ Turkish •—■ Serbian • • Hungarian •— Dutch • • Galician «—■ Catalan ■—• Danish ■—■ Indonesian • • Belarusian >— Lithuanian Portuguese - • Croatian 0 0 6 8 log(rank) 10 12 14 (Jimenez, 2015) Manifest vs. latent content Design of CA research Design of CA research Unitizing Sampling Recording/ Coding Narrating Reducing Krippendorff 2013, p. 86 Basic terminology • Corpus • Body of all text pieces available for the content analysis • Term • Text token, usually word • Term-document matrix • Matrix which records occurrence of terms in documents Methods Methods of TA Supervised Unsupervised Semi-supervised Methods of TA • Supervised methods • Manual coding • Semi-supervised • Dictionary-based methods • Deductively given dictionary • Dictionary obtained from data • Automatically • Manually • Unsupervised • Frequencies • Topic modeling Fully supervised - manual coding Manual coding of text units Inductive vs. deductive coding • Inductive - data-driven • Categories not known • Open coding - categories emerge in iterative text reading • Axial coding - abstraction from open coding into categories • Deductive-theory-driven • Categories known a-priori • Existing code-book applied over data Fully supervised • Coding is input for further analysis • Frequencies of codes • Temporal development • Standard statistical methods • Socio-semantic networks • Discourse network analysis • Socio-semantic networks of actors and meanings (codes) they use Fully supervised - DNA Lucia Pu@ch [CDU] Bärbel H^hífl [G rüne] Demonstranten /—N Schleswig-holsteinische FDP Michael Bauchmüller [SZ] ■ Tanja Cannier [CDU] X \ Peter harry (£arsťenseri [CDU] Stefan MappjJS [CDU] Europäische Union Nicolas Sarkozy [Frankreichs Regierungschef] Sylvia Kotti^g>Uhl [Grüne] Angella Merkel |CDU] / \ SicherheitsüberdViJfung Deutsches Atomforum Werner S] one <- 1 > one + one [1] 2 RStudio Fjle Edit Code View Plots Session Build Debug Profile lools Help Source - ^ I ^ B I jj* Import Dataset * I ^ Global Environment » Values one 1 Project: (None) ' List - I @ Files Plots Packages Help Viewer = n Install © Update 1 IQ, Name Description Version User Library A □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 e beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C+ + Header Files 1.62.0-1 © □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF], vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data,frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 RStudio Eile Edit Code View Plots Session Build Debug Profile lools Help -I & ' I Q 01 Ü I I * Go to file/functio idqi - Addins Q] Untftledl* x fllflü Source on Save I Q £ - I |_l I - iO I Environment History Project: (None) ' _+Run .*■+ ._■+Source - 1 -= one <- 1 > one + one [1] 2 > Files Plots Packages Help Vie Ler Install Update I* I I s Name Description j f Version User Library A □ assertthat Easy Pre and#ost Assertions 0.2.0 © □ audio Audio IntsJFace for R 0.1-5 e beepr Easilyjray Notification Sounds on any ~->\amrn 1.2 © □ BH M (Boost C + + Header Files 1.62.0-1 © □ blndr ^^^^ Parametrized Active Bindings 0.1 © ^rloYcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data,frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 What is an object? • Anything may become an object • Temporary objects • Only appear in console • Their values must be stored in order to use them in operations • Stored objects • Must be defined by user • Remain the same unless overwritten • Must be removed by user as well o RStudio - °D File Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » ^ Project: (None) » <3j Untitledl* x aD Environment History = n 1 <£L 1 H □ Source onSave 1 one <- 1 > one + one [1] 2 > two <- one + one > Files Plots Packages Help Viewer Ol Install (£) Update i O Name Description I Version User Library A □ assertthat Easy Pre and Post Assertions m 0.2.0 □ audio Audio Interface for R m 0.1-5 © beepr Easily Play Notification Sounds m\ any Platform f 1.2 © □ BH Boost C+ + Header Files^r 1.62.0-1 © □ bindr Parametrized Acliyafcindings 0.1 © □ bindrcpp An 'RcppJ|()efface to Active Bindings 0.2 © □ . bitops ^^^^^^ *rmise Operations 1.0-6 TT Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF], vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data,frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 RStudio Fjle Edit Code View Plots Session Build Debug Profile Tools Help -I & ' I Q 01 Ü I I * Go to file/functio idqi - Addins Project: (None) Q] Untftledl* x fllflü Source on Save | ^ j£ - | B | - 1 one <- 1 2 3 one + one 4 5 two <- one + one 6 7 two 8 I 8:' (Top Level) i Console -/ 1 > one <- 1 > one + one [1] 2 > two <- one + one > two [1] 2 > R Script : Environment History = n Q J* Import Dataset - List - I 10 Data classes > as.numeric(10.49) [1] 10.49 > > as.integer(10.49) [1 ] 10 > > as.character(-1) [1 ] "-1 " > > as.numeric("anythingwithinquotes") [1] NA Warning message: NAs introduced by coercion > > 5 > 10 [1] FALSE > > as.character(5 > 10) [1] "FALSE" Object types - prop, of objects Object types - prop, of objects • vector • sequence (1-dimensional) of elements of same data class • matrix • 2-dimensional rectangular collection of elements of same data class • array: n-dimensional matrix • list • vector that can contain elements of different data classes • data frame • list of vectors of equal length • table data Vector > c (2, 3, 5) [1] 2 3 5 > > c("aa", "bb", "cc", "del", "ee") [1] "aa" "bb" "cc" "dd" "ee" > > c(TRUE, FALSE, TRUE, FALSE, FALSE) [1] TRUE FALSE TRUE FALSE FALSE > Matrix > m <- matrix (data = c (1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12) , + nrow = 3, + ncol = 4) > > m [1,1 [2,1 [3,1 > [,2] [,3] [.4] 1 4 7 10 2 5 8 11 3 6 9 12 List > n <- c(2, 3, 5) > s <- c("aa", "fob", "cc", "dd", "ee") > x <- list(n, s, b, 3) # x contains copy of n, s > x [[!]] [1] 2 3 5 [[2U [11 "aa" "bb" "cc" "dd" "ee" [[311 [11 TRUE FALSE TRUE FALSE FALSE [[411 [11 3 Data frame > teams <- c("PHI","NYM","FLA","ATL", "WSN") > wins <- c(92,89,94,72,59) > losses <- c(70,73,77,90,102) > > data <- data.frame(teams,wins,losses) > > data teams wins losses 1 PHI 92 70 2 NYM 89 73 3 FLA 94 77 4 ATL 72 90 5 WSN 59 102 > R functions •word() indicates function > sqrt(9) [1] 3 • function (argument_l, argument_2, ...) > sample(x = 0:100, size = 10, rep = FALSE) [1] 48 50 37 94 42 39 21 19 63 95 • basic functions (part of the basic R package) • package functions (part of the particular package) • user functions (user-defined functions) R libraries • Libraries allow to load pre-defined functions according to problem at hand • Load, install and unload either using R Studio or using functions in script • Libraries download and install automatically RStudio _ n £ile Edit Code View Plots Session Build Debug Profile lools Help C'l 3f'I fl 01 ö I ^ Go to file/function idqi - Addins Qj Untitled 1 fllflü Source on Save | ^ £ - | B | - aD Environment History t^Run I |*-M ^Source - I ^ <3f H Jf Import Data set * I ^ "1 Global Environment » Project: (None) » = n I List - I © Environment is empty Files PI Packages jHelp Viewer Export • 1:1 [Top Level) t R Script Console -7 1 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions Type 'licenseQ' or 'licenceQ' for distribution details. R is a collaborative project with many contributors. Type 'contributors{)' for more information and 'citation{)' on how to cite R or R packages in publications Type 'dernoQ' for some demos, 'help{)' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. o RStudio - °D Fjle | Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » ^ Project: (None) » Untitledl x aD Environment History = n 1 SX 1 S □ Source on Save 1 "f R Script : Files Plots Packages Help Viewer = n Install fi) Update 1 IQ, 11 <3 Name Description Version User Library A □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C+ + Header Files 1.62.0-1 © □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data,frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 o RStudio - °D Eile | Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » ^ Project: (None) » Untitledl x aD Environment History = n 1 SX 1 S □ Source on Save 1 "f Files Plots Packages Help Viewer = n Install © Update Name Description Version User Library A □ assertthat Easy Pre and Post Assertions 0.2.0 © audio Audio Interface for R 0.1-5 © |beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C+ + Header Files 1.62.0-1 © □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data.frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 o RStudio - °D Eile | Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » ^ Project: (None) » Untitledl x aD Environment History = n 1 SX 1 S □ Source on Save 1 library{"beepr", lib.loc="~/R/win-library/3-A") > Files Plots Packages Help Viewer = n Install © Update 1 IQ, Name Description Version User Library A □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C+ + Header Files 1.62.0-1 © □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 □ chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data.frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 RStudio Eile Edit Code View Plots Session Build Debug Profile lools Help library(beepr) > 1 Global Environment * I Environment is empty ^^^bts Packages Help Viewer = n I Ol Install J© Update 1 IQ, Description Version User Library A □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © H beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C+ + Header Files 1.62.0-1 © □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data.frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 RStudio Eile Edit Code View Plots Session Build Debug Profile lools Help library(beepr) > Install from: t Configuring Repositories Repository (CRAN, C RAN extra) PI i Packages (separate multiple with space or comma): Plots Packages all © Update ime Help Viewer Description Install to Library: C:/U sers/Lu ka s/D o cu m ents/R/win-library/3.4 [Default] arary .sertthat Easy Pre and Post Assertions Version 0.2.0 0 Install dependencies Install DC Cancel □ □ □ chron colorspace □ curl □ data .table I I dichromat for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output Chronological Objects which can Handle Dates and Times Color Space Manipulation 2.3-50 1.3-2 A Modern and Flexible Web Client for R 2.8.1 Extension of'data.frame' Color Schemes for Dichromats 1.10.4 2.0-0 (9 dio Audio Interface for R 0.1-5 e epr Easily Play Notification Sounds on any Platform 1.2 © Boost C+ + Header Files 1.62.0-1 idr Parametrized Active Bindings 0.1 e bindrcpp An 'Repp' Interface to Active Bindings 0.2 © bitops Bitwise Operations 1.0-6 Cairo R graphics device using cairo graphics library 1.5-9 RStudio £ile Edit Code View Plots Session Build Debug Profile lools Help library(beepr) > aD Environment History t^Run I |*-M ^Source - I ^ <3f H J* Import Data set * I ^ "1 Global Environment » Project: (None) » List - I @ Environment is empty Install Packages Install from: t Configuring Repositories Repository (CRAN, C RAN extra) Packages (separate multiple with space or comma): lots Packages all 4) Update Help Viewer netwo network NetworkChange Netwo rkComparisonTest networkD3 networkDynamic networkDynamicData networkGen Netwo-x nJerence networkreporting N etwo rkRiskMeasures networksis netwo rkTo m o g ra phy networktools R/win-library/3.4 [Default] arary sertthat Description Easy Pre and Post Assertions Version 0.2.0 Install DC Cancel Boost C + + Header Files 1.62.0- ndr Parametrized Active Bindings bindrcpp An 'Repp' Interface to Active Bindings 0.1 0.2 □ data.table Extension of'data.frame' 1.10.4 (9 dio Audio Interface for R 0.1-5 e epr Easily Play Notification Sounds on any Platform 1.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 □ CL "I A Modern and Flexible Web Client for R 23.1 RStudio FJIe Edit Code View Plots Session Build Debug Profile Tools Help library(beepr) > 0 tall dependencies Install DC Cancel □ □ CT XI colorspace □ □ curl □ data .table I I dichromat for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output Chronological Objects which can Handle Dates and Times Color Space Manipulation 2.3-50 1.3-2 A Modern and Flexible Web Client for R 2.8.1 Extension of'data.frame' Color Schemes for Dichromats 1.10.4 2.0-0 (9 dio Audio Interface for R 0.1-5 e epr Easily Play Notification Sounds on any Platform 1.2 © Boost C+ + Header Files 1.62.0-1 idr Parametrized Active Bindings 0.1 © bindrcpp An 'Repp' Interface to Active Bindings 0.2 © bitops Bitwise Operations 1.0-6 Cairo R graphics device using cairo graphics library 1.5-9 o RStudio - °D £ile | Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » ^ Project: (None) » <3j Untitledl* x =n j Environment History = n fllflü Source on Save 1 library(beepr) > install.packages{"network") Installing package into fC:/Users/Lukas/Documents/R/win-library/3.4J (as clibJ is unspecified) trying URL 1https://cran.rstudio.com/bin/windows/contrib/3.4/network_l 0.zip' Content type 'application/zip' length 661S53 bytes (646 KB) downloaded 646 KB package 'network' successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\Lukas\AppData\Local\Temp\RtmpekQD3G\downloaded_packages > I Install © Update 1 IQ, I I (c Name Description Version D Idatuning Tuning of the Latent Dirichlet Allocation Models Parameters 0.2.0 9 □ magrittr A Forward-Pipe Operator for R 1.5 Q □ maptools Tools for Reading and Handling Spatial Objects 0.9-2 □ mime Map Filenames to MIME Types 0.5 □ modeltools Tools and Classes for Statistical Models 0.2-21 □ munsell Utilities for Using Munsell Colours 0.4.3 ©I □ network Classes for Relational Data 1.13.0 □ NLP Natural Language Processing Infrastructure 0.1-10 © □ openNLP Apache OpenNLP Tools Interface 0.2-6 A □ openNLPdata Apache OpenNLP Jars and Basic English Language Models 1.5.3-2 □ openssl Toolkit for Encryption, Signatures and Certificates Based on OpenSSL 0.9.6 O □ PCIT Partial Correlation Coefficient with Information Theory 1.5-3 © □ pkgconfig Private Configuration for R' Packages 2.0.1 © □ plogr The'plog'C + + Logging Library 0.1-1 © O RStudio _ n MM £ile | Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » Project: (None) » '-' Untitled'* Environment History = n 1 <£L 1 6 □ Source on Save 1 1 L> Source - 1 ^ Q J* Import Dataset - List -1 © 1 library(beepr) "1 Global Environment • s ~] 2 Environment is empty Files Plots Packages Help Viewer = n 1 Install © Update 1 IQ, 11 © 1 Name Description Version Idatuning Tuning of the Latent Dirichlet Allocation Models Parameters 02.0 magrittr A Forward-Pipe Operator for R 1.5 © 1 2:1 [Top Level) t R Script ; 1 Console □ maptools Tools for Reading and Handling Spatial Objects 0.9-2 C:\Users\Lukas\AppData\Local\Temp\RtmpekQD3G\downloaded_packages rA □ mime Map Filenames to MIME Types 0.5 > library{"network", lib.loc="~/R/win-library/3.4") □ modeltools Tools and Classes for Statistical Models 0.2-21 network: Classes for Relational Data munsell Utilities for Using Munsell Colours 0.4.3 ©| Version 1.13.0 created on 2015-08-31. copyright (c) 2005, Carter T. Butts, University of California-Irvine Mark S. Handcock, University of California -- Los An geles David R. Hunter, Penn State University Martina Morris, University of Washington Skye Bender-deMol1, University of Washington In et work Classes for Relational Data 1.13.0 © NLP Natural Language Processing Infrastructure 0.1-10 © □ openNLP Apache OpenNLP Tools Interface 0.2-6 o □ openNLPdata Apache OpenNLP Jars and Basic English Language Models 1.5.3-2 © 1 □ openssl Toolkit for Encryption, Signatures and Certificates Based on OpenSSL 0.9.6 © For citation information, type citation("network"). Type help("network-package") to get started. □ 3CIT Partial Correlation Coefficient with Information Theory 1.5-3 © □ pkgconfig Private Configuration for R' Packages 2.0.1 © > □ plogr The'plog'C + + Logging Library 0.1-1 © v 1—1 Ä v 1 1 RStudio _ n £ile Edit Code View Plots Session Build Debug Profile lools Help C'l 3f'I fl 01 ö I ^ Go to file/function idqi - Addins Q] Untrtledl* x fllflü Source on Save I G± £ - I |_l I - 1 library(beepr) 2 aD Environment History fc^Run I IS* I ^Source - I ^ <3f H Jf Import Data set * I ^ "1 Global Environment » Project: (None) ' List - I © Environment is empty 2:1 I (Top Level) i Console > library{"network", lib.loc="~/R/win-library/3.4") network: Classes for Relational Data Version 1.13.6 created on 2015-08-31. copyright (c) 2665, Carter T. Butts, University of California-Irvine Mark S. Handcock, University of California -- Los geles David R. Hunter, Penn State University Martina Morris, University of Washington Skye Bender-deMoll, University of Washington For citation information, type citation("network"). Type help("network-package") to get started. R Script : ■ ö — rA An > detach{"package:network' > unload=TRUE) Files Plots Packages Help Viewer = n Install (t) Update 1 IQ, 11 © I Name Description Version Idatuning Tuning of the Latent Dirichlet Allocation Models Parameters 0.2.0 © A " □ magrittr A Forward-Pipe Operator for R 1.5 © □ maptools Tools for Reading and Handling Spatial Objects 0.9-2 □ mime Map Filenames to MIME Types 0.5 modeltools Tools and Classes for Statistical Models 0.2-21 □ munsell Utilities for Using Munsell Colours 0.4.3 ©l network Classes for Relational Data 1.13.0 © □ NLP Natural Language Processing Infrastructure 0.1-10 © □ openNLP Apache OpenNLP Tools Interface 0.2-6 o □ openNLPdata Apache OpenNLP Jars and Basic English Language Models 1.5.3-2 □ openssl Toolkit for Encryption, Signatures and Certificates Based on OpenSSL 0.9.6 o PCIT Partial Correlation Coefficient with 1.5-3 © Information Theory □ pkgconfig Private Configuration for R' Packages 2.0.1 © □ plogr The'plog'C + + Logging Library 0.1-1 © r^ V Basic R functions c() # combine two or more elements into an object class() # explore elements' data class length() # explore number of first dim. of object dim () # explore dimensions of two-dimensional obj. nrow() # number of rows ncol() # number of columns head() # first few rows of data tail() # last few rows of data str () # explore structure of object names() # names in the named vector - one dimension rownames() # names of rows - two dimensions colnames() # names of columns - two dimensions Working directory • Folder, where all imports and exports are taking place - enough to set once • Makes data import and export easier • Functions setwd () and getwd () • Does not accept single backslash in Win path • Replace backslash \ with forwardslash / or double backslash \\ setwd("C:\\Users\\Lukas\\Documents\\R intro") setwd("C:/Users/Lukas/Documents/R intro") File Edit Code View Plots 0] Untrtledl I £3 I H □ Source Session Build Debug Erofile lools Help New Session Interrupt R lerminate R... Restart R Ctrl+Shift+F10 Set Working Directory Load Workspace... Save Workspace As.., Clear Workspace,,, Quit Session... _+Run _ Source - , I Environment History | £jf Q J* Import Dataset - I Global Environment ^ Project: (None) ■ To Source File Location To Files Pane Location Choose Directory.,, Ctrl+Shift+H Ctrl+Q 1:1 (Top Level) t Console -/ > I R Script : can Files Plots Packages Help Viewer = n Ol Install © Update 1 (<* MC9I Name Description Version I I User Library A □ abind Combine Multidimensional Arrays 1.4-5 □ acepack ACE and AVAS for Selecting Multiple Regression Transformations 1.4.1 © □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © □ backports Reimplementations of Functions Introduced Since R-3.0.0 1.1.0 © □ base64enc Tools for base64 encoding 0.1-3 © □ beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C + + Header Files 1.62.0-1 □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 □ bitcps Bitwise Operations 1.0-6 © □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap (PNG, JPEG, TIFF), vector (PDF, SVG, PostScript) and display (X11 and Win32) output 1.5-9 © Data output • Save entire workspace • Save all R objects you've created so far • Allows to return to work/backup current work • Save particular object • Export data to tabular objects • CSV as most common format CSV - most common data format • Comma-Separated Values • Tabular data separated by commas (separator/delimiter) or other signs (tabulator, space, semicolon) • CSV file (.csv), TSV file (.tsv) - always a text file (.txt) • Must have same number of columns (separators) cars,type,price,consumption,emissions,expensive BMW,3,1200000,6.2,0,0 Audi,A4,1164 000,5. 9, 0,0 VW, Passat,950500,6.2,NA,NA CSV - other examples cars;type/price;consumption/emissions BMW;3;1200000;6.2;0 Audi;A4;1164000;5.9;0 VW;Passat;950500;6.2;0 "cars" "type" "price" "consumption" "emissions" "BMW" "3" "1,200,000" "6.2" "0" "Audi" "A4" "1,164,000" "5.9" "0" "VW" "Passat" "950,500" "6.2" "0" cars,type,price,consumption,emissions Bad data - improper use of comma delimiter results in uneven # of rows RStudio - °H Eile j Edit Code View Plots Session Build Debug Profile lools Help S I ö I I ^ Go to file/function I (33| I Addins » Project: (None) » Untitledl* x J pimajr N =n I Environment History = n S\ 1 ? Filter (A Tl ■H^H J* Import Dataset - jf = List - I @ XI ■ npreg glu bp ^ -skin bmi ped ' age -type Global Environment » lA ~~1 I 1 1 5 86 68 28 30,2 0.364 24 No Data 2 2 7 '95 70 33 25.1 0.163 55 Yes Opima_tr 390 obs. of 9 variables 3 3 5 77 82 35,8 0.156 35 No 4 4 0 165 76 43 47.9 0259 26 No 5 5 0 107 50 25 26.4 0.133 23 No G 6 5 97 76 27 35,6 0.373 52 Yes 7 7 3 83 58 31 34.3 0.336 25 No G 8 1 193 50 16 25,9 0.655 24 No 9 3 142 80 15 32.4 C.2CC 63 No Files Plots Packages Help Viewer 10 10 2 12S 78 37 43,3 1.224 31 Yes 4» <* £r i ü i m I °* I I cs? 11 11 137 40 35 t-z: 33 Yes 0 Home ■» 12 12 9 '5^ 78 30 309 0.164 45 No A 15 -IGQ «i on -i Voc Showing 1 to 13 of 300 entries R Resources U RStudio Console Learning R Online RStudio IDE Support CRAN Task Views R on StackOverflow Getting Help with R Manuals An Introduction to R Writing R Extensions R Data Import/Export Reference Packages RStudio Cheat Sheets RStudio Tip of the Day RStudio Packages RStudio Products The R Language Definition R Installation and Administration R Internals Search Engine & Keywords V Exporting object - tabular • Function write . table () • Name of file must be specified • Easy to import to Excel or other software frequencies <- c(92,89,94,72,59) write.table(frequencies, "frequencies.csv", sep = 11, 11 , row.names = FALSE, col.names = TRUE, fileEncoding = "UTF-8") Exporting object - unstructured • Function writeLines () • Has basically no arguments • Saves the whole object as one text frequencies <- c(92,89,94,72,59) writeLines(frequencies, "frequencies.txt") Text analysis in R Text analysis in R • Most prominent package for text analysis is "tm" (stands for text mining) • Provides tools corpus creation, text manipulation, term-document matrix creation • Easily allows to read text documents as corpus • Competing packages - "quanteda" • Developed by Ken Benoit (WordScores) • Provides some TA methods • Overlaps with "tm" package - if both packages loaded, it will generate conflicts (feature, not bug) Corpus • getSources () provides list of available sources • Files inside a directory - DirSource () • Text inside a vector-VectorSource () • Dataframe, XML, links to web-sites,... • Corpus () creates a corpus object out of text sources Corpus my.texts <- "C:\\Users\\Lukas\\Desktop\\data\\" directory.source <- DirSource(directory = my.texts) text.corpus <- Corpus(directory.source) Corpus Operations - functions • Useful functions: • removePunctuation () - remove all punctuation • removeWords () - remove stopwords • stripWhitespace () - remove duplicate white space • removeNumbers () - remove all numbers • stemDocument () - stem document • plainTextDocument () - turn document into tm package's plain text format Corpus operations • tm_map () function allows to apply manipulations over the corpus data edited.corpus <- text.corpus edited.corpus <- tm map(edited.corpus, removeNumbers) edited.corpus <- tm map(edited.corpus, removePunctuation) edited.corpus <- tm map(edited.corpus, stripWhitespace) edited.corpus <- tm map(edited.corpus, removeWords, stopwords("english")) Term-document matrix • Function TermDocumentMatrix () • Terms in rows • Documents in columns • DocumentTermMatrix () creates inverse TDM • Output is non-standard matrix object • If matrix operations are needed, it must be converted to basic matrix format with as . matrix () function Term-document matrix tdm <- TermDocumentMatrix(edited.corpus) dtm <- DocumentTermMatrix(edited.corpus) tdm.matrixed <- as.matrix(tdm) Useful functions in "tm • removeSparseTerms() • Removes terms to a defined sparsity of the TDM matrix - removes terms which are used sparsely across documents • findFreqTerms() • Lists most frequent terms across the TDM matrix • Does not provide frequencies; though •findAssocs() • Correlation of appearance of a term with other terms across TDM - returns Pearson's r Frequencies • f indFreqTerms () shows frequent terms • Has two attributes defining bounds - lowf req, highfreq • Easier to calculate frequencies separately • Convert TDM to matrix with as . matrix () • Calculate sums of rows with rowSums () • Sort the vector with sort () with decreasing attribute tdm.matrixed <- as.matrix(tdm) frequencies <- rowSums(tdm.matrixed) frequencies <- sort(frequencies, decreasing = T) Wordclouds • Package "wordcloud" • Function wordcloud () Attribute Description words Terms f req Frequencies of terms scale Two values in c () function to bound upper and lower scale max.words Maximum number of words rendered random.order Binary - should terms be placed in random order? rot.per Percentage of terms placed vertically colors Color or color palette random.color Binary - should colors be assigned randomly or based on the word frequency? Wordclouds tdm.matrixed <- as.matrix(tdm) frequencies <- rowSums(tdm.matrixed) frequencies <- sort(frequencies,decreasing terms <- names(frequencies) library(wordcloud) wordcloud(words = terms, freq = frequencies, scale = c (5,0.5), max.words = 150, random.order = F, rot.per = 0, colors = "red") Wordclouds . capjtal shearson cash common value Äcompany«°-uire viacom DC! inCagreed also analysts O Cl I U exoress market V* 1 >^ express reuterd|rsthesecurities one stabku^ Shares share Stock amerjean management business" tender