Text analysis Lukáš Lehotský R studio layout Scripting window Environment (stored objects) History Plots n , . . Packages Console window , Help Viewer RStudio _ n File Edit Code View Plots Session Build Debug Profile lools Help Q_, * I » I Q ® I I I, Go to file/function I | HH * I Addins » Project: (None) •■• Untitled 1 fllflü Source on Save | ^ j£ - | B | - Environment History _*Run Source 'I I B ^J* Import Data set "1 Global Environment » List - I @ 1:1 nop Level) i R Script - Console -/ ' Scripting window R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors{)' for more information and 'citation{)' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help{)' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. Console Environment is empty Environment History Files Plots Packages Help Viewer Export - Plots Packages Help Viewer Object • Object is a container which holds data, and can be manipulated with functions • The most basic object is called vector • There are other types of objects - matrix, data frame, list one <- 1 o RStudio - °D Eile | Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » Project: (None) » <3j Untitledl* x =n j Environment History = n 1 <£L 1 H □ Source onSave 1 one <- 1 > one + one [1] 2 > Files Plots Packages Help Vie Ler Install Update I* I I s Name Description j f Version User Library A □ assertthat Easy Pre and#ost Assertions 0.2.0 © □ audio Audio IntsJFace for R 0.1-5 e beepr Easilyjray Notification Sounds on any ~->\amrn 1.2 © □ BH M (Boost C + + Header Files 1.62.0-1 © □ blndr ^^^^ Parametrized Active Bindings 0.1 © ^rloYcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data,frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 o RStudio - °D File Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » ^ Project: (None) » <3j Untitledl* x aD Environment History = n 1 <£L 1 H □ Source onSave 1 one <- 1 > one + one [1] 2 > two <- one + one > Files Plots Packages Help Viewer Ol Install (£) Update i O Name Description I Version User Library A □ assertthat Easy Pre and Post Assertions m 0.2.0 □ audio Audio Interface for R m 0.1-5 © beepr Easily Play Notification Sounds m\ any Platform f 1.2 © □ BH Boost C+ + Header Files^r 1.62.0-1 © □ bindr Parametrized Actiyafcindings 0.1 © □ bindrcpp An 'RcppJ|(lmace to Active Bindings 0.2 © □ . bitops ^^^^^^ *rmise Operations 1.0-6 TT Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF], vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data,frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 RStudio Eile Edit Code View Plots Session Build Debug Profile lools Help one <- 1 > one + one [1] 2 > two <- one + one > two [1] 2 > R Script : Environment History = n Q J* Import Dataset - List - I "f Files Plots Packages Help Viewer = n Install © Update 11 <3 Name Description Version User Library A □ assertthat Easy Pre and Post Assertions 0.2.0 © audio Audio Interface for R 0.1-5 © |beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C+ + Header Files 1.62.0-1 © □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data.frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 o RStudio - °D File | Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » ^ Project: (None) » ■■• Untitledl x aD Environment History = n 1 ffl 1 H □ Source onSave 1 library{"beepr", lib.loc="~/R/win-library/3.4") > Files Plots Packages Help Viewer = n Install © Update 1 IQ, 11 <3 Name Description Version User Library A □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C+ + Header Files 1.62.0-1 © □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 □ chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data.frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 RStudio Eile Edit Code View Plots Session Build Debug Profile lools Help library(beepr) > 1 Global Environment * I Environment is empty ^^^bts Packages Help Viewer = n I Ol Install J© Update 1 IQ, Description Version User Library A □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © H beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C+ + Header Files 1.62.0-1 © □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data.frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 "© V File Edit Code View Plots 0] Untitledl I £3 I H □ Source Session Build Debug Erofile lools Help New Session Interrupt R lerminate R... Restart R Ctrl+Shift+F10 Set Working Directory Load Workspace... Save Workspace As.., Clear Workspace,,, Quit Session... _-fvRun _ Source - , I Environment History | £jf Q J* Import Dataset - I Global Environment ^ Project: (None) ■ To Source File Location To Files Pane Location Choose Directory.,, Ctrl+Shift+H Ctrl+Q 1:1 (Top Level) t Console -/ > I R Script : can Files Plots Packages Help Viewer = n Ol Install © Update 1 (<* MC9I Name Description Version I I User Library A □ abind Combine Multidimensional Arrays 1.4-5 □ acepack ACE and AVAS for Selecting Multiple Regression Transformations 1.4.1 © □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © □ backports Reimplementations of Functions Introduced Since R-3.0.0 1.1.0 © □ base64enc Tools for base64 encoding 0.1-3 © □ beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C + + Header Files 1.62.0-1 □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 0 □ bltcps Bitwise Operations 1.0-6 © □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap (PNG, JPEG, TIFF), vector (PDF, SVG, PostScript) and display (X11 and Win32) output 1.5-9 © Data export: saving XLSX • Package "xlsx" • Function write . xlsx () • Arguments • x - object from the environment which you want to export • file - name of the file in your working directory write.xlsx(x = object, file = "mysheet.xlsx") Quantitative TA in R Acquire Documents -Existing Corpora 7 -Electronic sources Un digitized text + Preprocess -> Research Objective Classification Ideological Scaling Supervised (wordscores) Unsupervised (wordfish.) Known Categories Dictionary Methods Supervised Methods Unknown Categories Fully Automated Clustering Computer Assisted Clustering Individual Classification Individual Methods Measuring Proportions (Read Mo) Ensembles Single Membership Models Mixed Membership Models Document Level (LDA) Date Level (Dynamic Multi topic -VI odd) Author Level (Expressed Agenda Model) (Grimmer & Steweart 2013) „Be careful what is a result and what is just a residue of your data choices'" Jana Diesner, 2018 Text analysis in R • Package "quanteda" (http://quanteda.io) • Developed by Ken Benoit (LSE) • Comprehensive package on text analysis methods • Package "readtext" • Ken Benoit & Adam Obeng • Package which allows data import from text sources • Easy to work with • Package "stopwords" • Ken Benoit, David Muhr & Kohei Watanabe • Package containing various stopwords for different languages Before we start. • Open the folder "text_analysis_quanti" folder • Open script file "text_analysis_l.R" in R Studio • Install all libraries • quanteda, readtext, stopwords, xlsx Steps leading to analysis Set directory -> Load packages -> Read texts V Create Extract Create corpus -> tokens -> DFM Set directory -> Load packages -> Read texts V Create Extract Create corpus -> tokens -> DFM Set working directory • Window approach • Session -> Set Working Directory -> Choose Folder • Script approach work.dir <- "C:\\path\\to\\folderW" setwd(work.dir) Set directory -> Load packages -> Read texts V Create Extract Create corpus -> tokens -> DFM Load packages • Window approach • Session -> Set Working Directory -> Choose Folder • Script approach library(readtext) library(quanteda) library(stopwords) library(xlsx) Set directory -> Load packages -> Read texts V Create Extract Create corpus -> tokens -> DFM Reading texts into R • readtext () function loads all text files into R • Very easy to use - reads everything in any specified folder • Supports various document types TXT PDF DOC Twitter data format JSON • Just need to insert a path to a specific folder • Arguments • file • Path to specific source file or path to folder containing files • encoding Reading texts into R • Encoding • Text files are usually stored in certain computer-readable format • Consider text "Príklad zlého kódovania" • ASCII/ISO-8859-1: "PrÄklad zlÄ©ho kÄ3dovanio" • UTF-8: "Príklad zlého kódovania" • As a rule of thumb, UTF-8 encoding is desired Reading texts into R text.dir <- "C:\\path\\to\\folder\\with\\texts\\" texts <- readtext(file = text.dir, encoding = "UTF-8") Reading texts into R text.dir <- "C:\\path\\to\\folder\\texts\\" texts <- readtext(file = text.dir, encoding = "UTF-8") Reading texts into R Argument specifying location of texts (object input) text.dir <- "C:\\pa*\\to\\folder\\texts\\" texts <- readtext(file = text.dir, encoding = "UTF-8") Function Name of Argument specifying a new object character encoding (text input = quotes) Set directory -> Load packages -> Read texts V Create Extract Create corpus -> tokens -> DFM Corpus • Simple function corpus () • Creates corpus from all imported texts from the previous step • Arguments • X • Imported text files • docnames • Optional specification of document names Corpus • All sorts of statistics may be acquired once corpus is generated corp <- corpus(x = texts) • summary() • Provides overview of corpus documents •ndoc() • Counts number of documents in the corpus ndoc(corp) summary(corp) Set directory -> Load packages -> Read texts V Create Extract Create corpus -> tokens -> DFM From corpus to DFM • Two-step process • Tokenization of corpus • A step necessary to apply some pre-processing choices which are not text-based (removal of noise) • Remove numbers • Remove punctuation • Remove white space (separators) • DFM generation from tokens • Furthter pre-processing choices (because of bag-of-words) • Stemming • Lowercasing • Stop words removal • Dictionary application Tokenization • Function tokens () • Tokenization arguments • what-word, character, sentence • ngrams - ngramization of the corpus • Pre-processing arguments • remove_numbers - numerals • remove_punct - punctuation • remove_symbols - special "Unicode" symbols (encoding residues) • remove_separators - white space, line ends, etc. • remove hyphens - remove hyphens between words Tokenization tokenization <- tokens(x = corp, what = "word", ngrams = 1, remove numbers = TRUE, remove punct = TRUE, remove separators = TRUE, remove hyphens = FALSE ) tokenization.bigrams <- tokens(x = corp, what = "word", ngrams = c (1:2), remove numbers = TRUE, remove punct = TRUE, remove separators = TRUE, remove hyphens = FALSE ) Set directory -> Load packages -> Read texts V Create Extract Create corpus -> tokens -> DFM Document-feature matrix • Function dfm () • Documents in rows, features (tokens) in columns • Preprocessing arguments • tolower - converts words to lowercase • stem-implement stemmer • remove - list of words to be dropped from the DFM • Application arguments • dictionary - applies dictionary and converts features from tokens to dictionary dimensions • groups - allows to add another dimension by which the corpus can be grouped/split Document-feature matrix basic.matrix <- dfm(x = tokenization, tolower = TRUE) bigram.matrix <- dfm(x = tokenization.bigrams, tolower = TRUE) stem.matrix <- dfm(x = tokenization, tolower = TRUE, stem = TRUE) prep.matrix <- dfm(x = tokenization, tolower = TRUE, stem = TRUE, remove = stopwords(language = "en")) DFM weighting • DFM frequencies are displayed in absolute numbers • Document size bias • dfm_weight() • Weighting of terms according to document size or other rules • Useful to offset the effect of the document size • dfm_tfidf () • Incidence of term in document divided by number of documents in which it occurs • Useful to find term importance within document DFM weighting weight.matrix <- dfm weight(prep.matrix, scheme = "prop") tfidf.matrix <- dfm tfidf(prep.matrix) DFM manipulation • dfm_trim() • Reduction in the dimensionality - removal of very sparse words, very frequent words, etc. • dfm_subset() • Subsetting of the the DFM - extraction of DFM portion • dfm_sample() • Random sampling from the DFM • Useful in various computation-intensive tests DFM manipulation red.matrix <- dfm_trim(prep.matrix, min termfreq = 5) sample.matrix <- dfm_sample(prep.matrix, size = 5) Analysis Analysis • Corpus-based • Require full texts • E.g. KWIC • DFM-based • Require frequencies • Bag-of-words assumption • E.g. token frequencies, correspondence analysis, wordfish ... Keywords in context • kwic () function • Modifying arguments • pattern • A term of interest or multiple terms of interest wrapped in function c () • window • Length of text part extracted before and after the keyword term • case.insensitive • Binary - should function take term case into account? • You can save it using write . xlsx () function Keywords in context kewords.in.context <- kwic(corp, pattern = "energy", window = 5) kewords.in.context.2 <- kwic(corp, pattern = c("energy", "russia"), window = 5) write.xlsx(x = kewords.in.context,file = "kwic.xlsx") Keywords in context Docname From To Pre Keyword Post 2007-2008-CZ.txt.proc.txt 3784 3784 current issues related to renewable energy sources , cooperation in EU 2007-2008-CZ.txt.proc.txt 3805 3805 The V4 working group on energy meets regularly . The European 2007-2008-CZ.txt.proc.txt 3842 3842 cooperation in the field of energy with Nordic Council countries . 2008-2009-PLtxt.proc.txt 101 101 in early 2009 the Russian-Ukrainian energy crisis broke out, with 2008-2009-PLtxt.proc.txt 1195 1195 progress , the issue of energy security became the prime topic 2008-2009-PL.txt.proc.txt 1269 1269 group of governmental plenipotentiaries for energy security . On June 3rd 2008-2009-PLtxt.proc.txt 1580 1580 2008-2009-PLtxt.proc.txt 1613 1613 September 5th 2008 - Energy Expert Group meeting . The Agency for the Co-operation of Energy Regulations [ ACER ], Keywords in context • May be plotted easily •textplot_xray() • Function for plotting • One or several KWIC objects may be passed (each must be passed separately) • Argument scale allows to plot absolute or weighted positions (normalized by document length) Keywords in context textplot_xray( kwic(corp, pattern = kwic(corp, pattern = sort = TRUE "energy", window = 5 "security", window = textplot_xray( kwic(corp, pattern kwic(corp, pattern sort = TRUE, scale = "absolute" "energy", window = "security", window Lexical dispersion plot security energy 1999-2000-CZ.txt prac txt 2000-2001-PL txt prac txt 1 2002-2003-SK txt prac txt 20Q3-2004-CZ txt prac txt 2004- 2005-PL txt prac txt 2005- 2006-HU txt prac txt 2007-2003-CZ txt prac txt 20Q8-2009-PL.txt. prac.txt 1111 2009-2010-HU.txt.prac.txt II 1 2010-2011-SK txt prac txt 1 2011-2012-CZ txt prac txt 11 11 1 1 1 II f II 1 1 2012-2013-PL txt prac txt 1 1 III J 1 11II 1 1 2013-2014-HU txt prac txt 1 IIIII 1 III 1 1 III III 1 1 II 1 r i i 2014-2015-SK txt prac txt 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Relative token index Lexical dispersion plot o Q security energy 1999-2000-CZ.txt prac txt 2000-2001-PL .txt.proc.1xt 2002-2003-SK txt prac txt 2003-2004-CZ txt prac txt 2004- 2005-PL txt prac txt 2005- 2006-HU txt prac txt 2007-2003-CZ txt prac txt 20Q8-2009-PL.txt.prac.txt 1 1 III III 2009-2010-HU.txt.prac.txt 1 1 2010-2011-SK txt prac txt III II II 1 II 2011-2012-CZ txt prac txt | | LU 1 1 1 1 LJ I 2012-2013-PI_txtproc.txt 1 1 1 II III 1 | | 2013-2014-HU txt prac txt 1 III 11 III II 1 I IIIIIIIIIIII1 1 1 1 2014-2015-SK txt prac txt 0 10000 20000 30Ó00 40000 0 10Ó00 20000 30000 40Ô00 Token index 62 Frequencies • Frequency of features in the DFM • Absolute token frequencies • Dictionary category frequencies • topfeatures () • General function to extract number of tokens Frequencies freq.basic <- textstat frequency(x = basic.matrix, freq.stem <- textstat frequency(x = stem.matrix, n freq.prep <- textstat frequency(x = prep.matrix, n write.xlsx(x = freq.prep, file = "frequencies.xlsx" Wordcloud • Function textplot wordcloud () Argument Description X Terms max words Maximum number of words rendered min size Size of smallest category max size Size of largest category rotation Percentage of terms placed vertically color Color or color palette • • • Many other arguments available (use help) Wordcloud textplot wordcloud(x = basic.matrix, max words = 50, min size = 1, max size = 4, rotation = 0, color = "steelblue2") textplot wordcloud(x = prep.matrix, max words = 50, min size = 1, max size = 4, rotation = 0, color = "red3") regional security thjs defence support be european common at countries presidency slovakwfsm . byCzech visegrad -'J ai ,VJ group energy asthA for from its with LI I w on an easternJ0intv4 /^ftn euforei9n po icy that KJl ministers also meeting their which cooperation republic development international budapest implement integr develop minist ministri junegroupcountri.state europnrpciH » , A work intern wenfoNci V4 eu J°int new energi |T|0gt ^OOP^T repubivisegradforei9n Slovak regjon supportproject expert secur discuss nation defenc budapestactiv Dictionaries • Two step process • Requires a dictionary object • Manually constructed dictionary • Dictionary in the external location • File "LaverGarry.cat" in your folder • Dictionary included in a package • Package "tidytext" • Sentiments dictionary • Dictionary has to be applied in a DFM construction process Dictionaries CULTURE ECONOMY CULTURE-HIGH ART (1) ARTISTIC (1) DANCE (1) GALLER* (1) MUSEUM* (1) MUSIC* (1) OPERA* (1) THEATRE* (1) CULTURE-POPULAR MEDIA (1) SPORT ANGLER* (1) PEOPLE (1) WARJNJRAQ(l) CIVIL_WAR(1) +STATE+ ACCOMMODATION (1) AGE (1) AMBULANCE (1) ASSIST (1) BENEFIT (1) CARE (1) CARER* (1) CHILD* (1) CLASS (1) CLASSES(1) CLINICS (1) COLLECTIVE* (1) Dictionaries • Using dataset from a file/creating own dictionary • Function dictionary () allows to load a file as a dictionary • Arguments • file - specifies the path to file (because we are in a working directory, we have to specify only a file name) • format - specifies the pre-defined format of dictionary • The new object will be used in the DFM argument dictionary • Useful to weigh the DFM after application Dictionaries wordstat.diet <- dictionary(file = "LaverGarry.cat", format = "wordstat") dfm.dict <- dfm(tokenization, dictionary = wordstat.diet) dfm.diet.w <- dfm weight(dfm.diet,scheme = "prop") VALUES.CONSERVATIVE 2014-2013-2012-2011-2010-2009-2008-2007-2005-2004-2003-2002-2000-1999- 2015-2014-2013-2012-2011-2010-2009-2008-2006-2005-2004-2003-2001-2000- SK.txt HU.txt PL.txt. CZ.txt SK.txt HU.txt PL.txt. CZ.txt HU.txt PL.txt. CZ.txt SK.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt PL.txt.proc.txt CZ.txt.proc.txt VALUES.LIBERAL 2014-2015-SK.txt.proc.txt .proc.txt 2013-2014-HU.txt 2012-2013-PL.txt. 2011-2012-CZ.txt 2010-2011-SK.txt 2009-2010-HU.txt 2008-2009-PL.txt. 2007-2008-CZ.txt 2005-2006-HU.txt 2004-2005-PL.txt. 2003-2004-CZ.txt 2002-2003-SK.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt 2000-2001 -PL.txt.proc.txt 1999-2000-CZ.txt.proc.txt O o o o.oo 0.01 0.02 0.03 0.04 0.05 0.06 Second data set • Parts of UK 2010 election manifestos • Issue of migration • English, already pre-formatted, part of quanteda package • Just type data_char_ukimmig2 010 into script • Same drill as before • Texts • Corpus • Tokens • DFM Distances • The simplest algorithm to obtain scaling • Function textstat_dist () • Creates a distance object which is recognized by other R packages and functions • We may use hclust () function which creates a hierarchical clusters and plot it with plot () function afterwards Distances dist.analysis <- textstat dist(mig.dfm) clusters <- hclust(dist.analysis) plot(clusters) Cluster Dendrogram o CO Q_ Z CD O CD 'cd X o o CN Q_ ZJ o .G CO CO d CD CD Ü E CD Q CD > -t—» CO £ CO o Ü o CD o ü ü Q_ Q_ Z CO dist.analysis hclust (*, "complete") Keyness • Useful method to evaluate keywords-finds words which are specific in relation to the rest of the corpus • Function textstat_keyness () • Argument target • Specifies the numeric ID of the document, which is compared to the rest of the corpus • Can be plotted via textplot keyness () Keyness key.analysis <- textstat keyness(x = mig.dfm,target textplot keyness(key.analysis) ethnic percent I shal bnp popul societi multicultur within H : i ill i statist accord points-bas fair want appli detent live valu fore social group third on offici school current re qui r migrat need can visa eu end student control new system migrant BNP reference -10 0 chi2 10 20 Models Correspondence analysis • Method of singular value decomposition • Allows to reduce complexity of matrix into low-dimensional space (2 or 3) • No underlying assumptions about distributions • Scaling is a method of capturing the variation in the observed data • Not clear what is the variation captured (actual positions, tone, style,...) Correspondence analysis • Function textmodel_ca () • Arguments • sparse • Allows to omit less frequent words in order to reduce the use of computer memory • nd • Default estimates as many dimensions as possible, allows to limit the number of estimated dimensions • Useful to explore model with function summary () Correspondence analysis model <- textmodel ca(mig.dfm,sparse = TRUE) summary(model) textplot scaleld(model) BNP UKIP Greens LibDem Labour SNP Coalition Conservative ■0.5 0.0 Document position WordFish • Model based on naive Bayes classifier • Estimation of one dominant dimension • Assumes a word is drawn from a Poisson distribution, which is based on • Amount the actor speaks • Frequency how much the word is used • Extent how much the word discriminates the underlying ideological space • Actors' underlying position • Model is estimated given the observed data • Again, lack of clarity, what the scale captures WordFish • Function textmodel_wordf ish () • Arguments allow a further specification of prior assumptions about the Poisson distribution, model parameters,... • Result provides also SE for each estimated position • Function summary () allows to see the estimated model • Function textplot_scaleld () allows to visualize results • Scaling of actors • Scaling of words using argument margin • Word highlight using argument highlight and a word list wrapped in function c () WordFish model <- textmodel ca(mig.dfm,sparse = TRUE) summary(model) textplot scaleld(model) Conservative Coalition Labour SNP LibDem Greens UKIP BNP -1 0 Estimated theta Word Fish vs. CA Wordfish Correspondence Analysis Conservative Conservative Coalition Coalition Labour SNP SNP Labour Liberal Democrats Plaid Cymru Plaid Cymru Liberal Democrats Green Party Green Party UKIP UKIP British National Party British National Party WordFish textplot scaleld(model, margin = "features") textplot_scaleld(model, margin = "features", highlighted = c("eu","multicultur"), highlighted color = "black" ) immigr peopl system british mm contr°' border hrit^i^ uť r|gnT work bntainnation citizenship year student depftän social la; rnju law-a nnnul -.-v itľi i iN'i ==|LťM(HaÍi( "multiaJlfiiíSB come want graffiti I traffick on ^Ife fore valu strengthen univers non-eu weapon cours serious ethnic t9EílQITä:t -25 0.0 2.5 Estimated beta