Text analysis
Lukáš Lehotský
R studio layout
Scripting window
Environment (stored objects) History
Plots
n , . . Packages Console window ,
Help
Viewer
RStudio
_ n
File   Edit   Code  View  Plots  Session   Build   Debug   Profile  lools Help
Q_, * I        » I Q ® I        I I,     Go to file/function        I | HH * I   Addins »
Project: (None)
•■•   Untitled 1
fllflü Source on Save  | ^ j£ - | B | -
Environment History
_*Run Source   'I I        B    ^J* Import Data set
"1 Global Environment »
List - I @
1:1       nop Level) i		R Script -
		
Console -/ '		
Scripting window
R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors{)' for more information and
'citation{)' on how to cite R or R packages in publications.
Type 'demo()' for some demos,  'help{)' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()'  to quit R.
Console
Environment is empty
Environment History
Files    Plots    Packages    Help Viewer	
Export -	
Plots	
Packages	
Help	
Viewer	
Object
• Object is a container which holds data, and can be manipulated with functions
• The most basic object is called vector
• There are other types of objects - matrix, data frame, list
one <- 1
o	RStudio					- °D
Eile | Edit   Code  View  Plots  Session   Build   Debug   Profile  lools Help						
CH Ä*l Q 01 Qli^Goto file/function          1 @ -    Addins »					Project: (None) »	
						
<3j Untitledl* x	=n j	Environment		History		= n
1 <£L 1 H □ Source onSave  1 <i        - 1 O 1 *	&Run 1 1*!^ 1 ^Source   - 1 ^		Q    ^jf Import Dataset -			= List -1 @
1   one <- 1		"1 Global Environment *			lA	
2					Environment is empty	
						
		Files	Plots	Packages    Help Viewer		= n
		Install    © Update			1 IQ,	11 @
			Name		Description	Version
		User Library				A
			assertthat		Easy Pre and Post Assertions	0.2.0
1S     | (Top Level) t	R Script t 1					
		□	audio		Audio Interface for R	0.1-5
Console						
		□	beepr		Easily Play Notification Sounds on any Platform	1.2
						
		□	BH		Boost C+ + Header Files	1.62.0-1
		□	bindr		Parametrized Active Bindings	0.1
		□	bindrcpp		An 'Repp' Interface to Active Bindings	0.2
		□	bitops		Bitwise Operations	1.0-6
		□	Cairo		R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF], vector [PDF, SVG, PostScript) and display [X11 and Win32) output	
		□	chron		Chronological Objects which can Handle Dates and Times	2.3-50
		□	colorspace		Color Space Manipulation	1.3-2
		□	CL "1		A Modern and Flexible Web Client for R	23.1
		□	data.table		Extension of'data,frame'	1.10-4
		n	dichromat		Color Schemes for Dichromats	2.0-0        © v I I
						
o	RStudio					- °D
Eile | Edit   Code  View  Plots  Session   Build   Debug   Profile  lools Help						
CH Ä*l Q 01 Qli^Goto file/function          1 @ -    Addins »					Project: (None) »	
						
<3j Untitledl* x	s---V	Environment		History		= n
1 <£L 1 H □ Source onSave  1 <i        - 1 O 1 *	f   ._*Run   J^rf   ._■* Source -		Q    ^jf Import Dataset -			= List -1 @
1   one <- 1	V--^	"1 Global Environment *				
2					Environment is empty	
						
		Files	Plots	Packages    Help Viewer		= n
		Install    © Update			1 IQ,	11 @
			Name		Description	Version
		User Library				A
			assertthat		Easy Pre and Post Assertions	0.2.0
1S     | (Top Level) t	R Script t 1					
		□	audio		Audio Interface for R	0.1-5
Console						
		□	beepr		Easily Play Notification Sounds on any Platform	1.2
						
		□	BH		Boost C+ + Header Files	1.62.0-1
		□	bindr		Parametrized Active Bindings	0.1
		□	bindrcpp		An 'Repp' Interface to Active Bindings	0.2
		□	bitops		Bitwise Operations	1.0-6
		□	Cairo		R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF], vector [PDF, SVG, PostScript) and display [X11 and Win32) output	
		□	chron		Chronological Objects which can Handle Dates and Times	2.3-50
		□	colorspace		Color Space Manipulation	1.3-2
		□	CL "1		A Modern and Flexible Web Client for R	23.1
		□	data.table		Extension of'data,frame'	1.10-4
		n	dichromat		Color Schemes for Dichromats	2.0-0        © v I I
						
RStudio
Eile  Edit   Code  View  Plots  Session   Build   Debug   Profile  lools Help
-I & ' I Q 01 Ü I I * Go to file/functio
idqi - Addins
Q] Untftledl* x
I <£L I B □ Source on Save     ^ £ ' I O
2:1       (Top Level)
Project: (None)
User Library
assertthat
Easy I
I Post Assertions
0.2.0
I  I dichromat
Color Schemes for Dichromats
2.0-0
Environment History	= n
■^f H          Import Data set •	= List - 1 ©
J Global Environment »	
Values	
one 1	
	
	
Files    Plots    Packages    Help    Viewer J	
)| Install    (t) Update X	l 1 (55
Name                  Description m	Version
□	audio	Ausi^hlerface for R	0.1-5
□	beepr	^^zasily Play Notification Sounds on any Platform	1.2
	BH	Boost C+ + Header Files	1.62.0-1 O
□	bindr	Parametrized Active Bindings	0.1
□	bindrcpp	An 'Repp' Interface to Active Bindings	0.2
□	bitops	Bitwise Operations	1.0-6
□	Cairo	R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output	1.5-9
□	chron	Chronological Objects which can Handle Dates and Times	2.3-50 ©
□	colorspace	Color Space Manipulation	1.3-2
□	CL "1	A Modern and Flexible Web Client for R	23.1
□	data.table	Extension of'data,frame'	1.10.4
Creating/storing objects
	<-	
Obj. name		Object
		
RStudio
Eile  Edit   Code  View  Plots  Session   Build   Debug   Profile  lools Help
-I & ' I Q 01 Ü I I * Go to file/functio
idqi - Addins
Q] Untftledl* x
fllflü Source on Save  I Q £ - I |_l I -
aD      Environment History
hit Run I IS* I 1_3»Source   -    ^   I   ^ B I jj* Import Dataset * I ^
Global Environment »
Values one 1
Project: (None) '
List -1 @
Files	Plots    Packages    Help Viewer			= n
Install    © Update		1 IQ,		11 <3
	Name	Description	Version	
User Library				A
□	assertthat	Easy Pre and Post Assertions	0.2.0	©
□	audio	Audio Interface for R	0.1-5	e
	beepr	Easily Play Notification Sounds on any Platform	1.2	©
□	BH	Boost C+ + Header Files	1.62.0-1	©
□	bindr	Parametrized Active Bindings	0.1	©
□	bindrcpp	An 'Repp' Interface to Active Bindings	0.2	©
□	bitops	Bitwise Operations	1.0-6	
□	Cairo	R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output	1.5-9	
	chron	Chronological Objects which can Handle Dates and Times	2.3-50	©
□	colorspace	Color Space Manipulation	1.3-2	©
□	CL "1	A Modern and Flexible Web Client for R	23.1	©
□	data.table	Extension of'data,frame'	1.10.4	©
n	dichromat	Color Schemes Jor Dichronats	2.0-0	
RStudio
Eile  Edit   Code  View  Plots  Session   Build   Debug   Profile  lools Help
-I & ' I Q 01 Ü I I * Go to file/functio
idqi - Addins
Q] Untftledl* x
fllflü Source on Save  I Q £ - I |_l I -
iO I    Environment History
Project: (None) '
_+Run     .*■+   ._■+Source   - 1 -=
<Z3P Q    J* Import Data set * Jp' J Global Environment »
Values one 1
List - I @
4:1       (Top Level)
Console -/ 1
> one <- 1
> one + one [1] 2
>
Files	Plots    Packages    Help Vie		Ler		
Install Update			I*		I I s
	Name	Description j	f Version		
User Library					A
□	assertthat	Easy Pre and#ost Assertions		0.2.0	©
□	audio	Audio IntsJFace for R		0.1-5	e
	beepr	Easilyjray Notification Sounds on any ~->\amrn		1.2	©
□	BH M	(Boost C + + Header Files		1.62.0-1	©
□	blndr ^^^^	Parametrized Active Bindings		0.1	©
	^rloYcpp	An 'Repp' Interface to Active Bindings		0.2	©
□	bitops	Bitwise Operations		1.0-6	
□	Cairo	R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output		1.5-9	
	chron	Chronological Objects which can Handle Dates and Times		2.3-50	©
□	colorspace	Color Space Manipulation		1.3-2	©
□	CL "1	A Modern and Flexible Web Client for R		23.1	©
□	data.table	Extension of'data,frame'		1.10.4	©
n	dichromat	Color Schemes Jor Dichronats		2.0-0	
o	RStudio	- °D
File   Edit   Code  View  Plots  Session   Build   Debug   Profile  lools Help		
CH Ä*l Q 01 Qli^Goto file/function          1 @ -    Addins »		^  Project: (None) »
		
<3j Untitledl* x	aD      Environment History	= n
1 <£L 1 H □ Source onSave  1 <i        - 1 O 1 *	t^Run 1 |*!M ^Source   '1  ^   1   ^ S 1 £3* Import Dataset - 1 ^	= List - 1 @
1   one <- 1	Global Environment »	r<x ~~i 1
6:1       (Top Level)
Console -/ 1
> one <- 1
> one + one [1] 2
> two <- one + one
>
Files	Plots    Packages    Help Viewer				
Ol Install    (£) Update			i O		
	Name	Description	I Version		
User Library					A
□	assertthat	Easy Pre and Post Assertions m		0.2.0	
□	audio	Audio Interface for R m		0.1-5	©
	beepr	Easily Play Notification Sounds m\ any Platform f		1.2	©
□	BH	Boost C+ + Header Files^r		1.62.0-1	©
□	bindr	Parametrized Actiyafcindings		0.1	©
□	bindrcpp	An 'RcppJ|(lmace to Active Bindings		0.2	©
□ .	bitops ^^^^^^	*rmise Operations		1.0-6	
TT	Cairo	R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF], vector [PDF, SVG, PostScript) and display [X11 and Win32) output		1.5-9	
	chron	Chronological Objects which can Handle Dates and Times		2.3-50	©
□	colorspace	Color Space Manipulation		1.3-2	©
□	CL "1	A Modern and Flexible Web Client for R		23.1	©
□	data.table	Extension of'data,frame'		1.10.4	©
n	dichromat	Color Schemes Jor Dichronats		2.0-0	
RStudio
Eile  Edit   Code  View  Plots  Session   Build   Debug   Profile  lools Help
<fH -I & ' I Q 01 Ü I I * Go to file/functio
idqi - Addins
Project: (None)
Q] Untftledl* x
fllflü Source on Save  | ^ j£ - | B | -
1 one <- 1
2
3 one + one
4
5 two <- one + one
6
7 two
8 I
8:'
(Top Level) i
Console -/ 1
> one <- 1
> one + one [1] 2
> two <- one + one
> two [1] 2
>
R Script :
Environment		History	= n		
	Q    J* Import Dataset -			List	- I <s
J Global Environment »			\<x		I
Values					
one			l		
two			2		
					
					
Files	Plots	Packages    Help Viewer			= n
)| Install    (t) Update			1 IQ,		
	Name		Description	Version	
User Library					A
□	assertthat		Easy Pre and Post Assertions	0.2.0	©
□	audio		Audio Interface for R	0.1-5	e
	beepr		Easily Play Notification Sounds on any Platform	1.2	©
□	BH		Boost C+ + Header Files	1.62.0-1	©
□	bindr		Parametrized Active Bindings	0.1	©
□	bindrcpp		An 'Repp' Interface to Active Bindings	0.2	©
□	bitops		Bitwise Operations	1.0-6	
□	Cairo		H graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output	1.5-9	
	chron		Chronological Objects which can Handle Dates and Times	2.3-50	©
□	colorspace		Color Space Manipulation	1.3-2	©
□	CL "1		A Modern and Flexible Web Client for R	23.1	©
□	data.table		Extension of'data,frame'	1.10.4	©
n	dichromat		Color Schemes Jor Dichronats	2.0-0	© v
Functions
• Pre-defined methods
• To create an object with more than one element, function c () is used
onetofive <- c(l,3,5,4,2)
• Any object may be manipulated with a function
sort(onetofive) [1 ] 1 2 3 4 5
Functions
• To extend functionality, functions have pre-defined arguments
• Arguments are further options of functions
• Some functions have many arguments, some none
• To keep function result, it must be stored in the environment as an object
sort(onetofive) [1]  1 2 3 4 5
sort(onetofive,  decreasing = TRUE) [1]  5 4 3 2 1
onetofive <- sort(onetofive,  decreasing = TRUE)
Functions
• Arguments usually require input format
• Boolean input-TRUE or FALSE
• Name of object- onetofive
• Text value - "linear "
• Format of each argument may be found in help page
• Just add question mark in front of the function name
?sample()
RStudio
_ n
£ile  Edit   Code  View  Plots  Session   Build   Debug   Profile  lools Help
C'l 3f'I fl 01 ö   I ^ Go to file/function
idqi - Addins
Qj Untitled 1
fllflü Source on Save  I G± £ - I |_l I -
aD      Environment History
t^Run I I** I ^Source   - I  ^       <3f H    Jf Import Data set * I ^
"1 Global Environment »
Project: (None) »
= n
I list - I ©
Environment is empty
Files PI
Packages JHelp Viewer
Export •
1:1       [Top Level) t				R Script
				
Console -7 1				
R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions Type 'licenseQ' or 'licenceQ' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors{)' for more information and
'citation{)' on how to cite R or R packages in publications
Type 'demo()' for some demos,  'help{)' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()'  to quit R.
o	RStudio	- °D
Eile | Edit   Code  View  Plots  Session   Build   Debug   Profile  lools Help		
CH Ä*l Q 01 Qli^Goto file/function          1 @ -    Addins »		^  Project: (None) »
		
Untitledl x	aD      Environment History	= n
1 SX 1 S □ Source on Save  1 <i ^ - 1 O 1 *	fc^Run 1 IS* 1 ^Source   '1  ^   1          H 1 £3* Import Dataset - 1 ^	= List - 1 @
1 1	1 Global Environment »	
Environment is empty
1:1       (Top Level)
R Script
Console -7
">"f
Files	Plots	Packages    Help Viewer		= n
Install ©		Update		11 <3
	Name	Description	Version	
User Library				A
□	assertthat	Easy Pre and Post Assertions	0.2.0	©
	audio	Audio Interface for R	0.1-5	©
	|beepr	Easily Play Notification Sounds on any Platform	1.2	©
□	BH	Boost C+ + Header Files	1.62.0-1	©
□	bindr	Parametrized Active Bindings	0.1	©
□	bindrcpp	An 'Repp' Interface to Active Bindings	0.2	©
□	bitops	Bitwise Operations	1.0-6	
□	Cairo	R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output	1.5-9	
	chron	Chronological Objects which can Handle Dates and Times	2.3-50	©
□	colorspace	Color Space Manipulation	1.3-2	©
□	CL "1	A Modern and Flexible Web Client for R	23.1	©
□	data.table	Extension of'data.frame'	1.10.4	©
n	dichromat	Color Schemes Jor Dichronats	2.0-0	
o	RStudio	- °D
File | Edit   Code  View  Plots  Session   Build   Debug   Profile  lools Help		
CH Ä*l Q 01 Qli^Goto file/function          1 @ -    Addins »		^  Project: (None) »
		
■■•   Untitledl x	aD      Environment History	= n
1 ffl 1 H □ Source onSave  1 <i ^ - 1 O 1 *	t^Run 1        1 ^Source   '1  ^   1 \& H 1 £3* Import Dataset - 1 ^	= List - 1 @
1 1	1 Global Environment »	
Environment is empty
1:1
(Top Level) ;
R Script
Console -7
> library{"beepr", lib.loc="~/R/win-library/3.4")
>
Files	Plots    Packages    Help Viewer			= n
Install    © Update		1 IQ,		11 <3
	Name	Description	Version	
User Library				A
□	assertthat	Easy Pre and Post Assertions	0.2.0	©
□	audio	Audio Interface for R	0.1-5	©
	beepr	Easily Play Notification Sounds on any Platform	1.2	©
□	BH	Boost C+ + Header Files	1.62.0-1	©
□	bindr	Parametrized Active Bindings	0.1	©
□	bindrcpp	An 'Repp' Interface to Active Bindings	0.2	©
□	bitops	Bitwise Operations	1.0-6	
□	Cairo	R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output	1.5-9	
□	chron	Chronological Objects which can Handle Dates and Times	2.3-50	©
□	colorspace	Color Space Manipulation	1.3-2	©
□	CL "1	A Modern and Flexible Web Client for R	23.1	©
□	data.table	Extension of'data.frame'	1.10.4	©
n	dichromat	Color Schemes Jor Dichronats	2.0-0	
RStudio
Eile  Edit   Code  View  Plots  Session   Build   Debug   Profile  lools Help
<fH -I & ' I Q 01 Ü I I ^» Go to file/functio
idqi - Addins
Q] Untftledl* x
fllflü Source on Save  | ^ j£ - | B | -
aD      Environment History
_*Run Source   - 1  * S    J* Import Dataset - /
Project: (None) '
List - I @
1 library(beepr)
2
2:1
(Top Level) i
Console -7
R Script :
> library(beepr)
>
1 Global Environment *				I
Environment is empty				
				
				
	^^^bts    Packages    Help Viewer			= n
I Ol Install J© Update		1 IQ,		
		Description	Version	
User Library				A
□	assertthat	Easy Pre and Post Assertions	0.2.0	©
□	audio	Audio Interface for R	0.1-5	©
H	beepr	Easily Play Notification Sounds on any Platform	1.2	©
□	BH	Boost C+ + Header Files	1.62.0-1	©
□	bindr	Parametrized Active Bindings	0.1	©
□	bindrcpp	An 'Repp' Interface to Active Bindings	0.2	©
□	bitops	Bitwise Operations	1.0-6	
□	Cairo	R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output	1.5-9	
	chron	Chronological Objects which can Handle Dates and Times	2.3-50	©
□	colorspace	Color Space Manipulation	1.3-2	©
□	CL "1	A Modern and Flexible Web Client for R	23.1	©
□	data.table	Extension of'data.frame'	1.10.4	©
n	dichromat	Color Schemes Jor Dichronats	2.0-0	"© V
File  Edit   Code  View Plots
0] Untitledl
I £3 I H □ Source
Session   Build   Debug  Erofile  lools Help
New Session
Interrupt R lerminate R...
Restart R
Ctrl+Shift+F10
Set Working Directory
Load Workspace... Save Workspace As..,
Clear Workspace,,,
Quit Session...
_-fvRun _   Source -
,   I      Environment History
|   £jf Q    J* Import Dataset - I Global Environment
^ Project: (None) ■
To Source File Location To Files Pane Location
Choose Directory.,,
Ctrl+Shift+H
Ctrl+Q
1:1       (Top Level) t
Console -/
> I
R Script : can
Files	Plots    Packages    Help Viewer			= n
Ol Install    © Update		1 (<*		MC9I
	Name	Description	Version	I I
User Library				A
□	abind	Combine Multidimensional Arrays	1.4-5	
□	acepack	ACE and AVAS for Selecting Multiple Regression Transformations	1.4.1	©
□	assertthat	Easy Pre and Post Assertions	0.2.0	©
□	audio	Audio Interface for R	0.1-5	©
□	backports	Reimplementations of Functions Introduced Since R-3.0.0	1.1.0	©
□	base64enc	Tools for base64 encoding	0.1-3	©
□	beepr	Easily Play Notification Sounds on any Platform	1.2	©
□	BH	Boost C + + Header Files	1.62.0-1	
□	bindr	Parametrized Active Bindings	0.1	©
□	bindrcpp	An 'Repp' Interface to Active Bindings	0.2	0
□	bltcps	Bitwise Operations	1.0-6	©
□	Cairo	R graphics device using cairo graphics library for creating high-quality bitmap (PNG, JPEG, TIFF), vector (PDF, SVG, PostScript) and display (X11 and Win32) output	1.5-9	©
Data export: saving XLSX
• Package "xlsx"
• Function write . xlsx ()
• Arguments
• x - object from the environment which you want to export
• file - name of the file in your working directory
write.xlsx(x = object,  file = "mysheet.xlsx")
Quantitative TA in R
Acquire Documents
-Existing Corpora
7
-Electronic sources
Un digitized text
+ Preprocess
-> Research Objective
Classification
Ideological Scaling
Supervised
(wordscores)
Unsupervised
(wordfish.)
Known Categories
Dictionary Methods
Supervised Methods
Unknown Categories
Fully
Automated Clustering
Computer
Assisted Clustering
Individual
Classification
Individual
Methods
Measuring Proportions
(Read Mo)
Ensembles
Single Membership Models
Mixed Membership
Models
Document Level (LDA)
Date Level
(Dynamic Multi topic -VI odd)
Author Level (Expressed Agenda Model)
(Grimmer & Steweart 2013)
„Be careful what is a result and what is just a residue of your data choices'"
Jana Diesner, 2018
Text analysis in R
• Package "quanteda" (http://quanteda.io)
• Developed by Ken Benoit (LSE)
• Comprehensive package on text analysis methods
• Package "readtext"
• Ken Benoit & Adam Obeng
• Package which allows data import from text sources
• Easy to work with
• Package "stopwords"
• Ken Benoit, David Muhr & Kohei Watanabe
• Package containing various stopwords for different languages
Before we start.
• Open the folder "text_analysis_quanti" folder
• Open script file "text_analysis_l.R" in R Studio
• Install all libraries
• quanteda, readtext, stopwords, xlsx
Steps leading to analysis
Set directory	->	Load packages	->	Read texts
				
V				
Create		Extract		Create
corpus	->	tokens	->	DFM
Set directory	->	Load packages	->	Read texts
				
V				
Create		Extract		Create
corpus	->	tokens	->	DFM
Set working directory
• Window approach
• Session -> Set Working Directory -> Choose Folder
• Script approach
work.dir <- "C:\\path\\to\\folderW" setwd(work.dir)
Set directory	->	Load packages	->	Read texts
				
V				
Create		Extract		Create
corpus	->	tokens	->	DFM
Load packages
• Window approach
• Session -> Set Working Directory -> Choose Folder
• Script approach
library(readtext) library(quanteda) library(stopwords) library(xlsx)
Set directory	->	Load packages	->	Read texts
				
V				
Create		Extract		Create
corpus	->	tokens	->	DFM
Reading texts into R
• readtext () function loads all text files into R
• Very easy to use - reads everything in any specified folder
• Supports various document types
TXT PDF DOC
Twitter data format JSON
• Just need to insert a path to a specific folder
• Arguments
• file
• Path to specific source file or path to folder containing files
• encoding
Reading texts into R
• Encoding
• Text files are usually stored in certain computer-readable format
• Consider text "Príklad zlého kódovania"
• ASCII/ISO-8859-1: "PrÄklad zlÄ©ho kÄ3dovanio"
• UTF-8: "Príklad zlého kódovania"
• As a rule of thumb, UTF-8 encoding is desired
Reading texts into R
text.dir <-    "C:\\path\\to\\folder\\with\\texts\\" texts <- readtext(file = text.dir,   encoding = "UTF-8")
Reading texts into R
text.dir <-    "C:\\path\\to\\folder\\texts\\"
texts <- readtext(file = text.dir,   encoding = "UTF-8")
Reading texts into R
Argument specifying location of texts (object input)
text.dir <-    "C:\\pa*\\to\\folder\\texts\\"
texts <- readtext(file = text.dir,   encoding = "UTF-8")
Function
Name of Argument specifying
a new object character encoding
(text input = quotes)
Set directory	->	Load packages	->	Read texts
				
V				
Create		Extract		Create
corpus	->	tokens	->	DFM
Corpus
• Simple function corpus ()
• Creates corpus from all imported texts from the previous step
• Arguments
• X
• Imported text files
• docnames
• Optional specification of document names
Corpus
• All sorts of statistics may be acquired once corpus is generated
corp <- corpus(x = texts)
• summary()
• Provides overview of corpus documents
•ndoc()
• Counts number of documents in the corpus
ndoc(corp) summary(corp)
Set directory	->	Load packages	->	Read texts
				
V				
Create		Extract		Create
corpus	->	tokens	->	DFM
From corpus to DFM
• Two-step process
• Tokenization of corpus
• A step necessary to apply some pre-processing choices which are not text-based (removal of noise)
• Remove numbers
• Remove punctuation
• Remove white space (separators)
• DFM generation from tokens
• Furthter pre-processing choices (because of bag-of-words)
• Stemming
• Lowercasing
• Stop words removal
• Dictionary application
Tokenization
• Function tokens ()
• Tokenization arguments
• what-word, character, sentence
• ngrams - ngramization of the corpus
• Pre-processing arguments
• remove_numbers - numerals
• remove_punct - punctuation
• remove_symbols - special "Unicode" symbols (encoding residues)
• remove_separators - white space, line ends, etc.
• remove hyphens - remove hyphens between words
Tokenization
tokenization <- tokens(x = corp,
what = "word",
ngrams = 1,
remove numbers = TRUE, remove punct = TRUE, remove separators = TRUE, remove hyphens = FALSE )
tokenization.bigrams <- tokens(x = corp,
what = "word",
ngrams = c (1:2), remove numbers = TRUE, remove punct = TRUE, remove separators = TRUE, remove hyphens = FALSE )
Set directory	->	Load packages	->	Read texts
				
V				
Create		Extract		Create
corpus	->	tokens	->	DFM
Document-feature matrix
• Function dfm ()
• Documents in rows, features (tokens) in columns
• Preprocessing arguments
• tolower - converts words to lowercase
• stem-implement stemmer
• remove - list of words to be dropped from the DFM
• Application arguments
• dictionary - applies dictionary and converts features from tokens to dictionary dimensions
• groups - allows to add another dimension by which the corpus can be grouped/split
Document-feature matrix
basic.matrix <- dfm(x = tokenization,
tolower = TRUE)
bigram.matrix <- dfm(x = tokenization.bigrams,
tolower = TRUE)
stem.matrix <- dfm(x = tokenization,
tolower = TRUE, stem = TRUE)
prep.matrix <- dfm(x = tokenization,
tolower = TRUE, stem = TRUE,
remove = stopwords(language = "en"))
DFM weighting
• DFM frequencies are displayed in absolute numbers
• Document size bias
• dfm_weight()
• Weighting of terms according to document size or other rules
• Useful to offset the effect of the document size
• dfm_tfidf ()
• Incidence of term in document divided by number of documents in which it occurs
• Useful to find term importance within document
DFM weighting
weight.matrix <- dfm weight(prep.matrix,
scheme = "prop")
tfidf.matrix <- dfm tfidf(prep.matrix)
DFM manipulation
• dfm_trim()
• Reduction in the dimensionality - removal of very sparse words, very frequent words, etc.
• dfm_subset()
• Subsetting of the the DFM - extraction of DFM portion
• dfm_sample()
• Random sampling from the DFM
• Useful in various computation-intensive tests
DFM manipulation
red.matrix <- dfm_trim(prep.matrix,
min termfreq = 5)
sample.matrix <- dfm_sample(prep.matrix,
size = 5)
Analysis
Analysis
• Corpus-based
• Require full texts
• E.g. KWIC
• DFM-based
• Require frequencies
• Bag-of-words assumption
• E.g. token frequencies, correspondence analysis, wordfish ...
Keywords in context
• kwic () function
• Modifying arguments
• pattern
• A term of interest or multiple terms of interest wrapped in function c ()
• window
• Length of text part extracted before and after the keyword term
• case.insensitive
• Binary - should function take term case into account?
• You can save it using write . xlsx () function
Keywords in context
kewords.in.context <- kwic(corp,
pattern = "energy", window = 5)
kewords.in.context.2 <- kwic(corp,
pattern = c("energy", "russia"), window = 5)
write.xlsx(x = kewords.in.context,file = "kwic.xlsx")
Keywords in context
Docname	From	To	Pre	Keyword	Post
2007-2008-CZ.txt.proc.txt	3784	3784	current issues related to renewable	energy	sources , cooperation in EU
2007-2008-CZ.txt.proc.txt	3805	3805	The V4 working group on	energy	meets regularly . The European
2007-2008-CZ.txt.proc.txt	3842	3842	cooperation in the field of	energy	with Nordic Council countries .
2008-2009-PLtxt.proc.txt	101	101	in early 2009 the Russian-Ukrainian	energy	crisis broke out, with
2008-2009-PLtxt.proc.txt	1195	1195	progress , the issue of	energy	security became the prime topic
2008-2009-PL.txt.proc.txt  1269  1269   group of governmental plenipotentiaries for   energy   security . On June 3rd
2008-2009-PLtxt.proc.txt 1580 1580 2008-2009-PLtxt.proc.txt   1613 1613
September 5th 2008 -   Energy   Expert Group meeting . The Agency for the Co-operation of   Energy   Regulations [ ACER ],
Keywords in context
• May be plotted easily
•textplot_xray()
• Function for plotting
• One or several KWIC objects may be passed (each must be passed separately)
• Argument scale allows to plot absolute or weighted positions (normalized by document length)
Keywords in context
textplot_xray(
kwic(corp, pattern = kwic(corp, pattern = sort = TRUE
"energy", window = 5 "security",  window =
textplot_xray(
kwic(corp, pattern kwic(corp, pattern sort = TRUE, scale = "absolute"
"energy", window = "security", window
Lexical dispersion plot
security energy																									
																									1999-2000-CZ.txt prac txt
																									2000-2001-PL txt prac txt
												1													2002-2003-SK txt prac txt 20Q3-2004-CZ txt prac txt 2004- 2005-PL txt prac txt 2005- 2006-HU txt prac txt 2007-2003-CZ txt prac txt
																									
																									
																									
																									
																									20Q8-2009-PL.txt. prac.txt
												1111													2009-2010-HU.txt.prac.txt
												II	1												2010-2011-SK txt prac txt
														1											2011-2012-CZ txt prac txt
	11	11	1			1		1		II		f		II			1						1		2012-2013-PL txt prac txt
1	1	III								J	1 11II		1					1							2013-2014-HU txt prac txt
1	IIIII	1 III	1			1					III III	1 1	II		1		r		i	i					2014-2015-SK txt prac txt
0.00 0.25 0.50 0.75 1.00    0.00 0.25 0.50 0.75 1.00
Relative token index
Lexical dispersion plot
o Q
security																								energy																								
																																																1999-2000-CZ.txt prac txt
																																																2000-2001-PL .txt.proc.1xt
																																																2002-2003-SK txt prac txt
																																																2003-2004-CZ txt prac txt
																																																2004- 2005-PL txt prac txt 2005- 2006-HU txt prac txt 2007-2003-CZ txt prac txt
																																																
																																																
																																																20Q8-2009-PL.txt.prac.txt
1	1																							III	III																							2009-2010-HU.txt.prac.txt
																								1	1																							2010-2011-SK txt prac txt
																								III	II II 1 II																							2011-2012-CZ txt prac txt
							|		|	LU														1	1 1				1	LJ	I																	2012-2013-PI_txtproc.txt
							1	1 1		II														III	1				|						|													2013-2014-HU txt prac txt
	1						III	11	III	II									1					I	IIIIIIIIIIII1						1									1		1						2014-2015-SK txt prac txt
0 10000 20000 30Ó00 40000   0 10Ó00 20000 30000 40Ô00
Token index
62
Frequencies
• Frequency of features in the DFM
• Absolute token frequencies
• Dictionary category frequencies
• topfeatures ()
• General function to extract number of tokens
Frequencies
freq.basic <- textstat frequency(x = basic.matrix,
freq.stem <- textstat frequency(x = stem.matrix, n freq.prep <- textstat frequency(x = prep.matrix, n
write.xlsx(x = freq.prep,   file = "frequencies.xlsx"
Wordcloud
• Function textplot wordcloud ()
Argument	Description
X	Terms
max words	Maximum number of words rendered
min size	Size of smallest category
max size	Size of largest category
rotation	Percentage of terms placed vertically
color	Color or color palette
•  • •	Many other arguments available (use help)
Wordcloud
textplot wordcloud(x = basic.matrix,
max words = 50, min size = 1, max size = 4, rotation = 0, color = "steelblue2")
textplot wordcloud(x = prep.matrix,
max words = 50, min size = 1, max size = 4, rotation = 0, color = "red3")
regional security thjs defence
support be european
common at countries presidency slovakwfsm        . byCzech
visegrad -'J ai ,VJ group energy asthA for from its with LI I w on an easternJ0intv4 /^ftn euforei9n po icy that KJl ministers
also meeting their which cooperation republic development
international budapest
implement integr
develop minist ministri
junegroupcountri.state
europnrpciH » , A work intern
wenfoNci   V4 eu J°int new energi |T|0gt ^OOP^T
repubivisegradforei9n
Slovak regjon supportproject expert    secur discuss nation defenc budapestactiv
Dictionaries
• Two step process
• Requires a dictionary object
• Manually constructed dictionary
• Dictionary in the external location
• File "LaverGarry.cat" in your folder
• Dictionary included in a package
• Package "tidytext"
• Sentiments dictionary
• Dictionary has to be applied in a DFM construction process
Dictionaries
CULTURE
ECONOMY
CULTURE-HIGH
ART (1) ARTISTIC (1) DANCE (1) GALLER* (1) MUSEUM* (1) MUSIC* (1) OPERA* (1) THEATRE* (1) CULTURE-POPULAR
MEDIA (1)
SPORT
ANGLER* (1)
PEOPLE (1)
WARJNJRAQ(l)
CIVIL_WAR(1)
+STATE+
ACCOMMODATION (1) AGE (1)
AMBULANCE (1) ASSIST (1) BENEFIT (1) CARE (1) CARER* (1) CHILD* (1) CLASS (1) CLASSES(1) CLINICS (1) COLLECTIVE* (1)
Dictionaries
• Using dataset from a file/creating own dictionary
• Function dictionary () allows to load a file as a dictionary
• Arguments
• file - specifies the path to file (because we are in a working directory, we have to specify only a file name)
• format - specifies the pre-defined format of dictionary
• The new object will be used in the DFM argument
dictionary
• Useful to weigh the DFM after application
Dictionaries
wordstat.diet <- dictionary(file = "LaverGarry.cat",
format = "wordstat")
dfm.dict <- dfm(tokenization,
dictionary = wordstat.diet)
dfm.diet.w <- dfm weight(dfm.diet,scheme = "prop")
VALUES.CONSERVATIVE
2014-2013-2012-2011-2010-2009-2008-2007-2005-2004-2003-2002-2000-1999-
2015-2014-2013-2012-2011-2010-2009-2008-2006-2005-2004-2003-2001-2000-
SK.txt HU.txt PL.txt. CZ.txt SK.txt HU.txt PL.txt. CZ.txt HU.txt PL.txt. CZ.txt SK.txt
proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt
PL.txt.proc.txt CZ.txt.proc.txt
VALUES.LIBERAL 2014-2015-SK.txt.proc.txt .proc.txt
2013-2014-HU.txt 2012-2013-PL.txt. 2011-2012-CZ.txt 2010-2011-SK.txt 2009-2010-HU.txt 2008-2009-PL.txt. 2007-2008-CZ.txt 2005-2006-HU.txt 2004-2005-PL.txt. 2003-2004-CZ.txt 2002-2003-SK.txt
proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt proc.txt
2000-2001 -PL.txt.proc.txt 1999-2000-CZ.txt.proc.txt
O
o
o
o.oo
0.01
0.02
0.03
0.04
0.05
0.06
Second data set
• Parts of UK 2010 election manifestos
• Issue of migration
• English, already pre-formatted, part of quanteda package
• Just type data_char_ukimmig2 010 into script
• Same drill as before
• Texts
• Corpus
• Tokens
• DFM
Distances
• The simplest algorithm to obtain scaling
• Function textstat_dist ()
• Creates a distance object which is recognized by other R packages and functions
• We may use hclust () function which creates a hierarchical clusters and plot it with plot () function afterwards
Distances
dist.analysis <- textstat dist(mig.dfm) clusters <- hclust(dist.analysis) plot(clusters)
Cluster Dendrogram
o
CO
Q_
Z CD
O
CD
'cd X
o
o
CN
Q_
ZJ
o
.G
CO
CO
d
CD CD
Ü
E
CD
Q
CD > -t—»
CO
£
CO
o Ü
o
CD
o
ü
ü
Q_
Q_ Z
CO
dist.analysis hclust (*, "complete")
Keyness
• Useful method to evaluate keywords-finds words which are specific in relation to the rest of the corpus
• Function textstat_keyness ()
• Argument target
• Specifies the numeric ID of the document, which is compared to the rest of the corpus
• Can be plotted via textplot keyness ()
Keyness
key.analysis <- textstat keyness(x = mig.dfm,target textplot keyness(key.analysis)
ethnic percent
I shal bnp
popul
societi multicultur within
H : i ill i
statist accord
points-bas fair want appli detent live
valu fore
social group third on offici school current
re qui r migrat need can
visa eu end student control new
system
migrant
BNP
reference
-10
0
chi2
10
20
Models
Correspondence analysis
• Method of singular value decomposition
• Allows to reduce complexity of matrix into low-dimensional space (2 or 3)
• No underlying assumptions about distributions
• Scaling is a method of capturing the variation in the observed data
• Not clear what is the variation captured (actual positions, tone, style,...)
Correspondence analysis
• Function textmodel_ca ()
• Arguments
• sparse
• Allows to omit less frequent words in order to reduce the use of computer memory
• nd
• Default estimates as many dimensions as possible, allows to limit the number of estimated dimensions
• Useful to explore model with function summary ()
Correspondence analysis
model <- textmodel ca(mig.dfm,sparse = TRUE) summary(model) textplot scaleld(model)
BNP
UKIP
Greens
LibDem
Labour
SNP
Coalition
Conservative
■0.5 0.0
Document position
WordFish
• Model based on naive Bayes classifier
• Estimation of one dominant dimension
• Assumes a word is drawn from a Poisson distribution, which is based on
• Amount the actor speaks
• Frequency how much the word is used
• Extent how much the word discriminates the underlying ideological space
• Actors' underlying position
• Model is estimated given the observed data
• Again, lack of clarity, what the scale captures
WordFish
• Function textmodel_wordf ish ()
• Arguments allow a further specification of prior assumptions about the Poisson distribution, model parameters,...
• Result provides also SE for each estimated position
• Function summary () allows to see the estimated model
• Function textplot_scaleld () allows to visualize results
• Scaling of actors
• Scaling of words using argument margin
• Word highlight using argument highlight and a word list wrapped in function c ()
WordFish
model <- textmodel ca(mig.dfm,sparse = TRUE) summary(model) textplot scaleld(model)
Conservative
Coalition
Labour
SNP
LibDem
Greens
UKIP
BNP
-1 0
Estimated theta
Word Fish vs. CA
Wordfish	Correspondence Analysis
Conservative	Conservative
Coalition	Coalition
Labour	SNP
SNP	Labour
Liberal Democrats	Plaid Cymru
Plaid Cymru	Liberal Democrats
Green Party	Green Party
UKIP	UKIP
British National Party	British National Party
WordFish
textplot scaleld(model,  margin = "features")
textplot_scaleld(model,
margin = "features",
highlighted = c("eu","multicultur"), highlighted color = "black"
)
immigr
peopl system
british  mm contr°' border
hrit^i^ uť    r|gnT work bntainnation citizenship
year
student
depftän
social la;
rnju law-a
nnnul   -.-v itľi i iN'i ==|LťM(HaÍi( "multiaJlfiiíSB
come
want graffiti I traffick
on ^Ife
fore
valu
strengthen
univers
non-eu weapon
cours
serious
ethnic
t9EílQITä:t
-25 0.0 2.5
Estimated beta