Text analysis Lukáš Lehotský Why text analysis? Text as discourse Text as patterns Manifest vs. latent content "text analysis is just a fancy and convoluted way how to obtain independent or dependent variable" Inaki Sagarzazu 2016 Design of CA research Unitizing Sampling Recording/ Coding Narrating Reducing Krippendorff 2013, p. 86 Methods Methods of TA Supervised Unsupervised Semi-supervised Acquire Documents -Existing Corpora 7 -Electronic sources Un digitized text + Preprocess -> Research Objective Classification Ideological Scaling Supervised (wordscores) Unsupervised (wordfish.) Known Categories Dictionary Methods Supervised Methods Unknown Categories Fully Automated Clustering Computer Assisted Clustering Individual Classification J- Individual Methods Measuring Proportions (Read Mo) Ensembles Single Membership Models Mixed Membership Models Document Level (LDA) Date Level (Dynamic Multitopic -VI odd) Author Level (Expressed Agenda Model) (Grimmer & Steweart 2013) Methods of text ana • Supervised methods • Kewords in context (KWIC) • Manual coding • Semi-supervised • Dictionary-based methods • Deductively given dictionary • Dictionary obtained from data • Automatically • Manually • Unsupervised • Frequencies • Topic modeling Fully supervised • Requires manual text processing • Most approaches based on manual coding of text units • Inductive vs. deductive coding • Inductive - data-driven • Categories not known • Open coding - categories emerge in iterative text reading • Axial coding - abstraction from open coding into categories • Constant comparative approach - re-reading already coded units • Deductive-theory-driven • Categories known a-priori • Existing code-book applied over data • Comparative Manifesto Project Keywords in Context (KWIC) • Relational analysis of a concept • Analyzes context of the concept through the way the concept is used within the text - original linguistic environment of the text • Exploration of the corpus • Requires prior knowledge of keywords • Input for further analysis • Dictionary construction • Frequency analysis • Coding Keywords in Context (KWIC) exchange of information about energy policy and coordination of and coordination of the energy policy of V 4 sphere of new EU energy legislation, especially rule of trans- European Energy Networks, concentrate on with the operation of energy facility, impact of the field of the energy sector, industry and Energy continuation of meeting of establishment of a common energy and gas market. - operation in the energy sector in the usual Fully supervised • Discourse analysis • Socio-semantic networks • Discourse network analysis • Socio-semantic networks of actors and meanings (codes) they use Fully supervised - DNA Lucia Pu@ch [CDU] Bärbel H^hífl [G rüne] Demonstranten /—N Schleswig-holsteinische FDP Michael Bauchmüller [SZ] ■ Tanja Cannier [CDU] X \ Peter harry (£arsťenseri [CDU] Stefan MappjJS [CDU] Europäische Union Nicolas Sarkozy [Frankreichs Regierungschef] Sylvia Kotti^g>Uhl [Grüne] Angella Merkel |CDU] / \ SicherheitsüberdViJfung Deutsches Atomforum Werner S] Cancel Date ^ Statements ® all o current o filter iäS Search within document ft Regex highlighter Current fie: none Discourse Network Analyzer File Document Export Settings _ n A 14 I □ i $ Coder Title Create new database. lwi Document properties (No document or per i.lJi Document statistics ~1 Options Q Database Q Coder Q Statement types Q Summary Summary: Database: C:\Users\Lukas\Documents\new_db.dna Number of coders: 1 Statement types: 2 O Create database O Cancel ^ Statements Text urrent o filter thin document lighter Current fie: none Data import • Manual import of data - text by text • Folder import • Place all documents in one folder (text format) • Identify folder • Documents get loaded into the software • Automatic construction of metadata • Proper document name results into automatic metadata construction • dd.mm.yyyy- Name Surname - Publication name -TEXTTYPE.txt File [ Document Export Settings _J Add new document,,, Import text "j S Import from DMA 2.0 fie... H Import from DMA 1.31 file... -i \_s Batch -recede rn eta d ata... J Name Document properties (No document or permission} ,-UJn Document statistics V Discourse Netwo Title Date ^ Statements ® all o current o f Iter SŠ Search within document Ř Regex highlighter Carrentf le: C:\Usersll_u kaslDocu mentstaew d b. d n a Add new document... title add date 2018-11-08 00:00:00 rr coder [ ~~\ Admin author source section type notes. paste the contents of the document here using Ctrl File D [ m ---,1, f, , Import text files. Pattern Title: Author: Source: Section: (?<=.+?- }[a-zA-Z]+(?= - [A-Z0-9\(\}]+Vtxt) Type: [A-Z-]+(?=([0-9\(\}]}*\.txt) Notes: Coden q Admin Date: [0-9K2}V[0-9K2}V[0-9K4-} Format: dd.MM.yyyy Metadata Reg ex Preview 0 0 Cur Select folder Refresh ^ Importfiles m File Import text files... Pattern Title: Author: Source: Section: Type: Motes: Coder: Date: Format: [?■=:-+?- ).+?[?=-) (?<=.+?-}[a-zA-Z]+(?= [A-Z-]-H[?=[[0-9V^])*\txt □ Ad mm Look In: Open Documents ^ fl Ö g:g: □— Audacity [3 PDF Architect Custom Office Templates [3 Python Scripts dna_export C3 R C3 downloads fJUCINETdaia fjGIS DataBase mailer n My Data Sources Folder name-: Files of Type-: C.VJ s e rs\Lu ka s\D o cu m e nts\d n a_exp o rt All Files Open Cancel [0-9k2}v[0-9k2}v[0-9k4 dd.MM.yyyy Cur Select folder Refresh ^ Importfiles File Import text files. Rh i story 03.02.201& 03.02.2015 04.02.2015 04.02.2015 04.02.2015 05.02.2015 06.02.2015 06.02.2015 07.02.2015 Author U n specif e d - Mlada fronta Dnes - NATIONALPRINT.txt Author Unspecified - Pravo - NATIONALPRINT.txt Author Unspecified - Lidové noviny - NATIONALPRINT.txt Author Unspecified - Mlada fronta Dnes - NATIONALPRINT.txt Author Unspecified - Pravo - NATIONALPRINT.txt Author Unspecified - Mlada fronta Dnes - NATIONALPRINT.txt Author Unspecified - Blesk - NATIONALPRINT.txt Author Unspecified - Mlada fronta Dnes - NATIONALPRINT.txt Author Unspecified - Lidové noviny - NATIONALPRINT.txt Ah Metadata Pattern Title: Author: Source: Section: Type: Notes: Coder: Date: Format: [?<=.+?- ).+?[?=-) (?<=.+?- }[a-zA-Z]+(?= - [A-Z0-9\Q}B.txt} [a-z-;+i:?=([o-9',i: 'C*\m: Reg ex Preview 0 3.02.2015 - Author Unspecified - Mlada fronta Dnes - NATIONALPRINT 0 Author Unspecified 0 Mlada fronta Dnes 0 NATIONALPRINT □ Ad min [0-9k2}v[0-9k2}\.[0-9K4-} 0 03.02.2015 dd.MM.yyyy Tue Feb 03 00:00:00 GET 2015 Cur Select folder Refresh ^ Importfiles Discourse Network Analyzer _ n File Document Export Settings |£i i 220^- A 14 I □ i a; Coder Admin Name <5 ovf □ 0 [3 Document properties ft (Mo document or permission} liiii Document statistics v Title 03.02.2015 - Author Unspecified - Mlada fronta Dnes - NATIONALPRINT 03.02.2015 - Author Unspecified - Fravc - NATIONALPRlNT 04.02.2015-AuthorUnspecified - Lidové noviny - NATIONALPRINT 04.02.2015 - Author Unspecified - Mlada fronta Dnes - NATIONALPRINT 04.02.2015 - Author Unspecified - Právo - NATIONALPRINT Date Feb 3. 2015 Feb 3. 2015 Feb 4, 2015 Feb 4, 2015 Feb 4. 2015 ^ Statements ID Text ® all o current o filter SS Search within document Ä ir.rlŕ-.l-r.£_-r j vt.ě._hovoří person organization concept agreement fj Oldřich Bubenicek Vladimir Burt odbory. Řeč s nimi na včerejší trif DNA Statement ID: 3 Start: 1169 end: 1195 S [1 aše bohatství, jež nemá ležet ladem. Z podstatné části ČSSD se prolomení který mívá přesně zmapováno, co je upozorňuje, že další hnědé uhlí nej Premiér 3ohuslav Sobotka odmítá bo uspokojí starousedlíky. Lidovému domu blízke odboráře i teplárenský průmysl, ministra průmyslu Jana Mládka natvrdo zaznělo, že pro teplárenství je uhlí nepostradatelné. Straší, že zachování limitů by nás mohla stát vyšší ceny tepla Prostor pro upachtenou dohodu však existuje. Na stole jsou čtyři alternativy, Mládek pokládá za reálné dvě: limity budou upraveny jen na dole 3ílina (což připouští i ministr životního prostředí Richard Brabec z AtJOl, nebo na Bílině a a jabiš, . a jež Cd ^ Statements ft ID Text 1 snaží oslabit hlavnítrumf pre... 2 véŕí, že by pracovní místa pri... 3 desí "sociálni katastrofy ® all o current o filter íää Search within document ft Regex highlighter C u rrent f I e: C :UJ s erslLu ka s ,Docu m ents ,n ew d b. d n a Discourse Network Analyzer _ n File Document Export Settings I A 14 m I I H I Coder Admin Name <5 ovf □ 0 ^ Document properties Ä Title 03.02.2015 - Author Unspecified Pravj Date 2015-02-03 00:00:00 *" Coder UJ Admin Author Author Unspecifi ed Source Pravo Section Type MATIDMALPRIMT T" Motes Title 03.02.2015 - Author Unspecified - Mlada fronta Dnes - NATIONALPRINT 03.03.2015 - Author Unspecified - Pravo - NATIONALPRINT 04.02.2015-Author Unspecified - Lidové noviny - NATIONALPRINT 04.02.2015 - Author Unspecified - Mlada fronta Dnes - NATIONALPRINT 04.02.2015 - Author Unspecified - Pravo - NATIONALPRINT 0 Date Feb 3. 2015 Feb 3. 2015 Feb 4, 2015 Feb 4, 2015 Feb 4. 2015 subjekty bránici ukusováni severočeské krajiny s lidskými domovy. Starosta řiorniho Jiřetina Vladimir Buřt (SZ) se snaži oslabit hlavni trumf protistrany, ješ tvrdi, že rozšířeni těžby pomůže zaměstnanosti. Podle něj by v katastru obce bylo ohroženo asi osm set pracovnicn mist a na povrchovou těžbu by doplatily i tfi stovky zaměstnanců mistniho hlubinného dolu. Odpůrci těžby vyčíslili její vedlejší finanční náklady na 269 miliard. Do této sumy započítali výdaje na léčeni nemoci způsobených škodlivinami ze spalováni uhli při výrobě elektrické energie a tepla, jako jsou srdečni choroby a chronická bronchitída, výplatu nemocenských dávek nebo škody na zemědělské půdě a na krajině. Ústecký hejtman Oldřich 3ubeniček (KSČM) naopak věři, že by pracovni mista přibyla. Kraj má přes 60 tisic nezaměstnaných. V případě završeni těžby se hejtman děsi "sociálni katastrofy". Podobně rázně hovoři odbory. Řeč s nimi na včerejší tripartitě našli i podnikatelé - v uhli vidi naše bohatství, jež nemá ležet ladem. Zástupci státu maji ale pochyby. ANO, lidovci a podstatné části ČSSD se prolomeni limitů nezamlouvá. Ministr financi Andrej 3abiš, který mivá přesně zmapováno, co je efektívni a populárni, o něm nechce slyšet a upozorňuje, že dalši hnědé uhli nepotřebujeme - energii nakonec stát vyváži. Premiér 3ohuslav Sobotka odmítá bourat obydli a usiluje o kompromisní řešeni, jež uspokoj! starousedlíky. Lidovému domu blízké odboráře i teplárenský průmysl. Od ministra průmyslu Jana Mládka natvrdo zaznělo, že pro teplárenství je uhli nepostradatelné. Straší, že zachováni limitů by nás mohlo stát vyšší ceny tepla. Prostor pro upachtenou dohodu však existuje. Na stole jsou čtyři alternativy. Mládek pokládá za reálné dvě: limity budou upraveny jen na dole 3ilina (což připouští i ministr životního prostřed! Richard 3rabec z AND), nebo na 3ilině a částečně na dole ČSA - padl by jih řiorniho Jiřetina čitajici 170 domů (to doporučuje Mládek)■ Okrajovými jsou varianty úplného prolomeni limitů (zbouráni_ ^ Statements ID Text 1 snaží oslabit hlavnítrumf pro... 2 véFÍ, že by pracovní místa při... 3 děsí "sociální katastrofy" 4 upozorňuje, že další hnědé ... 5 odmítá bourat obydlí 6 natvrdo zaznelo, že pro teplá... <•) all o current o filter íää Search within document Regex highlighter C j rre nt f I e: C\\V s erslLu ka s ,Docu m ents'.n e w d b. d n a Recoding • Open coding as first step • Establishment of relations between codes (axial coding) • Adjustment of original codes' labels • Reduction of dimensions • Any variable can be recoded • From original variable value to target value Discourse Network Analyzer _ n File Document Export Settings I A 14 m I fp Coder Admin Name <5 ovf □ 0 ^ Document properties Title 03.02.2015 - Author Unspecified - Prav| Date 2015-02-03 00:00:00 Coder d Admin Author Author Unspecifi ed Source Pravo Section Type NATIQNALPRIIJT T" Uotes Title 03.02.2015 - Author Unspecified - Mlada fronta Dnes - NATIONALPRINT 03.03.2015 - Author Unspecified - Pravo - NATIONAL?RlNT 04.02.2015-AuthorUnspecified - Lidové noviny - NATIONALPRINT 04.02.2015 - Author Unspecified - Mlada fronta Dnes - NATIONALPRINT 04.02.2015 - Author Unspecified - Pravo - NATIONALPRINT 0 Date Feb 3. 2015 Feb 3. 2015 Feb 4, 2015 Feb 4, 2015 Feb 4. 2015 subjekty bránici ukusování severočeské krajiny s lidskými domovy. Starosta Horního Jiretína Vladimír Buřt (SZ) se snaží oslabit hlavní trumf protistrany, ješ tvrdí, že rozšíření těžby pomůže zaměstnanosti. Podle něj by v katastru obce bylo ohroženo asi osm set pracovních míst a na povrchovou těžba by doplatily i tři stovky zaměstnanců místního hlubinného dolu. Odpůrci těžby vyčíslili její vedlejší finanční náklady na 269 miliard. Do této sumy započítali výdaje na léčení nemocí způsobených škodlivinami ze spalování uhlí při výrobě elektrické energie a tepla, jako jsou srdeční choroby a chronická bronchitída, výplatu nemocenských dávek nebo škody na zemědělské půdě a na krajině. Ústecký hejtman Oldřich 3ubeníček (KSČM) naopak věří, že by pracovní místa přibyla. Kraj má přes 60 tisíc nezaměstnaných. V případě završení těžby se hejtman děsí "sociální katastrofy". Podobně rázně hovoří odbory. Řeč s nimi na včerejší tripartitě našli i podnikatelé - v uhlí vidí naše bohatství, jež nemá ležet ladem. Zástupci státu mají ale pochyby. AIJO, lidovci a podstatné části ČSSD se prolomeni limitu nezamlouvá. Ministr financí ftndrei Babiš, □ DNA Statement concept o save © reset S ED H original value ochrana obydli sociálni katastrofa teplárenství uhli export elektřiny zaměstnanost edited value ochrana obydli zaměstnanost teplárenství uhli export elektřiny zaměstnanost ochrana obydli sociálni katastrofa teplárenství uhli export elektřiny zaměstnanost C u rrent f I e: C :UJ s erslLu ka s ,Docu m ents ,n ew d b. d n a ^ Statements ID Text 1 snaží oslabit hlavnítrumf pro... 2 véŕí, že by pracovní místa pri... 3 děsí "sociálni katastrofy" 4 upozorňuje, že další hnedé ... 5 odmítá bourat obydlí 6 natvrdo zaznelo, že pro teplá... ® all o current o filter íää Search within document Ä Regex highlighter Discourse Network Analyzer _ n File Document Export Settings 220|-ŕ-| a i 14^ j HP 11 B 11 B I Coder Admin Name ^ ovf Q @ ^ Document properties Title 03.02.2015 - Author Unspecified - Prav| Date 2015-02-03 00:00:00 Coder ľ-] Admin Author Author Unspecif ed Source Pravo Section Type NATIONALPRINT Motes Title 03.02.2015 - Author Unspecified - Mlada fronta Dnes - NATIONALPRINT 03.02.2015 - Author Unspecified - Pravo - NATIONAL?RlNT 04.02.2015 -AuthorUnspecified - Lidové noviny - NATIONALPRINT 04.02.2015 - Author Unspecified - Mlada fronta Dnes - NATIONALPRINT 04.02.2015 - Author Unspecified - Pravo - NATIONALPRINT 0 Date Feb 3. 2015 Feb 3. 2015 Feb 4, 2015 Feb 4, 2015 Feb 4. 2015 subjekty bránici ukusování severočeské krajiny s lidskými domovy. Starosta řiorního Jiretína Vladimír Buřt (SZ) se snaží oslabit hlavní trumf protistrany, jež tvrdí, že rozšíření těžby pomůže zaměstnanosti. Podle něj by v katastru obce bylo ohroženo asi osm set pracovnícn mist a na povrchovou těžbu by doplatily i tři stovky zaměstnanců místního hlubinného dolu. Odpůrci těžby vyčíslili její vedlejší finanční náklč způsobených š jako jsou srd škody na země naopak věří, z| případě završe odbory. Řeč s Confirmation Are you sure you want to recode all values that have been changed? Mo nemoci a tepla, lávek nebo (KSČM) naných. V :r.ě h~~-~~£-L dí naše bohatství, jež nemá ležet ladem. Zástupci státu mají ale pochyby. AMO, lidovci a podstatné části ČSSD se prolomeni limitu nezamlouvá. Ministr financi ňndrei Babiš, □ DNA Statement concept © reset original value ochrana obydli sociálni katastrofa teplarenstvi uhli export elektřiny zaměstnanost edited value ochrana obydli zaměstnanost teplarenstvi uhli export elektřiny zaměstnanost ochrana obydli sociálni katastrofa teplarenstvi uhli export elektřiny zaměstnanost C u rrent f I e: C :UJ s erslLu ka s ,Docu m ents ,n ew d b. d n a ^ Statements ID Text 1 snaží oslabit hlavnítrumf pro... 2 véŕí, že by pracovní místa pri... 3 děsí "sociálni katastrofy" 4 upozorňuje, že další hnedé ... 5 odmítá bourat obydlí 6 natvrdo zaznelo, že pro teplá... ® all o current o filter íää Search within document Ä Regex highlighter Data export • Versatile data export possibility • One-mode network - values of single variable • Two-mode network - matrix of values • Data export formats • CSV • DL (network analysis software UCInet) • GRAPHML (network visualization software Visone) File Document fp Coder Export Settings ?. Export network... Admin Name <5 ovf □ 0 ^ Document properties Title 03.02.2015 - Author Unspecified - Prav| Date 2015-02-03 00:00:00 Coder □ Admin Author Author Unspecif ed Source Pravo Section zl Type NATIONALPRINT T" Motes Discourse Network Analyzer _ n A 14 m I Title 03.02.2015 - Author Unspecified - Mlada fronta Dnes - NATIONALPRINT 03.02.2015 - Author Unspecified - Pravo - NATIONALFRlNT 04.02.2015-Author Unspecified-Lidove noviny - NATIONALPRINT 04.02.2015 - Author Unspecified - Mlada fronta Dnes - NATIONALPRINT 04.02.2015 - Author Unspecified - Pravo - NATIONALPRINT 0 Date Feb 3, 2015 Feb 3, 2015 Feb 4, 2015 — Feb 4, 2015 Feb 4, 2015 Na těžební limity je jiný pohled z Horního uiřetína či Litvínova, jiný z Ustí nad Labem a jiný z Prahy. Proto jsme svědky debaty, která má vertikální i horizontální linii, a vstupují do ní akcenty politické, ekonomické, sociální i ekologické. Místní už kdysi řekli v lokálních referendech "ne" a od té doby na radnicích sílí subjekty bránící ukusování severočeské krajiny s lidskými domovy. Starosta řiorního Jiřetina Vladimír Buřt (SZ) se snaží oslabit hlavní trumf protistrany, jež tvrdi, že rozšíření těžby pomůže zaměstnanosti. Podle něj by v katastru obce bylo ohroženo asi osm set pracovních míst a na povrchovou těžbu by doplatily i tři stovky zaměstnanců místního hlubinného dolu. Odpůrci těžby vyčíslili její vedlejší finanční náklady na 269 miliard. Do této sumy započítali výdaje na léčení nemocí způsobených škodlivinami ze spalování uhlí při výrobě elektrické energie a tepla, jako jsou srdeční choroby a chronická bronchitída, výplatu nemocenských dávek nebo škody na zemědělské půdě a na krajině. Ústecký nejtraan Oldřich 3ubenlček (KSČM) naocak věří-_že b v pracovní mí *ť-.fi přibyla._Rrai má rjřes 60 tÍ3Íc nesaměstnanýcn■ V_ □ DNA Statement concept O save © reset original value ochrana obydli sociálni katastrofa teplárenství uhli export elektřiny zaměstnanost edited value ochrana obydli zaměstnanost teplárenství uhli export elektřiny zaměstnanost ochrana obydli zaměstnanost teplárenství uhli export elektřiny C u rrent f I e: C :UJ s erslLu ka s ,Docu m ents ,n ew d b. d n a ^ Statements ID Text 1 snaží oslabit hlavnítrumf pro... 2 véŕí, že by pracovní místa pri... 3 děsí "sociálni katastrofy" 4 upozorňuje, že další hnedé ... 5 odmítá bourat obydlí 6 natvrdo zaznelo, že pro teplá... ® all O current O filter íää Search within document Ä Regex highlighter Discourse Network Analyzer _ n File Document Export Settings 22o|-ŕ-| a I 14|-ť-| I hd 11 b 11 Cl I Name Type NATIONAL PRINT Notes Two-mode network One-mode network Event list Variable 2 Qualifier Qualifier aggregation organization agreement ignore Normalization Isolates Duplicates ^ Document properties Title 03.02.2015 - Author Unspecifie Date 2015-02-03 00:00:00 Coder Ö Admin Author Author Unspecif ed Source Pravo Section only current nodes include all duplicates Include from Include until Moving time window Time window size 2015-02-03 -00:00:00 * 2015-02-03 -00:00:00 * notime window 100 * Exclude from variable Exclude values Preview of excluded values person organization concept agreement author source section type □ Display tooltips with instructions $ Revert 0 Cancel © Export, zaměstnanost zaměstnanost snaží oslabit hlavnítrumf pro. lěfí, že by pracovní místa při... lesí 'sociální katastrofy" jpozornuje, ze dalsi hnede . idmítá bourat obydlí latvrdo zaznělo, že pro teplá.. o current o filter h within document Ä : highlighter C u rrent f I e: C :UJ s erslLu ka s '.Docu m ents ,n ew d b. d n a Discourse Network Analyzer _ n File Document Export Settings 220|-ŕ-| a I 14^ I HP 11 B 11 B I Coder Admin Type of network Name Two-mode network Variable 1 concept Normalization ~j Document properties Includefrom Title 2015-02-03 -00:00:00 03.02.2015 - Author Unspecifie Exclude from variable Date 2015-02-03 00:00:00 Coder Admin Author Author Unspecified Source Pravo person organization concept agreement author source section type Title Date Statement type Export data Fileformat □ DNA Statement Save In: l~1 Documents rjg Q rJ LI- LI— Audacity PDF Architect [3 Custom Off ceTemplates □3 Python Scripts dna_export [3 downloads C3 ĽCINETdata C3GIS DataBase Q 09_Pollock_states_red.csv [3 mailer My Data Sources File Name: Files of Type: Network File j'.csvj Save Cancel Section □ Display tooltips with instructions i$ Revert aggregation dow size O Cancel trh ítatamer,tS 100 — Text snaží oslabit hlavní trumf pro... /ěn, že by pracovní místa při... děsí "sociální katastrofy" jpozomuje, ie další hnědé ... admítá bourat obydlí natvrdo zaznělo, že pro teplá... O current O filter :h within document Ä l«M*>l (highlighter Export... Type NATIONALPRINT Notes zaměstnanost zaměstnanost C u rrent f I e: C :UJ s erslLu ka s ,Docu m ents ,n ew d b. d n a Discourse Network Analyzer _ n File Document Export Settings 220 — 22o|-ŕ-| a I 14|-ť-| I hd 11 b 11 lh I Coder Admin Name ^ ovT □ 0 ^ Document properties Title 03.02.2015 - Author Unspecified - PravJ Date 2015-02-03 00:00:00 Coder □ Admin Author Author Unspecif ed Source Pravo Section zl Type NATIONALPRINT Notes Title 03.02.2015 - Author Unspecified - Mlada fronta Dnes - NATIONALPRINT 03.02.2015 - Author Unspecified - Pravo - NATIONALPRINT 04.02.2015-Author Unspecified-Lidove noviny - NATIONALPRINT 04.02.2015 - Author Unspecified - Mlada fronta Dnes - NATIONALPRINT 04.02.2015 - Author Unspecified - Pravo - NATIONALPRINT 0 Date Feb 3, 2015 Feb 3, 2015 Feb 4, 2015 — Feb 4, 2015 Feb 4, 2015 Na těžební limity je jiný pohled z Horního uiřetína či Litvínova, jiný z Ustí nad Labem a jiný z Prahy. Proto jsme svědky debaty, která má vertikální i horizontální linii, a vstupují do ní akcenty politické, ekonomické, sociální i ekologické. Místní už kdysi řekli v lokálních referendech "neřr a od té doby na radnicích sílí subjekty bránící ukusování severočeské krajiny s lidskými domovy. Starosta Horního Jiřetína Vladimír. že rozšíření těžH Message asi osm set pracd z amě stnancůmístrí Kj- J Data were ex porte d to "C:lUsers\Lufcas\Documents',f le.csv" finanční náklady způsobených škod J jako jsou srdečn škody na zemědělské pudě a na krajině. Ústecký hejtman Oldřich Bubeníček (KSČM) naocak věří-_že by pracovní míatn přibyla._Krai má cres 60 tÍ3Íc nezaměstnaných. □ DNA Statement concept O save © reset original value ochrana obydli sociálni katastrofa teplárenství uhli export elektřiny zaměstnanost edited value ochrana obydli zaměstnanost teplárenství uhli export elektřiny zaměstnanost ochrana obydli zaměstnanost teplárenství uhli export elektřiny C u rrent f I e: C :UJ s erslLu ka s ,Docu m ents ,n ew d b. d n a Statements ID Text 1 snaží oslabit hlavnítrumf pro... 2 véŕí, že by pracovní místa pri... 3 děsí "sociálni katastrofy" 4 upozorňuje, že další hnedé ... 5 odmítá bourat obydlí 6 natvrdo zaznelo, že pro teplá... ® all O current O filter ää Search within document Ä Regex highlighter What next? Quantitative TA Zipf law 14 Zipf's law 12 10 CD =5 U 2 o — Esperanto German ■—• Latin Malay ■—■ Ukrainian English •—• Czech Slovak Italian Romanian ■—■ Spanish Polish • • Slovene Uzbek Finnish French Hebrew ■—• Basque ■—■ Turkish •—■ Serbian • • Hungarian •— Dutch • • Galician «—■ Catalan ■—• Danish ■—■ Indonesian • • Belarusian >— Lithuanian Portuguese - • Croatian 0 0 6 8 log(rank) 10 12 14 (Jimenez, 2015) Corpus Corpus • Decision over document unitizing • Decision over sampling • Does 5M texts provide more information than 15k? • Random vs. non-random sampling • Inclusion of metadata - additional information • Author • Time and date • Source (e.g. media/newspaper) Words and content Words and content • Some words used to convey meaning, other are used functionally to allow meaningful language • Nouns, verbs, adjectives, pronouns • Stopwords - make sense only when connected with other terms • Depends on task at hand • In some cases, it is justified to drop them • In others, these words are important Text pre-processing • Considerations over pre-processing • Dropping sparse terms • Dropping most frequent terms • Dropping "stopwords" • Dropping numerals, punctuation,... • Dropping time and place information • • • • Method dependent • Sometimes affects results (topic modeling) Text pre-processing • Stemming/lemmatization • Disposal of grammatical features of text • Dictionary-based • Rules-based • Both introduce some error into the corpus • Lemmatization • Identification of lemmas (lexemes) of the words -transformation to lemmas • Stemming • Stripping the word of prefixes or suffixes, leaving only word stems Lemmatization and stemming "This was the most tranquil presidential address. President's approach was very relaxed." • Lemmatization "This be the most tranquil presidential address. President approach be very relax." • Stemming "This be the most tranquil presidenti address. Presid approach be veri relax." Bag of wo Bag of words • The quick brown fox jumps over the lazy dog Word Occurrence brown 1 dog 1 fox 1 jumps 1 lazy 1 over 1 quick 1 the 2 Document-feature matrix • Matrix - most methods based on this • 1st dim - Features/Tokens (words, phrases,...) • 2nd dim - Documents/units • Cells - frequency of tokens in documents • Boolean - Present vs. Not present (1/0) • Weighted • Absolute frequency (how many times word occur in document) • TF-IDF • Grows large easily • 500 documents * lk unique tokens = 0.5M cells • Usually very sparse • Most of cells are empty - contain 0 Document-feature matrix 2003- 2004- 2005- 2006- 2007- 2004- cz 2005-pl 2006-hu 2007-sk 2008-cz Sum agriculture 3 6 2 5 3 19 aim 4 2 7 12 6 31 area 11 8 8 28 26 81 base 1 2 2 2 5 12 border 5 9 9 3 3 29 central 2 3 6 3 5 19 cohesion 3 1 7 4 4 19 commission 2 7 3 2 4 18 common 10 9 17 8 17 61 community 2 2 3 3 6 16 concern 9 13 12 18 6 58 Co-occurrence Co-occurrence • The quick brown fox jumps over the lazy dog. • Brown dog sleeps well. Feature Document 1 Document 2 brown 1 1 dog 1 1 fox 1 jumps 1 lazy 1 over 1 quick 1 sleeps 1 the 2 well 1 Co-locations and N-grams Co-locations/n-grams • Co-locations • Established phrases, usually occur together • Provide little information over text • Ministry of the Environment • European Union • prime minister • N-grams • Phrases which are not established, but occur together in text • Provide insights • Crooked Hillary „Be careful what is a result and what is just a residue of your data choices'" Jana Diesner, 2018 R basics R is a language • Any programming language is just very condensed and formalized speech • Understand and formulate the process is key • Scripting is just a matter of knowing right expressions Many resources out there • R package / library manuals • Rsite: http://cran.r-project.org • community forums: • htt p: //sta c ko ve rf I o w. co m • http://www.statmethods.net • htt p: //www, r-b I ogge rs.com • Youtube videos: https://www.youtube.com/watch?v=qHfSTRNg6iE • googling (often fastest) Introduction to R • You should have two programs installed on computer • R • R Studio • Both have to be installed to run R Studio • We are going to use R Studio • More convenient to work with R studio layout Scripting window Environment (stored objects) History Plots n , . . Packages Console window , Help Viewer RStudio _ n £ile Edit Code View Plots Session Build Debug Profile lools Help C'l 3f'I fl 01 ö I ^ Go to file/function idqi - Addins Qj Untitled 1 fllflü Source on Save | ^ £ - | B | - aD Environment History fc^Run I |*>M ^Source - I ^ <3f H J* Import Data set * I ^ "1 Global Environment » Project: (None) » = n I list - I © Environment is empty Files Plots Packages Help Viewer -Si Export • I 1:1 [Top Level) t R Script Console -7 1 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions Type 'licenseQ' or 'licenceQ' for distribution details. R is a collaborative project with many contributors. Type 'contributors{)' for more information and 'citation{)' on how to cite R or R packages in publications Type 'demo()' for some demos, 'help{)' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. RStudio _ n File Edit Code View Plots Session Build Debug Profile lools Help Q_, * I » I Q ® I I I, Go to file/function I | HH * I Addins » Project: (None) •■• Untftledl fllflü Source on Save | ^ j£ - | B | - Environment History _*Run Source 'I I B ^J* Import Data set "1 Global Environment » List - I @ 1:1 nop Level) i R script - Console -/ ' Scripting window R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors{)' for more information and 'citation{)' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help{)' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. Console Environment is empty Environment History Files Plots Packages Help Viewer Export - Plots Packages Help Viewer Object • Object is a container which holds data, and can be manipulated with functions • The most basic object is called vector • There are other types of objects - matrix, data frame, list one <- 1 o RStudio - °D Eile | Edit Code View Plots Session Build Debug Profile Tools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » Project: (None) » <3j Untitledl* x =n j Environment History = n 1 <£L 1 H □ Source onSave 1 Source - ^ I ^ B I rj» Import Dataset * I ^ Global Environment » Values one 1 Project: (None) ' list - I @ Files Plots Packages Help Viewer = n Install fi) Update 1 IQ, 11 <3 Name Description Version User Library a □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 e beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C+ + Header Files 1.62.0-1 © □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF], vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data,frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 RStudio Eile Edit Code View Plots Session Build Debug Profile lools Help -I & ' I Q 01 Ü I I * Go to file/functio idqi - Addins Q] Untftledl* x fllflü Source on Save I Q £ - I |_l I - iO I Environment History Project: (None) ' _+Run .*■+ ._■+Source - 1 -= one <- 1 > one + one [1] 2 > Files Plots Packages Help Vie Ler Install Update I* I I s Name Description j f Version User Library a □ assertthat Easy Pre and#ost Assertions 0.2.0 © □ audio Audio IntsJrace for R 0.1-5 e beepr EasilyjRy Notification Sounds on any 3laJPrn 1.2 © □ BH M (Boost C + + Header Files 1.62.0-1 © □ bindr ^^^^ Parametrized Active Bindings 0.1 © ^floYcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data,frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 Object • Anything may become an object • New values must be stored as an object • Conscious choice to keep a result • Object remains the same unless overwritten • Must be removed by user as well two <- one + one two [1] 2 o RStudio - °D File Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » ^ Project: (None) » <3j Untitledl* x aD Environment History = n 1 <£L 1 H □ Source onSave 1 Q £ - 1 O 1 * t^Run 1 |*!M ^Source '1 ^ 1 ^ S 1 £3* Import Data set - 1 ^ = List - 1 @ 1 one <- 1 Global Environment » r one <- 1 > one + one [1] 2 > two <- one + one > Files Plots Packages Help Viewer Ol Install (£) Update i O Name Description I Version User Library a □ assertthat Easy Pre and Post Assertions m 0.2.0 □ audio Audio Interface for R m 0.1-5 © beepr Easily Play Notification Sounds m\ any Platform f 1.2 © □ BH Boost C+ + Header Files^r 1.62.0-1 © □ bindr Parametrized Actiyafcindings 0.1 © □ bindrcpp An 'RcppJ|(lmace to Active Bindings 0.2 © □ . bitops ^^^^^^ *rmise Operations 1.0-6 TT Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF], vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data,frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 RStudio Fjle Edit Code View Plots Session Build Debug Profile lools Help -I & ' I Q 01 Ü I I ^» Go to file/functio idqi - Addins Project: (None) Q] Untftledl* x fllflü Source on Save I Q £ - I |_l I - _-fRun 1 one <- 1 2 3 one + one 4 5 two <- one + one 6 7 two 8 I 8:' (Top Level) i Console -/ 1 > one <- 1 > one + one [1] 2 > two <- one + one > two [1] 2 > R Script : Environment History = n Q J* Import Dataset - List - I "f R Script : Files Plots Packages Help Viewer = n Install fi) Update 1 IQ, 11 <3 Name Description Version User Library a □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C+ + Header Files 1.62.0-1 © □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data.frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 o RStudio - °D FJIe | Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » ^ Project: (None) » Untitledl x aD Environment History = n 1 SX 1 S □ Source onSave 1 "f Files Plots Packages Help Viewer = n Install © Update 11 <3 Name Description Version User Library a □ assertthat Easy Pre and Post Assertions 0.2.0 © audio Audio Interface for R 0.1-5 © |beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C+ + Header Files 1.62.0-1 © □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data.frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 o RStudio - °D Eile | Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » ^ Project: (None) » Untitledl x aD Environment History = n 1 SX 1 S □ Source on Save 1 library{"beepr", lib.loc="~/R/win-library/3.4") > Files Plots Packages Help Viewer = n Install fi) Update 1 IQ, 11 © Name Description Version User Library a □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C+ + Header Files 1.62.0-1 © □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 □ chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data.frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 RStudio Eile Edit Code View Plots Session Build Debug Profile lools Help library(beepr) > 1 Global Environment * I Environment is empty ^^^bts Packages Help Viewer = n I Ol Install J© Update 1 IQ, Description Version User Library a □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © H beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C+ + Header Files 1.62.0-1 © □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 © □ CL "1 A Modern and Flexible Web Client for R 23.1 © □ data.table Extension of'data.frame' 1.10.4 © n dichromat Color Schemes Jor Dichronats 2.0-0 "© V RStudio Eile Edit Code View Plots Session Build Debug Profile lools Help -I & ' I Q 01 Ü I I * Go to file/functio idqi - Addins Q] Untftledl* x fllflü Source on Save I Q £ - I |_l I - 1 library(beepr) 2 aD Environment History fc^Run I I ^Source - I ^ tjf H J* Import Data set * I -jf 1 Global Environment » Project: (None) » List - I @ Environment is empty Install Packages 2:1 Hop Level) Console -7 > library(beepr) > Install from: T Configuring Repositories Repository (CRAN, C RAN extra) PI i Packages (separate multiple with space or comma): Plots Packages all © Update ime Help Viewer Description Install to Library: C:/U sers/Lu ka s/D o cu m ents/R/win-library/3.4 [Default] arary .sertthat Easy Pre and Post Assertions Version 0.2.0 0 Install dependencies Install DC Cancel □ □ □ chron colorspace □ curl □ data .table I I dichromat for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output Chronological Objects which can Handle Dates and Times Color Space Manipulation 2.3-50 1.3-2 A Modern and Flexible Web Client for R 2.8.1 Extension of'data.frame' Color Schemes for Dichromats 1.10.4 2.0-0 (9 dio Audio Interface for R 0.1-5 e epr Easily Play Notification Sounds on any Platform 1.2 © Boost C+ + Header Files 1.62.0-1 idr Parametrized Active Bindings 0.1 e bindrcpp An 'Repp' Interface to Active Bindings 0.2 © bitops Bitwise Operations 1.0-6 Cairo R graphics device using cairo graphics library 1.5-9 RStudio Fjle Edit Code View Plots Session Build Debug Profile lools Help library(beepr) > aD Environment History t^Run I |*-M ^Source - I ^ tjf H J* Import Dataset * I ^ "1 Global Environment » Project: (None) » List - I @ Environment is empty Install Packages Install from: t Configuring Repositories Repository (CRAN, C RAN extra) Packages (separate multiple with space or comma): lots Packages all 4) Update Help Viewer netwo network NetworkChange Netwo rkComparisonTest networkD3 networkDynamic networkDynamicData networkGen Netwo-x nJerence networkreporting N etwo rkRiskMeasures networksis netwo rkTo m o g ra phy networktools R/win-library/3.4 [Default] arary iertthat Description Easy Pre and Post Assertions Version 0.2.0 Install DC Cancel Boost C + + Header Files 1.62.0- ndr Parametrized Active Bindings bindrcpp An 'Repp' Interface to Active Bindings 0.1 0.2 □ data.table Extension of'data.frame' 1.10.4 (9 dio Audio Interface for R 0.1-5 e epr Easily Play Notification Sounds on any Platform 1.2 © □ bitops Bitwise Operations 1.0-6 □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap [PNG, JPEG, TIFF], vector [PDF, SVG, PostScript) and display [X11 and Win32) output 1.5-9 chron Chronological Objects which can Handle Dates and Times 2.3-50 © □ colorspace Color Space Manipulation 1.3-2 □ CL "I A Modern and Flexible Web Client for R 23.1 RStudio Eile Edit Code View Plots Session Build Debug Profile lools Help library(beepr) > 0 tall dependencies Install DC Cancel □ □ CT XI colorspace □ □ curl □ data .table I I dichromat for creating high-quality bitmap [PNG, JPEG, TIFF), vector [PDF, SVG, PostScript) and display [X11 and Win32) output Chronological Objects which can Handle Dates and Times Color Space Manipulation 2.3-50 1.3-2 A Modern and Flexible Web Client for R 2.8.1 Extension of'data,frame' Color Schemes for Dichromats 1.10.4 2.0-0 (9 dio Audio Interface for R 0.1-5 e epr Easily Play Notif cation Sounds on any Platform 1.2 © Boost C+ + Header Files 1.62.0-1 idr Parametrized Active Bindings 0.1 © bindrcpp An 'Repp' Interface to Active Bindings 0.2 © bitops Bitwise Operations 1.0-6 Cairo R graphics device using cairo graphics library 1.5-9 o RStudio - °D Eile | Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » ^ Project: (None) » <3j Untitledl* x t=n i Environment History = n fllflü Source on Save 1 library(beepr) > install.packages{"network") Installing package into fC:/Users/Lukas/Documents/R/win-library/3.4J (as clibJ is unspecified) trying URL 1https://cran.rstudio.com/bin/windows/contrib/3.4/network_l 0.zip' Content type 'application/zip' length 661S53 bytes (646 KB) downloaded 646 KB package 'network' successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\Lukas\AppData\Local\Temp\RtmpekQD3G\downloaded_packages > I Install © Update 1 IQ. I I (c Name Description Version D Idatuning Tuning of the Latent Dirichlet Allocation Models Parameters 0.2.0 9 □ magrittr A Forward-Pipe Operator for R 1.5 Q □ maptools Tools for Reading and Handling Spatial Objects 0.9-2 □ mime Map Filenames to MIME Types 0.5 □ modeltools Tools and Classes for Statistical Models 0.2-21 □ munsell Utilities for Using Munsell Colours 0.4.3 ©I □ network Classes for Relational Data 1.13.0 □ NLP Natural Language Processing Infrastructure 0.1-10 © □ openNLP Apache OpenNLP Tools Interface 0.2-6 A □ openNLPdata Apache OpenNLP Jars and Basic English Language Models 1.5.3-2 □ openssl Toolkit for Encryption, Signatures and Certificates Based on OpenSSL 0.9.6 O □ PCIT Partial Correlation Coefficient with Information Theory 1.5-3 © □ pkgconfig Private Configuration for R' Packages 2.0.1 © □ plogr The'plog'C + + Logging Library 0.1-1 © O RStudio _ n MM £ile | Edit Code View Plots Session Build Debug Profile lools Help CH Ä*l Q 01 Qli^Goto file/function 1 @ - Addins » Project: (None) » '-' Untitled'* Environment History = n 1 <£L 1 6 □ Source on Save 1 Source - 1 ^ Q J* Import Dataset » List -1 © 1 library(beepr) "1 Global Environment • s ~] 2 Environment is empty Files Plots Packages Help Viewer = n 1 Install © Update 1 IQ, 11 © 1 Name Description Version Idatuning Tuning of the Latent Dirichlet Allocation Models Parameters 02.0 magrittr A Forward-Pipe Operator for R 1.5 © 1 2:1 [Top Level) t R Script ; 1 Console □ maptools Tools for Reading and Handling Spatial Objects 0.9-2 C:\Users\Lukas\AppData\Local\Temp\RtmpekQD3G\downloaded_packages A, □ mime Map Filenames to MIME Types 0.5 > library{"network", lib.loc="~/R/win-library/3.4") □ modeltools Tools and Classes for Statistical Models 0.2-21 network: Classes for Relational Data munsell Utilities for Using Munsell Colours 0.4.3 ©l Version 1.13.0 created on 2015-08-31. copyright (c) 2005, Carter T. Butts, University of California-Irvine Mark S. Handcock, University of California -- Los An geles David R. Hunter, Penn State University Martina Morris, University of Washington Skye Bender-deMol1, University of Washington In et work Classes for Relational Data 1.13.0 © NLP Natural Language Processing Infrastructure 0.1-10 © □ openNLP Apache OpenNLP Tools Interface 0.2-6 o □ openNLPdata Apache OpenNLP Jars and Basic English Language Models 1.5.3-2 © 1 □ openssl Toolkit for Encryption, Signatures and Certificates Based on OpenSSL 0.9.6 © For citation information, type citation("network"). Type help("network-package") to get started. □ 3CIT Partial Correlation Coefficient with Information Theory 1.5-3 © □ pkgconfig Private Configuration for R' Packages 2.0.1 © > □ plogr The'plog'C + + Logging Library 0.1-1 © v 1—1 Ä v 1 1 RStudio _ n £ile Edit Code View Plots Session Build Debug Profile lools Help C'l 3f'I fl 01 ö I ^ Go to file/function idqi - Addins Q] Untftledl* x fllflü Source on Save I Q £ - I |_l I - 1 library(beepr) 2 aD Environment History fc^Run I |*-M ^Source - I ^ <3f H Jf Import Data set * I ^ "1 Global Environment » Project: (None) ' List - I @ Environment is empty 2:1 I (Top Level) t Console > library{"network", lib.loc="~/R/win-library/3.4") network: Classes for Relational Data Version 1.13.6 created on 2015-08-31. copyright (c) 2665, Carter T. Butts, University of California-Irvine Mark S. Handcock, University of California -- Los geles David R. Hunter, Penn State University Martina Morris, University of Washington Skye Bender-deMoll, University of Washington For citation information, type citation("network"). Type help("network-package") to get started. R Script : ■ ö — An > detach{"package:network' > unload=TRUE) Files Plots Packages Help Viewer = n Install © Update 1 IQ, 11 © I Name Description Version Idatuning Tuning of the Latent Dirichlet Allocation Models Parameters 0.2.0 © a " □ magrittr A Forward-Pipe Operator for R 1.5 © □ maptools Tools for Reading and Handling Spatial Objects 0.9-2 □ mime Map Filenames to MIME Types 0.5 modeltools Tools and Classes for Statistical Models 0.2-21 □ munsell Utilities for Using Munsell Colours 0.4.3 ©l pL network Classes for Relational Data 1.13.0 © □ NLP Natural Language Processing Infrastructure 0.1-10 © □ openNLP Apache OpenNLP Tools Interface 0.2-6 o □ openNLPdata Apache OpenNLP Jars and Basic English Language Models 1.5.3-2 □ openssl Toolkit for Encryption, Signatures and Certificates Based on OpenSSL 0.9.6 o PCIT Partial Correlation Coefficient with 1.5-3 © Information Theory □ pkgconfig Private Configuration for R' Packages 2.0.1 © □ plogr The'plog'C + + Logging Library 0.1-1 © r^ v Working directory • Folder, where everything is taking place - enough to set once • Makes data import and export easier • Function setwd () • Does not accept single backslash in Win path • Replace backslash \ with forwardslash / or double backslash \\ setwd("C:\\Users\\Lukas\\Documents\\R intro") setwd("C:/Users/Lukas/Documents/R intro") File Edit Code View Plots 0] Untrtledl I £3 I H □ Source Session Build Debug Profile lools Help New Session Interrupt R lerminate R... Restart R Ctrl+Shift+F10 Set Working Directory Load Workspace... Save Workspace As.., Clear Workspace,,, Quit Session... _+Run _ Source - , I Environment History | £jf Q J* Import Dataset - I Global Environment ^ Project: (None) ■ To Source File Location To Files Pane Location Choose Directory.,, Ctrl+Shift+H Ctrl+Q 1:1 (Top Level) t Console -/ > I R Script : can Files Plots Packages Help Viewer = n Ol Install © Update 1 A Name Description Version I I User Library A □ abind Combine Multidimensional Arrays 1.4-5 □ acepack ACE and AVAS for Selecting Multiple Regression Transformations 1.4.1 © □ assertthat Easy Pre and Post Assertions 0.2.0 © □ audio Audio Interface for R 0.1-5 © □ backports Reimplementations of Functions Introduced Since R-3.0.0 1.1.0 © □ base64enc Tools for base64 encoding 0.1-3 © □ beepr Easily Play Notification Sounds on any Platform 1.2 © □ BH Boost C + + Header Files 1.62.0-1 □ bindr Parametrized Active Bindings 0.1 © □ bindrcpp An 'Repp' Interface to Active Bindings 0.2 0 □ bitcps Bitwise Operations 1.0-6 © □ Cairo R graphics device using cairo graphics library for creating high-quality bitmap (PNG, JPEG, TIFF), vector (PDF, SVG, PostScript) and display (X11 and Win32) output 1.5-9 © Data output • Save entire workspace • Save all R objects you've created so far • Allows to return to work/backup current work RStudio - °H Eile j Edit Code View Plots Session Build Debug Profile lools Help 0 I ö I I ^ Go to file/function I (33| I Addins » Project: (None) » Untitledl* x J pimajr N =n I Environment History = n S\ 1 ? Filter (A Tl ■H^H J* Import Dataset - jf = List - I @ XI ■ npreg glu bp ^ -skin bmi ped ' age -type Global Environment » lA ~~1 I 1 1 5 86 68 28 30,2 0.364 24 No Data 2 2 7 '95 70 33 25.1 0.163 55 Yes Opima_tr 390 obs. of 9 variables 3 3 5 77 82 35,8 0.156 35 No 4 4 0 165 76 43 47.9 0259 26 No 5 5 0 107 50 25 26.4 0.133 23 No G 6 5 97 76 27 35,6 0.373 52 Yes 7 7 3 83 58 31 34.3 0.336 25 No G 8 1 193 50 16 25,9 0.655 24 No 9 3 142 80 15 32.4 C.2CC 63 No Files Plots Packages Help Viewer 10 10 2 12S 78 37 43,3 1.224 31 Yes 4» <* £r i ü i m I °* I I cs? 11 11 137 40 35 t-z: 33 Yes 0 Home ■» 12 12 9 '5^ 78 30 309 0.164 45 No A 15 -IGQ «i on -i Voc Showing 1 to 13 of 300 entries R Resources U RStudio Console Learning R Online RStudio IDE Support CRAN Task Views R on StackOverflow Getting Help with R Manuals An Introduction to R Writing R Extensions R Data Import/Export Reference Packages RStudio Cheat Sheets RStudio Tip of the Day RStudio Packages RStudio Products The R Language Definition R Installation and Administration R Internals Search Engine & Keywords V Quantitative TA in R Text analysis in R • Package "quanteda" • Developed by Ken Benoit (LSE) • Comprehensive package on text analysis methods • Package "readtext" • Ken Benoit & Adam Obeng • Package which allows data import from text sources • Easy to work with • Package "stopwords" • Ken Benoit, David Muhr & Kohei Watanabe • Package containing various stopwords for different languages Before we start... • Open the folder "text_analysis_quanti" folder • Open script file "text_analysis_l.R" in R Studio First steps in R • Install all libraries • quanteda • readtext • stopwords • Set working directory work.dir <- "C:\\path\\to\\folderW" setwd(work.dir) library(readtext) library(quanteda) library(stopwords) Reading texts into R • readtext () function loads all text files into R • Very easy to use - reads everything in the folder • Supports various document types • TXT • PDF • DOC • Twitter data format JSON ■ ■ • • Arguments • File source • Encoding Reading texts into R • Encoding • Text files are usually stored in certain format • Consider text "Príklad zlého kódovania" • ASCII/ISO-8859-1: "PrÄklad zlÄ©ho kÄ3dovanio" • UTF-8: "Príklad zlého kódovania" • As a rule of thumb, UTF-8 encoding is a desired choice Reading texts into R text.dir <- "C:\\path\\to\\folder\\texts\\" texts <- readtext(file = text.dir, encoding = "UTF-8") Reading texts into R text.dir <- "C:\\path\\to\\folder\\texts\\" texts <- readtext(file = text.dir, encoding = "UTF-8") Reading texts into R Argument specifying location of texts (object input) text.dir <- "C:\\pa*\\to\\folder\\texts\\" texts <- readtext(file = text.dir, encoding = "UTF-8") Function Name of Argument specifying a new object character encoding (text input = quotes) Corpus • Simple function corpus () • Creates corpus from all imported texts from the previous step • All sorts of statistics may be acquired once corpus is generated - e.g. function summary () corp <- corpus(x = texts) ndoc(corp) summary(corp) Document-feature matrix • Two-step process in "quanteda" package • Tokenization of corpus • A step necessary to apply some pre-processing choices which are not text-based (removal of noise) • Remove numbers • Remove punctuation • Remove white space (separators) • Creation of DFM • Furthter pre-processing choices • Stemming • Lowercasing • Stop words removal Document-feature matrix • Function dfm () • Documents in rows • Features (tokens) in columns • Output is in format understood by quanteda Document-feature matrix tokenization <- tokens(x = corp, remove numbers = TRUE, remove punct = TRUE, remove separators = TRUE, remove hyphens = FALSE ) doc.term.matrix <- dfm(x = tokenization, tolower = TRUE ) Wordcloud • Function textplot wordcloud () Attribute Description X Terms max words Maximum number of words rendered min size Size of smallest category max size Size of largest category rotation Percentage of terms placed vertically color Color or color palette • • • Many other arguments available Wordclouds textplot wordcloud(x = doc.term.matrix, max words = 50, min size = 1, max size = 4, rotation = 0, color = "steelblue") Wordcloud stát byl, než bílina , =eh0 který roce není „ dnes bude^ |žjsme csatezbyj.-. *l~í,~jakovsak další letu. lon37ataké lidí limitů |W L 5_ Jenze okdUhlí V Ze P°nebo korun siby^^edO^pi? pggmá veSeoa|eaabyJeště roku pro kpodlejakkde byfo jS0V víáda-foto u H 3 dole když u