Text as Data Juraj Medzihorsky 2016-11-28 ? Motivation ? all models are false, but some are useful George E. P. Box Document Scraping Document Scraping • Numbers and text in files • Local • Web Web • eXtensible Markup Language (XML) • APIs (e.g. for Twitter) Text Analysis ? Text Analysis: The Big Picture Text Analysis • Discourse • Content Content Analysis • ‘Manual’ Text Analysis • Computer-Assisted Text Analysis (CATA) ‘Manual’ Text Analysis • Humans do most of the work • Expensive • Slow • Reliability issues Computer-Assisted Text Analysis • Computers do most of the work • Boom • Huge amount of digitized text available • Cheap computing power • New methods – CS & PS Political Text in CATA Examples: • Manifestos & platforms • Press releases • Social media content • Floor and debate speeches ‘Bag of Words’ • Common assumption in CATA • Order of words (n-ngrams of words) does not matter • Texts as vectors of word counts Table 2: A word-frequency matrix for two randomly selected sentences from the corpus: a) ‘Let’s enforce the laws already on the book.’ (Cain in 2012); b) ‘Now, the administration has $800 million on hand right now, cash on hand.’ (Hunter in 2008) Sentence Stem a b administr 0 1 book 1 0 cash 0 1 enforc 1 0 hand 0 2 law 1 0 million 0 1 CATA in Political Science • Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, mps028. Fig. 1 An overview of text as data methods. Dictionary-Based Methods Google N-Grams Dictionary-Based Methods • Build a dictionary • Statistical models are not necessary Figure 1: Frequency of appearance of “public intellectual” in Google Books from 1980 to 2000 Source: Graph. Google Books 1800 to 2000 Red, Green, and Blue in Google Books English Fiction Corpus 0 0.01 0.02 Smoothed % 1800 1900 2000 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 red green blue Juraj Medzihorsky twitter.com/medzihorsky Scaling Scaling Goals • One or more dimensions • Place documents (texts, speeches) in a space • Place words in the same space Common Scaling Methods Supervised: Wordscores Unsupervised: Wordfish Unsupervised: Correspondence Analysis A Scaling Example • 2008 and 2012 Republican presidential primaries • Debate transcripts from a UCSB website • Expected move towards Tea Party positions • Unsupervised scaling: Wordfish Another Scaling Example • Transcripts of US presidential debates • Unsupervised scaling with correspondence analysis • Two dimensions −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 Bush Carson Christie Cruz Kasich Rubio Trump Candidates Governor Senator Outsider −1 −0.5 0 0.5 1 absolut actual allow alreadi also amend america american answer around ask attack away back bad barack behind believ best better bring busi buy call came can check china clear clinton come common corpor countri court day deal democrat destroy differ donald done economi elect end enemi even ever everi everybodi everyth fact feel first four friend fund get give good got govern great gun hand happen head here heshillari immigr interest isi issu job just keep kind know last leav left let like listen live long look lot made make mani maria mean mention militari money much name nation need neil new next now number obama offic one order peopl percent person plan play point polici polit presid problem protect put question radic rais rate reallireason refuge republicanright run said saw say second secur see senat send simpl stand start state stop support system take talk tariff tax tell term thank that there thing think three time togethtonight tri two understand unit use want war way week well whether whole whos will work world wrong yearyour Words by @medzihorsky Sixth GOP'16 Presidential Debate Yet Another Scaling Example • Slovakian party manifestos: text and CMP codes • Unsupervised scaling with correspondence analysis • Two dimensions And One More Scaling Example • Nielsen’s (2013) dissertation • 25,000 + documents by ∼ 100 clerics • Jihad Score • Supervised scaling with a training set −0.15 −0.1 −0.05 0 0.05Jihad Score Histogram of cleric Jihad scores Ruling on Fighting Now in Palestine and Afghanistan. The foregoing has clarified that if an inch of Muslim lands are attacked, then Jihad is obligatory for the people of that area, and those near by. If they do not succeed or are incapable or lazy, the individual obligation widens to those behind them and then gradually the individual obligation expands until it is general for the whole land, from East to West. (Abdallah Azzam) If a person arrives while the Imam is preaching at Friday prayers, he should pray two brief prostrations and sit without greeting anyone as greeting people in this circumstance is forbidden because the Prophet, peace be upon him, says, "If your friend speaks to you during the Friday prayers, silence him while the Imam preaches because it is idle talk." (Ibn Uthaymeen) There is a fundamental fact about the nature of this religion and the way it works in people's lives. A fundamental, simple fact, but although it is simple, it is often forgotten or not realized at all. Forgetting this fact, or failing to recognize it arises from a serious omission from views of this religion: its truthfulness and historical, present, and future reality. (Sayyid Qutb) Abdallah Azzam Sayyid Qutb Usama bin Laden Ibn Uthaymeen Ibn Baz Failed Unsupervised Scaling ‘Topic’ Models Back to the Big Picture Fig. 1 An overview of text as data methods. + !