Yoshikoder: An Open Source Multilingual Content Analysis Tool for Social Scientists Will Lowe will.lowe@nottingham.ac.uk Methods and Data Institute, University of Nottingham This short paper is about the Yoshikoder1, an open-source desktop tool for performing classical computer-aided content analysis in multiple languages. The paper starts with some background on content analysis, continues with a short technical characterization of the Yoshikoder as a content analysis tool, and concludes with a some necessarily brief examples of the kind of analysis the Yoshikoder makes possible. Classical Content Analysis By classical content analysis I mean the tradition of examining word frequencies, creating concordances, and building content dictionaries in order to operationalize substantively interesting aspects of document meaning (West, 2001; Neuendorf, 2002, for reviews). There are, of course, other traditions of content analysis e.g. discourse analysis, cognitive mapping, and collocational clustering, with specialized software available often available to apply each method (see Herrera and Braumoeller, 2004, for some comparisons). Content analysis also borrows technology from computational linguistics (Manning and Schütze, 2000; Jurafsky and Martin, 2000). However, the Yoshikoder is designed primarily for classical content analysis so I will not discuss alternative methods. Classical content analysis was originally performed manually (see Krippendorff, 1980, for a history), but its emphasis on classifying and counting individual words made it quite straightforward to automate (see e.g. Stone, 1997, for an early example). As of 2006 there exists a wide variety of computer packages to help researchers perform classical content analyses (see Lowe, 2002, for a functional typology and review). Yoshikoder as a Content Analysis Tool Among existing existing packages, the Yoshikoder is the only package I am aware of that runs on any operating system, is distributed for free as open-source software, and deals with documents in any natural language. Let me provide some motivation for these features. Yoshikoder is written in the Java language as a desktop application that runs on all major operating systems. Moreover, the content analysis machinery does not require any particular interface, and may be used in a server environment2. It is a minimal requirement of the scientific replication standard (King, 1995) that the algorithms in a content analysis package used for academic research be known. And it is helpful if Thanks to Michael Laver and John Garry for providing their U.K. political manifesto dictionary. Work on the Yoshikoder has been funded by Harvard University's Weatherhead Center for International Affairs at Harvard and the Princeton Center for Advanced Study. 1 See http://www.yoshikoder.org for details. 2 Yes, a web service version of the Yoshikoder is under development 1 the financial cost of replication is kept low. The Yoshikoder fulfils both by making all its source code publically available for download3. Computerized versions of classical content analysis have traditionally been performed on West European language sources ­ a fact no doubt partly determined by path dependencies in the history of computing. However, current computer languages now have excellent implementations of the universal character set and character encoding standard Unicode (equivalently ISO 10646)4. The practical consequence of this technological advance is that the Yoshikoder's document import mechanism allows users to work with documents in almost any encoding, whilst operating internally in Unicode. For example, a project might contain russian language documents encoded variously in ISO-8859-5, KOI-9, and Macintosh Cyrillic. To ensure that the document is segmented into words appropriately, users may also specify a document locale5 e.g. Russian, as spoken in Russia, in order to distinguish it from other languages also written in cyrillic script such as Serbian. For some languages, notably Chinese, Japanese, and Thai, segmenting a text into words automatically is a difficult and computationally demanding task. For these cases, the Yoshikoder allows third parties to write and distribute `tokenizer plugins' that perform the relevant segmentation. A tokenizer plugin for Chinese, as spoken in the People's Republic of China, is currently available. Using the Yoshikoder The first thing to do with the Yoshikoder is to make a project. A project consists of a single content analysis dictionary and a set of documents from your filesystem. When the program starts it loads the last project you were working on, or the default dictionary with no entries and an empty document set. Adding Documents Now to add some documents. For the examples below I'll use the U.K. political party manifestos from the 1992 and 1997 elections6. When you add or import a document Yoshikoder does not copy the document but simply keeps a reference to where it is, how iťs encoded and what locale it is written for. Only when you click on its name in the interface is the text loaded from the file. This means you can have a project with more documents in it than would fit in computer memory. Computing Word Frequencies Before moving to the dictionary there are several things we can ask, starting with a sortable word frequency breakdown for each document. This is perhaps the simplest form of report available. It lists the number of times each word type occurs and the corresponding proportion of the text its tokens take up. Reports are shown as tables that can be saved in various formats or just highlighted, copied, and pasted into other applications. For comparative purposes it is often more useful to have a unified frequency report where the same document statistics are listed for every word type appearing in any document and collected in one large table. The unified frequency report combines the word frequency statistics for all the documents you select in the interface. For example, the unified report notes that in the 1992 manifestos `police' contributes 34 tokens to the Conservatives, 14 to the Liberal Democrats, and only 4 to the Labour Party. 3 http://www.yoshikoder.org also hosts content dictionaries in Yoshikoder format to foster replication and reuse. 4 Language support for Unicode is now often better than that of the underlying operating system, so the problem of working with foreign language materials is reduced to locating a suitable font to display them in. 5 A locale is the combination of a language and a country. Sometimes both are necessary to determine appropriate word segmentation 6 Available to download from http://www.wordscores.com 2 Content Analysis Dictionaries Word frequency statistics are useful for getting a feel for your documents, but for further analysis a dictionary is helpful. A Yoshikoder dictionary is a tree of possibly nested categories containing patterns. A pattern is an possible wildcarded string that matches one or more words in a text. The asterisk is used to indicated one or more unspecified letters, e.g. chin* matches both `china' and `chinese'. Each category and pattern has a name and an optional numerical score. The Yoshikoder can read dictionaries in its own format (a simple XML dialect), and also files created by the DOS-based content analysis program VBPro7. The Yoshikoder allows users to add, edit and move categories and patterns manually, but for these examples I will use Laver and Garry's dictionary of policy position terms8 (Laver and Garry, 2000). The Laver-Garry dictionary organizes 594 patterns into 9 top level and 18 nested policy categories. Comparing Documents using a Dictionary Applying the dictionary to the 1992 Labour manifesto reveals that about 0.5% (55 words) of the manifesto consists of terms in the category law and order. The category with the highest proportion is economy taking up 8.3% (952) of the manifesto's words. If these seem like small proportions, bear in mind that about 50% of all tokens in English text are contentless grammatical function words. To examine the policy differences between `olď and `new' Labour, run a report comparing the 1992 (old) and 1997 (new) Labour manifestos with respect to dictionary categories. This reveals that the proportion of words in the category economy>pro-state that is, the subcategory of economy containing words indicating more governmental influence in the economy, has shrunk by half (from 0.04 to 0.025), whereas representation of the category law and order has more than doubled (from 0.005 to 0.011), consistent with substantive theory about the policy preferences of the current U.K. Labour party. Making Reliable Comparisons Since these are large changes in small proportions it is useful to have a measure of reliability. The Yoshikoder offers a statistical comparison report that computes risk ratio estimates and confidence intervals for each dictionary category. For example, to test the reliability of the increase in representation of law and order, the Yoshikoder computes the ratio of the probability of seeing a law and order term given that we are reading the 1997 Labour manifesto and the same probability given that we are instead reading the 1992 manifesto. Call the ratio r. If r > 1 then the 1997 manifesto contains (r - 1)100% more law and order words. Alternatively, if r < 1 it contains (r-1 - 1)100% fewer law and order words. If, moreover, the confidence interval for r also excludes one, then the percentage change in law and order words is statistically significant. In these manifestos the risk ratio for for law and order is 2.32 with 95% confidence interval [1.72, 3.12], an increase of 132% between old and new Labour platforms. The estimate and interval for economy>pro-state is 0.62 [0.54, 0.71], a 60% decrease. Measuring Local Context The Yoshikoder is designed to compare whole documents on the basis of the categories in a content analysis dictionary. However, it is sometimes useful to apply the dictionary to more local contexts, e.g. to gain a idea of how some person, place or theme is talked about in subsections 7 Intermittently available from http//mmmiller.com/vbpro/vbpro.htm. VBPro format has a particularly simple form that lets users create large dictionaries offline from existing wordlists rather than adding patterns manually via the interface. 8 Available from http://www.yoshikoder.org. 3 of the document. One way to get at this information is to organize a dictionary so that it has categories for e.g. positive and negative language, and another category capturing references to the subject. Generate a suitably wide concordance for the subject category, and run the Yoshikoder's concordance report. The concordance report bundles all the left and right surrounding context from each line of the concordance, centered on references to the subject, to form a pseudodocument of local contexts. The dictionary is then applied to this pseudo-document generating a characterization of the context local to the subject in the form of a regular dictionary report9. The Bigger Picture The Yoshikoder is designed to help non-technical social scientists perform classical content analyses on text in arbitrary languages. Using the Yoshikoder helps support the replication standard and annoys people who sell similar functionality in proprietary packages, but it is also part of a larger project to unify, standardize, and disseminate the theory and technology of content analysis. To this end, the Yoshikoder homepage also hosts a free application for converting PDF, Word documents and web pages into plain text in bulk for subsequent content analysis, and should soon host the older but widely-used DOS-based content analysis program VBPro. The homepage also hosts content analysis dictionaries in various languages. If you have a dictionary you'd like to make more widely available, I'll translate it into Yoshikoder format and host it there too. References Herrera, Y. F. and Braumoeller, B. F., editors (2004). Symposium on discourse and content analysis. Qualitative Methods Newsletter. Available from http://www.people.fas.harvard.edu/%7Eherrera/papers.html. Jurafsky, D. and Martin, J. H. (2000). Speech and Language Processing. Prentice-Hall, Upper Saddle River NJ. King, G. (1995). Replication, replication. PS: Political Science and Politics, 28(3):443­499. Krippendorff, K. (1980). Content Analysis: An Introduction to Its Methodology. Sage, Beverly Hills CA. Laver, M. and Garry, J. (2000). Estimating policy positions from political texts. American Journal of Political Science, 44(3):619­634. Lowe, W. (2002). A review of content analysis packages. Available from http://www.wcfia.harvard.edu/misc/initiative/identity/. Manning, C. D. and Schütze, H. (2000). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge MA. Neuendorf, K. A. (2002). The Content Analysis Guidebook. Sage, Thousand Oaks CA. Stone, P. J. (1997). Thematic text analysis: New agendas for analyzing text content. In Roberts, C., editor, Text Analysis for the Social Sciences. Lawrence Erlbaum Associates. West, M. D., editor (2001). Theory, Method,and Practice in Content Analysis, volume 16 of Progress in Communication Sciences. Ablex, Westport CT. 9 This process is a rather noisy way to detect predication, but the full linguistic analysis is unlikely to be more reliable, particularly when there is no one to write the parser for the language involved. 4