Scaling things we can count*
Will Lowe
Princeton University
DRAFT: April 2016
*
Thanks to audiences at the universities of Mannheim, Konstanz, and Princeton for pointing out which parts of the paper were
incomprehensible, to Thomas Dauebler for pointing which part was false, and to Brandon Stewart for forcing his class to read it. Any
remaining errors are probably why I never got a job in the UK.
1
Abstract
This paper reviews some of many methods political scientists have used to get positions out of textual
data and argues that they are the identical to, approximations to, or special cases of a single count data
unfolding model that produces low dimensional representations of patterns of relative emphasis. Ideological
positioning via a logic of relative emphasis not only generates useful scaling technique for anything
we can count, but also vindicates (some of) the key intuitions of the saliency theory of party positioning.
I also use the model to motivate biplots as an underused graphical tool for understanding scaling models.
“There should be one – and preferably only one – obvious way to do it”1
Political scientists often need to infer policy positions from text, and have over the last 30 years proposed,
developed, or borrowed methods and techniques of steadily increasing sophistication. The available methods
seem very heterogenous. Many statistical frameworks have been applied, from factor analysis (Warwick
2002), unfolding (Elff 2013), and item-response theory (IRT Bakker 2009; Albright 2008) to heuristic or numerical
approaches such as Laver, Benoit, and Garry 2003, Gabel and Huber 2000, and Lowe et al. 2011.
Researchers have distinguished between ‘a priori’ scaling methods that assume that the positions of at least
some texts or words are known (Laver, Benoit, and Garry 2003; Pennings and Keman 2002) and ‘inductive’
methods in which positions and other parameters are simultaneously estimated (Slapin and Proksch 2008;
Monroe and Maeda 2004).
These methods have been applied to a range of units, e.g. counts of word types (Slapin and Proksch 2008;
Monroe and Maeda 2004; Laver, Benoit, and Garry 2003), sentences and quasi-sentences (Budge, Robertson,
and Hearl 1987), dictionary-based content analysis categories (Laver and Garry 2000), metadata tags (König
and Luig 2009), and – most popularly in comparative politics – subsets of the category scheme developed
by the Budge, Robertson, and Hearl’s Comparative Manifesto Project (e.g. Pennings and Keman 2002; Lowe
et al. 2011; Elff 2013; König, Marbach, and Osnabrugge 2013). These applictions imply though never quite
make explicit, and important desideratum: a suitable scaling models should apply equally well to anything
we can count. This issue is taken up in more detail in the appendix.
In the face of this variation researchers might reasonably ask: Is there a common set of substantive and statistical
assumptions underlying these text scaling methods? What am I committing to substantively by choosing
one method over another? Empirical reviews of the consequences of these different methods exist for particular
cases e.g., Dinas and Gemenis (2009) compare Greek and Klemmensen, Hobolt, and Hansen (2007)
Danish party positions and Klüver (2009) look at interest groups in relation to the European Commission.
But these reviews do not in general attempt to provide insight into the methods themselves.
The first goal of this paper is to provide a unifying theory for existing text scaling methods. I will argue that
there is essentially only one way scale the kind of count data that textual data implies, and that the methods
above are particular implementations, special cases, or computational approximations to it.
This is possible because there is a common logic to scaling count data. Specifically, I show that existing methods
reflect the idea that document positions are based on low dimensional reconstructions of patterns of relative
emphasis by deriving them from a model that explicitly does so.
Low dimensional reconstruction is common theme in latent variable modeling, of which factor analysis and
IRT models are perhaps the most widely used instances in political science. However, scaling texts is essen-
1. from The Zen Of Python. Quote continues “(although that way may not be obvious at first unless you’re Dutch.)”
2
tially a problem of count data, for which these approaches are not appropriate. Methods like factor analysis
assume conditionally Normal responses and the IRT models borrowed from the analysis of roll call votes
assume that the data is binary. In contrast, for count data it can only be the relative emphasis of one word
(or category) over another that signals position. Consequently the empirical quantities that need low dimensional
reconstruction in count data scaling are not correlations but associations, so the appropriate model
class will be association models and their relatives.
One interesting substantive corollary to the argument will be that all these methods embody – and to the
extent they work, also vindicate – some but not all of the key claims for the saliency theory of positioning
(Budge 2001). This connection makes it easier to explain what substantive commitments about political text
come with text scaling.
A practical consequence the unified theory is to offer new ways to interpret and visualise existing text scaling
models and also to define useful new ones. In particular, this paper argues for the use of biplots for visualizing
the results of text scaling models.
The paper proceeds as follows: I present a first text scaling model defined as a choice-model over words.
I then consider an important application, scaling manifesto positions from CMP data (Budge, Robertson,
and Hearl 1987), and note that two recent methods, logit scoring (Lowe et al. 2011) and the CMP’s own
preferred measures, are special cases. I then show how a computationally convenient way to estimate this
model recovers the Wordfish (Slapin and Proksch 2008) and the Rhetorical Ideal-point model (Monroe and
Maeda 2004). I then re-derive the model directly as an association model (Goodman 1979) and introduce
canonical correlation analysis, perhaps better known as simple correspondence analysis, as an efficient least
squares approximation. I note in passing that the ‘vanilla method’ (Gabel and Huber 2000) is in practise
another approximation and derive Wordscores (Laver, Benoit, and Garry 2003) as a special case. Finally I
show the advantages of the method for inference about dimensionality and position visualisation.
Text Scaling
Text scaling models almost all represent each document in a collection as a ‘bag of words’ assumption. This
means that a set of documents is represented by an N×V matrix of counts C where entry Cij is the number of
times the j-th textual unit (word, category, or sentence) occurs in the i-th text (or document or speech). Denote
the row marginal totals as Ci+, the column marginal totals as C+j, and the grand total as C++. For concreteness
I will where possible refer to N documents and a vocabulary of V words, although sometimes these
‘words’ will be categories, topics or some other countable unit. In this terminology C is a cross-tabulation of
documents and word types, Ci+ is the length of document i in words, and C+j is the frequency of word j in
the document collection. The samping scheme for C is product multinomial because words are sampled in a
stratification determined by document.2
The scaling model
Each row of the cross-tabulation [Ci1 . . . CiV] represents the content of document i as a vector of word counts.
For text scaling we suppose that each row is generated by an actor attempting to express, and thereby provide
information about, her unobserved position θ on one or more policy dimensions. Perhaps the simplest generalized
linear statistical model that reflects this relationship between which words she chooses to use and
2. Only Wordscores makes use of this fact, though it need not have bothered since only quantities that are invariant to row and
column margins totals are estimated as positions.
3
her position is a multinomial IRT model, modeled fairly directly on the Binomial model familiar to roll-call
vote analysis.
[Ci1 . . . CiV] ∼ Multinomial(πi1 . . . πiV, Ci+)
log
(
πij
πik
)
= ψj/k + θi βj/k (1)
where the two sets of word parameters ψ and β are labelled to indicate the word contrasts they apply to and
the k-th word operates as a baseline.
The same scaling model
Practically, Eq.(1) can be hard to estimate because conditioning on the document length Ci+ couples the word
parameters so they must be jointly estimated. For large V this can be impractical. Fortunately, a completely
equivalent ‘surrogate Poisson model’ formulation of the model decouples them by assuming independent
Poisson processes for each cell count and introducing nuisance parameters αi to capture the effect of conditioning
on each Ci+:
Cij ∼ Poisson(μij)
log μij = αi + ψj + θi βj. (2)
The model parameters from (1) and (2) are linearly related:
log
(
πij
πik
)
= log
(
μij/
∑
j μij
μik/
∑
j μij
)
= log μij − log μik
= (αi − αi) + (ψj − ψk) + θi (βj − βk)
= ψj/k + θi βj/k
This relationship between the two formulations is the multinomial-Poisson transform (Palmgren 1981; Baker
1994).
Eq.(2) has the practical advantage over Eq.(1) that it can be simply fitted by iteratively optimising α and θ
conditional on the current values of ψ and β, and then ψ and β conditional on the values of α and θ until
a peak in the likelihood is reached, as first suggested by Goodman (1979). For large tables, which includes
essentially all text scaling applications, this is a far better choice than Fisher-scoring methods.
The multinomial-Poisson transformation also guarantees that either of the two profile likelihoods can be
used for conditional inference about parameters from the other margin. Interestingly, this also means that
the model is consistent in increasing N, in fixed or random effects formulation due to parameter orthogonality
(Lancaster 2002; Charbonneau 2012)).
A second practical advantage of the equivalence is that if we are willing to assume that the word parameters
are well-estimated, the formulation in Eq. (1) allows easy computation of asymptotic standard errors for
θ without bootstrapping. Implementers can therefore fit using Eq. (2) and construct conditional standard
errors using Eq. (1). Lowe and Benoit (2013) discuss this and alternative methods of uncertainty estimation.
Conditional standard errors are implemented in the austin R package.
4
Special cases: Wordfish and Rhetorical Ideal-points
Legislative scholars will recognise Eq.(2) as Wordfish (Slapin and Proksch 2008). With the addition of a
grand mean parameter it is also the ‘Rhetorical ideal-point’ model of Monroe and Maeda (2004) in fixed
effects formulation.
The scaling model, again
A illuminating alternative way to motivate Eq.(2) as a text scaling model is not as a cheaper way to estimate
Eq.(1) but as a log-multiplicative extension to the set of hierarchical log-linear models of C.
Of the five possible log-linear models of the two-way table C, two are relevant to word or topic scaling. These
are the independence model
log μij = λ + λR
i + λC
j (3)
and the saturated model
log μij = λ + λR
i + λC
j + λRC
ij (4)
Under the independence model expected counts depend only on the length of each document and the corpus
frequency of the words or topics used within it. Clearly no positions can be expressed or inferred from data
assumed to be generated this way. In contrast, the saturated model always recovers C perfectly but provides
no analysis of C into positional and non-positional variation.
While both models are substantively unsatisfactory, they do locate the source of positioning information:
the (N − 1)(V − 1) odds ratios that determine the N × V matrix of interaction terms λRC
. Substantively,
when scaling topic counts these odds ratios reflect facts such as: party A talks about the economy twice as
much as social welfare whereas party B talks about it three times as much. This reflects the idea that it is
not how much a party talks about the economy that matters but how much more or less it talks about it in
proportion to other policy topics. The full set of (N−1)(V−1) such facts exhausts the information available
to determine positioning and the assertion that all such facts can be represented adequately in a much lower
dimensional space is the assertion that the spatial model is true of political speech in the same way as it is of
other behaviour.
To explore the patterns of relative emphasis represented by full set of odds ratios we define a set of models
with complexity greater than the independence model but less than the saturated model by systematically
decomposing λRC
.
To motivate this approach intuitively, note that every matrix has a spectral decomposition
λRC
= UΣVT
=
M∑
m=1
u(m)σ(m)vT
(m)
into orthogonal left and right singular vectors and singular values. This offers a principled way to define a
sequence of lower dimensional spatial models
log μij = λ + λR
i + λC
j +
M∑
m=1
ui(m)σ(m)vT
i(m) (5)
≈ λ + λR
i + λC
j + uiσvj (6)
5
where the second line is a rank one approximation to λRC
constructed by taking the largest singular value
and its associated vectors. The general form in Eq.(5) is M-dimensional row-column association model, or
RC(M) model developed by Goodman (1979, 1985).
When M = 0 Eq.(5) defines the independence model so the documents have no positions. When M =
min(N − 1, V − 1) each document (and word) has a position in M-dimensional space and the model is
saturated. Intermediate choices of M define models of intermediate complexity. In applications usually hope
that M is small and substantively interpretable. Eq.(6) is the lowest dimensional model that gives documents
and words positions.
This parameterisation and identification strategy have three advantages for text scaling. First, the symmetrical
normalisation for document positions u and word positions v brings out their symmetric role. Unlike a
vote scaling IRT model nothing changes except the substantive interpretation if we work with CT instead of
C.
Second, zero values are interpretable: if vj ≈ 0 then the j-th word occurs at rate determined by the independence
model and so takes no role in determining document positions.
Third, σ provides a measure of how much of the variation in C is due to positioning and how much is to be
expected ‘by chance’ from the baseline independence model. In addition, when M > 1 it provides a relative
measure of the number of dimensions inherent in C. This is discussed below.
The same scaling model, yet again
Finally, as perhaps still a fourth formulation of the model, Eq.(5) can be interpreted as a multidimensional
unfolding model: a proximity formulation is
log μij = λ + λR
i + λC
j − σ(ui − vj)2
(7)
where σ is an shared inverse variance. However, without any extra information about u and v the ‘proximity’
model in Eq.(7) has Eq.(6), an explicitly ‘directional’ formulation, as its reduced form.3 (Expand the quadratic
terms, reduce, and flip some signs.)
Eq.(7) can be seen as i’s unobserved utility for deploying a word or topic with βj, which depends quadratically
on distance to θi, in the same way as i must determine whether the status quo βnay or bill’s position βyea is
closer to θi in a roll call. Whereas roll calls consist under the roll call analysis model as a sequence of simple
choices between each of which generate a single decision boundary and a logistic regression with coefficient
βyea/nay, a speech consists of V possible decisions made N times, implying the multinomial logistic regression
structure of Eq.(1).
Special case: Wordfish and Rhetorical Ideal-points
To recover Wordfish Slapin and Proksch’s Wordfish from the RC(M) parameterisation set
α ← λ + λR
ψ ← λC
β ← σv θ ← u.
For Monroe and Maeda’s Rhetorical Ideal-points, set β to σv. The appendix describes some identification
issues with these models and their solutions.
3. It is possible, and perhaps desirable to assert a non-quadratic utility structure, but that is not what existing models do.
6
Special cases: Logit scores and proportional differences
If C is collapsed over left and right categories, or if there are to begin with only two categories in the column
set, then the ˆθ of Eq.(11) are just linear transformations of the estimated θ from Eq.(2). This makes Wordfish
is the multi-category generalisation of the logit scores described in Lowe et al. 2011. Proofs are straighforward
and summarised in (Meyer and Wagner 2014, Appendix).
Approximation: Canonical correlation analysis, correspondence analysis
The intuition embodied in the Poisson formulations above is that positions can only be uncovered after factoring
out the counts that would be expected just on the basis of document length and word frequencies. This
is realised by a reduced rank model of the interaction terms λRC
embedded in an otherwise log-linear model.
However, a computationally cheaper and almost equivalent realisation of the same intuition is to analyse the
residuals from the independence model directly.
Define Pij as Cij/C++. The predicted cell probabilities under the independence model are each Pi+P+j with
Chi-squared residuals
Rij =
Pij − Pi+P+j
√
Pi+P+j
As before, the matrix of residuals has singular value decomposition
R = UΣVT
=
M∑
m=1
u(m)σ(m)vT
(m)
When a smaller number of u and m are retained this is known as canonical correlation analysis (see e.g.
Agresti 2002, sec.9.6.3, σ(m) is the ‘correlation’), and in the case where M = 2 and u and v are plotted in the
same space, simple correspondence analysis (see M. Greenacre 2007, for a review).
The status of this procedure as an approximation to the association model in Eq.(5) is clearer when it is
formulated as an explicit model of the elements of P
Pij = Pi+ P+j (1 +
M∑
m=1
ui(m) σ(m) vT
j(m)) (8)
≈ Pi+ P+j (1 + ui σ vj) (9)
where the second line is a rank one approximation that is optimal in the least-squares sense (Eckart and Young
1936). This is referred to as the ‘reconstitution formula’ in the correspondence analysis literature.
As the notation suggests, the u and v of Eqs (8) and (9) play the same role as, and are highly correlated
with, Eqs.(5) and (6). The approximation will be best when relatively little variation in the data is due to
positioning. This is because ex ≈ 1 + x when x is close to zero. Goodman 1985 provides a exhaustive
description of the similarities and differences between canonical correlation / correspondence analysis and
the RC(M) association model.
In practical applications θ from Eq.(6) and u from Eq.(9) are very highly correlated, as we might expect an
ML model and its least squares approximation. Perhaps not surprisingly for a linear approximation, there are
7
values of u and v that reconstruct entries in P that are less than zero or larger than one. This is a function of
the geometric roots of correspondence analysis rather than the generative roots of association models.
Substantively, Eq.(9) reflects the samelogic of positioningbyrelative emphasisandshould be used as such.
Application: Manifesto positions from coded manifestos
In applications based on CMP data (e.g. Budge, Robertson, and Hearl 1987; Bakker 2009) Cij contains percentages
of quasi-sentences counts in manifesto i devoted to category j rather than counts. With this important
difference to the definition of C, researchers have suggested several ways to estimate party policy positions.
The CMP project itself does so by identifying a set L of the column indices corresponding to ‘left’ categories
and another set R for ‘right’ categories e.g. Budge 2001, table 1, asserting these are equally informative and
substantively equivalent, and then collapsing C over R and L to create a new matrix with two columns from
which row positions are derived. Different functions contrasting the two columns correspond to different
ways to estimate positions.
A straighforward choice is the difference in the proportion of each document spent expressing left versus
right policy categories
˜θi ∝
∑
j∈R
Cij −
∑
j∈L
Cij (10)
When divisor is Ci+ this is the the preferred CMP estimate of left-right position ‘RILE’. Kim and Fording
(2002) and Laver and Garry (2000) have suggested that a better denominator might be the sum total of left
and right category counts. Lowe et al. (2011) have argued, partly on psychophysical grounds, that sentences
categorised on either side of an issue should have decreasing rather than constant marginal effects on a position
estimate. This leads to so-called ‘logit scores’
ˆθi = log
∑
j∈R Cij
∑
j∈L Cij
(11)
Notice that Eqs.(1) and (11) map two modeling limits. Eq.(11) places almost no constraints on possible
positions and in this respect resembles a saturated Binomial version of Eq.(1) with as many parameters as
positions. The lack of an explicit model has the undesirable consequence that zero counts must be dealt with
in an ad-hoc manner, by adding a constant. In contrast, Eq.(1) with scalar θ is nearly the strongest possible
model because it forces its linear predictors into a much lower dimensional structure than the original count
data. As the dimension of θ increases, Eq.(1) provides fewer constraints on the fitted counts until there are as
many parameters as positions and the positions themselves are just the empirical logits in (11).
If left and right category counts are fairly balanced then ˜θ ≈ ˆθ. Budge and McDonald (2012) confirm that
this is the case in applications.
We can make an independent check on whether the choice of left and right categories that go into RILE are
indeed substantively equivalent by fitting the scaling model and examining the item parameters β. Fig. 1
shows the results for post war manifestos in Germany. Here all the basic CMP categories and plotted them in
order of estimated position in blocks. The figure shows first that while the left and right categories do seem
to support these interpretations in Germany, some categories are further left or right than others, so it is not
strictly true that they are substantively equivalent, and aggregating them is weakly motivated.
Also, other CMP categories from the middle block could have been included in the left right measure but
were not. The scaling model therefore provides some validation of the basic choices but opens the way to a
8
more complete and nuanced scale construction if needed: θ estimates would perhaps be a better choice of
scale than either aggregated differences or their log ratio.
Special case: the Vanilla Method
Correlation models compute the SVD of R which is a transformation of C from which expected variation due
to differing document lengths and word frequencies has been removed. Practically this is done by subtracting
expected means and dividing by expected standard deviations under independence. But what happens if SVD
is applied to C directly rather than to R as in Gabel and Huber’s ‘vanilla method’?
In this case the first left and right singular vectors will capture the information provided by the independence
model, essentially the relative document lengths and relative word frequencies and the second and subsequent
left and right singular vectors will have positional interpretations.
The ‘vanilla method’ for estimating manifesto positions from CMP data is to apply principal component
analysis (PCA) and take the first principal component score as the manifesto’s position. Although there is a
close connection between SVD and PCA – if the SVD of a matrix A is UΣVT then U are also the eigenvectors
of AAT, V the eigenvectors of ATA, and σ the square roots of the eigenvalues of AAT and ATA – we would
not necessarily expect the first principal component of the data matrix to be close to the u of Eq.(9), or even
to be particularly interpretable.
However, the CMP represents each manifesto as a vector of policy area percentages rather than counts.4 Because
these should, and mostly do sum to one hundred in each row, document length information is already
effectively removed from the data. Moreover, standard routines for PCA suggest standardising by column to
remove the effects of having variables (columns) on different scales.5 After these row and column normalisations,
the manifesto positions constructed by the vanilla method will be therefore be extremely close to those
constructed using Eq.(9) or indeed Eq.(6). One substantive implication of this is that researchers should
probably not use the ‘vanilla method’ on general C matrices, but only on the CMP’s prefered format.
Implementations
Practical implementations of Eq.(8) require a method for computing the singular value decomposition. Efficient
matrix methods are now widely available, but in the past a reciprocal averaging scheme was used. This
consists of alternating
u
(t)
i ←
∑
j
Cijv
(t−1)
j /C+j v
(t)
j ←
∑
i
Ciju
(t−1)
i /Ci+ (12)
As t increases u(t) → u and v(t) → v, from any non-identical starting values Hill 1974, prop.1. At the end of
each iteration a normalisation of u, e.g., Eq.(13), must be applied to prevent u and v collapsing to vectors of
ones.
4. Original counts can be recovered by multiplication. Mostly. This would work better if the project believed in distributing policy
area counts in even single precision.
5. For example R’s prcomp function, which describes its argument scale as follows: “a logical value indicating whether the variables
should be scaled to have unit variance before the analysis takes place. The default is ‘FALSE’ for consistency with S, but in general
scaling is advisable.” (emphasis added). Alternatively PCA operates on the correlation matrix of the data which clearly performs both
row and column normalisation.
9
Political Authority
Freedom and Human Rights
Constitutionalism: Positive
Social Harmony
Law and Order
Incentives
Protectionism: Negative
Traditional Morality: Positive
Economic Orthodoxy
Welfare State Limitation
Free Enterprise
National Way of Life: Positive
Military: Positive
Marxist Analysis
Constitutionalism: Negative
Keynesian Demand Management
National Way of Life: Negative
Social Justice
Foreign Special Relationships: Negative
Political Corruption
Underprivileged Minority Groups
Environmental Protection
Anti−Growth Economy: Positive
Non−economic Demographic Groups
Centralisation
Multiculturalism: Positive
Traditional Morality: Negative
Culture
Decentralisation
Economic Goals
Corporatism
European Community/Union: Positive
Farmers
Technology and Infrastructure
European Community/Union: Negative
Governmental and Administrative Efficiency
Foreign Special Relationships: Positive
Middle Class and Professional Groups
Productivity
Internationalism: Negative
Labour Groups: Negative
Education Limitation
Multiculturalism: Negative
Anti−Imperialism: Positive
Controlled Economy
Nationalisation
Protectionism: Positive
Military: Negative
Democracy
Welfare State Expansion
Labour Groups: Positive
Peace: Positive
Economic Planning
Internationalism: Positive
Market Regulation
Education Expansion
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−3 −2 −1 0 1 2 3
Figure 1: CMP category scores for manifestos from post-war Germany in groups of by left RILE (top), unused
(middle), and right RILE (bottom).
10
Special case: Wordscores
When all document positions u are considered to be known, only the word positions v need to be estimated.
Wordscores applies just one application of the right hand side of Eq.(12) and constructs positions for out-ofsample
documents using the left hand side. This defines the Wordscores algorithm6 (Laver, Benoit, and Garry
2003). A more detailed description of this connection with reference to Eq.(9) in the form of correspondence
analysis is given in Lowe (2008).
Wordscores appears not to require the imposition of any normalisation step such as (13) because the assumption
of known u implies that no iterations are necessary. However, a related issue does arise when predicting
the new document positions(Benoit and Laver 2008; Martin and Vanberg 2007). This identification issue
cannot be avoided.
The connection between canonical correlation and Wordscores has one more substantive implication: The
linearity of (9)’s approximation to (6) implies that if the true data are generated by Eq.(5) then the word and
document position parameters in Eq.(8) will be shrunk towards the centre at the edges. This is related to the
endpoint shrinkage discussed by Lowe (2008) and in more detail by ter Braak and Looman (1986).
FDP
FDPFDP
FDP
FDP
PDSPDS
PDS
PDS
PDS
GREENSGREENS
GREENS
GREENS
GREENS
SPD
SPDSPD
SPD
SPD
CDU
CDU
CDU
CDU
CDU
−0.25
0.00
0.25
−0.75 −0.50 −0.25 0.00 0.25 0.50
Dimension 1
Dimension2
Figure 2: Party positions in German federal elections of 1990, 1994, 1998, 2004, and 2008 from a twodimensional
canonical correlation model. Positions are labelled with party initials. Labels are sized from
the 1990 (smallest) to 2008 (largest).
6. Strictly, the left recursion in Wordscores is performed on a column-normalised version of C but dividing each column by a
constant makes no difference, except through rounding error, to the final u and v.
11
q
q
q
q
q
q
q
q q q
q q q
q q q q
q q q0.15
0.18
0.21
0.24
0.27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
sigma (canonical correlation)
Value
Figure 3: Canonical correlation parameters σ for the first 20 dimensions of the german economic manifesto
data.
Practical Advantages
In addition to providing a unifying conceptual scheme for text scaling models, and perhaps stemming the tide
of new formulations of the same (good) intuition, there are practical advantages to working with association
models or their canonical correlation approximations for text scaling.
Easily fitted multidimensional models
A considerable practical advantage of the canonical correlation is the ease and speed of fitting models with
multiple dimensions, to which models like Wordfish have not yet been extended.
For example, Fig. 2 shows positions on the first two dimensions for each the German manifesto data used in
the original Wordfish paper (The combined economic sections of five main parties in five German elections
Slapin and Proksch 2008). The positions have been coloured and labelled by party and larger labels indicate
more recent elections. Fig. 3 shows the estimated values of σ for the 20 largest dimensions and suggests that
a model where M = 2 might be reasonable.
Fig. 3 suggests that there are two main dimensions of variation in this data. So we can now ask: what substantive
meaning, if any, do the dimensions in Fig. 2 have? Knowledge of recent German politics suggests
that there are indeed two interpretable dimensions embedded in the positions, but that these are not (quite)
aligned with the axes. This will in general be true – there is no reason to expect substantive dimensions to be
orthogonal (Albright 2010).
If the positions are projected onto a line extending from middle left to bottom right then the parties order on a
12
free market economics dimension from PDS on the left to FDP on the right. At 90 degrees to this the parties
order by time, with earlier elections towards the bottom left and later ones towards to the top right of the
figure. A one dimensional model would have recovered only the first dimension which would mostly remove
the distinctions in economic policy between the centre-right CDU and the explicitly free market-oriented
FDP.
An interesting consequence of estimating the second dimension is the movement of some but not all party
positions towards the centre of the substantive first dimension and the large changes over time of the left wing
parties’ ways of talking about economics, or more likely the choice of economic topics that their manifestos
covered over this set of elections.
Better visualisation
Fig. 4 shows a two dimensional canonical correlation model of CMP category counts for the major post-war
German party manifestos (inflated to suitable C matrices). This plot is a biplot (M. J. Greenacre 2010) and its
substantive interpretation is as follows:
Parties appearing close together hae similar profiles of topic emphasis and topics close together are emphasised
by similar sets of parties. Points appearing at the origin use each column category at rates that would
be expected by chance, that is, according to the independence model. That is, a party appearing at the origin
uses each policy topic exactly as much as would be expected from the frequency of references to these topics
in manifestos of the period, and a topic at the origin is one used by each party equally often. Deviations from
the origin represent the amount more relative emphasis each manifesto puts on each category.
Finally, the angle between the a vector from the origin to a party manifesto and a vector from the origin to
a policy topic indicates how much a party emphasised that topic over others. Thus a topic such as ‘Freedom
and Human Rights’ (lower left quadrant) is emphasised almost equally as strongly by the FDP (lower right
quadrant) and the Greens and left parties (left side). ‘Education Expansion’ has a similar party profile but its
emphasis is weaker. In contrast, ‘Welfare State Limitation’ is almost exclusively an FDP policy topic.
The biplot also nicely shows the two non-orthogonal dimensions of policy dispute in the party system: a
liberal-conservative ordering on social issues from left to top right, and a redistributive-neoliberal dimension
from left to bottom right.
The biplot shows only the policy topics recommended for a left right dimension by the CMP, but e can also
scale and project all measured policy topics in the same space. Figure 5 shows all topics.
The saliency theory of positioning
Budge (2001) lays out the basic assumptions of the saliency theory of party competition. The two most relevant
for this paper are that
1. “policy differences between parties thus consist of contrasting emphases on different policy areas (thus,
one party often mentions taxes, another benefits)”.
2. “party strategists see electors as overwhelmingly favouring one course of action on most issues. Hence
all programmes endorse that position, with only minor exceptions”
3. In certain special cases “sets of policy emphases which go together can be added numerically and contrasted
with sets of opposing emphases to form a unified directional index such as Right versus Left”
13
(Budge 2001)
Dimension 1 (34.86%)
Dimension2(17.14%)
−1.0 −0.5 0.0 0.5 1.0
−0.50.00.5
q
q
q q
q
q
q
q
q
q
q
q
qq
q
q q q
q
q
q
q
q
q
q
q
Anti−Imperialism: Positive
Military: Positive
Military: Negative Peace: Positive
Internationalism: Positive
Freedom and Human Rights
Democracy
Constitutionalism: Positive
Political Authority
Free Enterprise
Incentives
Market Regulation
Economic PlanningProtectionism: Positive
Protectionism: Negative
Controlled EconomyNationalisation Economic Orthodoxy
Welfare State Expansion
Welfare State Limitation
Education Expansion
National Way of Life: Positive
Traditional Morality: Positive
Law and Order
Social HarmonyLabour Groups: Positive
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Greens/90 1990
90/Greens 1994
90/Greens 1998
90/Greens 2002
90/Greens 2005
90/Greens 2009PDS 1990
PDS 1994PDS 1998
PDS 2002
Left
Left
SPD 1990
SPD 1994
SPD 1998
SPD 2002
SPD 2005
SPD 2009
FDP 1990
FDP 1994
FDP 1998
FDP 2002
FDP 2005
FDP 2009
CDU/CSU 1990
CDU/CSU 1994
CDU/CSU 1998
CDU/CSU 2002
CDU/CSU 2005
CDU/CSU 2009
Figure 4: Biplot of post war German manifestos and their use of left-right CMP categories.
The assumptions of the models discussed above reflect exactly the first assumption. Relative emphasis on a
category in a manifesto means, in the CMP quasi-sentence operationalisation, talking more about relative to
other categories.
The models discussed here operationalise this idea more precisely because they require a speaker or writer
to use more words or sentences on one topic than another relative to what would be expected by chance
according to the independence model. Thus the actual number of words or quasi-sentences needed to do
this differs according to the chance baseline, the prevalence of the other categories, and the length of the
document. Nevertheless, the intuition is the same as 1) above.
In particular cases, some categories are naturally contrasted (or bundled) and are good candidates for an
index measure. When the bundles of categories are known then logit or linear indexes are a natural option.
When bundle contents are unknown they can be estimated as in Fig 1.
14
Dimension 1 (31.3%)
Dimension2(14.3%)
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−1.0−0.50.00.5
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Foreign Special Relationships: Positive
Foreign Special Relationships: Negative
Anti−Imperialism: Positive
Military: Positive
Military: Negative
Peace: Positive
Internationalism: Positive
European Community/Union: Positive
Internationalism: Negative
European Community/Union: Negative
Freedom and Human Rights
Democracy
Constitutionalism: Positive
Constitutionalism: Negative
Decentralisation
Centralisation
Governmental and Administrative Efficiency
Political Corruption
Political Authority
Free Enterprise
Incentives
Market Regulation
Economic Planning
Corporatism
Protectionism: Positive
Protectionism: Negative
Economic Goals
Keynesian Demand Management
Productivity
Technology and Infrastructure
Controlled Economy
Nationalisation
Economic Orthodoxy
Marxist Analysis
Anti−Growth Economy: PositiveEnvironmental Protection
Culture
Social Justice
Welfare State Expansion
Welfare State Limitation
Education Expansion
Education Limitation
National Way of Life: Positive
National Way of Life: Negative
Traditional Morality: Positive
Traditional Morality: Negative
Law and Order
Social Harmony
Multiculturalism: Positive
Multiculturalism: Negative
Labour Groups: Positive
Labour Groups: Negative
Farmers
Middle Class and Professional Groups
Underprivileged Minority Groups
Non−economic Demographic Groups
Greens/90 1990
90/Greens 199490/Greens 1998
90/Greens 2002
90/Greens 2005
90/Greens 2009
PDS 1990PDS 1994
PDS 1998
PDS 2002
Left 2005
Left 2009
SPD 1990
SPD 1994
SPD 1998
SPD 2002
SPD 2005
SPD 2009
FDP 1990
FDP 1994
FDP 1998
FDP 2002
FDP 2005
FDP 2009
CDU/CSU 1990
CDU/CSU 1994
CDU/CSU 1998
CDU/CSU 2002
CDU/CSU 2005
CDU/CSU 2009
Figure 5: Biplot of post war German manifestos and their use of all CMP categories.
15
The models above also provide a counter to the claim that all programmes have the same position. Here, position
just is the low dimensional summary of relative emphases over the appropriate unit so it is not possible
to differ in relative emphasis while maintaining the same position. To retain the intuition in 2) it might be
reasonable to say that electors all see the policy advantages of e.g. reducing state intervention in the economy
and also the advantages of increasing the state’s role, but weigh the benefits of these differently.
Finally, the model above not only locates positioning information, but also more traditional salience information.
In the association model fitted to category data, e.g., fitted to CMP category counts from a single
country, λL offers a view of the relative amounts of emphasis spent on each policy topic. This might be a
reasonable operationalisation of policy salience. An interesting implication of the model is that salience is
nearly, but not quite independent of position, because increasing salience by raising the base rate of discussion
about a policy area also changes the exact amount of emphasis, in sentence or word units, necessary for
an actor to maintain the same positions on potentially all issues.
To be clear, this discussion of saliency theory is not intended to criticise (or support) it as a substantive theory
of party competition, but only to point out that two of the three main assumptions of saliency theory are in
fact realised by the models discussed in this paper. This makes CMP objections to text scaling somewhat
ironic.
There are always positions
The logic of positioning by relative emphasis implies that anything we can count can always be scaled; the
worst that can happen is that no low-dimensional projection of counts will clearly account for a large amount
of the observed variation. So the question for political scientists is not: ‘Are there positions?’ but rather: ‘Are
these the patterns of relative emphasis we are looking for?’
Conclusion
This paper has presented a unified theory of scaling count data and showed many existing methods to be
special cases or approximation of it. The theory rests on the idea that a logic of relative emphasis can be
used to model and visualise contingency table data of all kinds. It also shows that models based on this
theory recover and allow us to extend previous manual coding-based approaches. While more sophisticated
probabilistic modeing will continue to be developed for particular political science applications we should
be able to agree on the basic logic of positioning and concentrate on harnessing it to address substantive
questions.
16
A Identification
The focus of this paper is on the logic and models of scaling rather than particular estimation strategies,
but it is important to note some practical considerations concerning identification. When θ is modeled as a
fixed but unobserved quantity it must be constrained for the model to be identified. For one-dimensional θ
a common approach is to require that
∑
θi = 0
∑
θ2
i = 1 θa > θb (13)
where a and b are any two row indices and the third constraint sets the left-right interpretation. For comparisons
between the association model and correspondence analysis formulations of the models it is sometimes
convenient to make these weighted averages with weights given by the margins of C. For identification it is
ony necessary there be two linear constraints.
Multidimensional θ will also require constraints on covariance between positions on different dimensions.
Setting them all to zero is a convenient choice.
If instead θ is instead considered as a sample from a population of possible policy positions θ ∼ Normal(0, 1)
is also sufficient to identify the model, though any sufficiently informative prior will work. The normal prior
defines a latent trait model (Moustaki and Knott 2000) or equivalently, makes the connection to multinomial
IRT models (Clinton, Jackman, and Rivers 2004).
It is also possible to define models that condition θ on external information e.g. about the speaker, which can
also serve to identify positions.
Regrdless of the strategy for identifying θ the other model parameters also need identifying. Interestingly, the
Wordfish model is not likelihood identified. Any change in ¯β =
∑
j βj/V can be compensated for by changes
in α without changing the probability of the data under the model. Specifically, let m and s be the average
and standard deviation of β. The translations between the two models are
Wordfish to Association Model Association Model to Wordfish
u ← θ θ ← u
v ← (β − m)/s β ← vσ + m
σ ← s
r ← α + θm a ← λ + λR
− θm
λR
← r − ¯r α ← a − a1
λC
← λC
− ¯λ
C
ψ ← λC
+ a1
λ ← ¯r + ¯λ
C
where a bar indicates arithmetic average. In the second column, m may be chosen arbitrarily; the association
model fixes it to be zero.
Despite this lack of identification, the Wordfish model as a whole is identified because of the extra constraint
provided by a ridge prior on β. Contrary to the original paper’s claim, this does not fix a ‘technical issue’ by
adding a regularisation term but rather completes the identification7.
7. so don’t leave it out…Alternating conditional maximization is not actually an EM algorithm either, despite the paper’s claims,
17
B Scaling in the space of text analysis models
Why should a model apply to counts at all levels? It is easier to see why this should be true from a measurement
perspective. Starting with at the lowest level, if words are modeled as conditionally independent given an
latent variable, e.g. a position, then we can scale words to recover positions directly.
If, as is more plausible, positions are patterns of relative emphasis over policy topics rather than individul
word types, then scaling topic counts rather than word counts will recover positions.
But if, as topic modelers assume, topics are themselves distributions of word types conditioned on unobserved
topic indicators, then scaling words directly will recover noisy versions of the topic-defined positions without
our needing to know or inferring the topic indicators, but with extra uncertainty due to not making use of
the subspace of word counts induced by convex combinations of underlying topics.
It is sometimes helpful to think of K topics as inducing a K − 1-dimensional subspace in the N-dimensional
space of word counts, and of scaling models as pursuing an M ≪ K dimensional summary of relative emphasis
across K topics. In this generative scheme knowing intermediate topic structure decreases variance in
position estimation and increases bias and substantive interpretability.
but that detail does not compromise our understanding of the form and substantive implications of the model.
18
References
Agresti, Alan. 2002. Categorical data analysis. 2nd ed. New York: Wiley-Interscience.
Albright, Jeremy J. 2008. Bayesian estimates of party left-right scores. Working Paper, Society for Political
Methodology 801. July 11.
. 2010. “The multidimensional nature of party competition.” Party Politics 16, no. 6 (November 1):
699–719. Accessed June 18, 2015.
Baker, Stuart G. 1994. “The multinomial-Poisson transformation.” Journal of the Royal Statistical Society. Series
D (The Statistician) 43 (4): 495–504. JSTOR: 2348134.
Bakker, Ryan. 2009. “Re-measuring left–right: A comparison of SEM and Bayesian measurement models for
extracting left–right party placements.” Electoral Studies 28, no. 3 (September): 413–421.
Benoit, Kenneth R., and Michael Laver. 2008. “Compared to what? A comment on ‘A robust transformation
procedure for interpreting political text’ by Martin and Vanberg.” Political Analysis 16 (1): 101–111.
Budge, Ian. 2001. “Validating party policy placements.” British Journal of Political Science 31 (1): 210–223.
JSTOR: 10.2307/3593282.
Budge, Ian, and Michael D McDonald. 2012. “Conceptualising and measuring ‘centrism’ correctly on the
Left–Right scale (RILE) - without systematic bias. A general response by MARPOR.” Electoral Studies 31
(3): 609–612.
Budge, Ian, David Robertson, and Derek Hearl, eds. 1987. Ideology, strategy and party change: Spatial analyses
of post-war election programmes in 19 democracies. Cambridge UK: Cambridge University Press.
Charbonneau, Karyne. 2012. “Multiple fixed effects in nonlinear panel data models.” Accessed April 7, 2016.
Clinton, Joshua, Simon Jackman, and Douglas Rivers. 2004. “The statistical analysis of roll call data.” American
Political Science Review 98, no. 02 (June): 1–16.
Dinas, Elias, and Kostas Gemenis. 2009. “Measuring parties’ ideological positions with manifesto data: A
critical evaluation of the competing methods”: 1–25.
Eckart, Carl, and Gale Young. 1936. “The approximation of one matrix by another of lower rank.” Psychometrika
1, no. 3 (September): 211–218. Accessed June 18, 2015.
Elff, Martin. 2013. “A dynamic state-space model of coded political texts.” Political Analysis 21 (2): 217–232.
Gabel, Matthew J., and John D. Huber. 2000. “Putting parties in their place: Inferring party left-right ideological
positions from party manifestos data.” American Journal of Political Science 44, no. 1 (January):
94. JSTOR: 2669295?origin=crossref.
Goodman, Leo A. 1979. “Simple models for the analysis of association in cross-classifications having ordered
categories.” Journal of the American Statistical Association 74 (367): 537–552.
. 1985. “The analysis of cross-classified data having ordered and/or unordered categories: Association
models, correlation models, and asymmetry models for contingency tables with or without missing
entries.” The Annals of Statistics 13 (1): 10–69. JSTOR: 10.2307/2241143.
Greenacre, Michael. 2007. Correspondence analysis in practice. Chapman and Hall/CRC.
Greenacre, Michael J. 2010. “Correspondence analysis.” WIREs Computation Statistics.
19
Hill, Mark O. 1974. “Correspondence analysis: A neglected multivariate method.” Applied Statistics 23 (3):
340–354.
Kim, Heemin, and Richard C. Fording. 2002. “Government partisanship in Western democracies, 1945-1998.”
European Journal of Political Research 41 (2): 187–206.
Klemmensen, Robert, Sara Binzer Hobolt, and Martin Ejnar Hansen. 2007. “Estimating policy positions using
political texts: An evaluation of the Wordscores approach.” Electoral Studies 26, no. 4 (December): 746–
755.
Klüver, Heike. 2009. “Measuring interest group influence using quantitative text analysis.” European Union
Politics 10 (4): 535–549.
König, Thomas, and Bernd Luig. 2009. “German ’LexIconSpace’: policy positions and their legislative context.”
German Politics 18, no. 3 (September): 345–364.
König, Thomas, Moritz Marbach, and Moritz Osnabrugge. 2013. “Estimating party positions across countries
and time–a dynamic latent variable model for manifesto data.” Political Analysis 21, no. 4 (October 1):
468–491. Accessed June 18, 2015.
Lancaster, Tony. 2002. “Orthogonal parameters and panel data.” The Review of Economic Studies 69 (3): 647–
666.
Laver, Michael, Kenneth R. Benoit, and John Garry. 2003. “Extracting policy positions from political texts
using words as data.” American Political Science Review 97, no. 2 (May): 311–331.
Laver, Michael, and John Garry. 2000. “Estimating policy positions from political texts.” American Journal of
Political Science 44 (3): 619–634.
Lowe, Will. 2008. “Understanding Wordscores.” Political Analysis 16 (4): 356–371.
Lowe, Will, and Kenneth R. Benoit. 2013. “Validating estimates of latent traits from textual data using human
judgment as a benchmark.” Political Analysis 21 (3): 298–313.
Lowe, Will, Kenneth R. Benoit, Slava Mikhaylov, and Michael Laver. 2011. “Scaling policy preferences from
coded political texts.” Legislative Studies Quarterly 36 (1): 123–155.
Martin, L W, and G Vanberg. 2007. “A robust transformation procedure for interpreting political text.” Political
Analysis 16 (1): 93–100.
Meyer, Thomas M., and Markus Wagner. 2014. “How parties and candidates move: Shifting emphasis or positions?”
October.
Monroe, Burt L., and Ko Maeda. 2004. “Talk’s cheap: Text-based estimation of rhetorical ideal-points.”
Moustaki, Irini, and Martin Knott. 2000. “Generalized latent trait models.” Psychometrika 65, no. 3 (September):
391–411.
Palmgren, Juni. 1981. “The Fisher information matrix for log linear models arguing conditionally on observed
explanatory variables.” Biometrika 68, no. 2 (August): 563–566.
Pennings, Paul, and Hans Keman. 2002. “Towards a new methodology of estimating party policy positions.”
Quality and Quantity 36 (1): 55–79.
Slapin, Jonathan B., and Sven-Oliver Proksch. 2008. “A scaling model for estimating time-series party positions
from texts.” American Journal of Political Science 52, no. 3 (July): 705–722.
20
Ter Braak, Cajo J. F., and Caspar W. N. Looman. 1986. “Weighted averaging, logistic regression and the Gaussian
response model.” Plant Ecology 65, no. 1 (January): 3–11.
Warwick, Paul V. 2002. “Toward a common dimensionality in West European policy spaces.” Party Politics 8,
no. 1 (January): 101–122.
21