Modeling Science
David M. Blei
Department of Computer Science
Princeton University
April 17, 2008
Joint work with John Lafferty (CMU)
D. Blei Modeling Science 1 / 53
Modeling Science
Science, August 13, 1886
water acid disease
milk water blood
food solution cholera
dry experiments bacteria
fed liquid found
cows chemical bacillus
houses action experiments
butter copper organisms
fat crystals bacilli
found carbon cases
made alcohol diseases
contained made germs
wells obtained animal
produced substances koch
poisonous nitrogen made
5
Science, June 24, 1994
evolution rna disease
evolutionary mrna host
species site bacteria
organisms splicing diseases
biology rnas new
phylogenetic nuclear bacterial
life sequence resistance
origin introns control
diversity messenger strains
groups cleavage infectious
molecular two malaria
animals splice parasites
two sequences parasite
new polymerase tuberculosis
living intron health
6
• On-line archives of document collections require better
organization. Manual organization is not practical.
• Our goal: To discover the hidden thematic structure with
hierarchical probabilistic models called topic models.
• Use this structure for browsing, search, and similarity.
D. Blei Modeling Science 2 / 53
Modeling Science
Science, August 13, 1886
water acid disease
milk water blood
food solution cholera
dry experiments bacteria
fed liquid found
cows chemical bacillus
houses action experiments
butter copper organisms
fat crystals bacilli
found carbon cases
made alcohol diseases
contained made germs
wells obtained animal
produced substances koch
poisonous nitrogen made
5
Science, June 24, 1994
evolution rna disease
evolutionary mrna host
species site bacteria
organisms splicing diseases
biology rnas new
phylogenetic nuclear bacterial
life sequence resistance
origin introns control
diversity messenger strains
groups cleavage infectious
molecular two malaria
animals splice parasites
two sequences parasite
new polymerase tuberculosis
living intron health
6
• Our data are the pages Science from 1880-2002 (from JSTOR)
• No reliable punctuation, meta-data, or references.
• Note: this is just a subset of JSTOR’s archive.
D. Blei Modeling Science 2 / 53
Discover topics from a corpus
“Genetics” “Evolution” “Disease” “Computers”
human evolution disease computer
genome evolutionary host models
dna species bacteria information
genetic organisms diseases data
genes life resistance computers
sequence origin bacterial system
gene biology new network
molecular groups strains systems
sequencing phylogenetic control model
map living infectious parallel
information diversity malaria methods
genetics group parasite networks
mapping new parasites software
project two united new
sequences common tuberculosis simulations
D. Blei Modeling Science 3 / 53
Model the evolution of topics over time
1880 1900 1920 1940 1960 1980 2000
o o o o o o o
o
o
o
o
o
o
o
o
o
o
o o o o o o o o
o o o o o o o o o
o
o o o o
o
o
o
o
o
o
o
o
o
o o
o o o o o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o o
o
o o
1880 1900 1920 1940 1960 1980 2000
o o o
o
o
o
o
o
o
o
o o
o
o
o o o o o o
o
o o
o o
o o o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o o
o o o o o o o o o o o o o
o
o
o
o
o
o
o o o
o o
o
RELATIVITY
LASER
FORCE
NERVE
OXYGEN
NEURON
"Theoretical Physics" "Neuroscience"
D. Blei Modeling Science 4 / 53
Model connections between topics
wild type
mutant
mutations
mutants
mutation
plants
plant
gene
genes
arabidopsis
p53
cell cycle
activity
cyclin
regulation
amino acids
cdna
sequence
isolated
protein
gene
disease
mutations
families
mutation
rna
dna
rna polymerase
cleavage
site
cells
cell
expression
cell lines
bone marrow
united states
women
universities
students
education
science
scientists
says
research
people
research
funding
support
nih
program
surface
tip
image
sample
device
laser
optical
light
electrons
quantum
materials
organic
polymer
polymers
molecules
volcanic
deposits
magma
eruption
volcanism
mantle
crust
upper mantle
meteorites
ratios
earthquake
earthquakes
fault
images
data
ancient
found
impact
million years ago
africa
climate
ocean
ice
changes
climate change
cells
proteins
researchers
protein
found
patients
disease
treatment
drugs
clinical
genetic
population
populations
differences
variation
fossil record
birds
fossils
dinosaurs
fossil
sequence
sequences
genome
dna
sequencing
bacteria
bacterial
host
resistance
parasite
development
embryos
drosophila
genes
expression
species
forest
forests
populations
ecosystems
synapses
ltp
glutamate
synaptic
neurons
neurons
stimulus
motor
visual
cortical
ozone
atmospheric
measurements
stratosphere
concentrations
sun
solar wind
earth
planets
planet
co2
carbon
carbon dioxide
methane
water
receptor
receptors
ligand
ligands
apoptosis
proteins
protein
binding
domain
domains
activated
tyrosine phosphorylation
activation
phosphorylation
kinase
magnetic
magnetic ﬁeld
spin
superconductivity
superconducting
physicists
particles
physics
particle
experiment
surface
liquid
surfaces
ﬂuid
model reaction
reactions
molecule
molecules
transition state
enzyme
enzymes
iron
active site
reduction
pressure
high pressure
pressures
core
inner core
brain
memory
subjects
left
task
computer
problem
information
computers
problems
stars
astronomers
universe
galaxies
galaxy
virus
hiv
aids
infection
viruses
mice
antigen
t cells
antigens
immune response
D. Blei Modeling Science 5 / 53
Outline
1 Introduction
2 Latent Dirichlet allocation
3 Dynamic topic models
4 Correlated topic models
D. Blei Modeling Science 6 / 53
Outline
1 Introduction
2 Latent Dirichlet allocation
3 Dynamic topic models
4 Correlated topic models
D. Blei Modeling Science 7 / 53
Probabilistic modeling
1 Treat data as observations that arise from a generative
probabilistic process that includes hidden variables
• For documents, the hidden variables reﬂect the thematic
structure of the collection.
2 Infer the hidden structure using posterior inference
• What are the topics that describe this collection?
3 Situate new data into the estimated model.
• How does this query or new document ﬁt into the estimated
topic structure?
D. Blei Modeling Science 8 / 53
Intuition behind LDA
Simple intuition: Documents exhibit multiple topics.
D. Blei Modeling Science 9 / 53
Generative process
• Cast these intuitions into a generative probabilistic process
• Each document is a random mixture of corpus-wide topics
• Each word is drawn from one of those topics
D. Blei Modeling Science 10 / 53
Generative process
• In reality, we only observe the documents
• Our goal is to infer the underlying topic structure
• What are the topics?
• How are the documents divided according to those topics?
D. Blei Modeling Science 10 / 53
Graphical models (Aside)
· · ·
Y
X1 X2 XN
Xn
Y
N
≡
• Nodes are random variables
• Edges denote possible dependence
• Observed variables are shaded
• Plates denote replicated structure
D. Blei Modeling Science 11 / 53
Graphical models (Aside)
· · ·
Y
X1 X2 XN
Xn
Y
N
≡
• Structure of the graph deﬁnes the pattern of conditional
dependence between the ensemble of random variables
• E.g., this graph corresponds to
p(y, x1, . . . , xN) = p(y)
N
n=1
p(xn | y)
D. Blei Modeling Science 11 / 53
Latent Dirichlet allocation
θd Zd,n Wd,n
N
D K
βk
α η
Dirichlet
parameter
Per-document
topic proportions
Per-word
topic assignment
Observed
word Topics
Topic
hyperparameter
Each piece of the structure is a random variable.
D. Blei Modeling Science 12 / 53
Latent Dirichlet allocation
θd Zd,n Wd,n
N
D K
βk
α η
1 Draw each topic βi ∼ Dir(η), for i ∈ {1, . . . , K}.
2 For each document:
1 Draw topic proportions θd ∼ Dir(α).
2 For each word:
1 Draw Zd,n ∼ Mult(θd ).
2 Draw Wd,n ∼ Mult(βzd,n
).
D. Blei Modeling Science 13 / 53
Latent Dirichlet allocation
θd Zd,n Wd,n
N
D K
βk
α η
• From a collection of documents, infer
• Per-word topic assignment zd,n
• Per-document topic proportions θd
• Per-corpus topic distributions βk
• Use posterior expectations to perform the task at hand, e.g.,
information retrieval, document similarity, etc.
D. Blei Modeling Science 13 / 53
Latent Dirichlet allocation
θd Zd,n Wd,n
N
D K
βk
α η
• Computing the posterior is intractable:
p(θ | α) N
n=1 p(zn | θ)p(wn | zn, β1:K )
θ p(θ | α) N
n=1
K
z=1 p(zn | θ)p(wn | zn, β1:K )
• Several approximation techniques have been developed.
D. Blei Modeling Science 13 / 53
Latent Dirichlet allocation
θd Zd,n Wd,n
N
D K
βk
α η
• Mean ﬁeld variational methods (Blei et al., 2001, 2003)
• Expectation propagation (Minka and Lafferty, 2002)
• Collapsed Gibbs sampling (Grifﬁths and Steyvers, 2002)
• Collapsed variational inference (Teh et al., 2006)
D. Blei Modeling Science 13 / 53
Example inference
• Data: The OCR’ed collection of Science from 1990–2000
• 17K documents
• 11M words
• 20K unique terms (stop words and rare words removed)
• Model: 100-topic LDA model using variational inference.
D. Blei Modeling Science 14 / 53
Example inference
1 8 16 26 36 46 56 66 76 86 96
Topics
Probability
0.00.10.20.30.4
D. Blei Modeling Science 15 / 53
Example topics
“Genetics” “Evolution” “Disease” “Computers”
human evolution disease computer
genome evolutionary host models
dna species bacteria information
genetic organisms diseases data
genes life resistance computers
sequence origin bacterial system
gene biology new network
molecular groups strains systems
sequencing phylogenetic control model
map living infectious parallel
information diversity malaria methods
genetics group parasite networks
mapping new parasites software
project two united new
sequences common tuberculosis simulations
D. Blei Modeling Science 16 / 53
LDA summary
• LDA is a powerful model for
• Visualizing the hidden thematic structure in large corpora
• Generalizing new data to ﬁt into that structure
• LDA is a mixed membership model (Erosheva, 2004) that builds
on the work of Deerwester et al. (1990) and Hofmann (1999).
• For document collections and other grouped data, this might be
more appropriate than a simple ﬁnite mixture
D. Blei Modeling Science 17 / 53
LDA summary
• Modular: It can be embedded in more complicated models.
• E.g., syntax and semantics; authorship; word sense
• General: The data generating distribution can be changed.
• E.g., images; social networks; population genetics data
• Variational inference is fast; lets us to analyze large data sets.
• See Blei et al., 2003 for details and a quantitative comparison.
• Code to play with LDA is freely available on my web-site,
http://www.cs.princeton.edu/∼blei.
D. Blei Modeling Science 18 / 53
LDA summary
• But, LDA makes certain assumptions about the data.
• When are they appropriate?
D. Blei Modeling Science 19 / 53
Outline
1 Introduction
2 Latent Dirichlet allocation
3 Dynamic topic models
4 Correlated topic models
D. Blei Modeling Science 20 / 53
LDA and exchangeability
θd Zd,n Wd,n
N
D K
βk
α η
• LDA assumes that documents are exchangeable.
• I.e., their joint probability is invariant to permutation.
• This is too restrictive.
D. Blei Modeling Science 21 / 53
Documents are not exchangeable
"Infrared Reﬂectance in Leaf-Sitting
Neotropical Frogs" (1977)
"Instantaneous Photography" (1890)
• Documents about the same topic are not exchangeable.
• Topics evolve over time.
D. Blei Modeling Science 22 / 53
Dynamic topic model
• Divide corpus into sequential slices (e.g., by year).
• Assume each slice’s documents exchangeable.
• Drawn from an LDA model.
• Allow topic distributions evolve from slice to slice.
D. Blei Modeling Science 23 / 53
Dynamic topic models
D
θd
Zd,n
Wd,n
N
K
α
D
θd
Zd,n
Wd,n
N
α
D
θd
Zd,n
Wd,n
N
α
βk,1 βk,2 βk,T
. . .
D. Blei Modeling Science 24 / 53
Modeling evolving topics
βk,1 βk,2 βk,T
. . .
• Use a logistic normal distribution to model evolving topics
(Aitchison, 1980)
• A state-space model on the natural parameter of the topic
multinomial (West and Harrison, 1997)
βt,k | βt−1,k ∼ N(βt−1,k , Iσ2
)
p(w | βt,k ) = exp βt,k − log(1 + V−1
v=1 exp{βt,k,v })
D. Blei Modeling Science 25 / 53
Posterior inference
• Our goal is to compute the posterior distribution,
p(β1:T,1:K , θ1:T,1:D, z1:T,1:D | w1:T,1:D).
• Exact inference is impossible
• Per-document mixed-membership model
• Non-conjugacy between p(w | βt,k ) and p(βt,k )
• MCMC is not practical for the amount of data.
• Solution: Variational inference
D. Blei Modeling Science 26 / 53
Science data
TECHVIEW: DNA S E Q U E N C I NG
Sequencing the Genome, Fast
James C. Mullikin and Amanda A. McMurray
Genome sequencing projects reveal
the genetic makeup of an organism
by reading off the sequence of the
DNA bases, which encodes all of the information
necessary for the life of the organism.
The base sequence contains four nucleotides-adenine,
thymidine, guanosine,
and cytosine-which are linked together
into long double-helical chains. Over the
last two decades, automated DNA sequencers
have made the process of obtaining
the base-by-base sequence of DNA...
• Analyze JSTOR’s entire collection from Science (1880-2002)
• Restrict to 30K terms that occur more than ten times
• The data are 76M words in 130K documents
D. Blei Modeling Science 27 / 53
Analyzing a document
Original article Topic proportions
D. Blei Modeling Science 28 / 53
Analyzing a document
sequence
genome
genes
sequences
human
gene
dna
sequencing
chromosome
regions
analysis
data
genomic
number
devices
device
materials
current
high
gate
light
silicon
material
technology
electrical
ﬁber
power
based
data
information
network
web
computer
language
networks
time
software
system
words
algorithm
number
internet
Original article Most likely words from top topics
D. Blei Modeling Science 28 / 53
Analyzing a topic
1880
electric
machine
power
engine
steam
two
machines
iron
battery
wire
1890
electric
power
company
steam
electrical
machine
two
system
motor
engine
1900
apparatus
steam
power
engine
engineering
water
construction
engineer
room
feet
1910
air
water
engineering
apparatus
room
laboratory
engineer
made
gas
tube
1920
apparatus
tube
air
pressure
water
glass
gas
made
laboratory
mercury
1930
tube
apparatus
glass
air
mercury
laboratory
pressure
made
gas
small
1940
air
tube
apparatus
glass
laboratory
rubber
pressure
small
mercury
gas
1950
tube
apparatus
glass
air
chamber
instrument
small
laboratory
pressure
rubber
1960
tube
system
temperature
air
heat
chamber
power
high
instrument
control
1970
air
heat
power
system
temperature
chamber
high
ﬂow
tube
design
1980
high
power
design
heat
system
systems
devices
instruments
control
large
1990
materials
high
power
current
applications
technology
devices
design
device
heat
2000
devices
device
materials
current
gate
high
light
silicon
material
technology
D. Blei Modeling Science 29 / 53
Visualizing trends within a topic
1880 1900 1920 1940 1960 1980 2000
o o o o o o o
o
o
o
o
o
o
o
o
o
o
o o o o o o o o
o o o o o o o o o
o
o o o o
o
o
o
o
o
o
o
o
o
o o
o o o o o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o o
o
o o
1880 1900 1920 1940 1960 1980 2000
o o o
o
o
o
o
o
o
o
o o
o
o
o o o o o o
o
o o
o o
o o o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o o
o o o o o o o o o o o o o
o
o
o
o
o
o
o o o
o o
o
RELATIVITY
LASER
FORCE
NERVE
OXYGEN
NEURON
"Theoretical Physics" "Neuroscience"
D. Blei Modeling Science 30 / 53
Time-corrected document similarity
• Consider the expected Hellinger distance between the topic
proportions of two documents,
dij = E
K
k=1
( θi,k − θj,k )2
| wi , wj
• Uses the latent structure to deﬁne similarity
• Time has been factored out because the topics associated to the
components are different from year to year.
• Similarity based only on topic proportions
D. Blei Modeling Science 31 / 53
Time-corrected document similarity
The Brain of the Orang (1880)
D. Blei Modeling Science 32 / 53
Time-corrected document similarity
Representation of the Visual Field on the Medial Wall of
Occipital-Parietal Cortex in the Owl Monkey (1976)
D. Blei Modeling Science 33 / 53
Browser of Science
D. Blei Modeling Science 34 / 53
Quantitative comparison
• Compute the probability of each year’s documents conditional on
all the previous year’s documents,
p(wt | w1, . . . , wt−1)
• Compare exchangeable and dynamic topic models
D. Blei Modeling Science 35 / 53
Quantitative comparison
1920 1940 1960 1980 2000
10152025
Year
Per−wordnegativeloglikelihood
q q
q
q
q q
q
q q
q
q
q
q q q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q q
q q
q q
q
q
LDA
DTM
D. Blei Modeling Science 36 / 53
Outline
1 Introduction
2 Latent Dirichlet allocation
3 Dynamic topic models
4 Correlated topic models
D. Blei Modeling Science 37 / 53
The hidden assumptions of the Dirichlet distribution
• The Dirichlet is an exponential family distribution on the simplex,
positive vectors that sum to one.
• However, the near independence of components makes it a poor
choice for modeling topic proportions.
• An article about fossil fuels is more likely to also be about
geology than about genetics.
D. Blei Modeling Science 38 / 53
The logistic normal distribution
• The logistic normal is a distribution on the simplex that can
model dependence between components.
• The natural parameters of the multinomial are drawn from a
multivariate Gaussian distribution.
X ∼ NK−1(µ, )
θi = exp{xi − log(1 + K−1
j=1 exp{xj })}
D. Blei Modeling Science 39 / 53
Correlated topic model (CTM)
Zd,n Wd,n
N
D
K
Σ
µ
ηd
βk
• Draw topic proportions from a logistic normal, where topic
occurrences can exhibit correlation.
• Use for:
• Providing a “map” of topics and how they are related
• Better prediction via correlated topics
D. Blei Modeling Science 40 / 53
wild type
mutant
mutations
mutants
mutation
plants
plant
gene
genes
arabidopsis
p53
cell cycle
activity
cyclin
regulation
amino acids
cdna
sequence
isolated
protein
gene
disease
mutations
families
mutation
rna
dna
rna polymerase
cleavage
site
cells
cell
expression
cell lines
bone marrow
united states
women
universities
students
education
science
scientists
says
research
people
research
funding
support
nih
program
surface
tip
image
sample
device
laser
optical
light
electrons
quantum
materials
organic
polymer
polymers
molecules
volcanic
deposits
magma
eruption
volcanism
mantle
crust
upper mantle
meteorites
ratios
earthquake
earthquakes
fault
images
data
ancient
found
impact
million years ago
africa
climate
ocean
ice
changes
climate change
cells
proteins
researchers
protein
found
patients
disease
treatment
drugs
clinical
genetic
population
populations
differences
variation
fossil record
birds
fossils
dinosaurs
fossil
sequence
sequences
genome
dna
sequencing
bacteria
bacterial
host
resistance
parasite
development
embryos
drosophila
genes
expression
species
forest
forests
populations
ecosystems
synapses
ltp
glutamate
synaptic
neurons
neurons
stimulus
motor
visual
cortical
ozone
atmospheric
measurements
stratosphere
concentrations
sun
solar wind
earth
planets
planet
co2
carbon
carbon dioxide
methane
water
receptor
receptors
ligand
ligands
apoptosis
proteins
protein
binding
domain
domains
activated
tyrosine phosphorylation
activation
phosphorylation
kinase
magnetic
magnetic ﬁeld
spin
superconductivity
superconducting
physicists
particles
physics
particle
experiment
surface
liquid
surfaces
ﬂuid
model reaction
reactions
molecule
molecules
transition state
enzyme
enzymes
iron
active site
reduction
pressure
high pressure
pressures
core
inner core
brain
memory
subjects
left
task
computer
problem
information
computers
problems
stars
astronomers
universe
galaxies
galaxy
virus
hiv
aids
infection
viruses
mice
antigen
t cells
antigens
immune response
D. Blei Modeling Science 41 / 53
Summary
• Topic models provide useful descriptive statistics for analyzing
and understanding the latent structure of large text collections.
• Probabilistic graphical models are a useful way to express
assumptions about the hidden structure of complicated data.
• Variational methods allow us to perform posterior inference to
automatically infer that structure from large data sets.
• Current research
• Choosing the number of topics
• Continuous time dynamic topic models
• Topic models for prediction
• Inferring the impact of a document
D. Blei Modeling Science 42 / 53
“We should seek out unfamiliar summaries of observational material,
and establish their useful properties... And still more novelty can
come from ﬁnding, and evading, still deeper lying constraints.”
(John Tukey, The Future of Data Analysis, 1962)
D. Blei Modeling Science 43 / 53
Supervised topic models (with Jon McAuliffe)
• Most topic models are unsupervised. They are ﬁt by maximizing
the likelihood of a collection of documents.
• Consider documents paired with response variables.
For example:
• Movie reviews paired with a number of stars
• Web pages paired with a number of “diggs”
• We develop supervised topic models, models of documents and
responses that are ﬁt to ﬁnd topics predictive of the response.
D. Blei Modeling Science 44 / 53
Supervised LDA
θd Zd,n Wd,n
N
D
K
βk
α
Yd η, σ2
1 Draw topic proportions θ | α ∼ Dir(α).
2 For each word
1 Draw topic assignment zn | θ ∼ Mult(θ).
2 Draw word wn | zn, β1:K ∼ Mult(βzn ).
3 Draw response variable y | z1:N, η, σ2
∼ N η ¯z, σ2
, where
¯z = (1/N) N
n=1 zn.
D. Blei Modeling Science 45 / 53
Comments
• SLDA is used as follows.
• Fit coefﬁcients and topics from a collection of
document-response pairs.
• Use the ﬁtted model to predict the responses of previously
unseen documents,
E[Y | w1:N, α, β1:K , η, σ2
] = η E[ ¯Z | w1:N, α, β1:K ].
• The process enforces that the document is generated ﬁrst,
followed by the response. The response is generated from the
particular topics that were realized in generating the document.
D. Blei Modeling Science 46 / 53
Example: Movie reviews
both
motion
simple
perfect
fascinating
power
complex
however
cinematography
screenplay
performances
pictures
effective
picture
his
their
character
many
while
performance
between
−30 −20 −10 0 10 20
● ●● ● ●● ● ● ● ●
more
has
than
ﬁlms
director
will
characters
one
from
there
which
who
much
what
awful
featuring
routine
dry
offered
charlie
paris
not
about
movie
all
would
they
its
have
like
you
was
just
some
out
bad
guys
watchable
its
not
one
movie
least
problem
unfortunately
supposed
worse
ﬂat
dull
• We ﬁt a 10-topic sLDA model to movie review data (Pang and
Lee, 2005).
• The documents are the words of the reviews.
• The responses are the number of stars associated with
each review (modeled as continuous).
• Each component of coefﬁcient vector η is associated with a topic.
D. Blei Modeling Science 47 / 53
Simulations
●
● ●
●
●
●
●
●
●
●
●
●
●
●
2 4 10 20 30
0.000.020.040.060.080.100.12
Number of topics
PredictiveR2
● ● ●
●
●
●
●
●
●
● ●
●
●
●
2 4 10 20 30
−8.6−8.5−8.4−8.3−8.2−8.1−8.0
Number of topics
Per−wordheldoutloglikelihood
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
5 10 15 20 25 30 35 40 45 50
−6.42−6.41−6.40−6.39−6.38−6.37
Number of topics
Per−wordheldoutloglikelihood
●
●
● ●
● ● ● ●
●
●
●
●
●
● ●
●
● ● ●
●
5 10 15 20 25 30 35 40 45 50
0.00.10.20.30.40.5
Number of topics
PredictiveR2
sLDA
LDA
Movie corpus
Digg corpus
D. Blei Modeling Science 48 / 53
Diversion: Variational inference
• Let x1:N be observations and z1:M be latent variables
• Our goal is to compute the posterior distribution
p(z1:M | x1:N) =
p(z1:M, x1:N)
p(z1:M, x1:N)dz1:M
• For many interesting distributions, the marginal likelihood of the
observations is difﬁcult to efﬁciently compute
D. Blei Modeling Science 49 / 53
Variational inference
• Use Jensen’s inequality to bound the log prob of the
observations:
log p(x1:N) ≥ Eqν [log p(z1:M, x1:N)] − Eqν [log qν(z1:M)].
• We have introduced a distribution of the latent variables with free
variational parameters ν.
• We optimize those parameters to tighten this bound.
• This is the same as ﬁnding the member of the family qν that is
closest in KL divergence to p(z1:M | x1:N).
D. Blei Modeling Science 50 / 53
Mean-ﬁeld variational inference
• Complexity of optimization is determined by factorization of qν
• In mean ﬁeld variational inference qν is fully factored
qν(z1:M) =
M
m=1
qνm (zm).
• The latent variables are independent.
• Each is governed by its own variational parameter νm.
• In the true posterior they can exhibit dependence
(often, this is what makes exact inference difﬁcult).
D. Blei Modeling Science 51 / 53
MFVI and conditional exponential families
• Suppose the distribution of each latent variable conditional on
the observations and other latent variables is in the exponential
family:
p(zm | z−m, x) = hm(zm) exp{gm(z−m, x)T
zm − am(gi (z−m, x))}
• Assume qν is fully factorized and each factor is in the same
exponential family:
qνm (zm) = hm(zm) exp{νT
mzm − am(νm)}
D. Blei Modeling Science 52 / 53
MFVI and conditional exponential families
• Variational inference is the following coordinate ascent algorithm
νm = Eqν [gm(Z−m, x)]
• Notice the relationship to Gibbs sampling
D. Blei Modeling Science 52 / 53
Variational family for the DTM
βk,1 βk,2 βk,T
. . .
ˆβk,1
ˆβk,2
ˆβk,T
• Distribution of θ and z is fully-factorized (Blei et al., 2003)
• Distribution of {β1,k , . . . , βT,k } is a variational Kalman ﬁlter
• Gaussian state-space model with free observations ˆβk,t .
• Fit observations such that the corresponding posterior over the
chain is close to the true posterior.
D. Blei Modeling Science 53 / 53
Variational family for the DTM
βk,1 βk,2 βk,T
. . .
ˆβk,1
ˆβk,2
ˆβk,T
• Given a document collection, use coordinate ascent on all the
variational parameters until the KL converges.
• Yields a distribution close to the true posterior of interest
• Take expectations w/r/t the simpler variational distribution
D. Blei Modeling Science 53 / 53