Introduction to Natural Language
Processing(600.465)
HMM Parameter Estimation: the
Baum-welch algorithm
Dr. Jan Hajiˇc
CS Dept., Johns Hopkins Univ.
hajic@cs.jhu.edu
www.cs.jhu.edu/~hajic
HMM: The Tasks
HMM(the general case):
ﬁve-tuple (S, S0, Y , PS , PY ), where:
S = {s1, s2, . . . , sT } is the set of states, S0 is the initial state,
Y = {y1, y2, . . . , yy } is the output alphabet,
PS (sj |si ) is the set of prob. distributions of transitions,
PY (yk |si , sj ) is the set of output (emission) probability
distributions.
Given an HMM & an output sequence Y = {y1, y2, . . . , yk} :
(Task 1) compute the probability of Y ;
(Task 2) compute the most likely sequence of states which has
generated Y
(Task 3) Estimating the parameters (transition/output
distributions)
10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 2/14
A variant of EM
Idea(∼EM, for another variant see LM smoothing):
Start with (possibly random) estimates of PS and PY .
Compute(fractional) ”counts” of state transitions/emissions
taken, from PS and PY , given data Y
Adjust the estimates of PS and PY from these ”counts” (using
MLE, i.e. relative frequency as the estimate).
Remarks:
many m,ore parameters than the simple four-way smoothing
no proofs here; see Jelinek Chapter 9
10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 3/14
Setting
HMM (without PS , PY )(S, S0, Y ), and data
T = {y1 ∈ Y }i=1...|T|
will use T ∼ |T|
HMM structure is given: (S, S0)
PS : Typically, one wants to allow ”fully connected” graph
(i.e. no transitions forbidden ∼ no transitions set to hard 0)
why? → we better leave it on the learning phase, based on the
data!
sometimes possible to remove some transitions ahead of time
PY : should be restricted (if not, we will not get anywhere!)
restricted ∼ hard 0 probabilities of p(y|s, s )
”Dictionary”: states ↔ words, ”m:n” mapping on S × Y (in
general)
10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 4/14
Initialization
For computing the initial expected ”counts”
Important part
EM guaranteed to ﬁnd a local maximum only (albeit a good
one in most cases)
PY initialization more important
fortunately, often easy to determine
together with dictionary ↔ vocabulary mapping, get counts,
then MLE
PS initialization less important
e.g. uniform distribution for each p(.|s)
10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 5/14
Data structures
Will need storage for:
The predetermined structure of the HMM (unless fully
connected → need not to keep it!)
The parameters to be estimated (PS , PY )
The expected counts (same size as (PS , PY ))
The training data T = {yi
∈ Y }i=1...T
The trellis (if f.c.):
10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 6/14
The Algorithm Part I
1. Initialize PS , PY
2. Compute ”forward” probabilities:
follow the procedure for trellis (summing), compute α(s, i)
everywhere
use the current values of PS , PY (p(s |s), p(y|s, s )) :
α(s , i) = s→s, α(s, i − 1) × p(s |s) × p(yi |s, s )
NB: do not throw away the previous stage!
3. Compute ”backward” probabilities
start at all nodes of the last stage, proceed backwards, β(s, i)
i.e., probability of the ”tail” of data from stage i to the end of
data
β(s , i) = s ←s β(s, i + 1) × p(s|s ) × p(yi+1|s , s)
also, keep the β(s, i) at all trellis states
10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 7/14
The Algorithm Part II
1. Collect counts:
for each output/transition pair compute
c(s, s ) = y∈Y c(y, s, s ) (assuming all observed yi in Y )
c(s) = s ∈S c(s, s )
2. Reestimate: p (s |s) = c(s, s )/c(s)
p (y|s, s ) = c(y, s, s )/c(s, s )
3. Repeat 2-5 until desired convergence limit is reached
10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 8/14
Baum-Welch: Tips & Tricks
Normalization badly needed
long training data → extremely small probabilities
Normalize α, β using the same norm.factor:
N(i) = s∈S α(s, i)
as follows:
compute α(s, i) as usual (Step 2 of the algorithm), computing
the sum N(i) at the given stage i as you go.
at the end of each stage, recompute all alphas(for each state
s):
α ∗ (s, i) = α(s, i)/N(i)
use the same N(i) for βs at the end of each backward (Step 3)
stage:
β ∗ (s, i) = β(s, i)/N(i)
10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 9/14
Example
Task: pronunciation of ”the”
Solution: build HMM, fully connected, 4 states:
S - short article, L - long article, C,V - word starting
w/consonant, vowel
thus, only ”the” is ambiguous (a, an, the - not members of
C,V)
Output form states only (p(w|s, s ) = p(w|s ))
10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 10/14
Example: Initialization
Output probabilities:
pinit(w|c) = c(c, w)/c(c); where
c(S, the) = c(L, the) = c(the)/2
(other than that, everything is deterministic)
Transition probabilities:
pinit(c |c) = 1/4(uniform)
Don’t forget:
about the space needed
initialize α(X, 0) = 1 (X : the never-occuring front buﬀer st.)
initialize β(s, T) = 1 for all s (except for s = X)
10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 11/14
Fill in alpha, beta
Left to right, alpha:
α(s , i) = s→s α(s, i − 1) × p(s |s) × p(wi |s ), where s is
the output from states
Remember normalization (N(i)).
Similary, beta (on the way back from the end).
10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 12/14
Counts & Reestimation
One pass through data
At each position i, go through all pairs (si , si+1)
Increment appropriate counters by frac. counts (Step 4):
inc(yi+1, si , si+1) = a(si , i)p(si+1|si )p(yi+1|si+1)b(si+1,i+1)
c(y, si , si+1)+ = inc (for y at pos i+1)
c(si , si+1)+ = inc (always)
c(si )+ = inc (always)
inc(big,L,C)=α(L, 7)p(C|L)p(big,C)β(V , 8)
inc(big,S,C)=α(S, 7)p(C|S)p(big,C)β(V , 8)
Reestimate p(s |s), p(y|s)
and hope for increase in p(C|S) and p(V |L) . . . !!
10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 13/14
HMM: Final Remarks
Parameter ”tying”
keep certain parameters same (∼ just one ”counter” for all of
them)
any combination in principle possible
ex.: smoothing (just one set of lambdas)
Real Numbers Output
Y of inﬁnite size (R, Rn
)
parametric (typically: few) distribution needed (e.g.,
”Gaussian”)
”Empty” transitions: do not generate output
∼ vertical areas in trellis; do not use in ”counting”
10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 14/14
Introduction to Natural Language
Processing(600.465)
HMM Tagging
Dr. Jan Hajiˇc
CS Dept., Johns Hopkins Univ.
hajic@cs.jhu.edu
www.cs.jhu.edu/~hajic
Review
Recall:
tagging ∼ morphological disambiguation
tagset VT ⊂ (C1, C2, . . . Cn)
Ci - morphological categories, such as POS, NUMBER, CASE,
PERSON, TENSE, GENDER,. . .
mapping w → {t ∈ VT } exists
restriction of Morphological Analysis: A+
→ 2(L,C2,C2,...,Cn)
where A is the language alphabet, L is the set of lemmas
extension of punctuation, sentence boundaries (treated as
words)
10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 2/12
The Setting
Noisy Channel setting:
Goal (as usual): discover ”input” to the channel (T, the tag
seq.) given the ”output” (W, the word sequence)
p(T|W ) = p(W |T)p(T)/p(W )
p(W ) ﬁxed (W given)...
argmaxT p(T|W ) = argmaxT p(W |T)p(T)
10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 3/12
The Model
Two models (d = |W | = |T| word sequence length):
p(W |T) = Πi=1...d p(wi |w1, . . . , wi−1, t1, . . . , td )
p(T) = Πi=1...d p(ti |t1, . . . , ti−1)
Too much parameters (as always)
Approximation using the following assumptions:
words do not depend on the context
tag depends on limited history:
p(ti |t1, . . . , ti−1) ∼= p(ti |ti−n+1, . . . , ti−1)
n-gram tag ”language” model
word depends on tag only:
p(wi |w1, . . . , wi−1, t1, . . . , td ) ∼= p(wi |ti )
10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 4/12
The HMM Model Deﬁnition
(Almost) general HMM:
output (words) emitted by states (not arcs)
states: (n-1)-tuples of tags if n-gram tag model used
ﬁve-tuple (S, s0, Y , PS , PY ) where:
S = {s0, s1, . . . , sT } is the set of states, s0 is the initial state,
Y = {y1, y2, . . . , yy } is the output alphabet (the words),
PS (sj |si ) is the set of prob. distributions of transitions
-PS (sj |si ) = p(ti |ti−n+1, . . . , ti−1); sj = (ti−n+2, . . . , ti ), si =
(ti−n+1, . . . , ti−1)
PY (yk |si ) is the set of output (emission) probability
distributions
-another simpliﬁcation: PY (yk |sj ) if si and sj contain the
same tag as the rightmost element: PY (yk |si ) = p(wi |ti )
10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 5/12
Supervised Learning (Manually Annotated Data Available)
Use MLE
p(wi |ti ) = cwt(ti , wi )/ct(ti )
p(ti |ti−n+1, ) =
ctn(ti−n+1, . . . , ti−1, ti )/ct(n−1)(ti−n+1, . . . , ti−1)
Smooth(both!)
p(wi |ti ) : ”Add 1” for all possible tag, word pairs using a
predeﬁned dictionary (thus some 0 kept!)
p(ti |ti−n+1, . . . , ti−1) : linear interpolation:
e.g. for trigram model:
pλ(ti |ti−2, ti−1) =
λ3p(ti |ti−2, ti−1) + λ2p(ti |ti−1) + λ1p(ti ) + λ0/|VT |
10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 6/12
Unsupervised Learning
Completely unsupervised learning impossible
at least if we have the tagset given- how would we associate
words with tags?
Assumed (minimal) setting:
tagset known
dictionary/morph. analysis available (providing possible tags
for any word)
Use: Baum-Welch algorithm (see class 15,10/13)
”tying”: output (state-emitting only, same dist. from two
states with same ”ﬁnal” tag)
10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 7/12
Comments on Unsupervised Learning
Initialization of Baum-Welch
is some annotated data available, use them
keep 0 for impossible output probabilities
Beware of:
degradation of accuracy (Baum-Welch criterion: entropy, not
accuracy!)
use heldout data for cross-checking
Supervised almost always better
10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 8/12
Unknown Words
”OOV” words (out-of-vocabulary)
we do not have list of possible tags for them
and we certainly have no output probabilities
Solutions:
try all tags (uniform distribution)
try open-class tags (uniform, unigram distribution)
try to ”guess” possible tags (based on suﬃx/ending) - use
diﬀerent output distribution based on the ending (and/or other
factors, such as capitalization)
10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 9/12
Running the Tagger
Use Viterbi
remember to handle unknown words
single-best, n-best possible
Another option
assign always the best tag at each word, but consider all
possibilities for previous tags (no back pointers nor a
path-backpass)
introduces random errors, implausible sequences, but might get
higher accuracy (less secondary errors)
10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 10/12
(Tagger) Evaluation
A must. Test data (S), previously unseen (in training)
change test data often if at all possible! (”feedback cheating”)
Error-rate based
Formally:
Out(w) = set of output ”items” for an input ”item” w
True(w) = single correct output (annotation) for w
Errors(S) = i=1..|S| δ (Out(wi ) = True(wi ))
Correct(S) = i=1..|S| δ (True(wi ) ∈ Out(wi ))
Generated(S) = i=1..|S| δ|Out(wi )|
10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 11/12
Evaluation Metrics
Accuracy: Single output (tagging: each word gets a single
tag)
Error rate: Err(S) = Errors(S)/|S|
Accuracy: Acc(S) = 1−(Errors(S)/|S|) = 1− Err(S)
What if multiple (or no) output?
Recall: R(S) = Correct(S)/|S|
Precision: P(S) = Correct(S)/Generated(S)
Combination: F measure: F = 1/(α/P + (1 − α)/R)
α is a weight given to precision vs. recall; for
α = 5, F = 2PR/(R + P)
10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 12/12
Introduction to Natural Language
Processing(600.465)
Transformation-Based Tagging
Dr. Jan Hajiˇc
CS Dept., Johns Hopkins Univ.
hajic@cs.jhu.edu
www.cs.jhu.edu/~hajic
The Task, Again
Recall:
tagging ∼ morphological disambiguation
tagset VT ∈ (C1, C2, . . . , Cn)
Ci - moprhological categories, such as POS, NUMBER, CASE,
PERSON, TENSE, GENDER,....
mapping w → {t ∈ VT } exists
restriction of Morphological Analysis: A+
→ 2(L,C1,C2,...,Cn)
,
where A is the language alphabet, L is the set of lemmas
extension to punctuation, sentence boundaries (treated as
word)
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 2/15
Setting
Not a source channel view
Not even a probabilistic model (no ”numbers” used when
tagging a text after a model is developed)
Statistical, yes:
uses training data (combination of supervised [manually
annotated data available] and unsupervised [plain text, large
volume] training)
learning [rules]
criterion: accuracy (that’s what we are interested in in the end
after all!)
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 3/15
The General Scheme
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 4/15
The Learner
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 5/15
The I/O of an Iteration
In (iteration i):
Intermediate data (initial or the result of previous iteration)
The TRUTH (the annotated training data)
poolofpossiblerules
Out:
One rule rselected(i) to enhance the set of rules learned so far
Intermediate data (input data transformed by the rule learned
in this iteration, rselected(i))
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 6/15
The Initial Assignment of Tags
One possibilty:
NN
Another:
the most frequent tag for a given word form
Even:
use an HMM tagger for the initial assignment
Not particulary sensitive
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 7/15
The Criterion
Error rate (or Accuracy):
beginning of an iteration: some error rate Ein
each possible rule r, when applied at every data position:
makes an improvement somewhere in the data (cimproved (r))
makes it worse at sme places (cworsened (r))
and, of course, does not touch the remaining data
Rule contribution to the improvement of the error rate:
contrib(r) = cimproved(r) − cworsened (r)
Rule selection at iteration i:
rselected(i) = argmaxr contrib(r)
New error rate: Eout = Ein − contrib(rselected(i))
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 8/15
The Stopping Criterion
Obvious:
no improvement can be made
contrib(r) ≤ 0
or improvement too small
contrib(r) ≤ Threshold
NB: prone to overtraining!
therefore, setting a reasonable threshold advisable
Heldout?
maybe: remove rules which degrade performance on H
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 9/15
The Pool of Rules(Templates)
Format: change tag at position i from a to b / condition
Context rules (condition deﬁnition - ”template”):
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 10/15
Lexical Rules
Other type: lexical rules
Example:
wi has suﬃx -ied
wi has preﬁx ge-
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 11/15
Rule Application
Two possibilities:
immediate consequences (left-to-right):
delayed (”ﬁxed input”):
use original input for context
the above rule then applies twice
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 12/15
In Other Words...
1. Strip the tags oﬀ the truth, keep the original truth
2. Initialize the stripped data by some simple method
3. Start with an empty set of selected rules S.
4. Repeat until the stopping criterion applies:
compute the contribution of the rule r, for each r:
contrib(r) = cimproved (r) − cworsened (r)
select r which has the biggest contribution contrib(r), add it to
the ﬁnal set of selected rules S.
5. Output the set S
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 13/15
The Tagger
Input:
untagged data
rules (S) learned by the learner
Tagging:
use the same initialization as the learner did
for i = 1..n (n - the number of rules learnt)
apply the rule i to the whole intermediate data, changing
(some) tags
the last intermediate data is the output
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 14/15
N-best & Unsupervised Modiﬁcations
N-best modiﬁcation
allow adding tags by rules
criterion: optimal combination of accuracy and the number of
tags per word (we want: close to ↓ 1)
Unsupervised modiﬁcation
use only unambiguous words for evaluation criterion
work extremely well for English
does not work for languages with few unambiguous words
11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 15/15