HMM Algorithms: Trellis and Viterbi
PA154 Language Modeling (4.2)
Pavel Rychly
pary@fi.muni.cz March 9,2023
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic
HMM: The Two Tasks
■ HMM (the general case):
■ five-tuple (S, S0, Y, Ps, PY), where:
■ S = {5i,52,... ,57-} is the set of states, S0 is the initial,
■ Y = {yij/2, • • • ,yv} is the output alphabet,
■ Ps(Sj\Sj) is the set of prob. distributions of transitions,
■ Py{yk\Si,Sj) is the set of output (emission) probability distributions.
■ Given an HMM & an output sequence Y = {y\,yj, • • • 5y&}
(Task 1) compute the probability of Y;
(Task 2) compute the most Likely sequence of states which has generated Y.
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
2/22
Trellis - Deterministic Output
HMM
Trellis:
time/position t
1 2 3 4.
p(toe)=x.6 x .88 x 1+ x.4 x .1 x 1 = .568
- treLLis state: (HMM state, position) -each state: holds one number (prob):a a(x,o) = iQ(^i) = .6a(D,2) = .568 a(s,3) = .568
- probability or Y: Ha in the Last state a^c ^ = 4
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
3/22
Creating the Trellis: The Start
■ Start in the start state (x),
■ its a(x,0) to 1.
■ Create the first stage:
■ get the first "output" symbol y1
■ create the first stage (column)
■ but only those trellis states
which generate yi
■ set their a(state,l) to the Ps(state
■ ...and forget about the 0-th stage
position/stage 0 1
yi:t
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
Trellis: The Next Step
position/stage
■ Suppose we are in stage /,
■ Creating the next stage:
■ create all trellis state in the next stage which generate y/+1, but only those reachable from any of the stage-/ states
■ set their a(state, i + 1) to: Ps{state\ prev.state) xa(prev.state, i) (add up all such numbers on arcs going to a common trellis state)
■ ...and forget about stage /
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
Trellis: The Last Step
Continue until "output" exhausted
■ |/| = 3: until stage 3
Add together all the a(state,\Y\) That's the P(Y).
Observation (pleasant):
■ memory usage max: 2|S
■ multiplications max: S2 Y
Last position/stage
a = .568
a = .568
P(Y)=568
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
Trellis: The General Case (still, bigrams)
Start as usual:
■ start state (x), set its a(x,0) to 1.
o,.06
e,.l2
p(toe) = .48 x .616 x .6 + .2 x 1 x .176 +
.2 x 1 x .12 ^ .237
a = 1
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
General Trellis: The Next Step
We are in stage /:
■ Generate the next stage i+1 as before (except now arcs generate output, thus use only those arcs marked by the output symbol yi+1)
■ For each generated state compute a(state,i +1) =
= incomingarcsPy{yi+i\state,prev.state) X a(prev.statej)
position/stage
0 1
a = 1
x
o,.06 1^06
e,.12
V
e
t,2
t.48 o,.08
e,.17i \L.088
B
o,.4
e,.6
a = .48
a = .2
and forget about stage / as usual
q )       o,.6167 q
Pavel Rychly • HMM Algorithms: Trellis and WterbK . March 9, 2023
8/22
Trellis: The Complete Example
The Case of Trigrams
■ Like before, but:
■ states correspond to bigrams,
■ output function always emits the second output symbol of the pair (state) to which the arc goes:
Multiple paths not possible -»trellis not really needed
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 10/22
Trigrams with Classes
More interesting:
■ n-gram class LM:p(w/|w/_2,        = p(vv/|c/)p(c/|c/_2, c,-_i)
->> states are pairs of classes (c,-_i. q), and emit "words":
(letters in our example)
p(t|C) = 1 usual,
p(o|V) = .3 non-
p(e|V) = .6 overlapping
p(y|V) = .1 classes
o,e,y
o,e,y
p{toe) = .6 x 1 x .88 x .3 x .07 x .6 ^ .00665 p{teo) = .6 x 1 x .88 x .6 x .07 x .3 ^ .00332 p{toy) = .6 x 1 x .88 x .3 x .07 x .1 ^ .00111 p{tty) = .6 x 1 x .12 x 1 x 1 x .1 ^ .0072
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
11/22
Class Trigrams: the Trellis
TreLLis generation (Y = toy")
p(t|C) = 1 p(o|V) = .3 p(e|V) = .6
p(y|V) = .1
a = .6 x 1
again, treLLis useful but not really needed
a = .1584 x .07 x .1 ,00111
o,e,y
o,e,y
Y: t
a = .6 x .88 x .3
o
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
12/22
Overlapping Classes
Imagine that classes may overlap
■ e.g. Y is sometimes vowel sometimes consonant, belongs to V as well as C:
p(t|Q =	.3
p(r|Q =	.7
p(o|V) =	.1
p(e|V) =	.3
P(Y|V) =	A
p(r|V) =	.2
o,e,y,r o,e,y,r
p(try) = ?
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
13/22
Overlapping Classes: Trellis Example
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
14/22
Trellis: Remarks
■ So far, we went Left to right (computing a)
■ Same result: going right to Left (computing (3)
■ supposed we know where to start (finite data)
■ In fact, we might start in the middle going Left and right
■ Important for parameter estimation (Forward-Backward ALgortihm alias Baum-WeLch)
■ Implementation issues:
■ scaling/normalizing probabilities, to avoid too small numbers & addition problems with many transitions
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
15/22
The Viterbi Algorithm
■ Solving the task of finding the most Likely sequence of states which generated the observed data
■ i.e., finding
Sbest = a rg maxSP(S\Y) which is equal to (Y is constant and thus P(Y) is fixed):
Sbest = a rg maxsP(Sy) = = argmaxsP(so,s1,s2,... ,sk,yi,y2,... ,yk) = = argmaxsni=1„k P(y1|s/,s/_1)P(s/|s/_1)
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
16/22
The Crucial Observation
Imagine the trellis build as before (but do not compute the as yet; assume they are o.k.); stage /:
stage 1
a = .6
NB: remember previous state .5     from which we got the maximum:
a = A
a = .max{.l>, .32) = .32
?...max!
stage 1
this is certainly the "backwards" maximum to (D,2)...but
it cannot change even whenever we go forward (M. Property: Limited History)
reverse" the arc
a = .32
Pavel Rychlý • HMM Algorithms: Trellis and Viterbi • March 9, 2023
17/22
Viterbi Example
p(t|Q = .3 p(r|C) = .7 p(o|V) = .1 p(e|V) = .3
P(y|v) = .4
p(r|V) = .2
argmaxxvz p(rry|XYZ) = ?
Possible state seq.: (x,V)(V,C)(C,V)[VCV], (x,C)(C,C)(C,V)[CCV], (x,C)(C,V)(V,V)[CW]
Pavel Rychlý • HMM Algorithms: Trellis and Viterbi • March 9, 2023
18/22
Viterbi Computation
P(t|Q =
p(r|Q = p(o|V) = p(e|V) =
P(V|V) = p(r|V) =
.3 .7 .1 .3 .4 .2
a in trellis state: best prob from start to here
o,e,y,r
Y:
42 x .12 x .7 : .03528
o,e,y,r o,e,y,r
a = .07392 x .07 x .4 =.002070
v,vk
o,cc = .03528 x 1 x .4 '=.01411
a.vc = .056 x .8 x .4 ' = .01792 = amax
-- .0.8 x 1 X .7 = 056
.4 x .2 .08
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
19/22
n-best State Sequences
■ Keep track of n best "back pointers":
■ Ex.: n= 2: Two "winners":
■ VCV (best)
■ CCV {2nd best)
Y:
.42 x .12 x .7 .03528
a = .07392 x .07 x .4 =.002070
ac>c = .03528 x 1 X .4
='.01411 aV}C = -056 x .8 x .4 = .01792 = amax
a = .0.8 X 1 X .7 = 056
a = .4 x .2 = 08
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
20/22
Tracking Back the n-best paths
■ Backtracking-style algorithm:
■ Start at the end, in the best of the n states {sbest)
■ Put the other n-1 best nodes/back pointer pairs on stack, except those leading from sbest to the same best-back state.
■ Follow the back "beam" towards the start of the data, spitting out nodes on the way (backwards of course) using always only the best back pointer.
■ At every beam split, push the diverging node/back pointer pairs onto the stack (node/beam width is sufficient!).
■ When you reach the start of data, close the path, and pop the topmost node/back pointer(width) pair from the stack.
■ Repeat until the stack is empty; expand the result tree if necessary.
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
21/22
Pruning
Sometimes, too many trellis states in a stage:
a = .002
a
a
a
a =
a =
.043 .001 .231 .002 .000003
criteria: (a) a < threshold (b) Y.H < threshold (c) # of states > threshold (get rid of smallest a)
x
a = .000435 a = .0066
Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023
22/22