HMM Algorithms: Trellis and Viterbi PA154 Language Modeling (4.2) Pavel Rychly pary@fi.muni.cz March 9,2023 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic HMM: The Two Tasks ■ HMM (the general case): ■ five-tuple (S, S0, Y, Ps, PY), where: ■ S = {5i,52,... ,57-} is the set of states, S0 is the initial, ■ Y = {yij/2, • • • ,yv} is the output alphabet, ■ Ps(Sj\Sj) is the set of prob. distributions of transitions, ■ Py{yk\Si,Sj) is the set of output (emission) probability distributions. ■ Given an HMM & an output sequence Y = {y\,yj, • • • 5y&} (Task 1) compute the probability of Y; (Task 2) compute the most Likely sequence of states which has generated Y. Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 2/22 Trellis - Deterministic Output HMM Trellis: time/position t 1 2 3 4. p(toe)=x.6 x .88 x 1+ x.4 x .1 x 1 = .568 - treLLis state: (HMM state, position) -each state: holds one number (prob):a a(x,o) = iQ(^i) = .6a(D,2) = .568 a(s,3) = .568 - probability or Y: Ha in the Last state a^c ^ = 4 Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 3/22 Creating the Trellis: The Start ■ Start in the start state (x), ■ its a(x,0) to 1. ■ Create the first stage: ■ get the first "output" symbol y1 ■ create the first stage (column) ■ but only those trellis states which generate yi ■ set their a(state,l) to the Ps(state ■ ...and forget about the 0-th stage position/stage 0 1 yi:t Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 Trellis: The Next Step position/stage ■ Suppose we are in stage /, ■ Creating the next stage: ■ create all trellis state in the next stage which generate y/+1, but only those reachable from any of the stage-/ states ■ set their a(state, i + 1) to: Ps{state\ prev.state) xa(prev.state, i) (add up all such numbers on arcs going to a common trellis state) ■ ...and forget about stage / Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 Trellis: The Last Step Continue until "output" exhausted ■ |/| = 3: until stage 3 Add together all the a(state,\Y\) That's the P(Y). Observation (pleasant): ■ memory usage max: 2|S ■ multiplications max: S2 Y Last position/stage a = .568 a = .568 P(Y)=568 Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 Trellis: The General Case (still, bigrams) Start as usual: ■ start state (x), set its a(x,0) to 1. o,.06 e,.l2 p(toe) = .48 x .616 x .6 + .2 x 1 x .176 + .2 x 1 x .12 ^ .237 a = 1 Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 General Trellis: The Next Step We are in stage /: ■ Generate the next stage i+1 as before (except now arcs generate output, thus use only those arcs marked by the output symbol yi+1) ■ For each generated state compute a(state,i +1) = = incomingarcsPy{yi+i\state,prev.state) X a(prev.statej) position/stage 0 1 a = 1 x o,.06 1^06 e,.12 V e t,2 t.48 o,.08 e,.17i \L.088 B o,.4 e,.6 a = .48 a = .2 and forget about stage / as usual q ) o,.6167 q Pavel Rychly • HMM Algorithms: Trellis and WterbK . March 9, 2023 8/22 Trellis: The Complete Example The Case of Trigrams ■ Like before, but: ■ states correspond to bigrams, ■ output function always emits the second output symbol of the pair (state) to which the arc goes: Multiple paths not possible -»trellis not really needed Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 10/22 Trigrams with Classes More interesting: ■ n-gram class LM:p(w/|w/_2, = p(vv/|c/)p(c/|c/_2, c,-_i) ->> states are pairs of classes (c,-_i. q), and emit "words": (letters in our example) p(t|C) = 1 usual, p(o|V) = .3 non- p(e|V) = .6 overlapping p(y|V) = .1 classes o,e,y o,e,y p{toe) = .6 x 1 x .88 x .3 x .07 x .6 ^ .00665 p{teo) = .6 x 1 x .88 x .6 x .07 x .3 ^ .00332 p{toy) = .6 x 1 x .88 x .3 x .07 x .1 ^ .00111 p{tty) = .6 x 1 x .12 x 1 x 1 x .1 ^ .0072 Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 11/22 Class Trigrams: the Trellis TreLLis generation (Y = toy") p(t|C) = 1 p(o|V) = .3 p(e|V) = .6 p(y|V) = .1 a = .6 x 1 again, treLLis useful but not really needed a = .1584 x .07 x .1 ,00111 o,e,y o,e,y Y: t a = .6 x .88 x .3 o Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 12/22 Overlapping Classes Imagine that classes may overlap ■ e.g. Y is sometimes vowel sometimes consonant, belongs to V as well as C: p(t|Q = .3 p(r|Q = .7 p(o|V) = .1 p(e|V) = .3 P(Y|V) = A p(r|V) = .2 o,e,y,r o,e,y,r p(try) = ? Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 13/22 Overlapping Classes: Trellis Example Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 14/22 Trellis: Remarks ■ So far, we went Left to right (computing a) ■ Same result: going right to Left (computing (3) ■ supposed we know where to start (finite data) ■ In fact, we might start in the middle going Left and right ■ Important for parameter estimation (Forward-Backward ALgortihm alias Baum-WeLch) ■ Implementation issues: ■ scaling/normalizing probabilities, to avoid too small numbers & addition problems with many transitions Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 15/22 The Viterbi Algorithm ■ Solving the task of finding the most Likely sequence of states which generated the observed data ■ i.e., finding Sbest = a rg maxSP(S\Y) which is equal to (Y is constant and thus P(Y) is fixed): Sbest = a rg maxsP(Sy) = = argmaxsP(so,s1,s2,... ,sk,yi,y2,... ,yk) = = argmaxsni=1„k P(y1|s/,s/_1)P(s/|s/_1) Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 16/22 The Crucial Observation Imagine the trellis build as before (but do not compute the as yet; assume they are o.k.); stage /: stage 1 a = .6 NB: remember previous state .5 from which we got the maximum: a = A a = .max{.l>, .32) = .32 ?...max! stage 1 this is certainly the "backwards" maximum to (D,2)...but it cannot change even whenever we go forward (M. Property: Limited History) reverse" the arc a = .32 Pavel Rychlý • HMM Algorithms: Trellis and Viterbi • March 9, 2023 17/22 Viterbi Example p(t|Q = .3 p(r|C) = .7 p(o|V) = .1 p(e|V) = .3 P(y|v) = .4 p(r|V) = .2 argmaxxvz p(rry|XYZ) = ? Possible state seq.: (x,V)(V,C)(C,V)[VCV], (x,C)(C,C)(C,V)[CCV], (x,C)(C,V)(V,V)[CW] Pavel Rychlý • HMM Algorithms: Trellis and Viterbi • March 9, 2023 18/22 Viterbi Computation P(t|Q = p(r|Q = p(o|V) = p(e|V) = P(V|V) = p(r|V) = .3 .7 .1 .3 .4 .2 a in trellis state: best prob from start to here o,e,y,r Y: 42 x .12 x .7 : .03528 o,e,y,r o,e,y,r a = .07392 x .07 x .4 =.002070 v,vk o,cc = .03528 x 1 x .4 '=.01411 a.vc = .056 x .8 x .4 ' = .01792 = amax -- .0.8 x 1 X .7 = 056 .4 x .2 .08 Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 19/22 n-best State Sequences ■ Keep track of n best "back pointers": ■ Ex.: n= 2: Two "winners": ■ VCV (best) ■ CCV {2nd best) Y: .42 x .12 x .7 .03528 a = .07392 x .07 x .4 =.002070 ac>c = .03528 x 1 X .4 ='.01411 aV}C = -056 x .8 x .4 = .01792 = amax a = .0.8 X 1 X .7 = 056 a = .4 x .2 = 08 Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 20/22 Tracking Back the n-best paths ■ Backtracking-style algorithm: ■ Start at the end, in the best of the n states {sbest) ■ Put the other n-1 best nodes/back pointer pairs on stack, except those leading from sbest to the same best-back state. ■ Follow the back "beam" towards the start of the data, spitting out nodes on the way (backwards of course) using always only the best back pointer. ■ At every beam split, push the diverging node/back pointer pairs onto the stack (node/beam width is sufficient!). ■ When you reach the start of data, close the path, and pop the topmost node/back pointer(width) pair from the stack. ■ Repeat until the stack is empty; expand the result tree if necessary. Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 21/22 Pruning Sometimes, too many trellis states in a stage: a = .002 a a a a = a = .043 .001 .231 .002 .000003 criteria: (a) a < threshold (b) Y.H < threshold (c) # of states > threshold (get rid of smallest a) x a = .000435 a = .0066 Pavel Rychly • HMM Algorithms: Trellis and Viterbi • March 9, 2023 22/22