HMM Algorithms: Trellis and Viterbi PA154 Jazykové modelovaní (5.2) Pavel Rychlý pary@fi.muni.cz March 16, 2020 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic H M M: The Two Tasks ■ HMM (the general case): ► five-tuple (S, So, Y, Ps, Py), where: ► S = {si, S2,..., St} is the set of states, So is the initial, ► Y = {yi,3/2,.. •, yv} is the output alphabet, ► Ps(sj\si) is the set of prob. distributions of transitions, ► Py{yk\s;1 Sj) is the set of output (emission) probability distributions. ■ Given an HMM & an output sequence Y = {yi,y2, • • • , y/c} (Task 1) compute the probability of Y; (Task 2) compute the most likely sequence of states which has generated Y. PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi 2/22 Trellis - Deterministic Output HMM Trellis: time/position t 1 2 3 4. p(toe)=x.6 x .88 x 1+ x.4 x .1 x 1 = .568 - trellis state: (HMM state, position) - each state: holds one number (prob):a - probability or Y: la in the last state Y: t o e a(x, 0) = 1 ot(A, 1) = .6 a(D, 2) = .568 a(ß, 3) a(C, 1) = .4 .568 PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi 3/22 Creating the Trellis: The Start ■ Start in the start state (x), ► its a(x,0) to 1. ■ Create the first stage: ► get the first "output" symbol yi ► create the first stage (column) ► but only those trellis states which generate yi ► set their a(state,l) to the Ps(state\x) a(x, 0) ■ . .. and forget about the 0-th stage PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi Trellis: The Next Step position/stage ■ Suppose we are in stage /, i=i ■ Creating the next stage: ► create all trellis state in the next stage which generate y,+i, but only those reachable from any of the stage-/ states ► set their a(state, i + 1) to: Ps(state\ prev.state) xa(prev.state, /) (add up all such numbers on arcs going to a common trellis state) ► ... and forget about stage / PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi Trellis: The Last Step Continue until "output" exhausted - Y\ = 3: until stage 3 Add together all the a(state,\Y\) That's the P(Y). Observation (pleasant): ► memory usage max: 2|S ► multiplicationsmax: S 2 Y last position/stage a .568 a — .568 P(Y)=.568 PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi 6/22 Trellis: The General Case (still, bigrams) Start as usual: ► start state (x), set its a(x,0) to 1 O..06 e,.12 p(toe) = .48 x .616 x .6 + .2 x 1 x .176 + .2 x 1 x .12 ^ .237 a = 1 PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi 7/22 General Trellis: The Next Step We are in stage /: ► Generate the next stage i+1 as before (except now arcs generate output, thus use only those arcs marked by the output symbol y,+i) ► For each generated state compute a(state,i + 1) = position/stage 0 incoming arcs a(prev.state,i) o,.06 Py(yi+i\state,prev.state) x e,.12 a = 1 a = .48 a = .2 and forget about stage / as usual PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi 8/22 Trellis: The Complete Example The Case of Trigrams ■ Like before, but: ► states correspond to bigrams, ► output function always emits the second output symbol of the pair (state) to which the arc goes: Multiple paths not possible —>> trellis not really needed PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi 10/22 Trigrams with Classes More interesting: ► n-gram class LM: p(i/i/;|i/i/;_2, = p(w/|c/)p(c/|c/_2, Q_i) —>» states are pairs of classes (c,-_i, q), and emit "words": (letters in our example) p(t | C) = 1 usual, p(o|V) = .3 non-p(e|V) = .6 overlapping P(y|V) = 1 classes o,e,y o,e,y p(toe) = .6 X 1 X .88 X .3 X .07 X .6 ^ .00665 p(teo) = .6 X 1 X .88 X .6 X .07 X .3 ^ .00665 p(toy) = .6 X 1 X .88 X .3 X .07 X .1 ^ .00111 p(tty) = .6 X 1 X .12 x 1 X 1 X .1 ^ .0072 PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi 11/22 Class Trigrams: the Trellis ■ Trellis generation (Y = "toy"): o,e,y o,e,y t a = .6 x .88 x .3 PA154 Jazykové modelování (5.2) Y- t n HMM Algorithms: Trellis and Viterbi ^^^^ 12/22 Overlapping Classes ■ Imagine that classes may overlap ► e.g. V is sometimes vowel sometimes as C: o,e,y,r o,e,y,r t,r consonant, belongs to V as well p(t|C) = .3 P(r|C) = .7 p(o|V) = .1 p(e|V) = .3 p(y|V) = .4 p(r|V) = .2 p(try) = ? PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi 13/22 Overlapping Classes: Trellis Example o,e,y,r o,e,y,r t,r a = .18 X .88 X .2 = .03168 PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi Trellis: Remarks ■ So far, we went left to right (computing a) ■ Same result: going right to left (computing /3) ► supposed we know where to start (finite data) ■ In fact, we might start in the middle going left and right ■ Important for parameter estimation (Forward-Backward Algortihm alias Baum-Welch) ■ Implementation issues: ► scaling/normalizing probabilities, to avoid too small numbe & addition problems with many transitions PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi The Viterbi Algorithm ■ Solving the task of fmding the most likely sequence of states which generated the observed data ■ i.e., finding Sbest = argmaxsP(S\Y) which is equal to (Y is constant and thus P(Y) is fixed): Sbest = argmaxsP{S,Y) = = argmaxsP(so,s1,s2,.. .,sklylly2,... ,y/c) = = argmaxsPUi=lmmk p(yi|s/5 s/_i)p(s/|s/_i) PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi 16/22 "he Crucial Observation ■ Imagine the trellis build as before (but do not compute the as yet; assume they are o.k.); stage /: stage a = .6 ?... max! this is certainly the "backwards" maximum to (D,2)... but it cannot change even whenever we go forward (M. Property: Limited History) PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi 17/22 Viterbi Example ■ V classification (C or V?, sequence?): p(t|C) = .3 p(r|C) = .7 p(o|V) = .1 p(e|V) = .3 p(y|V) = .4 p(r|V) = .2 argmaxxyz p(rry|XYZ) = ? Possible state seq.: (x, V)(V, C)(C, V)[VCV],(x,C)(C, C)(C, V)[CCV],(x, C)(C, V)(V, V)[CVV] PA154 Jazykové modelovaní (5.2) HMM Algorithms: Trellis and Viterbi 18/22 Viterbi Computation Y: p(t|C) P(r|C) p(o|V) p(e|V) p(y|v) p(r|V) ,3 7 .1 .3 .4 .2 a in trellis state: o best prob from start to here o,e,y,r o,e,y,r 18 X .12 x .7 : .03528 a = .07392 x .07 X .4 = .002070 X .88 X .2 ľ07392 a = .0.8 X 1 X .7 = .056 = .0141 avc = .( = .or a = .4 X .2 .08 PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi 19/22 n-best State Sequences Y: a. = 1 Keep track of n best "back pointers": Ex.: n= 2: Two "winners": VCV (best) CCV (2nd best) a = .18 X .12 x .7 = .03528 a = .07392 X .07 X .4 =.002070 .88 x .2 PA154 Jazykové modelování (5.2) a = .0.8 X 1 X .7 = .056 a = .4 X .2 = .08 HMM Algorithms: Trellis and Viterbi aC,C = -03528 X 1 > ='.01411 olvc = .056 X .8 X = .01792 = amax 20/22 Tracking Back the n-best paths ■ Backtracking-style algorithm: ► Start at the end, in the best of the n states (sbest) ► Put the other n-1 best nodes/back pointer pairs on stack, except those leading from Sbest to the same best-back state. ■ Follow the back "beam" towards the start of the data, spitting out nodes on the way (backwards of course) using always only the best back pointer. ■ At every beam split, push the diverging node/back pointer pairs onto the stack (node/beam width is sufficient!). ■ When you reach the start of data, close the path, and pop the topmost node/back pointer(width) pair from the stack. ■ Repeat until the stack is empty; expand the result tree if necessary. PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi 21/22 runing Sometimes, too many trellis states in a stage: a .002 a a a a .043 = .001 = .231 = .002 criteria: (a) a < threshold (b) Y.-K < threshold (c) # of states > threshold (get rid of smallest a) a .000003 a .000435 a .0066 PA154 Jazykové modelování (5.2) HMM Algorithms: Trellis and Viterbi 22/22