HMM: The Two Tasks HMM Algorithms: Trellis and Viterbi PA154 Jazykové modelování (5.2) Pavel Rychlý pary@fi.muni.cz March 20, 2019 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.c5.jhu.edLi/~li.ijn HMM (the general case): ► five-tuple (S, So, Y, Ps, PY), where: ► S = {si, S2,..., sr} is the set of states, So is the initial, ► Y = {yi,y2, ■ ■ ■ ,yu} is the output alphabet, ► Ps(Sj\Sj) is the set of prob. distributions of transitions, * Py{yk\Si, Sj) is the set of output (emission) probability distributions. Given an HMM & an output sequence Y = {yi,y2,... ,yi<} (Task 1) compute the probability of Y; (Task 2) compute the most likely sequence of states which has generated Y. PA154 Jazykové modelováni (5.2) HMM Algorithms: Trellis and Viterbi Trellis - Deterministic Output Creating the Trellis: The Start HMM: Trellis: time/position t 1 3 4.. p(toe)=x.6 X .88 X 1 + x.4 X .1 X =.568 - trellis state: (HMM state, position) Y: t - each state: holds one number (prob) - probability or Y: Za in the last state (X,0) = 1 a(A, 1) = .6 «(D, 2) = .568 a(8,3) = .568 a(C, 1) = .4 Start in the start state (x), ► its q(x,0) to 1. Create the first stage: ► get the first "output" symbol yi ► create the first stage (column) ► but only those trellis states which generate yi ► set their a(state,l) to the Ps(state\x) ot(x,0) l ... and forget about the 0-th stage position/stage yi: - HMM Algorithms: Trellis and Viterbi HMM Algorithms; Trellis and Viterbi Trellis: The Next Step Trellis: The Last Step position/stage Suppose we are in stage /, i=l Creating the next stage: ► create all trellis state in the next stage which generate y,-+1, but only those reachable from any of the stage-/ states ► set their a(state, i + 1) to: Ps(state\ prev.state) xa(prev.state, i) (add up all such numbers on arcs going to a common trellis state) ► .. .and forget about stage / Continue until "output" exhausted - |V| = 3: until stage 3 Add together all the a(state,\Y\) That's the P(Y). Observation (pleasant): ► memory usage max: 2|S * multiplicationsmax: |5|2|Y last position/stage i P(Y)=.568 PA154 Jazykové modeloN HMM Algorithms: Trellis and Viterbi HMM Algorithms: Trellis and Viterbi Trellis: The General Case (still, bigrams) General Trellis: The Next Step Start as usual: ► start state (x), set its a(x,0) to 1. ,.06 e,.12 p(toe) = .48 X .616 X .6 + .2 X 1 X .176 + .2 X 1 X .12 2* .237 a = 1 We are in stage /: position/stage ► Generate the next stage i+1 as before (except now arcs generate output, thus use only those arcs marked by the output symbol y/+i) ► For each generated state compute a(state,i + 1) = — incoming arcs Py(y,-_i_i|sŕaŕe,prev.state) x a(prev.statej) O..06 HMM Algorithms: Trellis and Viterbi . and forget about stage / as usual HMM Algorithms: Trellis and Viterbi Trellis: The Complete Example Stage: 2------í 3 = .024 + .177408 = .201408 D A P(Y) = P(tue)= 236608 HMM Algorithms: Trellis and Viterbi The Case of Trigrams Like before, but: ► states correspond to bigrams, ► output function always emits the second output symbol of the pair (state) to which the arc goes: p(toe) = .6 x .88 x .07 S .037 Multiple paths not possible —► trellis not really needed HMM Algorithms: Trellis and Viterbi Trigrams with Classes More interesting: ► n-gram class LM: p(w;\w;-2, W/-l) = p(w;|q)p(q|q-2, c/-l) —► states are pairs of classes (c;_i,c,-), and emit "words": (letters in our example) p(t|C) = 1 usual, p(o|V) = .3 non-p(e|V) = .6 overlapping P(y|V) = -1 classes o,e,y o,e,y p (toe) = .6 > p (too) = .6 > p(toy) = -6 > p(ííy) = .6 X < -6 -00665 < -3 -00665 < -1 Sŕ -00111 [ eŕ .0072 PA154 Jazykové modelováni (5.2) HMM Algorithms: Trellis and Viterbi Class Trigrams: the Trellis ■ Trellis generation (Y = "toy"): p(t|C) = 1 p(o|V) = .3 p(e|V) = .6 p(y|V) = .1 o,e,y o,e,y again, trellis useful not really neede a = .1584 Si .00111 a = .6 x .88 x .3 PA154 Jazykové modelováni (5.2) v ■ 1- n HMM Algorithms: Trellis and Viterbi Overlapping Classes Imagine that classes may overlap ► e.g. V is sometimes vowel sometimes consonant, belongs to V as well as C: b,e,y,r p(t|C) = .3 p(r|C) = .7 p(o|V) = .1 p(e|V) = .3 p(y|v) = .4 p(r|V) = .2 o,e,y,r o,e,y,r t,r p(try) = ? Overlapping Classes: Trellis Example p(t|C) = .3 p(r|C) = .7 p(o|V) = .1 p(e|V) = .3 p(y|v) = .4 p(r|V) = .2 o,e,y,r o,e,y,r HMM Algorithms: Trellis and Viterbi HMM Algorithms: Trellis and Viterbi Trellis: Remarks The Viterbi Algorithm So far, we went left to right (computing a) Same result: going right to left (computing ft) ► supposed we know where to start (finite data) In fact, we might start in the middle going left and right Important for parameter estimation (Forward-Backward Algortihm alias Baum-Welch) Implementation issues: ► scaling/normalizing probabilities, to avoid too small numbers & addition problems with many transitions ■ Solving the task of fmding the most likely sequence of states which generated the observed data i.e., finding Sbest = argmaxsP(S\Y) which is equal to (Y is constant and thus P(Y) is fixed): Sbest = argmaxsP(S,Y) = = argmaxsP(s0,s1,s2,.. -,sk,y1,y2,.. .,yk) = = argmaxsPni=1..k p(yi^, s;_i)p(s;|s;_i) HMM Algorithms: Trellis and Viterbi HMM Algorithms: Trellis and Viterbi The Crucial Observation Viterbi Example Imagine the trellis build as before (but do not compute the as yet; assume they are o.k.); stage /: NB: remember previous state from which we got the n (.3, .32|= .32 this is certainly the "backwards' (D,2). .. but forward (M. Property: Limited History) V classification (C or V?, sequence?): o,e,y,r o,e,y,r o,e,y,r p(t|C) = .3 p(r|C) = .7 p(o|V) = .1 p(e|V) = .3 p(y|V) = .4 p(r|V) = .2 argmaxxyz p(rry|XYZ) = ? Possible state seq.: (x, V)(V, C)(C, V)[VCV], (x, C)(C, C)(C, V)[CCV],(x, C)(C, V)(V, V)[CVV] PA154 Jazykové modelováni (5.2) HMM Algorithms: Trellis and Viterbi PA154 Jazykové modelováni (5.2) HMM Algorithms: Trellis and Viterbi Viterbi Computation n-best State Sequences p(t|Q P(r|C) p(o|V) p(e|V) p(y|v) p(r|V) o,e,y,r o,e,y,r 0-8 x 1 x -7 HMM Algorithms: Trellis and Viterbi Tracking Back the n-best paths Backtracking-style algorithm: ► Start at the end, in the best of the n states (s^est) ► Put the other n-1 best nodes/back pointer pairs on stack, except those leading from s^est to the same best-back state. Follow the back "beam" towards the start of the data, spitting out nodes on the way (backwards of course) using always only the best back pointer. At every beam split, push the diverging node/back pointer pairs onto the stack (node/beam width is sufficient!). When you reach the start of data, close the path, and pop the topmost node/back pointer(width) pair from the stack. Repeat until the stack is empty; expand the result tree if necessary. HMM Algorithms: Trellis and Viterbi Keep track of n best "back pointers": Ex.: n= 2: Two "winners": VCV (best) CCV {2nd best) Ctc = -03528 > ='.01411 v>c = -056 x . = -01792 = a, Pruning HMM Algorithms: Trellis and Viterbi Sometimes, too many trellis states in a stage: [A ) a = .002 criteria: (a) a < threshold (b) Ztt < threshold (c) # of states > threshold (get rid of smallest a) HMM Algorithms: Trellis and Viterbi