MUNI FI HMM Algorithms: Trellis and Viterbi PA154 Language Modeling (5.2) Pavel Rychlý pary@fi.muni.cz March 19, 2024 HMM: The Two Tasks HMM (the general case): ■ five-tuple (S, S0. Y, Ps>Py)> where: ■ S = {s1,s2,... :sT} is the set of states, SQ is the initiaL, ■ Y = {yi,y2, ■ ■ ■ ,yv} is the output aLphabet, ■ Ps(sj\si) is the set of prob. distributions of transitions, ■ Py(yk\si:Sj) is the set of output (emission) probability distributions. Given an HMM & an output sequence V = {yi,yi, ■ ■ ■ ,yi<} (Task 1) compute the probability of Y; (Task 2) compute the most likely sequence of states which has generated Y. Source: Introduction to NaturaL Language Processing (600.465) Jan Hajic.CS Dept.Jonns Hopkins Univ. www.cs.jnu.edu/~najic ~aveL RycnLy • HMM Algorithms: Trellis and Viterbi • Marc Trellis - Deterministic Output HMM: Trellis: time/position f 0 12 3 p(toe)=x.6 x .88 x + x.4 x .1 x 1 = .568 Y' t o e -trellis state: (HMM state, position) ' -each state: holds one number (prob):a <*(x,o) =1^,1) = .60(0,2) = - probability orY: la in the last state ^(c 1) = 4 Creating the Trellis: The Start Start in the start state (x), ■ its q(x,0) to 1. Create the first stage: ■ get the first "output" symbol y\ ■ create the first stage (column) ■ but only those trellis states which generate yi ■ set their a(state,l) to the Ps(state\ x' ...and forget about the 0-th stage position/stage 0 1 q(x,0) y-:t -aveL RycnLy . HMM Algorithms: Trellis and Viterbi . Marcn 19,2024 -aveL RycnLy • HMM Algorithms: Trellis and Viterbi • Marc Trellis: The Next Step Trellis: The Last Step Suppose we are in stage /, Creating the next stage: ■ create all trellis state in the next stage which generate y,+1, but only those reachable from any of the stage-/ states ■ set their a(state, i + 1) to: Ps{state\ prev.state) xa(prev.state, i) (add up all such numbers on arcs going to a common trellis state) ■ ...and forget about stage / position/stage i=l Continue until "output" exhausted ■ \Y\ = 3: until stage 3 Add together all the a(state,\ Y\) That's the P(Y). Observation (pleasant): ■ memory usage max: 2|S| ■ multiplications max: \S\2\Y\ last position/stage P(Y)-568 -aveL RycnLy . HMM Algorithms: Trellis and Viterbi . Marcn 19,2024 5/22 -aveL RycnLy • HMM Algorithms: Trellis and Viterbi • Marcn 19, 2024 6/22 Trellis: The General Case (still, bigrams) General Trellis: The Next Step Start as usual: ■ start state (x), set its q(x,0) to 1. p(toe) = .48 x .616 x .6 + .2 x 1 x .176 + .2 x 1 x .12 .237 ^aveL RychLy . HMM Algorithms: Trellis and Viterbi . March 19,2024 We are in stage /: ■ Generate the next stage i+1 as before (except now arcs generate output, thus use only those arcs marked by the output symbol y,+1) ■ For each generated state compute a(state,i + 1) - :sPy(yi+i \state,prev.state) ...and forget about stage / as usual ^aveL RychLy . HMM March 19, 2024 Trellis: The Complete Example Stage: 9/22 The Case of Trigrams ■ Like before, but: ■ states correspond to bigrams, ■ output function always emits the second output symbol of the pair (state) to which the arc goes: Multiple paths not possible -> trellis not really needed ^aveL RychLy . HMM Algorithms: Trellis and Viterbi . March 19, 2024 10 / 22 Trigrams with Classes ■ More interesting: ■ n-gram class LM:p(wi\wi_2, = p(rV/|c/)p(c,-|c; -> states are pairs of classes {cj_^ o],and emit "words' fo,e,y ^letters in our example) P(t|Q -p(o|V) = p(e|V) --p(y|V) = „(roe) = .6 p(reo) = .6 p(roy) = p(rr/) = .6 usual, non-overlapping classes .88 x .3 x .07 x .6 £s .00665 .88 x .6 x .07 x .3 ~ .00332 : .88 x .3 x .07 x .1 ~ .00111 .12 x 1 x 1 x .1 ~ .0072 Class Trigrams: the Trellis ■ Trellis generation (Y = "toy"): p(t|C) = 1 p(o|V) = .3 p(e|V) = .6 p(y|V) = .1 again, treLLis usefuL but not really needed ^aveL RychLy . HMM Algorithms: Trellis and Viterbi . March 19,2024 11/22 3aveL RychLy • HMM Algorithms: Trellis and Viterbi • March 19,2024 12/22 Overlapping Classes Imagine that classes may overlap ■ e.g. Y is sometimes vowel sometimes consonant, belongs to V as well as C: o,e,y,r o,e,y,r p(t|C) = .3 p(r|C) = .7 p(o|V) = .1 p(e|V) = .3 p(y|V) = .4 p(r|V) = .2 p(try) = ? Overlapping Classes: Trellis Example p(t|C) = .3 p(r|C) = .7 p(o|V) = .1 p(e|V) = .3 P(V|V) = .4 p(r|V) = .2 y p(Y)=.006955 o,e,y,r o,e,y,r ^aveL RychLy . HMM Algorithms: Trellis and Viterbi . March 19,2024 3aveL RychLy • HMM Algorithms: Trellis and Viterbi • March 19, 2024 Trellis: Remarks The Viterbi Algorithm So far, we went left to right (computing a) Same result: going right to left (computing fS) ■ supposed we know where to start (finite data) In fact, we might start in the middle going left and right Important for parameter estimation (Forward-Backward Algortihm alias Baum-Welch) Implementation issues: ■ scaling/normalizing probabilities, to avoid too small numbers & addition problems with many transitions ■ Solving the task of finding the most likely sequence of states which generated the observed data ■ i.e.,finding Sbest - argmaxsP(S\Y) which is equal to (Y is constant and thus P(Y) is fixed): Sbest ~ argmaxsP(S,Y) -= argmaxsP(so,s1,S2, ■ ■ ■,sk,y1,y2, ■ ■ ■,») = = argmaxsY\j=i k P(yi|s;,s;_i)P(s;|s,-_i) ^aveL RychLy . HMM Algorithms: Trellis and Viterbi . March 19,2024 3aveL RychLy • HMM Algorithms: Trellis and Viterbi • March 19, 2024 The Crucial Observation Viterbi Example Imagine the trellis build as before (but do not compute the as yet; assume they are o.k.); stage /: stage stage NB: remember previous state 5 from which we got the maximum q = .max{3, .32) = .32 © this is certainLythe "backwards" maximum to (D,2)...but it cannot change even whenever we go forward (M. Property: Limited History) Y classification (C or V?, sequence?): fo,e,y,r o,e,y,r o,e,y,r P(t|Q = .3 p(r|C) = .7 p(o|V) = .1 p(e|V) = .3 p(y|V) = .4 p(r|V) = .2 argmaxxvz p(rry|XYZ) = ? Possible state seq.: (x, V)(V, C)(C, V)[VCV], (x,C)(C,C)(C,V)[CQ/], (x, C)(C, V)(V, V)[CW] ^aveL RychLy . HMM Algorithms: Trellis and Viterbi . March 19,2024 17/22 3aveL RychLy • HMM Algorithms: Trellis and Viterbi • March 19,2024 18/22 Tracking Back the n-best paths Backtracking-style algorithm: ■ Start at the end, in the best of the n states (s6est) ■ Put the other n-1 best nodes/back pointer pairs on stack, except those leading from sbest to the same best-back state. Follow the back "beam" towards the start of the data, spitting out nodes on the way (backwards of course) using always only the best back pointer. At every beam split, push the diverging node/back pointer pairs onto the stack (node/beam width is sufficient!). When you reach the start of data, close the path, and pop the topmost node/back pointer(width) pair from the stack. Repeat until the stack is empty; expand the result tree if necessary. Pruning Sometimes, too many trellis states in a stage: a = .002 criteria: (a) a < threshold (b) Zir < threshold (c) # of states > threshold (get rid of smallest a) ^aveLRychLy . HMM Algorithms: Trellis and Viterbi . March 19,2024 21/22 3aveL RychLy • HMM Algorithms: Trellis and Viterbi • March 19, 2024 22/22