HMM: The Tasks HMM Parameter Estimation: the Baum-Welch algorithm PA154 Jazykové modelování (6.1) Pavel Rychlý pary@fi.muni.cz March 23, 2020 HMM (the general case): ► five-tuple (5, So, Y, Ps, Py), where: ► S — {si, 52,..., sr} is the set of states, So is the initial state, ► Y — {yi,y2,... ,yy} is the output alphabet, ► Ps(sj\s;) is the set of prob. distributions of transitions, * Py(yk\s;, Sj) is the set of output (emission) probability distributions. Given an HMM & an output sequence Y — {yi,y2, ■ ■ ■,y>}: ► (Task 1) compute the probability of Y; ► (Task 2) compute the most likely sequence of states which has generated Y ► (Task 3) Estimating the parameters (transition/output distributions) Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.c5.jhu.edu/~hajic n-Welch algorithm2/14 A variant of Expectation-Maximization Setting ldea(~EM, for another variant see LM smoothing (lect. 3)): ► Start with (possibly random) estimates of Ps and Py. ► Compute (fractional) "counts" of state transitions/emissions taken, from Ps and Py, given data Y ► Adjust the estimates of Ps and Py from these "counts" (using MLE, i.e. relative frequency as the estimate). Remarks: ► many more parameters than the simple four-way smoothing ► no proofs here; see Jelinek Chapter 9 HMM (without Ps,fY)(S,S0, Y), and data T = {y, e Y}!=1 _\T\ ► will use T ~ | 7"| HMM structure is given: (S,S0) Psm. Typically, one wants to allow "fully connected" graph ► (i.e. no transitions forbidden ~ no transitions set to hard 0) ► why? we better leave it on the learning phase, based on the data! ► sometimes possible to remove some transitions ahead of time Py : should be restricted (if not, we will not get anywhere!) ► restricted ~ hard 0 probabilities of p(y\s,s') ► "Dictionary": states o words, "m:n" mapping on S x Y (in general) : the Baum-Welch algorithm3/14 : the Baum-Welch algorithm4/14 Initialization Data structures For computing the initial expected "counts" Important part EM guaranteed to find a local maximum only (albeit a good one in most cases) Py initialization more important ► fortunately, often easy to determine ► together with dictionary o vocabulary mapping, get counts, then MLE Ps initialization less important ► e.g. uniform distribution for each p(.|s) Will need storage for: ► The predetermined structure of the HMM (unless fully connected need not to keep it!) ► The parameters to be estimated (Ps, Py) ► The expected counts (same size as (Ps, Py)) ► The training data T = {y,- £ V};=i...r ► The trellis (if f.c): Each trellis state: two [float] numbers (forward/backward) __f T Size: T x S (Precisely, |T|x|S|) © © © @ O & © © © © © © O © G O Cr o i and then some) : the Baum-Welch algorithm5/14 PA154 Jazykové modeli : the Baum-Welch algorithm6/14 The Algorithm Part I The Algorithm Part II Initialize Ps, Py Compute "forward" probabilities: ► follow the procedure for trellis (summing), compute a{s, i) everywhere ► use the current values of Ps, Py{p{s'\s), p(y\s, s')) : ais', 0 = Es^s>"(s>' - 1) x P(s'|s) x p{yi\s,s') ► NB: do not throw away the previous stage! Compute "backward" probabilities ► start at all nodes of the last stage, proceed backwards, /?(s, /) ► i.e., probability of the "tail" of data from stage i to the end of data 0 = Es-«/3(s,' + 1) x P(s\s') x P(yi+i\s',s) ► also, keep the /?(s, /) at all trellis states Collect counts: ► for each output/transition pair compute c(y,s,s') = Si30k., j=Si,a(s,i) p(s'|s) p(y,+1|s,s') p(s\i+l) t ~--" \. tail prob one pass through data, only stop at (output) y prefix prob. this transition prob x output prob c(s, s') = ey c(y? s> s') (assuming all observed y,- in Y) c(s) = E^esc(s,s') Reestimate: p'(s'\s) = c(s,s')/c(s) p'(y\s,s') = c(y,s,s')/c(s,s') Repeat 2-5 until desired convergence limit is reached : the Baum-Welch algorithm7/14 HMM Parameter Estimation: the Baum-Welch algorithm8/14 Baum-Welch: Tips & Tricks Example Normalization badly needed ► long training data extremely small probabilities Normalize a,/3 using the same norm.factor: w(0 = £«=s <*(*.'') as follows: ► compute a(s, i) as usual (Step 2 of the algorithm), computing the sum N(i) at the given stage / as you go. ► at the end of each stage, recompute all alphas(for each state s): a*(s,i) =a(s,i)/N(i) ► use the same N(i) for [3s at the end of each backward (Step 3) stage: F(s,i)=p(s,i)/N(i) Task: pronunciation of "the" Solution: build HMM, fully connected, 4 states: ► S - short article, L - long article, C,V - word starting w/consonant, vowel ► thus, only "the" is ambiguous (a, an, the - not members of C,V) Output form states only (p(w|s,s') — p(w\s')) * Dsta. I'. an egg and a piece of the big .... the end & © Trellis: © 0 © ....... ß o : the Baum-Welch algorithm9/14 HMM Parameter Estimation: the Banm-Welch algorithm 10/14 Example: Initialization Fill in alpha, beta Output probabilities: ► Pinit(w\c) = c(c, w)/c(c); where c(S, the) = c(L, the) = c(the)/2 (other than that, everything is deterministic) Transition probabilities: ► pi„it(c'\c) = l/A(uniform) Don't forget: ► about the space needed ► initialize a(X,0) = 1 (X : the never-occuring front buffer st.) ► initialize /?(s, 7") = 1 for all s (except for s = X) Left to right, alpha: o(s', /) = 2MS, q(s, / — 1) x p(s'|s) x p(w,-|s'), where s' is the output from states Remember normalization (N(i)). Similary, beta (on the way back from the end). an egg and a piece of the big .... die end p(S,7)ji(S|Y)p(tlicS) o