MUNI FI HMM Parameter Estimation: the Baum-Welch algorithm PA154 Language Modeling (5.1) Pavel Rychly pary@fi.muni.cz March 16,2023 HMM: The Tasks HMM(the general case): ■ five-tupLe (S, S0, V, Ps, Py), where: S = {si,s2,... ,sT} is the set of states, S0 is the initial state, ■ Y = {yi,y2, • • • ,yy} is the output alphabet, ■ Ps(s/|s.) is the set of prob. distributions of transitions, ■ Py(yt\Si,Sj) is the set of output (emission) probability distributions. Given an HMM & an output sequence Y = {yi,y2,... ,yi<}: ■ (Task 1) compute the probability of V; ■ (Task 2) compute the most likely sequence of states which has generated V ■ (Task 3) Estimating the parameters (transition/output distributions) Source: Introduction to Natural Language Processing (600.465) Jan Hajic.CS Dept.,Johns Hopkins Univ. www.cs.jn u.edu/~najic ^avel Rycniy ■ HMM Parameter Estimation: the Baum-Welch algorithm ■ Maren 16,2023 A variant of Expectation-Maximization Setting ldea(~EM, for another variant see LM smoothing (lecture 2.2)): ■ Start with (possibly random) estimates of P5 and Py. ■ Compute (fractional) "counts" of state transitions/emissions taken, from Ps and Py, given data V ■ Adjust the estimates of P5 and Py from these "counts" (using MLE, i.e. relative frequency as the estimate). Remarks: ■ many more parameters than the simple four-way smoothing ■ no proofs here; see Jelinek Chapter 9 HMM (without PS,PY){S,S0, K),and data T = {y, e /};=i...|r| ■ will use T ~ \T\ HMM structure is given: (S, S0) Ps'. Typically, one wants to allow "fully connected" graph ■ (i.e. no transitions forbidden ~ no transitions set to hard 0) ■ why? -> we better leave it on the learning phase, based on the data! ■ sometimes possible to remove some transitions ahead of time PY : should be restricted (if not, we will not get anywhere!) ■ restricted ~ hard 0 probabilities of p(y\s,s') ■ "Dictionary": states o words, "m:n" mapping on S x V (in general) ^avel Rycniy ■ HMM Parameter Estimation: the Baum-Welch algorithm ■ March 16,2023 ^avel Rycniý ■ HMM Parameter Estimation: the Baum-Welch algorithm ■ Maren 16,2023 Initialization Data structures For computing the initial expected "counts" Important part ■ EM guaranteed to find a local maximum only (albeit a good one in most cases) PY initialization more important ■ fortunately, often easy to determine ■ together with dictionary <-> vocabulary mapping, get counts, then MLE Ps initialization less important ■ e.g. uniform distribution for each p(.|s) Will need storage for: ■ The predetermined structure of the HMM (unless fully connected -> need not to keep it!) ■ The parameters to be estimated (Ps,Py) u The expected counts (same size as (Ps,Py)) ■ The training data T = {y, e V},=i„.r ■ The trellis (if f.c): t T Size: T X S (Precisely, |T|X|S|) Each trellis state: two [float] numbers (forwarcVbackward) © © © © o © © © © © © © O O Q Q © o and then some) ^avel Rychly ■ HMM Parameter Estimation: the Baum-Welch algorithm ■ March 16,2023 5/14 ^avel Rychly ■ HMM Parameter Estimation: the Baum-Welch algorithm ■ March 16,2023 6/14 The Algorithm Part I The Algorithm Part II 1. Initialize PS,PY 2. Compute "forward" probabilities: ■ follow the procedure for trellis (summing), compute a(s, i) everywhere ■ use the current values of P5, PY(p(s'\s), p(y\s,s')) : "(*',') = Es^s>"(V-1) xp(s'|s) xp(y,|s,s') ■ NB: do not throw away the previous stage! 3. Compute "backward" probabilities ■ start at all nodes of the Last stage, proceed backwards, fj(s, i) ■ i.e., probability of the "taiL" of data from stage i to the end of data = Es-«<3(V + 1) x p(s\s') x p(yi+1\s',s) ■ aLso, keep the fj(s, i) at aLL treLLis states 4. Collect counts: ■ for each output/transition pair compute c(y,s,s') = E^^^afeOpis'ls) p(y1+1|s,s') ß(s\i+l) / T -T- / prefix prob. , . . . , tail prob one pass through data,/ ™s transition prob only stop al (output) y x output prob c(s,s') = Eyer c(y.5.5') (assuming aLL observed y in V) 5. Reestimate: p'(s'|s) = c(s,s')/c(s) p'(y\s,s') = c(y,s,s')/c(s,s') 6. Repeat 2-5 until desired convergence limit is reached ^avel Rychly ■ HMM Parameter Estimation: the Baum-Welch algorithm ■ March 16,2023 ^avel Rychly ■ HMM Parameter Estimation: the Baum-Welch algorithm ■ March 16,2023 Baum-Welch: Tips & Tricks Example Normalization badly needed ■ Long training data -> extremeLysmaLL probabilities Normalize a, f) using the same norm.factor: as follows: ■ compute a(s, i) as usuaL (Step 2 of the aLgorithm), computing the sum N(i) at the given stage / as you go. ■ at the end of each stage, recompute aLL as (for each state s): a*(s,i) =a(s,i)/N(i) ■ use the same N(i) for fjs at the end of each backward (Step 3) stage: Task: pronunciation of "the" Solution: build HMM, fully connected, 4 states: ■ S - short articLe, L - Long articLe, C,V - word starting w/consonant, voweL ■ thus, onLy "the" is ambiguous (a, an, the - not members of C,V) Output form states only (p(w\s, s') = p(w\s')) • Delta I: an egg and a piece of the big .... the end @ © Trellis: © © © ....... 0 © © a o o Q, ^avel Rychly ■ HMM Parameter Estimation: the Baum-Welch algorithm ■ March 16,2023 ^avel Rychlý ■ HMM Parameter Estimation: the Baum-Welch algorithm ■ March 16,2023 Example: Initialization Fill in alpha, beta Output probabilities: ■ Pinit(w\c) = c(c, w)/c(c); where c(S, the) = c(L, the) = c(the)/2 (other than that, everything is deterministic) Transition probabilities: ■ Pi„it(c'\c) = l/4(uniform) Don't forget: about the space needed ■ initialize a(X, 0) = 1 (X : the never-occuring front buffer st.) ■ initialize /?(s, T) = 1 for aLL s (except for s = X) Left to right, alpha: o(s',/) = X^s'Q(s>' — 1) x P(s'ls) x P{wi\s')< where s' is the output from states Remember normalization (N(i)). Similary, beta (on the way back from the end). an egg and a piece of the big o P(V,í) = R(L,7)p(L|V>(5,+i|s,)p(y,+1|s,+1)6(s,+li,+1) ■ c(y,S/,s,-+i)+ = inc (for yat pos i+1) ■ c(s/,S/+i)+ = inc (always) ■ c(s,-)+ = inc (aLways) inc(big,L,Q=a(L, 7)p(C\L)p(b\q,C)fi(V, 8) inc(big,S,C)=a(S, l)p(C\S)p(b\q,C)fi(V, 8) ■ Reestimate p(s'\s),p(y\s) ■ and hope for increase in p(C\S) and p(V\L)...!! of the big ■ Parameter "tying" ■ keep certain parameters same (~ just one "counter" for all of them) ■ any combination in principLe possibLe ■ ex.: smoothing (just one set of Lambdas) ■ Real Numbers Output ■ Y of infinite size (R,R") ■ parametric (typically: few) distribution needed (e.g., "Gaussian") ■ "Empty" transitions: do not generate output ■ ~ vertical areas in trellis; do not use in "counting" ^avel RychLy ■ HMM Parameter Estimation: the Baum-Welch algorithm ■ March 16,2023 13/14 ^avel Rychly ■ HMM Parameter Estimation: the Baum-Welch algorithm ■ March 16,2023 14/ 14