LM Smoothing (The EM Algorithm) PA154 Jazykové modelování (3) Pavel Rychlý pary@fi.muni.cz March 9, 2017 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic "he Zero Problem ■ "Raw" n-gram language model estimate: - necessarily, some zeros ► Imany: trigram model 2.16 x 1014 parameters, data ~109 words - which are true 0? ► optimal situation: even the least grequent trigram would be seen several times, in order to distinguish it's probability vs. other trigrams ► optimal situation cannot happen, unfortunately (open question: how many data would we need?) —> we don't know - we must eliminate zeros ■ Two kinds of zeros: p(w|h) = 0, or even p(h) = 0! PA154 Jazykove modeloväni (3) LM Smoothing 2/17 Why do we need Nonzero Probs? ■ To avoid infinite Cross Entropy: - happens when an event is found in test data which has not been seen in training data H(p) = oo: prevents comparing data with > 0 "errors" ■ To make the system more robust - low count estimates: ► they typically happen for "detailed" but relatively rare appearances - high count estimates: reliable but less "detailed" PA154 Jazykove modelovani (3) LM Smoothing 3/17 Eliminating the Zero Probabilites: Smoothing Get new p'(w) (same ft): almost p(w) but no zeros Discount w for (some) p(w) > 0: new p'(w) < p(w) J2wediscounted(P(W) ~ P\W)) = D Distribute D to all w; p(w) = 0: new p'(w) > p(w) - possibly also to other w with low p(w) For some w (possibly): p'(w) = p(w) Make sure Y.wetiP\w) = 1 There are many ways of smoothing PA154 Jazykove modeloväni (3) LM Smoothing 4/17 Smoothing by Adding 1 Simplest but not really usable: Predicting words w from a vocabulary V, training data T: p\w\h) c(h, w) +1 c(h) + | V for non-conditional distributions: pf{w) = CW+1 T\ + \V\ Problem if | V\ > c(h) (as is often the case; even >> c(h)\) Example: Training data: what is it what is small? |T| = 8 V = {what, is, it, small, ?, ,flying, birds, are, a, bird,.}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 9* .001 p(it is flying.) = .125x.25 x 02 = 0 p'(it) = .1, p'(what) = .15, p'(what is it?) = .15M2 ^ .0002 P'(-) = -05 p'(it is flying.) = .1 x .15 x .052 ^ .00004 PA154 Jazykove modelovani (3) LM Smoothing 5/17 Adding less than 1 Equally simple: Predicting word w from a vocabulary V, training data T: „ ... c(h7w) + X ► for non-conditional distributions: pf{w) = \^x\v\ Example: Training data: what is it what is small? |T| = 8 V = {what, is, it, small, ?, ,flying, birds, are, a, bird,.}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 ^ .001 p(it is flying.) = .125x.25 x 02 = 0 Use A = .1 p'(it) ^ .12, p'(what) ^ .23, p'(what is it?) = .232 x .122 ^ .0007 p'(.) ^ -01 p'(it is flying.) = .12x.23 x .012 ^ .000003 PA154 Jazykove modelovani (3) LM Smoothing 6/17 Good-Turing Suitable for estimation from large data - similar idea: discount/boost the relative frequency estimate: Pr(w) (c(w) + 1) x A/(c(w) + 1) |7| x N(c(w)) where A/(c) is the count of words with count c (count-of-counts) specifically, for c(w) = 0 (unseen words), pr(w) = |r[> what is it what is small? |T| = 8 V = {what, is, it, small, ?, ,flying, birds, are, a, bird,.}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 ^ .001 p(it is flying.) = .125x.25 x 02 = 0 ■ Raw estimation (A/(0) = 6, A/(1) = 4, A/(2) = 2, N(i) = 0, for / > 2): pr(it) = (1+1)xN(1+1)/(8xN(1)) = 2x2/(8x4) = .125 pr(what) = (2+1)xN(2+1)/(8xN(2)) = 3x0/(8x2) = 0: keep orig. p(what) pr(.) = (0+1)xN(0+1)/(8xN(0)) = 1 x4/(8x6) ^ .083 ■ Normalize (divide by 1.5 = J2we\v\ Pr(w)) ancl compute: p'(it) = .08, p'(what) ^ .17, p'(.) = .06 p'(what is it?) = .172 x .082 ^ .0002 p'(it is flying.) = .082 x .17 x .062 ^ .00004 LM Smoothing 8/17 Smoothing by Combination: Linear Interpolation Combine what? ► distribution of various level of detail vs. reliability n-gram models: ► use (n-1)gram, (n-2)gram, uniform —> reliability <— detail Simplest possible combination: - sum of probabilities, normalize: ► p(0|0) = .8, p(1|0) = .2, p(0|1) = 1, p(1|1) = 0, p(0) = .4, p(1) = .6 ► p'(0|0) = .6, p'(110) = .4, p'(110) = .7, p'(111) = .3 PA154 Jazykové modelovaní (3) LM Smoothing 9/17 Typical n-gram LM Smoothing ■ Weight in less detailed distributions using A = (A0, A1, A2, A3): p'A(tV/|tV/-2, = A3p3(lV/|lV/_2, + A2p2(W/|W/-i) + AiPi(wj) + Xq/\V ■ Normalize: A/ > 0, J2i=o xi = 1 is sufficient (A0 = 1 - ElLi A/)(n = 3) ■ Estimation using MLE: - fjx the p3, p2, Pi and |V| parameters as estimated from the training data - then find such {A/} which minimizes the cross entropy (maximazes probablity of data): -4, Yl]=\ lo92(p/x(wi\hi)) PA154 Jazykove modelovani (3) LM Smoothing 10/17 Held-out Data ■ What data to use? - try training data T: but we will always get A3 = 1 ► why? let P/t be an i-gram distribution estimated using r.f. from T) ► minimizing HT(p\) over a vector A, p\ = A3P37- + A2P27 + A1P17- + A0/| V\ - remember HT{p\) = H(p37-) + D(p37-|Ip'a); Psrfixed H(p3r) fixed, best) -which p\ minimizes HT(p\)? Obviously, a p\ for which D(p37-||p'a) = 0 - ...and that's p37- (because D(p||p) = 0, as we know) - ...and certainly p'A = P37-ifA3 = 1 (maybe in some other cases, too). - (P'a = 1 x P37- + 0 x p27 + 1 x P17- + 0/|V|) - thus: do not use the training data for estimation of A! ► must hold out part of the training data (heldout data, H) ► ...call remaining data the (true/raw) training data, T ► the test data S (e.g., for comparison purposes): still different data! PA154 Jazykove modeloväni (3) LM Smoothing 11/17 The Formulas -1 I AVI Repeat: minimizing — J2i=\ loQ2(Px(^i\hi)) over A H = hP3{Wi = P'x(Wi Wj_2, W 7-i) + A2P2(^/ W;-l) + A1p1(Wi-) +A0|^| PA154 Jazykove modelovani (3) LM Smoothing 12/17 The (Smoothing) EM Algorithm □ Start with some A, such that A > 0 for all j e 0..3 H Compute "Expected Counts" for eachAy. B Compute new set of Ay, using "Next A" formula. □ Start over at step 2, unless a termination condition is met. ■ Termination condition: convergence of A. - Simply set an s, and finish if |Ay - Xj^extl < £ for each j (step 3). ■ Guaranteed to converge: follows from Jensen's inequality, plus a technical proof. PA154 Jazykove modelovani (3) LM Smoothing 13/17 Remark on Linear Interpolation Smoothing ■ "Bucketed Smoothing": - use several vectors of A instead of one, based on (the frequency of) history: A(h) ► e.g. for h = (micrograms,per) we will have A(h) = (.999, .0009, .00009, .00001) (because "cubic" is the only word to follow...) - actually: not a separate set for each history, but rather a set for "similar" histories ("bucket"): ---^^ A(b(h)), where b: V2 N (in the case of trigrams) b classifies histories according to their reliability (-frequency) PA154 Jazykove modelovani (3) LM Smoothing 14/17 Bucketed Smoothing: The Algorithm ■ First, determine the bucketing function b (use heldout!): - decide in advance you want e.g. 1000 buckets - compute the total frequency of histories in 1 bucket {fmax{b)) - gradually fill your buckets from the most frequent bigrams so that the sum of frequencies does not exceed fmax{b) (you might end up with slightly more than 1000 buckets) ■ Divide your heldout data according to buckets ■ Apply the previous algorithm to each bucket and its data PA154 Jazykove modelovani (3) LM Smoothing 15/17 Simple Example ■ Raw distribution (unigram only; smooth with uniform): p(a) = .25, p(b) = .5, p(a) = 1/64 for a e {c.r}, = 0 for the rest: s, t, u, v, w, x, y, z ■ Heldout data: baby; use one set of A (A-i: unigram, A0: uniform) ■ Start with A0 = A-i = .5: c(Ai) = .5x.5/.27 + .5x.25/.14 + .5x.5/.27 + .5x0/.02 = 2.27 c(A0) = .5x.04/.27 + .5 x.04/. 14 + .5x.04/.27 + .5x.04/.02 = 1.28 Normalize A1>nexf = .68, A0,nexf = -32 Repeat from step 2 (recomputep'A first for efficient computation, then c(A,),...). Finish when new lambdas almost equal to the old ones (say, < 0.01 difference). p'x(b) = .5 x .5+ .5/26 p'x{a) = .5 x .25+ .5/26 p'x(y) = .5 x 0 + .5/26: .27 .14 .02 PA154 Jazykove modelovani (3) LM Smoothing 16/17 Some More Technical Hints ■ Set V = {all words from training data}. ► You may also consider V = T u H, but it does not make the coding in any way simpler (in fact, harder). ► But: you must never use the test data for your vocabulary ■ Prepend two "words" in front of all data: ► avoids beginning-of-data problems ► call these index -1 and 0: then the formulas hold exactly ■ When cn(w,h) = 0: ► Assing 0 probability to pn(w|h) where cn_i (h) > 0, but a uniform probablity (1/|V|) to those pn(w|h) where c„-i(h) = 0 (this must be done both when working on the heldout data during EM, as well as when computing cross-entropy on the test data!) PA154 Jazykove modelovani (3) LM Smoothing 17/17