LM Smoothing (The EM Algorithm) PA154 Language Modeling (3.2) Pavel Rychlý pary@fi.muni.cz March 5, 2024 Source: Introduction to Natural Language Processing (600.465) Jan Hajič, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic The Zero Problem ■ "Raw" n-gram language model estimate: ■ necessarily, some zeros ■ Imany: trigram model 2.16 x 1014 parameters, data ~109 words ■ which are true 0? ■ optimal situation: even the least frequent trigram would be seen several times, in order to distinguish it's probability vs. other trigrams ■ optimal situation cannot happen, unfortunately (open question: how many data would we need?) ■ ^ we don't know ■ we must eliminate zeros ■ Two kinds of zeros: p(w|h) = 0, or even p(h) = 0! Pavel Rychly • LM Smoothing • March 5, 2024 2/18 Why do we need Nonzero Probs? ■ To avoid infinite Cross Entropy: ■ happens when an event is found in test data which has not been seen in training data H(p) = oo: prevents comparing data with > 0 "errors" ■ To make the system more robust ■ low count estimates: ■ they typically happen for "detailed" but relatively rare appearances ■ high count estimates: reliable but less "detailed" Pavel Rychly • LM Smoothing • March 5, 2024 3/18 Eliminating the Zero Probabilites: Smoothing ■ Get new p'(w) (same ft): almost p(w) but no zeros ■ Discount w for (some) p(w) > 0: new p\w) < p(w) £ (p(w)-p'(w)) = D wediscounted ■ Distribute D to all w; p(w) = 0: new p\w) > p(w) ■ possibly also to other w with low p(w) ■ For some w (possibly): p\w) = p(w) ■ Make sure Y.wetiP\w) =J_ ■ There are many ways of smoothing Pavel Rychly • LM Smoothing • March 5, 2024 4/18 Smoothing by Adding 1 Simplest but not really usable: ■ Predicting words w from a vocabulary V, training data T: p\w\h) = c(h, w) + 1 c(h) + | V\ for non-conditional distributions: pf{w) = CW+1 T\ + \V\ Problem if | V\ > c(h) (as is often the case; even >> c(h)\) Example Training data: what is it what is small? |T| = 8 V = {what, is, it, small, ?, ,flying, birds, are, a, bird,.}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 ^ .001 p(it is flying.) = .125x.25 x 02 = 0 p'(what is it?) = .152 x .12 ^ .0002 p'(it) = .1, p'(what) = .15, P'(-) = -05 p'(it is flying.) = .1 x .15 x .052 ^ .00004 Pavel Rychly • LM Smoothing • March 5, 2024 5/18 Adding less than 1 Equally simple: ■ Predicting word w from a vocabulary V, training data T: c(h, w) + A p'(w\h) = X < 1 c(h) + X\V\' for non-conditional distributions: p/(w) = \t\+\\v\ Example Training data: what is it what is small? |T| = 8 V = {what, is, it, small, ?, ,flying, birds, are, a, bird,.}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 ^ .001 p(it is flying.) = .125x.25 x 02 = 0 Use A = .1 p'(it) ^ .12, p'(what) ^ .23, p'(what is it?) = .232 x .122 ^ .0007 p'(.) ^ -01 p'(it is flying.) = .12x.23 x .012 ^ .000003 Pavel Rychly • LM Smoothing • March 5, 2024 6/18 Good-Turing Suitable for estimation from large data ■ similar idea: discount/boost the relative frequency estimate (c(w) + 1)x/V(c(w) + 1) Pr{w) =-\T\xN(c(w))- where A/(c) is the count of words with count c (count-of-counts) specifically, for c(w) = 0 (unseen words), pr(w) = |t^n(o) ■ good for small counts (< 5-10, where A/(c) is high) ■ normalization! (so that we have ^^(i^) = 1) Pavel Rychly • LM Smoothing • March 5, 2024 7/18 Good-Turing: An Example Remember: pr(w) = _ (c(i/i/)+1)xA/(c(i/i/)+1) T\xN(c(w)) Training data: what is it what is small? |T| = 8 V = {what, is, it, small, ?, ,flying, birds, are, a, bird,.}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 ^ .001 p(it is flying.) = .125x.25 x 02 = 0 ■ Raw estimation (A/(0) = 6, A/(1) = 4, A/(2) = 2, N(i) = 0, for / > 2): pr(it) = (1+1)xN(1+1)/(8xN(1)) = 2x2/(8x4) = .125 pr(what) = (2+1)xN(2+1)/(8xN(2)) = 3x0/(8x2) = 0: keep orig. p(what) pr(.) = (0+1)xN(0+1)/(8xN(0)) = 1 x4/(8x6) ^ .083 ■ Normalize (divide by 1.5 = J2we\v\ Pr(w)) ancl compute: p'(it) = .08, p'(what) ^ .17, p'(.) = .06 p'(what is it?) = .172 x .082 ^ .0002 p'(it is flying.) = .082 x .17 x .062 ^ .00004 Pavel Rychly • LM Smoothing • March 5, 2024 8/18 Smoothing by Combination: Linear Interpolation ■ Combine what? ■ distribution of various level of detail vs. reliability ■ n-gram models: ■ use (n-1)gram, (n-2)gram, uniform —> reliability <— detail ■ Simplest possible combination: - sum of probabilities, normalize: ■ p(0|0) = .8, p(1|0) = .2, p(0|1) = 1, p(1|1) = 0, p(0) = .4, p(1) = .6 ■ p'(0|0) = .6, p'(1|0) = .4, p'(0|1) = .7, p'(1|1) = .3 Pavel Rychly • LM Smoothing • March 5, 2024 9/18 Typical n-gram LM Smoothing ■ Weight in less detailed distributions using A = (A0, X^, A2, A3): p\(Wj\Wj-2, = A3P3(^/|^/-2, + A2fl2(w/|w/-i) + Aipi(wj) + Xq/\V ■ Normalize: A, > 0, J2i=o xi = 1 is sufficient (A0 = 1 - £-=1 A 0 for all j e 0..3 2. Compute "Expected Counts" for eachAy. 3. Compute new set of Ay, using "Next A" formula. 4. Start over at step 2, unless a termination condition is met. ■ Termination condition: convergence of A. - Simply set an s, and finish if |Ay - Xj^extl < £ for each j (step 3). ■ Guaranteed to converge: follows from Jensen's inequality, plus a technical proof. Pavel Rychly • LM Smoothing • March 5, 2024 13/18 Remark on Linear Interpolation Smoothing ■ "Bucketed Smoothing": - use several vectors of A instead of one, based on (the frequency of) history: A(h) ■ e.g. for h = (micrograms,per) we will have - actually: not a separate set for each history, but rather a set for "similar" histories ("bucket"): ^ A(b(h)), where b: V2 N (in the case of trigrams) b classifies histories according to their reliability (-frequency) A(h) = (.999, .0009, .00009, .00001) (because "cubic" is the only word to follow...) Pavel Rychly • LM Smoothing • March 5, 2024 14/18 Bucketed Smoothing: The Algorithm ■ First, determine the bucketing function b (use heldout!): - decide in advance you want e.g. 1000 buckets - compute the total frequency of histories in 1 bucket {fmax{b)) - gradually fill your buckets from the most frequent bigrams so that the sum of frequencies does not exceed fmax{b) (you might end up with slightly more than 1000 buckets) ■ Divide your heldout data according to buckets ■ Apply the previous algorithm to each bucket and its data Pavel Rychly • LM Smoothing • March 5, 2024 15/18 Simple Example ■ Raw distribution (unigram only; smooth with uniform): p(a) = .25, p(b) = .5, p(a) = 1/64 for a e {c.r}, = 0 for the rest: s, t, u, v, w, x, y, z ■ Heldout data: baby; use one set of A (A-i: unigram, A0: uniform) ■ Start with A0 = A-i = .5: p^(ib) = .5x.5 + .5/26 = .27 p^(a) = .5x .25+ .5/26 = .14 P'\{y) = -5 x 0 + .5/26 = .02 c(Ai) = .5x.5/.27 + .5x.25/.14 + .5x.5/.27 + .5x0/.02 = 2.27 c(A0) = .5x.04/.27 + .5x.04/.14 + .5x.04/.27 + .5x.04/.02 = 1.28 Normalize AMexf = .68, X0,next = -32 Repeat from step 2 (recompute pfx first for efficient computation, then c(A/), ...). Finish when new lambdas almost equal to the old ones (say, < 0.01 difference). Pavel Rychly • LM Smoothing • March 5, 2024 16/18 Some More Technical Hints ■ Set V = {all words from training data}. ■ You may also consider V = T u H, but it does not make the coding in any way simpler (in fact, harder). ■ But: you must never use the test data for your vocabulary ■ Prepend two "words" in front of all data: ■ avoids beginning-of-data problems ■ call these index -1 and 0: then the formulas hold exactly ■ When cn(w,h) = 0: ■ Assing 0 probability to pn(w|h) where cn_i (h) > 0, but a uniform probablity (1/|V|) to those pn(w|h) where c„-i(h) = 0 (this must be done both when working on the heldout data during EM, as well as when computing cross-entropy on the test data!) Pavel Rychly • LM Smoothing • March 5, 2024 17/18 Back-off model Combines n-gram models using lower order in not enough information in higher order Pbo(Wj\Wj_n+i ... = C(w/_n+i ...ivmiv/) uWj_n+i aWj-n+i.. ..Wi-,Pbo(Wi\Wi-n+2... otherwise if C(Wj-n+i ...Wj)>k Pavel Rychly • LM Smoothing • March 5, 2024 18/18