Language Modeling (and the Noisy Channel) PA154 Language Modeling (3.1) Pavel Rychlý pary@fi.muni.cz March 5, 2024 Source: Introduction to Natural Language Processing (600.465) Jan Hajiˇc, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic The Noisy Channel Prototypical case Model: probability of error (noise): Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6 The task: known: the noisy output; want to know; the input (decoding) Pavel Rychlý ·Noisy Cannel ·March 5, 2024 2 / 14 Noisy Channel Applications OCR – straightforward: text → print (adds noise), scan → image Handwriting recognition – text → neurons, muscles ("noise"), scan/digitize → image Speech recognition (dictation, commands, etc.) – text → conversion to acoustic signal ("noise") → acoustic waves Machine Translation – text in target language → translation ("noise") → source language Also: Part of Speech Tagging – sequence of tags → selection of word forms → text Pavel Rychlý ·Noisy Cannel ·March 5, 2024 3 / 14 The Golden Rule of OCR, ASR, HR, MT,... Recall: p(A|B) = p(B|A)p(A) p(B) (Bayes formula) Abest = argmaxAp(B|A)p(A) (The Golden Rule) p(B|A): the acoustic/image/translation/lexical model – application-specific name – will explore later p(A): language model Pavel Rychlý ·Noisy Cannel ·March 5, 2024 4 / 14 The Perfect Language Model Sequence of word forms (forget about tagging for the moment) Notation: A ~ W = (w1, w2, w3, ..., wd ) The big (modeling) question: p(W) = ? Well, we know (Bayes/chain rule) →): p(W) = p(w1, w2, w3, ..., wd ) = p(w1) × p(w2|w1) × p(w3|w1, w2) × ... × p(wd |w1, w2, ...wd−1) Not practical (even short W → too many parameters) Pavel Rychlý ·Noisy Cannel ·March 5, 2024 5 / 14 Markov Chain Unlimited memory (cf. previous foil): – for wi we know all its predecessors w1, w2, w3, ..., wi−1 Limited memory: – we disregard "too old" predecessors – remember only k previous words: wi−k , wi−k+1, ..., wi−1 – called "kth order Markov approximation" + stationary character (no change over time): p(W) ∼= i=1..d p(wi |wi−k , wi−k+1, ..., wi−1), d = |W| Pavel Rychlý ·Noisy Cannel ·March 5, 2024 6 / 14 n-gram Language Models (n − 1)th order Markov approximation → n-gram LM: prediction history p(W) =df i=1..d p( wi | wi−n+1, wi−n+2, ..., wi−1 ) In particular (assume vocabulary |V| = 60k): 0-gram LM: uniform model, p(w) = 1/|V|, 1 parameter 1-gram LM: unigram model, p(w), 6×104 parameters 2-gram LM: bigram model, p(wi |wi−1), 3.6×109 parameters 3-gram LM: trigram model, p(wi |wi−2, wi−1), 2.16×1014 parameters Pavel Rychlý ·Noisy Cannel ·March 5, 2024 7 / 14 LM: Observations How large n? – nothing in enough (theoretically) – but anyway: as much as possible (→ close to "perfect" model) – empirically: 3 parameter estimation? (reliability, data availability, storage space, ...) 4 is too much: |V|=60k → 1.296 × 1019 parameters but: 6–7 would be (almost) ideal (having enough data): in fact, one can recover original from 7-grams! Reliability ~(1/Detail) (→ need compromise) For now, keep word forms (no "linguistic" processing) Pavel Rychlý ·Noisy Cannel ·March 5, 2024 8 / 14 The Length Issue ∀n; Σw∈Ωn p(w) = 1 ⇒ Σn=1..∞Σw∈Ωn p(w) 1(→ ∞) We want to model all sequences of words – for "fixed" length tasks: no problem – n fixed, sum is 1 tagging, OCR/handwriting (if words identified ahead of time) – for "variable" length tasks: have to account for discount shorter sentences General model: for each sequence of words of length n, define p’(w) = λnp(w) such that Σn−1..∞λn = 1 ⇒ Σn=1..∞Σw∈Ωn p (w) = 1 e.g. estimate λn from data; or use normal or other distribution Pavel Rychlý ·Noisy Cannel ·March 5, 2024 9 / 14 Parameter Estimation Parameter: numerical value needed to compute p(w|h) From data (how else?) Data preparation: get rid of formating etc. ("text cleaning") define words (separate but include punctuation, call it "word") define sentence boundaries (insert "words" and ) letter case: keep, discard, or be smart: – name recognition – number type identification (these are huge problems per se!) – numbers: keep, replace by , or be smart (form ~ punctuation) Pavel Rychlý ·Noisy Cannel ·March 5, 2024 10 / 14 Maximum Likelihood Estimate MLE: Relative Frequency... – ...best predicts the data at hand (the "training data") Trigrams from training Data T: – count sequences of three words in T: c3(wi−2, wi−1, wi ) – (NB: notation: just saying that three words follow each other) – count sequences of two words in T: c2(wi−1, wi ) either use c2(y, z) = Σw c3(y, z, w) or count differently at the beginning (& end) of the data! p(wi|wi−2, wi−1) =est. c3(wi−2, wi−1, wi) c2(wi−2, wi−1) ! Pavel Rychlý ·Noisy Cannel ·March 5, 2024 11 / 14 Character Language Model Use individual characters instead of words: p(W) =df Πi=1..d p(ci |ci−n+1, ci−n+2, ..., ci ) Same formulas etc. Might consider 4-grams, 5-grams or even more Good only for language comparison) Transform cross-entropy between letter- and word-based models: HS(pc) = HS(pw )/avg. # of characters/word in S Pavel Rychlý ·Noisy Cannel ·March 5, 2024 12 / 14 LM: an Example Training data: He can buy the can of soda. – Unigram: p1(He) = p1(buy) = p1(the) = p1(of) p1(soda) = p1(.) = .125 p1(can) = .25 – Bigram: p2(He|) = 1, p2(can|He) = 1, p2(buy|can) = .5, p2(of|can) = .5, p2(the|buy) = 1,... – Trigram: p3(He|, ) = 1, p3(can|,He) = 1, p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(.|of,soda) = 1. – Entropy: H(p1) = 2.75, H(p2) = .25, H(p3) = 0 ← Great?! Pavel Rychlý ·Noisy Cannel ·March 5, 2024 13 / 14 LM: an Example (The Problem) Cross-entropy: S = It was the greatest buy of all. Even HS(p1) fails (= HS(p2) = HS(p3) = ∞), because: all unigrams but p1(the), p1(buy), p1(of) and p1(.) are 0. all bigram probabilities are 0. all trigram probabilities are 0. We want: to make all (theoretically possible*) probabilities non-zero. *in fact, all: remeber our graph from day1? Pavel Rychlý ·Noisy Cannel ·March 5, 2024 14 / 14