Language Modeling (and the
Noisy Channel)
PA154 Language Modeling (3.1)
Pavel Rychlý
pary@fi.muni.cz
March 5, 2024
Source: Introduction to Natural Language Processing (600.465)
Jan Hajiˇc, CS Dept., Johns Hopkins Univ.
www.cs.jhu.edu/~hajic
The Noisy Channel
Prototypical case
Model: probability of error (noise):
Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6
The task:
known: the noisy output; want to know; the input (decoding)
Pavel Rychlý ·Noisy Cannel ·March 5, 2024 2 / 14
Noisy Channel Applications
OCR
– straightforward: text → print (adds noise), scan → image
Handwriting recognition
– text → neurons, muscles ("noise"), scan/digitize → image
Speech recognition (dictation, commands, etc.)
– text → conversion to acoustic signal ("noise") → acoustic waves
Machine Translation
– text in target language → translation ("noise") → source language
Also: Part of Speech Tagging
– sequence of tags → selection of word forms → text
Pavel Rychlý ·Noisy Cannel ·March 5, 2024 3 / 14
The Golden Rule of OCR, ASR, HR, MT,...
Recall:
p(A|B) =
p(B|A)p(A)
p(B)
(Bayes formula)
Abest = argmaxAp(B|A)p(A) (The Golden Rule)
p(B|A): the acoustic/image/translation/lexical model
– application-specific name
– will explore later
p(A): language model
Pavel Rychlý ·Noisy Cannel ·March 5, 2024 4 / 14
The Perfect Language Model
Sequence of word forms (forget about tagging for the moment)
Notation: A ~ W = (w1, w2, w3, ..., wd )
The big (modeling) question:
p(W) = ?
Well, we know (Bayes/chain rule) →):
p(W) = p(w1, w2, w3, ..., wd ) =
p(w1) × p(w2|w1) × p(w3|w1, w2) × ... × p(wd |w1, w2, ...wd−1)
Not practical (even short W → too many parameters)
Pavel Rychlý ·Noisy Cannel ·March 5, 2024 5 / 14
Markov Chain
Unlimited memory (cf. previous foil):
– for wi we know all its predecessors w1, w2, w3, ..., wi−1
Limited memory:
– we disregard "too old" predecessors
– remember only k previous words: wi−k , wi−k+1, ..., wi−1
– called "kth
order Markov approximation"
+ stationary character (no change over time):
p(W) ∼=
i=1..d
p(wi |wi−k , wi−k+1, ..., wi−1), d = |W|
Pavel Rychlý ·Noisy Cannel ·March 5, 2024 6 / 14
n-gram Language Models
(n − 1)th
order Markov approximation → n-gram LM:
prediction history
p(W) =df
i=1..d
p( wi | wi−n+1, wi−n+2, ..., wi−1 )
In particular (assume vocabulary |V| = 60k):
0-gram LM: uniform model, p(w) = 1/|V|, 1 parameter
1-gram LM: unigram model, p(w), 6×104
parameters
2-gram LM: bigram model, p(wi |wi−1), 3.6×109
parameters
3-gram LM: trigram model, p(wi |wi−2, wi−1), 2.16×1014
parameters
Pavel Rychlý ·Noisy Cannel ·March 5, 2024 7 / 14
LM: Observations
How large n?
– nothing in enough (theoretically)
– but anyway: as much as possible (→ close to "perfect" model)
– empirically: 3
parameter estimation? (reliability, data availability, storage
space, ...)
4 is too much: |V|=60k → 1.296 × 1019
parameters
but: 6–7 would be (almost) ideal (having enough data): in fact,
one can recover original from 7-grams!
Reliability ~(1/Detail) (→ need compromise)
For now, keep word forms (no "linguistic" processing)
Pavel Rychlý ·Noisy Cannel ·March 5, 2024 8 / 14
The Length Issue
∀n; Σw∈Ωn p(w) = 1 ⇒ Σn=1..∞Σw∈Ωn p(w) 1(→ ∞)
We want to model all sequences of words
– for "fixed" length tasks: no problem – n fixed, sum is 1
tagging, OCR/handwriting (if words identified ahead of time)
– for "variable" length tasks: have to account for
discount shorter sentences
General model: for each sequence of words of length n,
define p’(w) = λnp(w) such that Σn−1..∞λn = 1 ⇒
Σn=1..∞Σw∈Ωn p (w) = 1
e.g. estimate λn from data; or use normal or other distribution
Pavel Rychlý ·Noisy Cannel ·March 5, 2024 9 / 14
Parameter Estimation
Parameter: numerical value needed to compute p(w|h)
From data (how else?)
Data preparation:
get rid of formating etc. ("text cleaning")
define words (separate but include punctuation, call it "word")
define sentence boundaries (insert "words" and )
letter case: keep, discard, or be smart:
– name recognition
– number type identification (these are huge problems per se!)
– numbers: keep, replace by , or be smart (form ~ punctuation)
Pavel Rychlý ·Noisy Cannel ·March 5, 2024 10 / 14
Maximum Likelihood Estimate
MLE: Relative Frequency...
– ...best predicts the data at hand (the "training data")
Trigrams from training Data T:
– count sequences of three words in T: c3(wi−2, wi−1, wi )
– (NB: notation: just saying that three words follow each other)
– count sequences of two words in T: c2(wi−1, wi )
either use c2(y, z) = Σw c3(y, z, w)
or count differently at the beginning (& end) of the data!
p(wi|wi−2, wi−1) =est.
c3(wi−2, wi−1, wi)
c2(wi−2, wi−1)
!
Pavel Rychlý ·Noisy Cannel ·March 5, 2024 11 / 14
Character Language Model
Use individual characters instead of words:
p(W) =df Πi=1..d p(ci |ci−n+1, ci−n+2, ..., ci )
Same formulas etc.
Might consider 4-grams, 5-grams or even more
Good only for language comparison)
Transform cross-entropy between letter- and word-based
models:
HS(pc) = HS(pw )/avg. # of characters/word in S
Pavel Rychlý ·Noisy Cannel ·March 5, 2024 12 / 14
LM: an Example
Training data:
He can buy the can of soda.
– Unigram:
p1(He) = p1(buy) = p1(the) = p1(of) p1(soda) = p1(.) = .125
p1(can) = .25
– Bigram:
p2(He|) = 1, p2(can|He) = 1, p2(buy|can) = .5, p2(of|can) = .5,
p2(the|buy) = 1,...
– Trigram:
p3(He|, ) = 1, p3(can|,He) = 1, p3(buy|He,can) = 1,
p3(of|the,can) = 1, ..., p3(.|of,soda) = 1.
– Entropy:
H(p1) = 2.75, H(p2) = .25, H(p3) = 0 ← Great?!
Pavel Rychlý ·Noisy Cannel ·March 5, 2024 13 / 14
LM: an Example (The Problem)
Cross-entropy:
S = It was the greatest buy of all.
Even HS(p1) fails (= HS(p2) = HS(p3) = ∞), because:
all unigrams but p1(the), p1(buy), p1(of) and p1(.) are 0.
all bigram probabilities are 0.
all trigram probabilities are 0.
We want: to make all (theoretically possible*) probabilities
non-zero.
*in fact, all: remeber our graph from day1?
Pavel Rychlý ·Noisy Cannel ·March 5, 2024 14 / 14