Statistical Natural Language Processing
PA153
P. Rychly
September 25,2023
P. Rychly • Statistical Natural Language Processing • September 25,2023
Q Word Lists Q CoLLocations Q Language Modeling Q N-grams
Q Evaluation of Language Models
P. Rychly • Statistical Natural Language Processing • September 25, 2023 2/25
Statistical Natural Language Processing
■ statistics provides a summary (of a text)
■ highlights important or interesting facts
■ can be used to model data
■ foundation of estimating probabilities
■ fundamental statistics: size (+ domain, range)
P. Rychly • Statistical Natural Language Processing • September 25, 2023
3/25
Statistical Natural Language Processing
■ statistics provides a summary (of a text)
■ highlights important or interesting facts
■ can be used to model data
■ foundation of estimating probabilities
■ fundamental statistics: size (+ domain, range)
	Lines	words	bytes
Book 1 Book 2	3,715 1,601	37,703 16,859	223,415 91,031
P. Rychly • Statistical Natural Language Processing • September 25, 2023
3/25
Word list
■ List of all words from a text
■ List of most frequent words
■ words, Lemmas, senses, tags, domains, years ...
Bookl	Book 2
the, and, of, to, you, his, in, said, that, 1, wiLL, him, your, he, a, my, was, with, s, for, me, He, is, , it, them, be, The, aLL, , have, from,       , on, her, , are, their, were, they, which,      , t, up, , had, there	the, 1, to, a, of, is, that, , you, he, and, said, was, , in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, aLL, with, on, wiLL, as, very, had, this, him, He, from, they, , so, them, no, You, do, wouLd, Like
P. Rychly • Statistical Natural Language Processing • September 25, 2023
4/25
Word list
■ List of all words from a text
■ List of most frequent words
■ words, Lemmas, senses, tags, domains, years ...
Bookl	Book 2
the, and, of, to, you, his, in, said, that, 1, wiLL, him, your, he, a, my, was, with, s, for, me, He, is, father,     , it, them, be, The, aLL, land, have, from,       , on, her, ,son,        , are, their, were, they, which, sons, t, up, , had, there	the, 1, to, a, of, is, that, little, you, he, and, said, was, , in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, aLL, with, on, wiLL, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, wouLd, Like
P. Rychly • Statistical Natural Language Processing • September 25, 2023
4/25
Word list
■ List of all words from a text
■ List of most frequent words
■ words, Lemmas, senses, tags, domains, years ...
Bookl	Book 2
the, and, of, to, you, his, in, said, that, 1, wiLL, him, your, he, a, my, was, with, s, for, me, He, is, father, God, it, them, be, The, aLL, land, have, from, Jacob, on, her, Yahweh, son, Joseph, are, their, were, they, which, sons, t, up, Abraham, had, there	the, 1, to, a, of, is, that, little, you, he, and, said, was, prince, in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, aLL, with, on, wiLL, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, wouLd, Like
P. Rychly • Statistical Natural Language Processing • September 25, 2023
4/25
Frequency
■ number of occurrences (raw frequency)
■ relative frequency (hits per million)
■ document frequency (number of documents with a hit)
■ reduced frequency (ARF, ALDf)
1 < reduced < raw
■ normalization for comparison
■ hapax Legomena (= 1 hit)
P. Rychly • Statistical Natural Language Processing • September 25, 2023
5/25
Zipfs Law
rank-frequency plot
rank x frequency = constant
U—i d
Z5 O"
rangorde
P. Rychly • Statistical Natural Language Processing • September 25, 2023
6/25
Zipf's Law
o o o o
o o o
n3 o w
a> _o
^ o o °
(D Zl CT (D
O'he
°1 O    O and
to O |n
Othat his 00 it
10
100
1000
—I
10000
Rank (log scale)
P. Rychly • Statistical Natural Language Processing • September 25, 2023
7/25
Keywords
select only important words from a word List compare to reference text (norm) simple math score:
freqfocus + N score =---
fr£ Preference
+ N
Genesis	Little Prince
son God father Jacob Yahweh Joseph Abraham wife behold daughter	prince planet flower LittLe fox never too drawing reply star
P. Rychlý • Statistical Natural Language Processing • September 25, 2023
8/25
Collocations
meaning of words is defined by the context
collocations a salient words in the context
usually not the most frequent
filtering by part of speech, grammatical relation
compare to reference = context for other words
many statistics (usually single use only) based on frequencies
Ml-score, t-score, x2,...
LogDice - scalable
LogDice = 14 + log ^AB
Ia +ft
B
P. Rychly • Statistical Natural Language Processing • September 25, 2023
9/25
Collocations of Prince
	X		=•= :o; x		=•= n x
modifiers of "prince	II		verbs with "prince" as object		verbs with "prince" as subject
little	é-é*		say		say
the little prince			said the little prince		the little prince said to himself
fair	■ ■ *		ask		come
fair, little prince			asked the little prince		saw the little prince coming
Oh			demand **•		go
Oh , little prince			demanded the little prince		And the little prince went away
dear			see **•		add
dear little prince			when he saw the little prince coming		the little prince added
prince prince . dear little prince					ask the little prince asked
	■ ■ *		inquire inquired the little prince		
great great prince					flush The little prince flushed
			repeat repeated the little prince . who		
					_
P. Rychlý • Statistical Natural Language Processing • September 25, 2023
10/25
Collocations of
great
fair
be
verbs with "prince" as object
repeat demanci see nquire flush
say
come ask ask    consequence be
>v / go
say #
add
prince
answer
modifiers of "prince"
prince
P. Rychly • Statistical Natural Language Processing • September 25, 2023
verbs with "prince" is subject
'prince" and/or..,
11/25
Thesaurus
■ comparing collocation distributions
■ counting same context
son as noun 301x
	Word	Frequency ?
1	brother	161
2	wife	125
3	father	278
4	daughter	103 -«
5	child	80 »•
6	man	187 »•
7	servant	91 »•
	Esau	78 »•
	Jacob	184 »•
10	name	85 »•
Abraham as noun 134*
	Word	Frequency ?
1	Isaac	82
z	Jacob	184
	Joseph	157 •»
	Noah	41 «■
c	Abram	61
	Laban	54 •»
	Esau	78
s	God	234
	Abimelech	24 •»
10	father	278
P. Rychly • Statistical Natural Language Processing • September 25, 2023
12/25
Multi-word units
■ meaning of some words is completely different in the context of specific co-occurring word
■ black hole, is not black and is not a hole
■ strong collocations
■ uses same statistics with different threshold
■ better to compare context distribution instead of only numbers
■ terminology - compare to a reference corpus
P. Rychly • Statistical Natural Language Processing • September 25, 2023
13/25
Language models-what are they good for?
■ assigning scores to sequences of words
■ predicting words
■ generating text
■ statistical machine translation
■ automatic speech recognition
■ optical character recognition
P. Rychly • Statistical Natural Language Processing • September 25, 2023
14/25
OCR + MT
P. Rychly • Statistical Natural Language Processing • September 25, 2023
15/25
Language models - probability of a sentence
■ LM is a probability distribution over all possible word sequences.
■ What is the probability of utterance of s?
Probability of sentence
P/_M(Catalonia President urges protests) /^(President Catalonia urges protests) P/_M(urges Catalonia protests President)
Ideally, the probability should strongly correlate with fluency and intelligibility of a word sequence.
P. Rychly • Statistical Natural Language Processing • September 25, 2023
16/25
N-gram models
■ an approximation of Long sequences using short n-grams
■ a straightforward implementation
■ an intuitive approach
■ good Local fluency
Randomly generated text
"Jsi nebylo vidět vteřin přestal po schodech se daL do deníku a poLožiLi se táhl ji viděl na konci místnosti 101," řekl důstojník.
Hungarian
Atársaság koteLezettségeiért kapta a kozépkori tempLoma az voLt, hogy a feLhasznáLók az adottságai, a feLhasznáLó azonosítása az egyesüLet aLapszabáLyát.
P. Rychlý • Statistical Natural Language Processing • September 25, 2023
17/25
N-gram models, naYve approach
Markov's assumption
p(this is a sentence) = p(this)xp(is\this)xp(a\this, is)xp(sentence\is, a)
p(a\this, is) =
this is a this is
Sparse data problem.
P. Rychly • Statistical Natural Language Processing • September 25, 2023
18/25
Probabilities, practical issue
■ probabilities of words are very small
■ multiplying small numbers goes quickly to zero
■ Limits of floating point numbers: 10~38,10~388
■ using Log space:
■ avoid underflow
■ adding is faster
log(Pi xftxftx Pa) = logpi + logp2 + logp3 + logp4
P. Rychly • Statistical Natural Language Processing • September 25, 2023
19/25
Computing, LM probabilities estimation
Trigram model uses 2 preceding words for probability Learning. Using maximum-likelihood estimation:
count(wi,wi,wi) p(wz\w1,w2)= v J
J2W county, w2,w)
quadrigram: (lord, of, the, ?)
P. Rychly • Statistical Natural Language Processing • September 25, 2023
20/25
Computing, LM probabilities estimation
Trigram model uses 2 preceding words for probability Learning. Using maximum-likelihood estimation:
count{w\, wj, wi) J2W county, w2,w)
quadrigram: (lord, of, the, ?)
w	count	p{w)
rings	30,156	0.425
flies	2,977	0.042
well	1,536	0.021
manor	907	0.012
dance	767	0.010
• • •		
P. Rychly • Statistical Natural Language Processing • September 25, 2023
20/25
Larger LM - n-gram counts
How many unique n-grams in a corpus?
order	unique	singletons
unigram	86,700	33,447 (38.6%)
bigram	1,948,935	1,132,844 (58.1%)
trigram	8,092,798	6,022,286 (74.4%)
4-gram	15,303,847	13,081,621 (85.5%)
5-gram	19,882,175	18,324,577 (92.2%)
Corpus: EuroparL, 30 M tokens.
P. Rychly • Statistical Natural Language Processing • September 25, 2023
21/25
Smoothing of probabilities
The problem: an n-gram is missing in the data but it is in a sentence —>► p(sentence) = 0.
We need to assign non-zero p for unseen data. This must hold:
Vw : p(w) > 0 The issue is more pronounced for higher-order models.
Smoothing: an attempt to amend real counts of n-grams to expected counts in any (unseen) data.
Add-one, Add-a, Good-Turing smoothing More in PA154 (Language Modeling).
P. Rychly • Statistical Natural Language Processing • September 25, 2023
22/25
Quality and comparison of LMs
We need to compare quality of various LM (various orders, various data, smoothing techniques etc.)
O extrinsic (WER, MT, ASR, OCR) O intrinsic (perplexity) evaluation
A good LM should assign a higher probability to a good (Looking) text than to an incorrect text. For a fixed test text we can compare various LMs.
P. Rychly • Statistical Natural Language Processing • September 25, 2023
23/25
Cross-entropy
H(Plm) = -- logPLM(Wi, W2j ... wn)
1 "
-^logPLMCW/IWi,... W/_i) i=l
Cross-entropy is average value of negative logarithms of words' probabilities in testing text. It corresponds to a measure of uncertainty of a probability distribution. The Lower the better.
A good LM should reach entropy close to real entropy of language. That can't be measured directly but quite reliable estimates exist, e.g. Shannon's game. For English, entropy is estimated to approx. 1.3 bit per letter.
P. Rychly • Statistical Natural Language Processing • September 25, 2023
24/25
Cross Perplexity
PP — 2h^Plm^
Cross perplexity is a simple transformation of cross-entropy. A good LM should not waste p for improbable phenomena. The Lower entropy, the better —>► the Lower perpLexity, the better.
P. Rychly • Statistical Natural Language Processing • September 25, 2023