MUNI
Statistical Natural Language Processing
PA153 Pavel Rychly
NLP Centre, FI MU, Brno October 14, 2024
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
1/23
Word Lists Collocations Language Modeling N-grams
Evaluation of Language Models
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 2/23
Statistical Natural Language Processing
■ statistics provides a summary (of a text)
■ highlights important or interesting facts
■ can be used to model data
■ foundation of estimating probabilities
■ fundamental statistics: size (+ domain, range)
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
3/23
Statistical Natural Language Processing
■ statistics provides a summary (of a text)
■ highlights important or interesting facts
■ can be used to model data
■ foundation of estimating probabilities
■ fundamental statistics: size (+ domain, range)
	Lines	words	bytes
Book 1 Book 2	3,715 1,601	37,703 16,859	223,415 91,031
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
3/23
Word lists
Word list
■ List of aLL words from a text
■ List of most frequent words
■ words, Lemmas, senses, tags, domains, years ...
Bookl	Book 2
the, and, of, to, you, his, in, said, that, 1, wiLL, him, your, he, a, my, was, with, s, for, me, He, is, , it, them, be, The, aLL, , have, from,       , on, her, , are, their, were, they, which,      , t, up, , had, there	the, 1, to, a, of, is, that, , you, he, and, said, was, , in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, aLL, with, on, wiLL, as, very, had, this, him, He, from, they, , so, them, no, You, do, wouLd, Like
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
4/23
Word lists
Word list
■ List of all words from a text
■ List of most frequent words
■ words, Lemmas, senses, tags, domains, years ...
Bookl	Book 2
the, and, of, to, you, his, in, said, that, 1, wiLL, him, your, he, a, my, was, with, s, for, me, He, is, father,     , it, them, be, The, aLL, land, have, from,       , on, her, ,son,        , are, their, were, they, which, sons, t, up, , had, there	the, 1, to, a, of, is, that, little, you, he, and, said, was, , in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, aLL, with, on, wiLL, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, wouLd, Like
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
4/23
Word lists
Word list
■ List of all words from a text
■ List of most frequent words
■ words, Lemmas, senses, tags, domains, years ...
Bookl	Book 2
the, and, of, to, you, his, in, said, that, 1, wiLL, him, your, he, a, my, was, with, s, for, me, He, is, father, God, it, them, be, The, aLL, land, have, from, Jacob, on, her, Yahweh, son, Joseph, are, their, were, they, which, sons, t, up, Abraham, had, there	the, 1, to, a, of, is, that, little, you, he, and, said, was, prince, in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, aLL, with, on, wiLL, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, wouLd, Like
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
4/23
Word lists
Frequency
■ number of occurrences (raw frequency)
■ relative frequency (hits per million)
■ document frequency (number of documents with a hit)
■ reduced frequency (ARF,ALDf)
1 < reduced < raw
■ normalization for comparison
■ hapax Legomena (= 1 hit)
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
5/23
Word lists
Keywords
select only important words from a word List compare to reference text (norm) simple math score:
freqfocus + N score =---
fr£ Preference
+ N
Genesis	Little Prince
son God father Jacob Yahweh Joseph Abraham wife behold daughter	prince planet flower LittLe fox never too drawing reply star
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
6/23
Collocations
Collocations
meaning of words is defined by the context
collocations a salient words in the context
usually not the most frequent
filtering by part of speech, grammatical relation
compare to reference = context for other words
many statistics (usually single use only) based on frequencies
Ml-score, t-score, x2,...
LogDice - scalable
Jab
logDice = 14 + log
fA+ft
b
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
7/23
Collocations
Collocations of Pri
^^^^	:o: X
modifiers of	"prince"
little the little prince	IM
fair fair, little prince	
Oh Oh , little prince	
dear dear little prince	
prince prince , dear little prince	
great great prince	
=•= :o: X
verbs with "prince" as object
say
said the little prince
ask
asked the little prince
demand
demanded the little prince
see
when he saw the little prince coming
inquire
inquired the little prince
repeat
repeated the little prince . who
=•= n X
verbs with "prince" as subject
Mt*
say
the little prince said to himself
come
saw the little prince coming
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
8/23
Collocations
Collocations of
great
Oh
be
verbs with "prince" as object
repeat demand see inquire flush
say
come ask ask     consequence be
go
x
sav add
oay prince
prince answer little
Oh snow
prince
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
Collocations
Thesaurus
comparing collocation distributions counting same context
son as noun 301*
	Word	Frequency ?
1	brother	161 •«
2	wife	125
	father	278 *"
4	daughter	108 •«
	child	80 •«
	man	187 •«
	servant	91 •«
	Esau	78 —
9	Jacob	184 •«
10	name	85 •«
Abraham as noun 134*
Word	Frequency ?	
Isaac	82 <	»
Jacob	184 •	
		
Joseph	157 •	
Noah	41 ■	
Abram	61 *	
Lab an	54 *	»
Esau	78 •	
God	234 •	«
Abimelech	24 *	
father	278 •	»
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
10/23
Collocations
Multi-word units
■ meaning of some words is completely different in the context of specific co-occurring word
■ black hole, is not black and is not a hole
■ strong collocations
■ uses same statistics with different threshold
■ better to compare context distribution instead of only numbers
■ terminology - compare to a reference corpus
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
11/23
Language Modeling
Language models-what are they good for?
■ assigning scores to sequences of words
■ predicting words
■ generating text
■ statistical machine translation
■ automatic speech recognition
■ optical character recognition
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
12/23
Language Modeling
OCR + MT
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
13/23
Language Modeling
Language models - probability of a sentence
■ LM is a probability distribution over all possible word sequences.
■ What is the probability of utterance of s?
Probability of sentence
P/_M(Catalonia President urges protests) /^(President Catalonia urges protests) P/_M(urges Catalonia protests President)
Ideally, the probability should strongly correlate with fluency and intelligibility of a word sequence.
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
14/23
N-grams
N-gram models
■ an approximation of Long sequences using short n-grams
■ a straightforward implementation
■ an intuitive approach
■ good Local fluency
Randomly generated text
"Jsi nebylo vidět vteřin přestal po schodech se daL do deníku a poLožiLi se táhl ji viděl na konci místnosti 101," řekl důstojník.
Hungarian
Atársaság koteLezettségeiért kapta a kozépkori tempLoma az voLt, hogy a feLhasznáLók az adottságai, a feLhasznáLó azonosítása az egyesüLet aLapszabáLyát.
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
15/23
N-grams
N-gram models, naYve approach
W = w\, Wj, • • • , wn
Markov's assumption
p{W) = JJp(w/|w/_2j w/_i)
p(this is a sentence) = p(this)xp(is\this)xp(a\this, is)xp(sentence\is, a)
p(a\this, is) =
f/?/s /s o this is
Sparse data problem.
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
16/23
N-grams
Probabilities, practical issue
■ probabilities of words are very small
■ multiplying small numbers goes quickly to zero
■ Limits of floating point numbers: 10~38,10~388
■ using Log space:
■ avoid underflow
■ adding is faster
log(Pi xftxftx Pa) = logpi + logp2 + logp3 + logp4
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
17/23
N-grams
Computing, LM probabilities estimation
Trigram model uses 2 preceding words for probability Learning. Using maximum-Likelihood estimation:
p(wi Wi, W2) =- / ,
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
18/23
N-grams
Computing, LM probabilities estimation
Trigram model uses 2 preceding words for probability Learning. Using maximum-likelihood estimation:
p{wz\w1,w2) =
count(wi, wj, Wi) count{w\, wj)
quadrigram: (lord, of, the, ?) Q
w	count	p(w)
rings	30,156	0.425
flies	2,977	0.042
well	1,536	0.021
manor	907	0.012
dance	767	0.010
• • •		
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
18/23
N-grams
Large LM - n-gram counts
How many unique n-grams in a corpus?
order	unique	singletons
unigram	86,700	33,447 (38.6%)
bigram	1,948,935	1,132,844 (58.1%)
trigram	8,092,798	6,022,286 (74.4%)
4-gram	15,303,847	13,081,621 (85.5%)
5-gram	19,882,175	18,324,577 (92.2%)
Corpus: EuroparL, 30 M tokens.
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
19/23
N-grams
Language models smoothing
The problem: an n-gram is missing in the data but is in a sentence —>► p(sentence) = 0.
We need to assign non-zero p for unseen data. This must hold:
Vw : p(w) > 0 The issue is more pronounced for higher-order models.
Smoothing: an attempt to amend real counts of n-grams to expected counts in any (unseen) data.
Add-one, Add-a, Good-Turing smoothing More in PA154 (Language Modeling).
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 20/23
Evaluation of Language Models
Quality and comparison of LMs
We need to compare quality of various LM (various orders, various data, smoothing techniques etc.)
1. extrinsic (WER, MT, ASR, OCR)
2. intrinsic (perplexity) evaluation
A good LM should assign a higher probability to a good (Looking) text than to an incorrect text. For a fixed test text we can compare various LMs.
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
21/23
Evaluation of Language Models
Cross-entropy
H(Plm) = -- logPLM(Wi, VV2, ... Wn)
1 "
-        logPLMtW/IWi, . . . W/_i) i=l
Cross-entropy is average value of negative logarithms of words' probabilities in testing text. It corresponds to a measure of uncertainty of a probability distribution. The Lower the better.
A good LM should reach entropy close to real entropy of language. That can't be measured directly but quite reliable estimates exist, e.g. Shannon's game. For English, entropy is estimated to approx. 1.3 bit per letter.
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024
22/23
Perplexity
Evaluation of Language Models
PP — 2h^Plm^
Perplexity is a simple transformation of cross-entropy. A good LM should not waste p for improbable phenomena. The Lower entropy, the better —>► the Lower perplexity, the better.
Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 23 / 23