MUNI Statistical Natural Language Processing PA153 Pavel Rychly NLP Centre, FI MU, Brno October 14, 2024 Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 1/23 Word Lists Collocations Language Modeling N-grams Evaluation of Language Models Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 2/23 Statistical Natural Language Processing ■ statistics provides a summary (of a text) ■ highlights important or interesting facts ■ can be used to model data ■ foundation of estimating probabilities ■ fundamental statistics: size (+ domain, range) Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 3/23 Statistical Natural Language Processing ■ statistics provides a summary (of a text) ■ highlights important or interesting facts ■ can be used to model data ■ foundation of estimating probabilities ■ fundamental statistics: size (+ domain, range) Lines words bytes Book 1 Book 2 3,715 1,601 37,703 16,859 223,415 91,031 Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 3/23 Word lists Word list ■ List of aLL words from a text ■ List of most frequent words ■ words, Lemmas, senses, tags, domains, years ... Bookl Book 2 the, and, of, to, you, his, in, said, that, 1, wiLL, him, your, he, a, my, was, with, s, for, me, He, is, , it, them, be, The, aLL, , have, from, , on, her, , are, their, were, they, which, , t, up, , had, there the, 1, to, a, of, is, that, , you, he, and, said, was, , in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, aLL, with, on, wiLL, as, very, had, this, him, He, from, they, , so, them, no, You, do, wouLd, Like Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 4/23 Word lists Word list ■ List of all words from a text ■ List of most frequent words ■ words, Lemmas, senses, tags, domains, years ... Bookl Book 2 the, and, of, to, you, his, in, said, that, 1, wiLL, him, your, he, a, my, was, with, s, for, me, He, is, father, , it, them, be, The, aLL, land, have, from, , on, her, ,son, , are, their, were, they, which, sons, t, up, , had, there the, 1, to, a, of, is, that, little, you, he, and, said, was, , in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, aLL, with, on, wiLL, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, wouLd, Like Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 4/23 Word lists Word list ■ List of all words from a text ■ List of most frequent words ■ words, Lemmas, senses, tags, domains, years ... Bookl Book 2 the, and, of, to, you, his, in, said, that, 1, wiLL, him, your, he, a, my, was, with, s, for, me, He, is, father, God, it, them, be, The, aLL, land, have, from, Jacob, on, her, Yahweh, son, Joseph, are, their, were, they, which, sons, t, up, Abraham, had, there the, 1, to, a, of, is, that, little, you, he, and, said, was, prince, in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, aLL, with, on, wiLL, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, wouLd, Like Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 4/23 Word lists Frequency ■ number of occurrences (raw frequency) ■ relative frequency (hits per million) ■ document frequency (number of documents with a hit) ■ reduced frequency (ARF,ALDf) 1 < reduced < raw ■ normalization for comparison ■ hapax Legomena (= 1 hit) Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 5/23 Word lists Keywords select only important words from a word List compare to reference text (norm) simple math score: freqfocus + N score =--- fr£ Preference + N Genesis Little Prince son God father Jacob Yahweh Joseph Abraham wife behold daughter prince planet flower LittLe fox never too drawing reply star Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 6/23 Collocations Collocations meaning of words is defined by the context collocations a salient words in the context usually not the most frequent filtering by part of speech, grammatical relation compare to reference = context for other words many statistics (usually single use only) based on frequencies Ml-score, t-score, x2,... LogDice - scalable Jab logDice = 14 + log fA+ft b Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 7/23 Collocations Collocations of Pri ^^^^ :o: X modifiers of "prince" little the little prince IM fair fair, little prince Oh Oh , little prince dear dear little prince prince prince , dear little prince great great prince =•= :o: X verbs with "prince" as object say said the little prince ask asked the little prince demand demanded the little prince see when he saw the little prince coming inquire inquired the little prince repeat repeated the little prince . who =•= n X verbs with "prince" as subject Mt* say the little prince said to himself come saw the little prince coming Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 8/23 Collocations Collocations of great Oh be verbs with "prince" as object repeat demand see inquire flush say come ask ask consequence be go x sav add oay prince prince answer little Oh snow prince Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 Collocations Thesaurus comparing collocation distributions counting same context son as noun 301* Word Frequency ? 1 brother 161 •« 2 wife 125 father 278 *" 4 daughter 108 •« child 80 •« man 187 •« servant 91 •« Esau 78 — 9 Jacob 184 •« 10 name 85 •« Abraham as noun 134* Word Frequency ? Isaac 82 < » Jacob 184 • Joseph 157 • Noah 41 ■ Abram 61 * Lab an 54 * » Esau 78 • God 234 • « Abimelech 24 * father 278 • » Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 10/23 Collocations Multi-word units ■ meaning of some words is completely different in the context of specific co-occurring word ■ black hole, is not black and is not a hole ■ strong collocations ■ uses same statistics with different threshold ■ better to compare context distribution instead of only numbers ■ terminology - compare to a reference corpus Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 11/23 Language Modeling Language models-what are they good for? ■ assigning scores to sequences of words ■ predicting words ■ generating text ■ statistical machine translation ■ automatic speech recognition ■ optical character recognition Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 12/23 Language Modeling OCR + MT Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 13/23 Language Modeling Language models - probability of a sentence ■ LM is a probability distribution over all possible word sequences. ■ What is the probability of utterance of s? Probability of sentence P/_M(Catalonia President urges protests) /^(President Catalonia urges protests) P/_M(urges Catalonia protests President) Ideally, the probability should strongly correlate with fluency and intelligibility of a word sequence. Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 14/23 N-grams N-gram models ■ an approximation of Long sequences using short n-grams ■ a straightforward implementation ■ an intuitive approach ■ good Local fluency Randomly generated text "Jsi nebylo vidět vteřin přestal po schodech se daL do deníku a poLožiLi se táhl ji viděl na konci místnosti 101," řekl důstojník. Hungarian Atársaság koteLezettségeiért kapta a kozépkori tempLoma az voLt, hogy a feLhasznáLók az adottságai, a feLhasznáLó azonosítása az egyesüLet aLapszabáLyát. Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 15/23 N-grams N-gram models, naYve approach W = w\, Wj, • • • , wn Markov's assumption p{W) = JJp(w/|w/_2j w/_i) p(this is a sentence) = p(this)xp(is\this)xp(a\this, is)xp(sentence\is, a) p(a\this, is) = f/?/s /s o this is Sparse data problem. Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 16/23 N-grams Probabilities, practical issue ■ probabilities of words are very small ■ multiplying small numbers goes quickly to zero ■ Limits of floating point numbers: 10~38,10~388 ■ using Log space: ■ avoid underflow ■ adding is faster log(Pi xftxftx Pa) = logpi + logp2 + logp3 + logp4 Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 17/23 N-grams Computing, LM probabilities estimation Trigram model uses 2 preceding words for probability Learning. Using maximum-Likelihood estimation: p(wi Wi, W2) =- / , Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 18/23 N-grams Computing, LM probabilities estimation Trigram model uses 2 preceding words for probability Learning. Using maximum-likelihood estimation: p{wz\w1,w2) = count(wi, wj, Wi) count{w\, wj) quadrigram: (lord, of, the, ?) Q w count p(w) rings 30,156 0.425 flies 2,977 0.042 well 1,536 0.021 manor 907 0.012 dance 767 0.010 • • • Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 18/23 N-grams Large LM - n-gram counts How many unique n-grams in a corpus? order unique singletons unigram 86,700 33,447 (38.6%) bigram 1,948,935 1,132,844 (58.1%) trigram 8,092,798 6,022,286 (74.4%) 4-gram 15,303,847 13,081,621 (85.5%) 5-gram 19,882,175 18,324,577 (92.2%) Corpus: EuroparL, 30 M tokens. Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 19/23 N-grams Language models smoothing The problem: an n-gram is missing in the data but is in a sentence —>► p(sentence) = 0. We need to assign non-zero p for unseen data. This must hold: Vw : p(w) > 0 The issue is more pronounced for higher-order models. Smoothing: an attempt to amend real counts of n-grams to expected counts in any (unseen) data. Add-one, Add-a, Good-Turing smoothing More in PA154 (Language Modeling). Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 20/23 Evaluation of Language Models Quality and comparison of LMs We need to compare quality of various LM (various orders, various data, smoothing techniques etc.) 1. extrinsic (WER, MT, ASR, OCR) 2. intrinsic (perplexity) evaluation A good LM should assign a higher probability to a good (Looking) text than to an incorrect text. For a fixed test text we can compare various LMs. Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 21/23 Evaluation of Language Models Cross-entropy H(Plm) = -- logPLM(Wi, VV2, ... Wn) 1 " - logPLMtW/IWi, . . . W/_i) i=l Cross-entropy is average value of negative logarithms of words' probabilities in testing text. It corresponds to a measure of uncertainty of a probability distribution. The Lower the better. A good LM should reach entropy close to real entropy of language. That can't be measured directly but quite reliable estimates exist, e.g. Shannon's game. For English, entropy is estimated to approx. 1.3 bit per letter. Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 22/23 Perplexity Evaluation of Language Models PP — 2h^Plm^ Perplexity is a simple transformation of cross-entropy. A good LM should not waste p for improbable phenomena. The Lower entropy, the better —>► the Lower perplexity, the better. Pavel Rychly • Statistical Natural Language Processing • October 14, 2024 23 / 23