Statistical Natural Language Processing P. Rychlý NLP Centre, Fl Mil, Brno September 23, 2022 P. Rychlý Statistical Natural Language Processing 1/29 O Word lists O Collocations Q Language Modeling -grams Q Evaluation of Language Models P. Rychlý Statistical Natural Language Processing 2/29 Statistical Natural Language Processing • statistics provides a summary (of a text) 9 highlights important or interesting facts 9 can be used to model data • foundation of estimating probabilities • fundamental statistics: size (+ domain, range) P. Rychlý Statistical Natural Language Processing Statistical Natural Language Processing • statistics provides a summary (of a text) 9 highlights important or interesting facts o can be used to model data • foundation of estimating probabilities • fundamental statistics: size (+ domain, range) lines words bytes Book 1 Book 2 3,715 1,601 37,703 16,859 223,415 91,031 Word list • list of all words from a text • list of most frequent words • words, lemmas, senses, tags, domains, years ... Book 1 Book 2 the, and, of, to, you, his, in, the, 1, to, a, of, is, that said, that, 1, will, him, your, he, , you, he, and, said, was, a, my, was, with, s, for, me, , in, it, not, me, my, He, is, , , it, them, have, And, are, one, for, But, be, The, all, , have, from, his, be, The, It, at, all, with, , on, her, on, will, as, very, had, this, , are, their, were, they, him, He, from, they, , so, which, , t, up, them, no, You, do, would, like had, there Word list • list of all words from a text • list of most frequent words • words, lemmas, senses, tags, domains, years ... Book 1 the, and, of, to, you, his, in, said, that, I, will, him, your, he, a, my, was, with, s, for, me, He, is, father, , it, them, be, The, all, land, have, from, , on, her, , son, , are, their, were, they, which, sons, t, up, had, there Book 2 the, I, to, a, of, is, that, little, you, he, and, said, was, , in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, all, with, on, will, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, would, like Word list • list of all words from a text • list of most frequent words • words, lemmas, senses, tags, domains, years ... Book 1 Book 2 the, and, of, to, you, his, in, said, that, 1, will, him, your, he, a, my, was, with, s, for, me, He, is, father, God, it, them, be, The, all, land, have, from, Jacob, on, her, Yahweh, son, Joseph, are, their, were, they, which, sons, t, up, Abraham, had, there the, 1, to, a, of, is, that, little, you, he, and, said, was, prince, in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, all, with, on, will, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, would, like Frequency • number of occurrences (raw frequency) • relative frequency (hits per million) • document frequency (number of documents with a hit) • reduced frequency (ARF, ALDf) 1 < reduced < raw • normalization for comparison • hapax legomena (= 1 hit) P. Rychlý Statistical Natural Language Processing 5/29 ipf s Law • rank-frequency plot • rank x frequency = constant i-1-1-1-1-r~ 0 20 40 SO SO 100 rangorde P. Rychlý statistical Natural Language Processing Zipf's Law o o o o O'he O O o o o of O O and to *0 O O in O that his OO i' 10 -1- 10000 100 1000 Rank (log scale) P. Rychlý "tatistical Natural Language Processing 7/29 Keywords • select only important words from a word list • compare to reference text (norm) <* simple math score: score = freq focus + N fr€Qreference ~l~ N Genesis Little Prince son God father Jacob Yahweh Joseph Abraham wife behold daughter prince planet flower little fox never too drawing reply star Collocations 9 meaning of words is defined by the context o collocations a salient words in the context • usually not the most frequent filtering by part of speech, grammatical relation o compare to reference = context for other words 9 many statistics (usually single use only) based on frequencies • Ml-score, t-score, x2, ... • logDice - scalable log Dice = 14 + log AB fA + fß P. Rychlý Statistical Natural Language Processing 9/29 Collocations of Prince =•= n X H X =•= :o; X modifiers of "prince" verbs with ,,prince,, as object verbs with "prince11 as subject little the little prince say said the little prince say the little prince said to himself fair fair, little prince ask asked the little prince corne saw the little prince coming Oh Oh , little prince demand demanded the little prince go And the little prince went away dear dear little prince see when he saw the little prince coming inquire inquired the little prince repeat repeated the little prince . who add the little prince added prince prince . dear little prince ask the little prince asked great great prince flush The little prince flushed P. Rychlý Statistical Natural Language Processing 10 / 29 Collocations of Prince be verbs with "prince" as object repeat demand see inquire flush say come as^ ask consequence be /^ prince prince little modifiers of "prince" verbs with prince" s subject "prince" and/or.., P. Rychlý Statistical Natural Language Processing 11 / 29 Thesaurus comparing collocation distributions counting same context son as noun 301* Word Frequency brother wife father daughter 5 child man servant Esau Jacob name 161 125 273 103 80 137 91 78 134 35 Abraham as noun 134* Word Isaac Jacob Joseph Noah Abrarn Lab an Esau God Abimelech father Frequency 82 184 157 41 Gl 54 78 234 24 278 P. Rychlý "tatistical Natural Language Processing Multi-word units • meaning of some words is completely different in the context of specific co-occurring word • black hole, is not black and is not a hole • strong collocations • uses same statistics with different threshold o better to compare context distribution instead of only numbers • terminology - compare to a reference corpus Language models—what are they good for? 9 assigning scores to sequences of words • predicting words generating text • statistical machine translation • automatic speech recognition • optical character recognition P. Rychlý Statistical Natural Language Processing 14 / 29 OCR + MT BUXQA B rOPOA i 'kv r iCCESS TO CITY o P. Rychlý 15 / 29 Language models - probability of a sentence • LM is a probability distribution over all possible word sequences. 9 What is the probability of utterance of s? Probability of sentence Pz_/w(Catalonia President urges protests) p/_M(President Catalonia urges protests) Pz_/w(urges Catalonia protests President) Ideally, the probability should strongly correlate with fluency and intelligibility of a word sequence. N-gram models • an approximation of long sequences using short n-grams • a straightforward implementation a an intuitive approach • good local fluency Randomly generated text "Jsi nebylo vidět vteřin přestal po schodech se dal do deníku a položili se táhl ji viděl na konci místnosti 101," řekl důstojník. Hungarian A társaság kótelezettségeiért kapta a kózépkori temploma az volt, hogy a felhasználók az adottságai, a felhasználó azonosítása az egyesúlet alapszabályát. N-gram models, naive approach 1/1/ = l/l/i. 1/1/2 ••• • • Wn p[W) = JJp(w/|wi • • • i/i//_i) Markov's assumption p(l/|/) = JJp(w/|w/_2, p(this is a sentence) = p(this) x p(is\this) x p(a\this, is) x p(sentence\is, a) p(a\this, is) = t/7/s /s a t/7/S /s Sparse data problem P. Rychlý Statistical Natural Language Processing 18 / 29 Probabilities, practical issue • probabilities of words are very small • multiplying small numbers goes quickly to zero 9 limits of floating point numbers: 10-38, 10-388 • using log space: ► avoid underflow ► adding is faster log(Pi x P2 x P3 x Pa) = logpi + logp2 + logp3 + logp4 P. Rychlý Statistical Natural Language Processing 19 / 29 Computing, LM probabilities estimation Trigram model uses 2 preceding words for probability learning. Using maximum-likelihood estimation: p(w3\wi. 1/1/2) = count{wi, 1/1/2- ws) V count(wi, 1/1/2. w) quadrigram: (7orc/, of, £/?e, ?) () count p(w) rings 30,156 0.425 flies 2,977 0.042 well 1,536 0.021 manor 907 0.012 dance 767 0.010 P. Rychlý Statistical Natural Language Processing 20 / 29 Large LM - n-gram counts How many unique n-grams in a corpus? order unique singletons unigram 86,700 33,447 (38.6%) bigram 1,948,935 1,132,844 (58.1%) trigram 8,092,798 6,022,286 (74.4%) 4-gram 15,303,847 13,081,621 (85.5%) 5-gram 19,882,175 18,324,577 (92.2%) Corpus: Europarl, 30 M tokens. Language models smoothing The problem: an n-gram is missing in the data but is in a sentence —> p(sentence) = 0. We need to assign non-zero p for unseen data. This must hold: Vi/i/ : p(w) > 0 The issue is more pronounced for higher-order models. Smoothing: an attempt to amend real counts of n-grams to expected counts in any (unseen) data. Add-one, Add-a, Good-Turing smoothing