Statistical Natural Language Processing PA153 P. Rychly September 25,2023 P. Rychly • Statistical Natural Language Processing • September 25,2023 Q Word Lists Q CoLLocations Q Language Modeling Q N-grams Q Evaluation of Language Models P. Rychly • Statistical Natural Language Processing • September 25, 2023 2/25 Statistical Natural Language Processing ■ statistics provides a summary (of a text) ■ highlights important or interesting facts ■ can be used to model data ■ foundation of estimating probabilities ■ fundamental statistics: size (+ domain, range) P. Rychly • Statistical Natural Language Processing • September 25, 2023 3/25 Statistical Natural Language Processing ■ statistics provides a summary (of a text) ■ highlights important or interesting facts ■ can be used to model data ■ foundation of estimating probabilities ■ fundamental statistics: size (+ domain, range) Lines words bytes Book 1 Book 2 3,715 1,601 37,703 16,859 223,415 91,031 P. Rychly • Statistical Natural Language Processing • September 25, 2023 3/25 Word list ■ List of all words from a text ■ List of most frequent words ■ words, Lemmas, senses, tags, domains, years ... Bookl Book 2 the, and, of, to, you, his, in, said, that, 1, wiLL, him, your, he, a, my, was, with, s, for, me, He, is, , it, them, be, The, aLL, , have, from, , on, her, , are, their, were, they, which, , t, up, , had, there the, 1, to, a, of, is, that, , you, he, and, said, was, , in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, aLL, with, on, wiLL, as, very, had, this, him, He, from, they, , so, them, no, You, do, wouLd, Like P. Rychly • Statistical Natural Language Processing • September 25, 2023 4/25 Word list ■ List of all words from a text ■ List of most frequent words ■ words, Lemmas, senses, tags, domains, years ... Bookl Book 2 the, and, of, to, you, his, in, said, that, 1, wiLL, him, your, he, a, my, was, with, s, for, me, He, is, father, , it, them, be, The, aLL, land, have, from, , on, her, ,son, , are, their, were, they, which, sons, t, up, , had, there the, 1, to, a, of, is, that, little, you, he, and, said, was, , in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, aLL, with, on, wiLL, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, wouLd, Like P. Rychly • Statistical Natural Language Processing • September 25, 2023 4/25 Word list ■ List of all words from a text ■ List of most frequent words ■ words, Lemmas, senses, tags, domains, years ... Bookl Book 2 the, and, of, to, you, his, in, said, that, 1, wiLL, him, your, he, a, my, was, with, s, for, me, He, is, father, God, it, them, be, The, aLL, land, have, from, Jacob, on, her, Yahweh, son, Joseph, are, their, were, they, which, sons, t, up, Abraham, had, there the, 1, to, a, of, is, that, little, you, he, and, said, was, prince, in, it, not, me, my, have, And, are, one, for, But, his, be, The, It, at, aLL, with, on, wiLL, as, very, had, this, him, He, from, they, planet, so, them, no, You, do, wouLd, Like P. Rychly • Statistical Natural Language Processing • September 25, 2023 4/25 Frequency ■ number of occurrences (raw frequency) ■ relative frequency (hits per million) ■ document frequency (number of documents with a hit) ■ reduced frequency (ARF, ALDf) 1 < reduced < raw ■ normalization for comparison ■ hapax Legomena (= 1 hit) P. Rychly • Statistical Natural Language Processing • September 25, 2023 5/25 Zipfs Law rank-frequency plot rank x frequency = constant U—i d Z5 O" rangorde P. Rychly • Statistical Natural Language Processing • September 25, 2023 6/25 Zipf's Law o o o o o o o n3 o w a> _o ^ o o ° (D Zl CT (D O'he °1 O O and to O |n Othat his 00 it 10 100 1000 —I 10000 Rank (log scale) P. Rychly • Statistical Natural Language Processing • September 25, 2023 7/25 Keywords select only important words from a word List compare to reference text (norm) simple math score: freqfocus + N score =--- fr£ Preference + N Genesis Little Prince son God father Jacob Yahweh Joseph Abraham wife behold daughter prince planet flower LittLe fox never too drawing reply star P. Rychlý • Statistical Natural Language Processing • September 25, 2023 8/25 Collocations meaning of words is defined by the context collocations a salient words in the context usually not the most frequent filtering by part of speech, grammatical relation compare to reference = context for other words many statistics (usually single use only) based on frequencies Ml-score, t-score, x2,... LogDice - scalable LogDice = 14 + log ^AB Ia +ft B P. Rychly • Statistical Natural Language Processing • September 25, 2023 9/25 Collocations of Prince X =•= :o; x =•= n x modifiers of "prince II verbs with "prince" as object verbs with "prince" as subject little é-é* say say the little prince said the little prince the little prince said to himself fair ■ ■ * ask come fair, little prince asked the little prince saw the little prince coming Oh demand **• go Oh , little prince demanded the little prince And the little prince went away dear see **• add dear little prince when he saw the little prince coming the little prince added prince prince . dear little prince ask the little prince asked ■ ■ * inquire inquired the little prince great great prince flush The little prince flushed repeat repeated the little prince . who _ P. Rychlý • Statistical Natural Language Processing • September 25, 2023 10/25 Collocations of great fair be verbs with "prince" as object repeat demanci see nquire flush say come ask ask consequence be >v / go say # add prince answer modifiers of "prince" prince P. Rychly • Statistical Natural Language Processing • September 25, 2023 verbs with "prince" is subject 'prince" and/or.., 11/25 Thesaurus ■ comparing collocation distributions ■ counting same context son as noun 301x Word Frequency ? 1 brother 161 2 wife 125 3 father 278 4 daughter 103 -« 5 child 80 »• 6 man 187 »• 7 servant 91 »• Esau 78 »• Jacob 184 »• 10 name 85 »• Abraham as noun 134* Word Frequency ? 1 Isaac 82 z Jacob 184 Joseph 157 •» Noah 41 «■ c Abram 61 Laban 54 •» Esau 78 s God 234 Abimelech 24 •» 10 father 278 P. Rychly • Statistical Natural Language Processing • September 25, 2023 12/25 Multi-word units ■ meaning of some words is completely different in the context of specific co-occurring word ■ black hole, is not black and is not a hole ■ strong collocations ■ uses same statistics with different threshold ■ better to compare context distribution instead of only numbers ■ terminology - compare to a reference corpus P. Rychly • Statistical Natural Language Processing • September 25, 2023 13/25 Language models-what are they good for? ■ assigning scores to sequences of words ■ predicting words ■ generating text ■ statistical machine translation ■ automatic speech recognition ■ optical character recognition P. Rychly • Statistical Natural Language Processing • September 25, 2023 14/25 OCR + MT P. Rychly • Statistical Natural Language Processing • September 25, 2023 15/25 Language models - probability of a sentence ■ LM is a probability distribution over all possible word sequences. ■ What is the probability of utterance of s? Probability of sentence P/_M(Catalonia President urges protests) /^(President Catalonia urges protests) P/_M(urges Catalonia protests President) Ideally, the probability should strongly correlate with fluency and intelligibility of a word sequence. P. Rychly • Statistical Natural Language Processing • September 25, 2023 16/25 N-gram models ■ an approximation of Long sequences using short n-grams ■ a straightforward implementation ■ an intuitive approach ■ good Local fluency Randomly generated text "Jsi nebylo vidět vteřin přestal po schodech se daL do deníku a poLožiLi se táhl ji viděl na konci místnosti 101," řekl důstojník. Hungarian Atársaság koteLezettségeiért kapta a kozépkori tempLoma az voLt, hogy a feLhasznáLók az adottságai, a feLhasznáLó azonosítása az egyesüLet aLapszabáLyát. P. Rychlý • Statistical Natural Language Processing • September 25, 2023 17/25 N-gram models, naYve approach Markov's assumption p(this is a sentence) = p(this)xp(is\this)xp(a\this, is)xp(sentence\is, a) p(a\this, is) = this is a this is Sparse data problem. P. Rychly • Statistical Natural Language Processing • September 25, 2023 18/25 Probabilities, practical issue ■ probabilities of words are very small ■ multiplying small numbers goes quickly to zero ■ Limits of floating point numbers: 10~38,10~388 ■ using Log space: ■ avoid underflow ■ adding is faster log(Pi xftxftx Pa) = logpi + logp2 + logp3 + logp4 P. Rychly • Statistical Natural Language Processing • September 25, 2023 19/25 Computing, LM probabilities estimation Trigram model uses 2 preceding words for probability Learning. Using maximum-likelihood estimation: count(wi,wi,wi) p(wz\w1,w2)= v J J2W county, w2,w) quadrigram: (lord, of, the, ?) P. Rychly • Statistical Natural Language Processing • September 25, 2023 20/25 Computing, LM probabilities estimation Trigram model uses 2 preceding words for probability Learning. Using maximum-likelihood estimation: count{w\, wj, wi) J2W county, w2,w) quadrigram: (lord, of, the, ?) w count p{w) rings 30,156 0.425 flies 2,977 0.042 well 1,536 0.021 manor 907 0.012 dance 767 0.010 • • • P. Rychly • Statistical Natural Language Processing • September 25, 2023 20/25 Larger LM - n-gram counts How many unique n-grams in a corpus? order unique singletons unigram 86,700 33,447 (38.6%) bigram 1,948,935 1,132,844 (58.1%) trigram 8,092,798 6,022,286 (74.4%) 4-gram 15,303,847 13,081,621 (85.5%) 5-gram 19,882,175 18,324,577 (92.2%) Corpus: EuroparL, 30 M tokens. P. Rychly • Statistical Natural Language Processing • September 25, 2023 21/25 Smoothing of probabilities The problem: an n-gram is missing in the data but it is in a sentence —>► p(sentence) = 0. We need to assign non-zero p for unseen data. This must hold: Vw : p(w) > 0 The issue is more pronounced for higher-order models. Smoothing: an attempt to amend real counts of n-grams to expected counts in any (unseen) data. Add-one, Add-a, Good-Turing smoothing More in PA154 (Language Modeling). P. Rychly • Statistical Natural Language Processing • September 25, 2023 22/25 Quality and comparison of LMs We need to compare quality of various LM (various orders, various data, smoothing techniques etc.) O extrinsic (WER, MT, ASR, OCR) O intrinsic (perplexity) evaluation A good LM should assign a higher probability to a good (Looking) text than to an incorrect text. For a fixed test text we can compare various LMs. P. Rychly • Statistical Natural Language Processing • September 25, 2023 23/25 Cross-entropy H(Plm) = -- logPLM(Wi, W2j ... wn) 1 " -^logPLMCW/IWi,... W/_i) i=l Cross-entropy is average value of negative logarithms of words' probabilities in testing text. It corresponds to a measure of uncertainty of a probability distribution. The Lower the better. A good LM should reach entropy close to real entropy of language. That can't be measured directly but quite reliable estimates exist, e.g. Shannon's game. For English, entropy is estimated to approx. 1.3 bit per letter. P. Rychly • Statistical Natural Language Processing • September 25, 2023 24/25 Cross Perplexity PP — 2h^Plm^ Cross perplexity is a simple transformation of cross-entropy. A good LM should not waste p for improbable phenomena. The Lower entropy, the better —>► the Lower perpLexity, the better. P. Rychly • Statistical Natural Language Processing • September 25, 2023