What matters in Language modeling (LM) Language models Language Models for IR Discussion PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/PV211 IIR 12: Language Models for IR Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2023-04-12 (compiled on 2023-04-13 20:25:56) Sojka, IIR Group: PV211: Language Models for IR 1 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Overview 1 What matters in Language modeling (LM) 2 Language models 3 Language Models for IR 4 Discussion Sojka, IIR Group: PV211: Language Models for IR 2 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Take-away today What matters in language modeling? Feature (term) selection for text classification and similarity Statistical language models: Introduction Statistical language models in IR Large language models Discussion: Properties of different probabilistic models in use in IR, hype of LLM Sojka, IIR Group: PV211: Language Models for IR 3 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Language modeling (in IR) A language model is a probability distribution over sequences of words. Language models are used in IR retrieval in the query likelihood model. There, a separate language model is associated with each document in a collection. Documents are ranked based on the probability of the query Q in the document’s language model Md : P(Q | Md). Commonly, the unigram language model (bag of words model) is used for this purpose. An n-gram language model is a language model that models sequences of words as a Markov process. Subword (character (Morse, Turing), syllable) or phrase models (SWOMPT in English) are also possible. Sojka, IIR Group: PV211: Language Models for IR 5 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Latest milestones in language modeling 2013: Mikolov et al.: Efficient Estimation of Word Representations in Vector Space (word2vec) 2017: Vaswani et al.: Attention is all you need 2018: Generative pretrained models (GPT): large language models consisting of deep neural networks with billions of trainable parameters, trained on massive datasets of unlabelled text, have demonstrated impressive results on a wide variety of natural language processing tasks. This development has led to a shift in research focus toward the use of general-purpose LLMs. Sojka, IIR Group: PV211: Language Models for IR 6 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Which language model is best? Perplexity: In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample. A low perplexity indicates the probability distribution is good at predicting the sample. https://en.wikipedia.org/wiki/Perplexity https://huggingface.co/docs/transformers/perplexity Sojka, IIR Group: PV211: Language Models for IR 7 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Feature (term) selection In text classification, we usually represent documents in a high-dimensional space, with each dimension corresponding to a term. In this lecture: axis = dimension = word = term = feature Many dimensions correspond to rare words. Rare words can mislead the classifier. Rare misleading features are called noise features. Eliminating noise features from the representation increases efficiency and effectiveness of text classification. Eliminating features is called feature selection. Sojka, IIR Group: PV211: Language Models for IR 8 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Example for a noise feature Let’s say we’re doing text classification for the class China. Suppose a rare term, say arachnocentric, has no information about China . . . . . . but all instances of arachnocentric happen to occur in China documents in our training set. Then we may learn a classifier that incorrectly interprets arachnocentric as evidence for the class China. Such an incorrect generalization from an accidental property of the training set is called overfitting. Feature selection reduces overfitting and improves the accuracy of the classifier. Sojka, IIR Group: PV211: Language Models for IR 9 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Basic feature selection algorithm SelectFeatures(D, c, k) 1 V ← ExtractVocabulary(D) 2 L ← [] 3 for each t ∈ V 4 do A(t, c) ← ComputeFeatureUtility(D, t, c) 5 Append(L, A(t, c), t ) 6 return FeaturesWithLargestValues(L, k) How do we compute A, the feature utility? Sojka, IIR Group: PV211: Language Models for IR 10 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Different feature selection methods A feature selection method is mainly defined by the feature utility measure it employs Feature utility measures: Frequency – select the most frequent terms Mutual information – select the terms with the highest mutual information Mutual information is also called information gain in this context. Chi-square (see book) Sojka, IIR Group: PV211: Language Models for IR 11 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Mutual information Compute the feature utility A(t, c) as the mutual information (MI) of term t and class c. MI tells us “how much information” the term contains about the class and vice versa. For example, if a term’s occurrence is independent of the class (same proportion of docs within/without class contain the term), then MI is 0. Definition: I(U; C)= et ∈{1,0} ec ∈{1,0} P(U =et, C =ec) log2 P(U =et, C =ec) P(U =et)P(C =ec) Sojka, IIR Group: PV211: Language Models for IR 12 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion How to compute MI values Based on maximum likelihood estimates, the formula we actually use is: I(U; C) = N11 N log2 NN11 N1.N.1 + N01 N log2 NN01 N0.N.1 + N10 N log2 NN10 N1.N.0 + N00 N log2 NN00 N0.N.0 N10: number of documents that contain t (et = 1) and are not in c (ec = 0); N11: number of documents that contain t (et = 1) and are in c (ec = 1); N01: number of documents that do not contain t (et = 1) and are in c (ec = 1); N00: number of documents that do not contain t (et = 1) and are not in c (ec = 1); N = N00 + N01 + N10 + N11. Sojka, IIR Group: PV211: Language Models for IR 13 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion How to compute MI values (2) Alternative way of computing MI: I(U; C)= et ∈{1,0} ec ∈{1,0} P(U =et, C =ec) log2 N(U =et, C =ec) E(U =et)E(C =ec) N(U =et, C =ec) is the count of documents with values et and ec . E(U =et, C =ec) is the expected count of documents with values et and ec if we assume that the two random variables are independent. Sojka, IIR Group: PV211: Language Models for IR 14 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion MI example for poultry/export in Reuters ec = epoultry = 1 ec = epoultry = 0 et = eexport = 1 N11 = 49 N10 = 27,652 et = eexport = 0 N01 = 141 N00 = 774,106 Plug these values into formula: I(U; C) = 49 801,948 log2 801,948 · 49 (49+27,652)(49+141) + 141 801,948 log2 801,948 · 141 (141+774,106)(49+141) + 27,652 801,948 log2 801,948 · 27,652 (49+27,652)(27,652+774,106) + 774,106 801,948 log2 801,948 · 774,106 (141+774,106)(27,652+774,106) ≈ 0.000105 Sojka, IIR Group: PV211: Language Models for IR 15 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion MI feature selection on Reuters Class: coffee term MI coffee 0.0111 bags 0.0042 growers 0.0025 kg 0.0019 colombia 0.0018 brazil 0.0016 export 0.0014 exporters 0.0013 exports 0.0013 crop 0.0012 Class: sports term MI soccer 0.0681 cup 0.0515 match 0.0441 matches 0.0408 played 0.0388 league 0.0386 beat 0.0301 game 0.0299 games 0.0284 team 0.0264 Sojka, IIR Group: PV211: Language Models for IR 16 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Naive Bayes: Effect of feature selection # # # # # # # # # # # # # ## 1 10 100 1000 10000 0.00.20.40.60.8 number of features selected F1measure o o o oo o o o o o o o o oo x x x x x x x x x x x x x xx b b b bb b b b b b b b b bb # o x b multinomial, MI multinomial, chisquare multinomial, frequency binomial, MI (multinomial = multinomial Naive Bayes, binomial = Bernoulli Naive Bayes) Sojka, IIR Group: PV211: Language Models for IR 17 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Feature selection for Naive Bayes In general, feature selection is necessary for Naive Bayes to get decent performance. Also true for many other learning methods in text classification: you need feature selection for optimal performance. Sojka, IIR Group: PV211: Language Models for IR 18 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Exercise (i) Compute the “export”/POULTRY contingency table for the “Kyoto”/JAPAN in the collection given below. (ii) Make up a contingency table for which MI is 0 – that is, term and class are independent of each other. “export”/POULTRY table: ec = epoultry = 1 ec = epoultry = 0 et = eexport = 1 N11 = 49 N10 = 27,652 et = eexport = 0 N01 = 141 N00 = 774,106 Collection: docID words in document in c = Japan? training set 1 Kyoto Osaka Taiwan yes 2 Japan Kyoto yes 3 Taipei Taiwan no 4 Macao Taiwan Shanghai no 5 London no Sojka, IIR Group: PV211: Language Models for IR 19 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Using language models (LMs) for IR 1 LM = language model 2 We view the document as a generative model that generates the query. 3 What we need to do: 4 Define the precise generative model we want to use 5 Estimate parameters (different parameters for each document’s model) 6 Smooth to avoid zeros 7 Apply to query and find document most likely to have generated the query 8 Present most likely document(s) to user 9 Note that 4–7 is very similar to what we did in Naive Bayes. Sojka, IIR Group: PV211: Language Models for IR 21 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion What is a language model? We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish I wish . . . Cannot generate: “wish I wish” or “I wish I” Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic. Sojka, IIR Group: PV211: Language Models for IR 22 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion A probabilistic language model q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes frog STOP P(string) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02 ·0.01 ·0.2 = 0.0000000000048 0.01* 0.03* 0.04* 0.01* 0.02* 0.01* 0.2 Sojka, IIR Group: PV211: Language Models for IR 23 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion A different language model for each document language model of d1 language model of d2 w P(w|.) w P(w|.) STOP .2 toad .01 the .2 said .03 a .1 likes .02 frog .01 that .04 . . . . . . w P(w|.) w P(w|.) STOP .2 toad .02 the .15 said .03 a .08 likes .02 frog .01 that .05 . . . . . . query: frog said that toad likes frog STOP P(query|Md1) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02 ·0.01 ·0.2 = 0.0000000000048 = 4.8 · 10−12 P(query|Md2) = 0.01 ·0.03 ·0.05 ·0.02 ·0.02 ·0.01 ·0.2 = 0.0000000000120 = 12 · 10−12 0.01* 0.03* 0.05* 0.02* 0.02* 0.01* 0.2 P(query|Md1) < P(query|Md2) Thus, document d2 is “more relevant” to the query “frog said that toad likes frog STOP” than d1 is.Sojka, IIR Group: PV211: Language Models for IR 24 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Using language models in IR Each document is treated as (the basis for) a language model. Given a query q Rank documents based on P(d|q) P(d|q) = P(q|d)P(d) P(q) P(q) is the same for all documents, so ignore P(d) is the prior – often treated as the same for all d But we can give a higher prior to “high-quality” documents, e.g., those with high PageRank. P(q|d) is the probability of q given d . For uniform prior: ranking documents according according to P(q|d) and P(d|q) is equivalent. Sojka, IIR Group: PV211: Language Models for IR 26 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Where we are In the LM approach to IR, we attempt to model the query generation process. Then we rank documents by the probability that a query would be observed as a random sample from the respective document model. That is, we rank according to P(q|d). Next: how do we compute P(q|d)? Sojka, IIR Group: PV211: Language Models for IR 27 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion How to compute P(q|d) We will make the same conditional independence assumption as for Naive Bayes. P(q|Md) = P( t1, . . . , t|q| |Md ) = 1≤k≤|q| P(tk|Md) (|q|: length of q; tk: the token occurring at position k in q) This is equivalent to: P(q|Md ) = distinct term t in q P(t|Md )tft,q tft,q: term frequency (# occurrences) of t in q Multinomial model (omitting constant factor) Sojka, IIR Group: PV211: Language Models for IR 28 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Parameter estimation Missing piece: Where do the parameters P(t|Md ) come from? Start with maximum likelihood estimates (as we did for Naive Bayes) ˆP(t|Md ) = tft,d |d| (|d|: length of d; tft,d : # occurrences of t in d) As in Naive Bayes, we have a problem with zeros. A single t with P(t|Md ) = 0 will make P(q|Md ) = P(t|Md ) zero. We would give a single term “veto power”. For example, for query [Michael Jackson top hits] a document about “top songs” (but not using the word “hits”) would have P(q|Md ) = 0. – Thats’s bad. We need to smooth the estimates to avoid zeros. Sojka, IIR Group: PV211: Language Models for IR 29 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Smoothing Key intuition: A nonoccurring term is possible (even though it didn’t occur), . . . . . . but no more likely than would be expected by chance in the collection. Notation: Mc: the collection model; cft: the number of occurrences of t in the collection; T = t cft: the total number of tokens in the collection. ˆP(t|Mc) = cft T We will use ˆP(t|Mc) to “smooth” P(t|d) away from zero. Sojka, IIR Group: PV211: Language Models for IR 30 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Jelinek-Mercer smoothing P(t|d) = λP(t|Md ) + (1 − λ)P(t|Mc ) Mixes the probability from the document with the general collection frequency of the word. High value of λ: “conjunctive-like” search – tends to retrieve documents containing all query words. Low value of λ: more disjunctive, suitable for long queries Correctly setting λ is very important for good performance. Sojka, IIR Group: PV211: Language Models for IR 31 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Jelinek-Mercer smoothing: Summary P(q|d) ∝ 1≤k≤|q| (λP(tk |Md) + (1 − λ)P(tk |Mc)) What we model: The user has a document in mind and generates the query from this document. The equation represents the probability that the document that the user had in mind was in fact this one. Sojka, IIR Group: PV211: Language Models for IR 32 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Example Collection: d1 and d2 d1: Jackson was one of the most talented entertainers of all time d2: Michael Jackson anointed himself King of Pop Query q: Michael Jackson Use mixture model with λ = 1/2 P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003 P(q|d2) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0.013 Ranking: d2 > d1 Sojka, IIR Group: PV211: Language Models for IR 33 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Exercise: Compute ranking Collection: d1 and d2 d1: Xerox reports a profit but revenue is down d2: Lucene narrows quarter loss but revenue decreases further Query q: revenue down Use mixture model with λ = 1/2 P(q|d1) = [(1/8 + 2/16)/2] · [(1/8 + 1/16)/2] = 1/8 · 3/32 = 3/256 P(q|d2) = [(1/8 + 2/16)/2] · [(0/8 + 1/16)/2] = 1/8 · 1/32 = 1/256 Ranking: d1 > d2 Sojka, IIR Group: PV211: Language Models for IR 34 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Dirichlet smoothing ˆP(t|d) = tft,d + αˆP(t|Mc) Ld + α The background distribution ˆP(t|Mc) is the prior for ˆP(t|d). Intuition: Before having seen any part of the document we start with the background distribution as our estimate. As we read the document and count terms we update the background distribution. The weighting factor α determines how strong an effect the prior has. Sojka, IIR Group: PV211: Language Models for IR 35 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Jelinek-Mercer or Dirichlet? Dirichlet performs better for keyword queries, Jelinek-Mercer performs better for verbose queries. Both models are sensitive to the smoothing parameters – you shouldn’t use these models without parameter tuning. Sojka, IIR Group: PV211: Language Models for IR 36 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Sensitivity of Dirichlet to smoothing parameter µ is the Dirichlet smoothing parameter (called α on the previous slides) Sojka, IIR Group: PV211: Language Models for IR 37 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Language models are generative models We have assumed that queries are generated by a probabilistic process that looks like this: (as in Naive Bayes) C=China X1=Beijing X2=and X3=Taipei X4=join X5=WTO Sojka, IIR Group: PV211: Language Models for IR 39 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Naive Bayes and LM generative models We want to classify document d. We want to classify a query q. Classes: e.g., geographical regions like China, UK, Kenya. Each document in the collection is a different class. Assume that d was generated by the generative model. Assume that q was generated by a generative model Key question: Which of the classes is most likely to have generated the document? Which document (=class) is most likely to have generated the query q? Or: for which class do we have the most evidence? For which document (as the source of the query) do we have the most evidence? Sojka, IIR Group: PV211: Language Models for IR 40 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Naive Bayes Multinomial model / IR language models C=China X1=Beijing X2=and X3=Taipei X4=join X5=WTO Sojka, IIR Group: PV211: Language Models for IR 41 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Naive Bayes Bernoulli model / Binary independence model UAlaska=0 UBeijing=1 UIndia=0 Ujoin=1 UTaipei=1 UWTO=1 C=China Sojka, IIR Group: PV211: Language Models for IR 42 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Comparison of the two models multinomial model / IR language model Bernoulli model / BIM event model generation of (multi)set of tokens generation of subset of voc random variable(s) X = t iff t occurs at given pos Ut = 1 iff t occurs in doc doc. representation d = t1, . . . , tk , . . . , tnd , tk ∈ V d = e1, . . . , ei , . . . , eM , ei ∈ {0, 1} parameter estimation ˆP(X = t|c) ˆP(Ui = e|c) dec. rule: maximize ˆP(c) 1≤k≤nd ˆP(X = tk |c) ˆP(c) ti ∈V ˆP(Ui = ei |c) multiple occurrences taken into account ignored length of docs can handle longer docs works best for short docs # features can handle more works best with fewer estimate for the ˆP(X = the|c) ≈ 0.05 ˆP(Uthe = 1|c) ≈ 1.0 Sojka, IIR Group: PV211: Language Models for IR 43 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion LMs vs. Multinomial Naive Bayes We classify the query in LMs; we classify documents in text classification. Each document is a class in LMs vs. classes are human-defined in text classification Sojka, IIR Group: PV211: Language Models for IR 44 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Vector space (tf-idf) vs. LM precision significant Rec. tf-idf LM %chg 0.0 0.7439 0.7590 +2.0 0.1 0.4521 0.4910 +8.6 0.2 0.3514 0.4045 +15.1 * 0.4 0.2093 0.2572 +22.9 * 0.6 0.1024 0.1405 +37.1 * 0.8 0.0160 0.0432 +169.6 * 1.0 0.0028 0.0050 +76.9 11-point average 0.1868 0.2233 +19.6 * The language modeling approach always does better in these experiments . . . . . . but note that where the approach shows significant gains is at higher levels of recall. Sojka, IIR Group: PV211: Language Models for IR 45 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Vector space vs BM25 vs LM BM25/LM: based on probability theory Vector space: based on similarity, a geometric/linear algebra notion Term frequency is directly used in all three models. LMs: raw term frequency, BM25/Vector space: more complex Length normalization Vector space: Cosine or pivot normalization LMs: probabilities are inherently length normalized BM25: tuning parameters for optimizing length normalization idf: BM25/vector space use it directly. LMs: Mixing term and collection frequencies has an effect similar to idf. Terms rare in the general collection, but common in some documents will have a greater influence on the ranking. Collection frequency (LMs) vs. document frequency (BM25, vector space) Sojka, IIR Group: PV211: Language Models for IR 46 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Language models for IR: Assumptions Simplifying assumption: Queries and documents are objects of the same type. Not true! There are other LMs for IR that do not make this assumption. The vector space model makes the same assumption. Simplifying assumption: Terms are conditionally independent. Again, vector space model (and Naive Bayes) make the same assumption. Cleaner statement of assumptions than vector space Thus, better theoretical foundation than vector space . . . but “pure” LMs perform much worse than “tuned” LMs. Sojka, IIR Group: PV211: Language Models for IR 47 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Take-away today What matters in language modeling? Feature (term) selection for text classification and similarity Statistical language models: Introduction Statistical language models in IR Large language models Discussion: Properties of different probabilistic models in use in IR, hype of LLM Sojka, IIR Group: PV211: Language Models for IR 48 / 48 What matters in Language modeling (LM) Language models Language Models for IR Discussion Resources Chapter 13 of IIR (feature selection) Chapter 12 of IIR Resources at https://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and FI MU library Ponte and Croft’s 1998 SIGIR paper (one of the first on LMs in IR) Zhai and Lafferty: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. (2004). Lemur toolkit (good support for LMs in IR) Sojka, IIR Group: PV211: Language Models for IR 49 / 48