What matters in Language modeling (LM) Language models Language Models for IR Discussion
PV211: Introduction to Information Retrieval
https://www.fi.muni.cz/~sojka/PV211
IIR 12: Language Models for IR
Handout version
Petr Sojka, Hinrich Schütze et al.
Faculty of Informatics, Masaryk University, Brno
Center for Information and Language Processing, University of Munich
2023-04-12
(compiled on 2023-04-13 20:25:56)
Sojka, IIR Group: PV211: Language Models for IR 1 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Overview
1 What matters in Language modeling (LM)
2 Language models
3 Language Models for IR
4 Discussion
Sojka, IIR Group: PV211: Language Models for IR 2 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Take-away today
What matters in language modeling? Feature (term) selection
for text classiﬁcation and similarity
Statistical language models: Introduction
Statistical language models in IR
Large language models
Discussion: Properties of diﬀerent probabilistic models in use
in IR, hype of LLM
Sojka, IIR Group: PV211: Language Models for IR 3 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Language modeling (in IR)
A language model is a probability distribution over sequences of
words.
Language models are used in IR retrieval in the query likelihood
model. There, a separate language model is associated with each
document in a collection. Documents are ranked based on the
probability of the query Q in the document’s language model
Md : P(Q | Md).
Commonly, the unigram language model (bag of words model) is
used for this purpose.
An n-gram language model is a language model that models
sequences of words as a Markov process.
Subword (character (Morse, Turing), syllable) or phrase models
(SWOMPT in English) are also possible.
Sojka, IIR Group: PV211: Language Models for IR 5 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Latest milestones in language modeling
2013: Mikolov et al.: Eﬃcient Estimation of Word
Representations in Vector Space (word2vec)
2017: Vaswani et al.: Attention is all you need
2018: Generative pretrained models (GPT): large language
models consisting of deep neural networks with billions of
trainable parameters, trained on massive datasets of
unlabelled text, have demonstrated impressive results on a
wide variety of natural language processing tasks. This
development has led to a shift in research focus toward the
use of general-purpose LLMs.
Sojka, IIR Group: PV211: Language Models for IR 6 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Which language model is best?
Perplexity: In information theory, perplexity is a measurement
of how well a probability distribution or probability model
predicts a sample. A low perplexity indicates the probability
distribution is good at predicting the sample.
https://en.wikipedia.org/wiki/Perplexity
https://huggingface.co/docs/transformers/perplexity
Sojka, IIR Group: PV211: Language Models for IR 7 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Feature (term) selection
In text classiﬁcation, we usually represent documents in a
high-dimensional space, with each dimension corresponding to
a term.
In this lecture: axis = dimension = word = term = feature
Many dimensions correspond to rare words.
Rare words can mislead the classiﬁer.
Rare misleading features are called noise features.
Eliminating noise features from the representation increases
eﬃciency and eﬀectiveness of text classiﬁcation.
Eliminating features is called feature selection.
Sojka, IIR Group: PV211: Language Models for IR 8 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Example for a noise feature
Let’s say we’re doing text classiﬁcation for the class China.
Suppose a rare term, say arachnocentric, has no
information about China . . .
. . . but all instances of arachnocentric happen to occur in
China documents in our training set.
Then we may learn a classiﬁer that incorrectly interprets
arachnocentric as evidence for the class China.
Such an incorrect generalization from an accidental property
of the training set is called overﬁtting.
Feature selection reduces overﬁtting and improves the
accuracy of the classiﬁer.
Sojka, IIR Group: PV211: Language Models for IR 9 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Basic feature selection algorithm
SelectFeatures(D, c, k)
1 V ← ExtractVocabulary(D)
2 L ← []
3 for each t ∈ V
4 do A(t, c) ← ComputeFeatureUtility(D, t, c)
5 Append(L, A(t, c), t )
6 return FeaturesWithLargestValues(L, k)
How do we compute A, the feature utility?
Sojka, IIR Group: PV211: Language Models for IR 10 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Diﬀerent feature selection methods
A feature selection method is mainly deﬁned by the feature
utility measure it employs
Feature utility measures:
Frequency – select the most frequent terms
Mutual information – select the terms with the highest mutual
information
Mutual information is also called information gain in this
context.
Chi-square (see book)
Sojka, IIR Group: PV211: Language Models for IR 11 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Mutual information
Compute the feature utility A(t, c) as the mutual information
(MI) of term t and class c.
MI tells us “how much information” the term contains about
the class and vice versa.
For example, if a term’s occurrence is independent of the class
(same proportion of docs within/without class contain the
term), then MI is 0.
Deﬁnition:
I(U; C)=
et ∈{1,0} ec ∈{1,0}
P(U =et, C =ec) log2
P(U =et, C =ec)
P(U =et)P(C =ec)
Sojka, IIR Group: PV211: Language Models for IR 12 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
How to compute MI values
Based on maximum likelihood estimates, the formula we
actually use is:
I(U; C) =
N11
N
log2
NN11
N1.N.1
+
N01
N
log2
NN01
N0.N.1
+
N10
N
log2
NN10
N1.N.0
+
N00
N
log2
NN00
N0.N.0
N10: number of documents that contain t (et = 1) and are
not in c (ec = 0); N11: number of documents that contain t
(et = 1) and are in c (ec = 1); N01: number of documents
that do not contain t (et = 1) and are in c (ec = 1); N00:
number of documents that do not contain t (et = 1) and are
not in c (ec = 1); N = N00 + N01 + N10 + N11.
Sojka, IIR Group: PV211: Language Models for IR 13 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
How to compute MI values (2)
Alternative way of computing MI:
I(U; C)=
et ∈{1,0} ec ∈{1,0}
P(U =et, C =ec) log2
N(U =et, C =ec)
E(U =et)E(C =ec)
N(U =et, C =ec) is the count of documents with values et
and ec .
E(U =et, C =ec) is the expected count of documents with
values et and ec if we assume that the two random variables
are independent.
Sojka, IIR Group: PV211: Language Models for IR 14 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
MI example for poultry/export in Reuters
ec = epoultry = 1 ec = epoultry = 0
et = eexport = 1 N11 = 49 N10 = 27,652
et = eexport = 0 N01 = 141 N00 = 774,106
Plug these values into formula:
I(U; C) =
49
801,948
log2
801,948 · 49
(49+27,652)(49+141)
+
141
801,948
log2
801,948 · 141
(141+774,106)(49+141)
+
27,652
801,948
log2
801,948 · 27,652
(49+27,652)(27,652+774,106)
+
774,106
801,948
log2
801,948 · 774,106
(141+774,106)(27,652+774,106)
≈ 0.000105
Sojka, IIR Group: PV211: Language Models for IR 15 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
MI feature selection on Reuters
Class: coﬀee
term MI
coffee 0.0111
bags 0.0042
growers 0.0025
kg 0.0019
colombia 0.0018
brazil 0.0016
export 0.0014
exporters 0.0013
exports 0.0013
crop 0.0012
Class: sports
term MI
soccer 0.0681
cup 0.0515
match 0.0441
matches 0.0408
played 0.0388
league 0.0386
beat 0.0301
game 0.0299
games 0.0284
team 0.0264
Sojka, IIR Group: PV211: Language Models for IR 16 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Naive Bayes: Eﬀect of feature selection
# # #
#
#
#
#
#
#
# #
#
#
##
1 10 100 1000 10000
0.00.20.40.60.8
number of features selected
F1measure
o
o o oo
o
o
o
o o
o
o
o
oo
x
x
x x
x
x
x x
x
x x
x
x xx
b
b
b
bb b b
b
b
b
b b b bb
#
o
x
b
multinomial, MI
multinomial, chisquare
multinomial, frequency
binomial, MI
(multinomial
= multinomial
Naive
Bayes, binomial
=
Bernoulli
Naive
Bayes)
Sojka, IIR Group: PV211: Language Models for IR 17 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Feature selection for Naive Bayes
In general, feature selection is necessary for Naive Bayes to
get decent performance.
Also true for many other learning methods in text
classiﬁcation: you need feature selection for optimal
performance.
Sojka, IIR Group: PV211: Language Models for IR 18 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Exercise
(i) Compute the “export”/POULTRY contingency table for the
“Kyoto”/JAPAN in the collection given below. (ii) Make up a
contingency table for which MI is 0 – that is, term and class are
independent of each other.
“export”/POULTRY table:
ec = epoultry = 1 ec = epoultry = 0
et = eexport = 1 N11 = 49 N10 = 27,652
et = eexport = 0 N01 = 141 N00 = 774,106
Collection:
docID words in document in c = Japan?
training set 1 Kyoto Osaka Taiwan yes
2 Japan Kyoto yes
3 Taipei Taiwan no
4 Macao Taiwan Shanghai no
5 London no
Sojka, IIR Group: PV211: Language Models for IR 19 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Using language models (LMs) for IR
1 LM = language model
2 We view the document as a generative model that generates
the query.
3 What we need to do:
4 Deﬁne the precise generative model we want to use
5 Estimate parameters (diﬀerent parameters for each
document’s model)
6 Smooth to avoid zeros
7 Apply to query and ﬁnd document most likely to have
generated the query
8 Present most likely document(s) to user
9 Note that 4–7 is very similar to what we did in Naive Bayes.
Sojka, IIR Group: PV211: Language Models for IR 21 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
What is a language model?
We can view a ﬁnite state automaton as a deterministic language
model.
I wish
I wish I wish I wish I wish . . .
Cannot generate: “wish I wish” or “I wish I”
Our basic model: each document was generated by a diﬀerent
automaton like this except that these automata are probabilistic.
Sojka, IIR Group: PV211: Language Models for IR 22 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
A probabilistic language model
q1
w P(w|q1) w P(w|q1)
STOP 0.2 toad 0.01
the 0.2 said 0.03
a 0.1 likes 0.02
frog 0.01 that 0.04
. . . . . .
This is a one-state probabilistic ﬁnite-state automaton – a unigram
language model – and the state emission distribution for its one
state q1.
STOP is not a word, but a special symbol indicating that the
automaton stops.
frog said that toad likes frog STOP
P(string) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02 ·0.01 ·0.2
= 0.0000000000048
0.01* 0.03* 0.04* 0.01* 0.02* 0.01* 0.2
Sojka, IIR Group: PV211: Language Models for IR 23 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
A diﬀerent language model for each document
language model of d1 language model of d2
w P(w|.) w P(w|.)
STOP .2 toad .01
the .2 said .03
a .1 likes .02
frog .01 that .04
. . . . . .
w P(w|.) w P(w|.)
STOP .2 toad .02
the .15 said .03
a .08 likes .02
frog .01 that .05
. . . . . .
query: frog said that toad likes frog STOP
P(query|Md1) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02 ·0.01 ·0.2
= 0.0000000000048 = 4.8 · 10−12
P(query|Md2) = 0.01 ·0.03 ·0.05 ·0.02 ·0.02 ·0.01 ·0.2
= 0.0000000000120 = 12 · 10−12
0.01* 0.03* 0.05* 0.02* 0.02* 0.01* 0.2
P(query|Md1) < P(query|Md2) Thus, document d2 is “more
relevant” to the query “frog said that toad likes frog STOP” than
d1 is.Sojka, IIR Group: PV211: Language Models for IR 24 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Using language models in IR
Each document is treated as (the basis for) a language model.
Given a query q
Rank documents based on P(d|q)
P(d|q) =
P(q|d)P(d)
P(q)
P(q) is the same for all documents, so ignore
P(d) is the prior – often treated as the same for all d
But we can give a higher prior to “high-quality” documents,
e.g., those with high PageRank.
P(q|d) is the probability of q given d .
For uniform prior: ranking documents according according to
P(q|d) and P(d|q) is equivalent.
Sojka, IIR Group: PV211: Language Models for IR 26 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Where we are
In the LM approach to IR, we attempt to model the query
generation process.
Then we rank documents by the probability that a query
would be observed as a random sample from the respective
document model.
That is, we rank according to P(q|d).
Next: how do we compute P(q|d)?
Sojka, IIR Group: PV211: Language Models for IR 27 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
How to compute P(q|d)
We will make the same conditional independence assumption
as for Naive Bayes.
P(q|Md) = P( t1, . . . , t|q| |Md ) =
1≤k≤|q|
P(tk|Md)
(|q|: length of q; tk: the token occurring at position k in q)
This is equivalent to:
P(q|Md ) =
distinct term t in q
P(t|Md )tft,q
tft,q: term frequency (# occurrences) of t in q
Multinomial model (omitting constant factor)
Sojka, IIR Group: PV211: Language Models for IR 28 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Parameter estimation
Missing piece: Where do the parameters P(t|Md ) come from?
Start with maximum likelihood estimates (as we did for Naive
Bayes)
ˆP(t|Md ) =
tft,d
|d|
(|d|: length of d; tft,d : # occurrences of t in d)
As in Naive Bayes, we have a problem with zeros.
A single t with P(t|Md ) = 0 will make
P(q|Md ) = P(t|Md ) zero.
We would give a single term “veto power”.
For example, for query [Michael Jackson top hits] a document
about “top songs” (but not using the word “hits”) would have
P(q|Md ) = 0. – Thats’s bad.
We need to smooth the estimates to avoid zeros.
Sojka, IIR Group: PV211: Language Models for IR 29 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Smoothing
Key intuition: A nonoccurring term is possible (even though it
didn’t occur), . . .
. . . but no more likely than would be expected by chance in
the collection.
Notation: Mc: the collection model; cft: the number of
occurrences of t in the collection; T = t cft: the total
number of tokens in the collection.
ˆP(t|Mc) =
cft
T
We will use ˆP(t|Mc) to “smooth” P(t|d) away from zero.
Sojka, IIR Group: PV211: Language Models for IR 30 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Jelinek-Mercer smoothing
P(t|d) = λP(t|Md ) + (1 − λ)P(t|Mc )
Mixes the probability from the document with the general
collection frequency of the word.
High value of λ: “conjunctive-like” search – tends to retrieve
documents containing all query words.
Low value of λ: more disjunctive, suitable for long queries
Correctly setting λ is very important for good performance.
Sojka, IIR Group: PV211: Language Models for IR 31 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Jelinek-Mercer smoothing: Summary
P(q|d) ∝
1≤k≤|q|
(λP(tk |Md) + (1 − λ)P(tk |Mc))
What we model: The user has a document in mind and
generates the query from this document.
The equation represents the probability that the document
that the user had in mind was in fact this one.
Sojka, IIR Group: PV211: Language Models for IR 32 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Example
Collection: d1 and d2
d1: Jackson was one of the most talented entertainers of all
time
d2: Michael Jackson anointed himself King of Pop
Query q: Michael Jackson
Use mixture model with λ = 1/2
P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003
P(q|d2) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0.013
Ranking: d2 > d1
Sojka, IIR Group: PV211: Language Models for IR 33 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Exercise: Compute ranking
Collection: d1 and d2
d1: Xerox reports a proﬁt but revenue is down
d2: Lucene narrows quarter loss but revenue decreases further
Query q: revenue down
Use mixture model with λ = 1/2
P(q|d1) = [(1/8 + 2/16)/2] · [(1/8 + 1/16)/2] = 1/8 · 3/32 =
3/256
P(q|d2) = [(1/8 + 2/16)/2] · [(0/8 + 1/16)/2] = 1/8 · 1/32 =
1/256
Ranking: d1 > d2
Sojka, IIR Group: PV211: Language Models for IR 34 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Dirichlet smoothing
ˆP(t|d) =
tft,d + αˆP(t|Mc)
Ld + α
The background distribution ˆP(t|Mc) is the prior for ˆP(t|d).
Intuition: Before having seen any part of the document we
start with the background distribution as our estimate.
As we read the document and count terms we update the
background distribution.
The weighting factor α determines how strong an eﬀect the
prior has.
Sojka, IIR Group: PV211: Language Models for IR 35 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Jelinek-Mercer or Dirichlet?
Dirichlet performs better for keyword queries, Jelinek-Mercer
performs better for verbose queries.
Both models are sensitive to the smoothing parameters – you
shouldn’t use these models without parameter tuning.
Sojka, IIR Group: PV211: Language Models for IR 36 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Sensitivity of Dirichlet to smoothing parameter
µ is the Dirichlet smoothing parameter (called α on the previous
slides)
Sojka, IIR Group: PV211: Language Models for IR 37 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Language models are generative models
We have assumed that queries are generated by a probabilistic
process that looks like this: (as in Naive Bayes)
C=China
X1=Beijing X2=and X3=Taipei X4=join X5=WTO
Sojka, IIR Group: PV211: Language Models for IR 39 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Naive Bayes and LM generative models
We want to classify document d.
We want to classify a query q.
Classes: e.g., geographical regions like China, UK, Kenya.
Each document in the collection is a diﬀerent class.
Assume that d was generated by the generative model.
Assume that q was generated by a generative model
Key question: Which of the classes is most likely to have
generated the document? Which document (=class) is most
likely to have generated the query q?
Or: for which class do we have the most evidence? For which
document (as the source of the query) do we have the most
evidence?
Sojka, IIR Group: PV211: Language Models for IR 40 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Naive Bayes Multinomial model / IR language models
C=China
X1=Beijing X2=and X3=Taipei X4=join X5=WTO
Sojka, IIR Group: PV211: Language Models for IR 41 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Naive Bayes Bernoulli model / Binary independence model
UAlaska=0 UBeijing=1 UIndia=0 Ujoin=1 UTaipei=1 UWTO=1
C=China
Sojka, IIR Group: PV211: Language Models for IR 42 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Comparison of the two models
multinomial model / IR language model Bernoulli model / BIM
event model generation of (multi)set of tokens generation of subset of voc
random variable(s) X = t iﬀ t occurs at given pos Ut = 1 iﬀ t occurs in doc
doc. representation d = t1, . . . , tk , . . . , tnd , tk ∈ V d = e1, . . . , ei , . . . , eM ,
ei ∈ {0, 1}
parameter estimation ˆP(X = t|c) ˆP(Ui = e|c)
dec. rule: maximize ˆP(c) 1≤k≤nd
ˆP(X = tk |c) ˆP(c) ti ∈V
ˆP(Ui = ei |c)
multiple occurrences taken into account ignored
length of docs can handle longer docs works best for short docs
# features can handle more works best with fewer
estimate for the ˆP(X = the|c) ≈ 0.05 ˆP(Uthe = 1|c) ≈ 1.0
Sojka, IIR Group: PV211: Language Models for IR 43 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
LMs vs. Multinomial Naive Bayes
We classify the query in LMs; we classify documents in text
classiﬁcation.
Each document is a class in LMs vs. classes are
human-deﬁned in text classiﬁcation
Sojka, IIR Group: PV211: Language Models for IR 44 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Vector space (tf-idf) vs. LM
precision signiﬁcant
Rec. tf-idf LM %chg
0.0 0.7439 0.7590 +2.0
0.1 0.4521 0.4910 +8.6
0.2 0.3514 0.4045 +15.1 *
0.4 0.2093 0.2572 +22.9 *
0.6 0.1024 0.1405 +37.1 *
0.8 0.0160 0.0432 +169.6 *
1.0 0.0028 0.0050 +76.9
11-point average 0.1868 0.2233 +19.6 *
The language modeling approach always does better in these
experiments . . .
. . . but note that where the approach shows signiﬁcant gains is at
higher levels of recall.
Sojka, IIR Group: PV211: Language Models for IR 45 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Vector space vs BM25 vs LM
BM25/LM: based on probability theory
Vector space: based on similarity, a geometric/linear algebra
notion
Term frequency is directly used in all three models.
LMs: raw term frequency, BM25/Vector space: more complex
Length normalization
Vector space: Cosine or pivot normalization
LMs: probabilities are inherently length normalized
BM25: tuning parameters for optimizing length normalization
idf: BM25/vector space use it directly.
LMs: Mixing term and collection frequencies has an eﬀect
similar to idf.
Terms rare in the general collection, but common in some
documents will have a greater inﬂuence on the ranking.
Collection frequency (LMs) vs. document frequency (BM25,
vector space)
Sojka, IIR Group: PV211: Language Models for IR 46 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Language models for IR: Assumptions
Simplifying assumption: Queries and documents are objects of
the same type. Not true!
There are other LMs for IR that do not make this assumption.
The vector space model makes the same assumption.
Simplifying assumption: Terms are conditionally independent.
Again, vector space model (and Naive Bayes) make the same
assumption.
Cleaner statement of assumptions than vector space
Thus, better theoretical foundation than vector space
. . . but “pure” LMs perform much worse than “tuned” LMs.
Sojka, IIR Group: PV211: Language Models for IR 47 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Take-away today
What matters in language modeling? Feature (term) selection
for text classiﬁcation and similarity
Statistical language models: Introduction
Statistical language models in IR
Large language models
Discussion: Properties of diﬀerent probabilistic models in use
in IR, hype of LLM
Sojka, IIR Group: PV211: Language Models for IR 48 / 48
What matters in Language modeling (LM) Language models Language Models for IR Discussion
Resources
Chapter 13 of IIR (feature selection)
Chapter 12 of IIR
Resources at https://www.fi.muni.cz/~sojka/PV211/
and http://cislmu.org, materials in MU IS and FI MU
library
Ponte and Croft’s 1998 SIGIR paper (one of the ﬁrst on LMs
in IR)
Zhai and Laﬀerty: A study of smoothing methods for language
models applied to information retrieval. ACM Trans. Inf. Syst.
(2004).
Lemur toolkit (good support for LMs in IR)
Sojka, IIR Group: PV211: Language Models for IR 49 / 48