Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
PV211: Introduction to Information Retrieval
https://www.fi.muni.cz/~sojka/PV211
IIR 8: Evaluation & Result Summaries
Handout version
Petr Sojka, Martin Líška, Hinrich Schütze et al.
Faculty of Informatics, Masaryk University, Brno
Center for Information and Language Processing, University of Munich
2023-03-15
(compiled on 2023-03-15 11:53)
Sojka, IIR Group: PV211: Evaluation & Result Summaries 1 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Overview
1 Introduction
2 Unranked evaluation
3 Ranked evaluation
4 Benchmarks
5 Result summaries
Sojka, IIR Group: PV211: Evaluation & Result Summaries 2 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Take-away today
Introduction to evaluation: Measures of an IR system
Evaluation of unranked and ranked retrieval
Evaluation benchmarks
Result summaries
Sojka, IIR Group: PV211: Evaluation & Result Summaries 3 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Evaluation
How well does an IR system work?
Sojka, IIR Group: PV211: Evaluation & Result Summaries 5 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Measures for a search engine
How fast does it index
e.g., number of bytes per hour
How fast does it search
e.g., latency as a function of queries per second
What is the cost per query?
in dollars
Sojka, IIR Group: PV211: Evaluation & Result Summaries 8 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Measures for a search engine
All of the preceding criteria are measurable: we can quantify
speed / size / money
However, the key measure for a search engine is user
happiness.
What is user happiness?
Factors include:
Speed of response
Size of index
Uncluttered UI
Most important: relevance
(actually, maybe even more important: it’s free)
Note that none of these is suﬃcient: blindingly fast, but
useless answers won’t make a user happy.
How can we quantify user happiness?
Sojka, IIR Group: PV211: Evaluation & Result Summaries 9 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Who is the user?
Who is the user we are trying to make happy?
Web search engine: searcher. Success: Searcher ﬁnds what
she was looking for. Measure: rate of return to this search
engine
Web search engine: advertiser. Success: Searcher clicks on
ad. Measure: clickthrough rate
E-commerce: buyer. Success: Buyer buys something.
Measures: time to purchase, fraction of “conversions” of
searchers to buyers
E-commerce: seller. Success: Seller sells something. Measure:
proﬁt per item sold
Enterprise: CEO. Success: Employees are more productive
(because of eﬀective search). Measure: proﬁt of the company
Sojka, IIR Group: PV211: Evaluation & Result Summaries 10 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Most common deﬁnition of user happiness: Relevance
User happiness is equated with the relevance of search results
to the query.
But how do you measure relevance?
Standard methodology in information retrieval consists of
three elements.
A benchmark document collection
A benchmark suite of queries
An assessment of the relevance of each query-document pair
Sojka, IIR Group: PV211: Evaluation & Result Summaries 11 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Relevance: query vs. information need
Relevance to what?
First take: relevance to the query
“Relevance to the query” is very problematic.
Information need i: “I am looking for information on whether
drinking red wine is more eﬀective at reducing your risk of
heart attacks than white wine.”
This is an information need, not a query.
Query q: [red wine white wine heart attack]
Consider document d′: At the heart of his speech was an
attack on the wine industry lobby for downplaying the role of
red and white wine in drunk driving.
d′ is an excellent match for query q . . .
d′ is not relevant to the information need i.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 12 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Relevance: query vs. information need
User happiness can only be measured by relevance to an
information need, not by relevance to queries.
Our terminology is sloppy in these slides and in IIR: we talk
about query-document relevance judgments even though we
mean information-need-document relevance judgments.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 13 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Precision and recall
Precision (P) is the fraction of retrieved documents that are
relevant
Precision =
#(relevant items retrieved)
#(retrieved items)
= P(relevant|retrieved)
Recall (R) is the fraction of relevant documents that are
retrieved
Recall =
#(relevant items retrieved)
#(relevant items)
= P(retrieved|relevant)
Sojka, IIR Group: PV211: Evaluation & Result Summaries 15 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Precision and recall
Relevant Nonrelevant
Retrieved true positives (TP) false positives (FP)
Not retrieved false negatives (FN) true negatives (TN)
P = TP/(TP + FP)
R = TP/(TP + FN)
Sojka, IIR Group: PV211: Evaluation & Result Summaries 16 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Precision/recall tradeoﬀ
You can increase recall by returning more docs.
Recall is a non-decreasing function of the number of docs
retrieved.
A system that returns all docs has 100% recall!
The converse is also true (usually): It’s easy to get high
precision for very low recall.
Suppose the document with the largest score is relevant. How
can we maximize precision?
Sojka, IIR Group: PV211: Evaluation & Result Summaries 17 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
A combined measure: F
F allows us to trade oﬀ precision against recall.
F =
1
α 1
P + (1 − α) 1
R
=
(β2 + 1)PR
β2P + R
where β2
=
1 − α
α
α ∈ [0, 1] and thus β2 ∈ [0, ∞]
Most frequently used: balanced F with β = 1 or α = 0.5
This is the harmonic mean of P and R: 1
F = 1
2 ( 1
P + 1
R )
What value range of β weights recall higher than precision?
Sojka, IIR Group: PV211: Evaluation & Result Summaries 18 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Example for precision, recall, F1
relevant not relevant
retrieved 20 40 60
not retrieved 60 1,000,000 1,000,060
80 1,000,040 1,000,120
P = 20/(20 + 40) = 1/3
R = 20/(20 + 60) = 1/4
F1 = 2 1
1
1
3
+ 1
1
4
= 2/7
Sojka, IIR Group: PV211: Evaluation & Result Summaries 19 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Accuracy
Why do we use complex measures like precision, recall, and F?
Why not something simple like accuracy?
Accuracy is the fraction of decisions (relevant/nonrelevant)
that are correct.
In terms of the contingency table above,
accuracy = (TP + TN)/(TP + FP + FN + TN).
Why is accuracy not a useful measure for web information
retrieval?
Sojka, IIR Group: PV211: Evaluation & Result Summaries 20 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Exercise
Compute precision, recall and F1 for this result set:
relevant not relevant
retrieved 18 2
not retrieved 82 1,000,000,000
The snoogle search engine below always returns 0 results (“0
matching results found”), regardless of the query. Why does
snoogle demonstrate that accuracy is not a useful measure in
IR?
Sojka, IIR Group: PV211: Evaluation & Result Summaries 21 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Why accuracy is a useless measure in IR
Simple trick to maximize accuracy in IR: always say no and
return nothing
You then get 99.99% accuracy on most queries.
Searchers on the web (and in IR in general) want to ﬁnd
something and have a certain tolerance for junk.
It’s better to return some bad hits as long as you return
something.
→ We use precision, recall, and F for evaluation, not accuracy.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 22 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
F: Why harmonic mean?
Why don’t we use a diﬀerent mean of P and R as a measure?
e.g., the arithmetic mean
The simple (arithmetic) mean is 50% for “return-everything”
search engine, which is too high.
Desideratum: Punish really bad performance on either
precision or recall.
Taking the minimum achieves this.
But minimum is not smooth and hard to weight.
F (harmonic mean) is a kind of smooth minimum.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 23 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
F1 and other averages
We can view the harmonic mean as a kind of soft minimum
Sojka, IIR Group: PV211: Evaluation & Result Summaries 24 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Diﬃculties in using precision, recall and F
We need relevance judgments for information-need-document
pairs – but they are expensive to produce.
For alternatives to using precision/recall and having to
produce relevance judgments – see end of this lecture.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 25 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Mean Average Precision
MAP(Q) = 1
|Q|
∑|Q|
j=1
1
mj
∑mj
k=1 Precision(Rjk)
For one query it is the area under the uninterpolated
precision-recall curve,
and so the MAP is roughly the average area under the
precision-recall curve for a set of queries.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 26 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Precision-recall curve
Precision/recall/F are measures for unranked sets.
We can easily turn set measures into measures of ranked lists.
Just compute the set measure for each “preﬁx”: the top 1
(P@1), top 2, top 3, top 4, etc., results
Doing this for precision and recall gives you a precision-recall
curve.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 28 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
A precision-recall curve
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8
Recall
Precision
Each point corresponds to a result for the top k ranked hits
(k = 1, 2, 3, 4, . . .).
Interpolation (in red): Take maximum of all future points
Rationale for interpolation: The user is willing to look at more
stuﬀ if both precision and recall get better.
Questions?
Sojka, IIR Group: PV211: Evaluation & Result Summaries 29 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
11-point interpolated average precision
Recall Interpolated
Precision
0.0 1.00
0.1 0.67
0.2 0.63
0.3 0.55
0.4 0.45
0.5 0.41
0.6 0.36
0.7 0.29
0.8 0.13
0.9 0.10
1.0 0.08
11-point average: ≈
0.425
How can precision
at 0.0 be > 0?
Sojka, IIR Group: PV211: Evaluation & Result Summaries 30 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Averaged 11-point precision/recall graph
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Precision
Compute interpolated precision at recall levels 0.0, 0.1, 0.2,
. . .
Do this for each of the queries in the evaluation benchmark
Average over queries
This measure measures performance at all recall levels.
The curve is typical of performance levels at TREC.
Note that performance is not very good!
Sojka, IIR Group: PV211: Evaluation & Result Summaries 31 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
ROC curve
0.0
0.2
0.4
0.6
0.8
1.0
0 0.2 0.4 0.6 0.8 1
1 − specificity
sensitivity(=recall)
Similar to precision-recall graph
But we are only interested in the small area in the lower left
corner.
Precision-recall graph “blows up” this area.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 32 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Variance of measures like precision/recall
For a test collection, it is usual that a system does badly on
some information needs (e.g., P = 0.2 at R = 0.1) and really
well on others (e.g., P = 0.95 at R = 0.1).
Indeed, it is usually the case that the variance of the same
system across queries is much greater than the variance of
diﬀerent systems on the same query.
That is, there are easy information needs and hard ones.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 33 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
What we need for a benchmark
A collection of documents
Documents must be representative of the documents we
expect to see in reality.
A collection of information needs
. . . which we will often incorrectly refer to as queries
Information needs must be representative of the information
needs we expect to see in reality.
Human relevance assessments
We need to hire/pay “judges” or assessors to do this.
Expensive, time-consuming
Judges must be representative of the users we expect to see in
reality.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 35 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
First standard relevance benchmark: Cranﬁeld
Pioneering: ﬁrst testbed allowing precise quantitative
measures of information retrieval eﬀectiveness
Late 1950s, UK
1398 abstracts of aerodynamics journal articles, a set of 225
queries, exhaustive relevance judgments of all
query-document-pairs
Too small, too untypical for serious IR evaluation today
Sojka, IIR Group: PV211: Evaluation & Result Summaries 36 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Second-generation relevance benchmark: TREC
TREC = Text Retrieval Conference (TREC)
Organized by the U.S. National Institute of Standards and
Technology (NIST)
TREC is actually a set of several diﬀerent relevance
benchmarks.
Best known: TREC Ad Hoc, used for ﬁrst 8 TREC evaluations
between 1992 and 1999
1.89 million documents, mainly newswire articles,
450 information needs
No exhaustive relevance judgments – too expensive
Rather, NIST assessors’ relevance judgments are available
only for the documents that were among the top k returned
for some system which was entered in the TREC evaluation
for which the information need was developed.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 37 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Standard relevance benchmarks: Others
GOV2
Another TREC/NIST collection
25 million web pages
Used to be largest collection that is easily available
But still 3 orders of magnitude smaller than what
Google/Yahoo/MSN index
NTCIR: East Asian language and cross-language information
retrieval
CLEF: Cross Language Evaluation Forum: This evaluation
series has concentrated on European languages and
cross-language information retrieval.
Many others
Sojka, IIR Group: PV211: Evaluation & Result Summaries 38 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Example of more recent benchmark: ClueWeb datasets
Clueweb09:
1 billion web pages, 25 terabytes (compressed: 5 terabyte)
collected during January/February 2009
crawl of pages in 10 languages
Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GB
compressed)
Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB
compressed)
Clueweb12:
733,019,372 docs, 27.3 TB (5.54 TB compressed)
Indexed in Sketch Engine, cf. LREC 2012 paper.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 39 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Validity of relevance assessments
Relevance assessments are only usable if they are consistent.
If they are not consistent, then there is no “truth” and
experiments are not repeatable.
How can we measure this consistency or agreement among
judges?
→ Kappa measure
Sojka, IIR Group: PV211: Evaluation & Result Summaries 40 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Kappa measure
Kappa is measure of how much judges agree or disagree.
Designed for categorical judgments
Corrects for chance agreement
P(A) = proportion of time judges agree
P(E) = what agreement would we get by chance
κ =
P(A) − P(E)
1 − P(E)
κ =? for (i) chance agreement (ii) total agreement
Sojka, IIR Group: PV211: Evaluation & Result Summaries 41 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Kappa measure (2)
Values of κ in the interval [2/3, 1.0] are seen as acceptable.
With smaller values: need to redesign relevance assessment
methodology used, etc.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 42 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Calculating the kappa statistic
Judge 2 Relevance
Yes No Total
Judge 1 Yes 300 20 320
Relevance No 10 70 80
Total 310 90 400
Observed proportion of the times the judges agreed
P(A) = (300 + 70)/400 = 370/400 = 0.925
Pooled marginals
P(nonrelevant) = (80 + 90)/(400 + 400) = 170/800 = 0.2125
P(relevant) = (320 + 310)/(400 + 400) = 630/800 = 0.7878
Probability that the two judges agreed by chance P(E) =
P(nonrelevant)2 + P(relevant)2 = 0.21252 + 0.78782 = 0.665
Kappa statistic κ = (P(A) − P(E))/(1 − P(E)) =
(0.925 − 0.665)/(1 − 0.665) = 0.776 (still in acceptable range)
Sojka, IIR Group: PV211: Evaluation & Result Summaries 43 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Interjudge agreement at TREC
information number of disagreements
need docs judged
51 211 6
62 400 157
67 400 68
95 400 110
127 400 106
Sojka, IIR Group: PV211: Evaluation & Result Summaries 44 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Impact of interjudge disagreement
Judges disagree a lot. Does that mean that the results of
information retrieval experiments are meaningless?
No.
Large impact on absolute performance numbers
Virtually no impact on ranking of systems
Supposes we want to know if algorithm A is better than
algorithm B.
An information retrieval experiment will give us a reliable
answer to this question. . .
. . . even if there is a lot of disagreement between judges.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 45 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Evaluation at large search engines
Recall is diﬃcult to measure on the web
Search engines often use precision at top k, e.g., k = 10 . . .
. . . or use measures that reward you more for getting rank 1
right than for getting rank 10 right.
Search engines also use non-relevance-based measures.
Example 1: clickthrough on ﬁrst result
Not very reliable if you look at a single clickthrough (you may
realize after clicking that the summary was misleading and the
document is nonrelevant). . .
. . . but pretty reliable in the aggregate.
Example 2: Ongoing studies of user behavior in the lab – recall
last lecture
Example 3: A/B testing
Sojka, IIR Group: PV211: Evaluation & Result Summaries 46 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
A/B testing
Purpose: Test a single innovation
Prerequisite: You have a large search engine up and running.
Have most users use old system
Divert a small proportion of traﬃc (e.g., 1%) to the new
system that includes the innovation
Evaluate with an “automatic” measure like clickthrough on
ﬁrst result
Now we can directly see if the innovation does improve user
happiness.
Probably the evaluation methodology that large search
engines trust most
Variant: Give users the option to switch to new
algorithm/interface
Sojka, IIR Group: PV211: Evaluation & Result Summaries 47 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Critique of pure relevance
We’ve deﬁned relevance for an isolated query-document pair.
Alternative deﬁnition: marginal relevance
The marginal relevance of a document at position k in the
result list is the additional information it contributes over and
above the information that was contained in documents
d1 . . . dk−1.
Exercise
Why is marginal relevance a more realistic measure of user
happiness?
Give an example where a non-marginal measure like precision
or recall is a misleading measure of user happiness, but
marginal relevance is a good measure.
In a practical application, what is the diﬃculty of using
marginal measures instead of non-marginal measures?
Sojka, IIR Group: PV211: Evaluation & Result Summaries 48 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Yet another metrics to be used in IR evaluation
Tehere are many other evaluation measures in IR.
DCG, nDCG: Discounted cumulative gain, normalized
Discounted cumulative gain
ROC: Receiver operating characteristic or
Bref: Bpref computes a preference relation of whether judged
relevant documents are retrieved ahead of judged irrelevant
documents. Thus, it is based on the relative ranks of judged
documents only.
Most metrics are implemented and documented in our glorious
pv211-utils and their use is supported by tutorials 9still under
preparation in the develop branch).
Sojka, IIR Group: PV211: Evaluation & Result Summaries 49 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
How do we present results to the user?
Sojka, IIR Group: PV211: Evaluation & Result Summaries 51 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
How do we present results to the user?
Sojka, IIR Group: PV211: Evaluation & Result Summaries 52 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
How do we present results to the user?
Most often: as a list – aka “10 blue links”
How should each document in the list be described?
This description is crucial.
The user often can identify good hits (= relevant hits) based
on the description.
No need to actually view any document
Sojka, IIR Group: PV211: Evaluation & Result Summaries 53 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Doc description in result list
Most commonly: doc title, url, some metadata . . .
. . . and a summary
How do we “compute” the summary?
Sojka, IIR Group: PV211: Evaluation & Result Summaries 54 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Summaries
Two basic kinds: (i) static (ii) dynamic
A static summary of a document is always the same,
regardless of the query that was issued by the user.
Dynamic summaries are query-dependent. They attempt to
explain why the document was retrieved for the query at hand.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 55 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Static summaries
In typical systems, the static summary is a subset of the
document.
Simplest heuristic: the ﬁrst 50 or so words of the document
More sophisticated: extract from each document a set of
“key” sentences
Simple NLP heuristics to score each sentence
Summary is made up of top-scoring sentences.
Machine learning approach: see IIR 13
Most sophisticated: complex NLP to synthesize/generate a
summary
For most IR applications: not quite ready for prime time yet
Sojka, IIR Group: PV211: Evaluation & Result Summaries 56 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Dynamic summaries
Present one or more “windows” or snippets within the
document that contain several of the query terms.
Prefer snippets in which query terms occurred as a phrase
Prefer snippets in which query terms occurred jointly in a
small window
The summary that is computed this way gives the entire
content of the window – all terms, not just the query terms.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 57 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Google dynamic summaries for [vegetarian diet running]
Good example
that snippet
selection is
non-trivial.
Criteria:
occurrence of
keywords, density
of keywords,
coherence of
snippet, number
of diﬀerent
snippets in
summary, good
cutting points,
etc.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 58 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
A dynamic summary
Query: [new guinea economic development]
Snippets (in bold) that were extracted from a document: . . . In recent years,
Papua New Guinea has faced severe economic diﬃculties and economic
growth has slowed, partly as a result of weak governance and civil war, and
partly as a result of external factors such as the Bougainville civil war which led
to the closure in 1989 of the Panguna mine (at that time the most important
foreign exchange earner and contributor to Government ﬁnances), the Asian
ﬁnancial crisis, a decline in the prices of gold and copper, and a fall in the
production of oil. PNG’s economic development record over the past few
years is evidence that governance issues underly many of the country’s
problems. Good governance, which may be deﬁned as the transparent and
accountable management of human, natural, economic and ﬁnancial resources
for the purposes of equitable and sustainable development, ﬂows from proper
public sector management, eﬃcient ﬁscal and accounting mechanisms, and a
willingness to make service delivery a priority in practice. . . .
Sojka, IIR Group: PV211: Evaluation & Result Summaries 59 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Generating dynamic summaries
Where do we get these other terms in the snippet from?
We cannot construct a dynamic summary from the positional
inverted index – at least not eﬃciently.
We need to cache documents.
The positional index tells us: query term occurs at position
4378 in the document.
Byte oﬀset or word oﬀset?
Note that the cached copy can be outdated
Don’t cache very long documents – just cache a short preﬁx
Sojka, IIR Group: PV211: Evaluation & Result Summaries 60 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Dynamic summaries
Real estate on the search result page is limited → snippets
must be short . . .
. . . but snippets must be long enough to be meaningful.
Snippets should communicate whether and how the document
answers the query.
Ideally: linguistically well-formed snippets
Ideally: the snippet should answer the query, so we don’t have
to look at the document.
Dynamic summaries are a big part of user happiness because
. . .
. . . we can quickly scan them to ﬁnd the relevant document we
then click on.
. . . in many cases, we don’t have to click at all and save time.
Sojka, IIR Group: PV211: Evaluation & Result Summaries 61 / 62
Introduction Unranked evaluation Ranked evaluation Benchmarks Result summaries
Resources
Chapter 8 of IIR
Resources at https://www.fi.muni.cz/~sojka/PV211/
and http://cislmu.org, materials in MU IS and FI MU
library
The TREC home page – TREC had a huge impact on
information retrieval evaluation.
Originator of F-measure: Keith van Rijsbergen
More on A/B testing
Too much A/B testing at Google?
Tombros & Sanderson 1998: one of the ﬁrst papers on
dynamic summaries
Google VP of Engineering on search quality evaluation at
Google
ClueWeb12 and other datasets available in Sketch Engine
Sojka, IIR Group: PV211: Evaluation & Result Summaries 62 / 62