Seminar 6
Definition 1 (Recall) Recall describes how many of the relevant documents are retrieved.
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑅 =
#𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
#𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡
Definition 2 (Precision) Precision describes how many of the retrieved documents are
relevant.
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑃 =
#𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
#𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
Definition 3 (𝐹 -measure) A balanced 𝐹-measure (𝐹1-measure) defines a recall-precision
relationship represented by their weighted harmonic mean:
𝐹 =
2 · 𝑅 · 𝑃
𝑅 + 𝑃
Definition 4 (Mean Average Precision) MAP expresses the precision in each point a
new relevant document is included in the result. Is counted as
𝑀 𝐴𝑃(𝑄) =
1
|𝑄|
·
⎛
⎝
∑︁
𝑞∈𝑄
1
𝑟𝑒𝑙 𝑞
·
⎛
⎝
𝑟𝑒𝑙 𝑞
∑︁
𝑖=1
𝑝𝑟𝑒𝑐𝑖
⎞
⎠
⎞
⎠
where 𝑟𝑒𝑙 𝑞 is the number of relevant documents retrieved by query 𝑞 and 𝑝𝑟𝑒𝑐𝑖 is the precision
at the 𝑖-th document.
Definition 5 (𝜅 statistic) Let 𝑁 be the total number of documents, 𝐽 is a set of judges
and 𝑃(𝐴) = #𝑎𝑔𝑟𝑒𝑒
𝑁 the number of documents on which the judges agree. Let also define 𝑅 𝑗
and 𝑁 𝑅 𝑗 be the number of relevant and non-relevant documents, respectively, according to
the judge 𝑗 ∈ 𝐽 and
𝑃(𝑅) =
∑︀
𝑗∈𝐽 𝑅 𝑗
|𝐽| · 𝑁
and 𝑃(𝑁 𝑅) =
∑︀
𝑗∈𝐽 𝑁 𝑅 𝑗
|𝐽| · 𝑁
as the number of relevant and non-relevant documents, respectively. Let finally define
𝑃(𝐸) = 𝑃(𝑅)2
+ 𝑃(𝑁 𝑅)2
as the approximate number of disagreements between the judges. Then the 𝜅 statistic is
defined as the measure of agreement between the judges
𝜅 =
𝑃(𝐴) − 𝑃(𝐸)
1 − 𝑃(𝐸)
.
Definition 6 (Rocchio relevance feedback) Rocchio relevance feedback has the form
𝑞 𝑚 = 𝛼𝑞0 + 𝛽
1
|𝐷 𝑟|
∑︁
⃗𝑑 𝑟∈𝐷 𝑟
⃗𝑑 𝑟 − 𝛾
1
|𝐷 𝑛𝑟|
∑︁
⃗𝑑 𝑛𝑟∈𝐷 𝑛𝑟
⃗𝑑 𝑛𝑟
where 𝑞0 is the original query vector, 𝐷 𝑟 is the set of relevant documents, 𝐷 𝑛𝑟 is the set of
non-relevant documents and the values 𝛼, 𝛽, 𝛾 depend on the system setting.
1
Exercise 1
The following ordered list of 20 letters R and N represents relevant (R) and non-relevant
(N) retrieved documents as an answer for a query on a collection of 10 000 documents.
The leftmost document is expected to be the most relevant. The list contains 6 relevant
documents. Assume that the collection contains 8 documents relevant to the query.
R R N N N N N N R N R N N N R N N N N R
a) What is the precision on the first 20 results?
b) What is the 𝐹-measure on the first 20 results?
c) What is the non-interpolated precision of the system at 25% recall? (R=25%)
d) What is the interpolated precision of the system at 33% recall? (R>33%)
e) Assume that these 20 documents is the complete list of retrieved documents. What is the
MAP of the system?
Now assume that the system returned all 10,000 documents in an ordered list and above is
the top 20.
f) What is the highest possible MAP the system can achieve?
g) What is the lowest possible MAP the system can achieve?
Applying the Definition 2 do calculate (a) we get 𝑃 = 6
20 = 3
10 . For (b) it is necessary to
count the recall with Definition 1 as 𝑅 = 6
8 = 3
4 . Using Definition 3 with 𝛼 = 0.5 we count
(𝛽2
+ 1)𝑃 𝑅
𝛽2 𝑃 + 𝑅
=
(12
+ 1) · 3
10 · 3
4
3
10 + 3
4
=
9
20
21
20
=
3
7
.
To count the non-interpolated precision of the system at 25% recall for (c) we need the
precisions for the documents of recall equal to 25%:
1. 𝑃 = 1
1 𝑅 = 1
8 = 12.5%
2. 𝑃 = 2
2 𝑅 = 2
8 = 25%
3. 𝑃 = 2
3 𝑅 = 2
8 = 25%
· · ·
8. 𝑃 = 2
8 𝑅 = 2
8 = 25%
9. 𝑃 = 3
9 𝑅 = 3
8 = 37.5%
We see that the first document has 𝑅 = 12.5% but this value is less than 25%. Documents 2
to 8 have the desired recall of 25%. Document 9 has already a higher value so we do not
include it in the result. Non-interpolated precision is then the set of precisions of these 7
documents
𝑃 =
{︂
2
2
,
2
3
,
2
4
,
2
5
,
2
6
,
2
7
,
2
8
}︂
.
2
For the interpolated precision (d) we are looking for the highest precision for the relevant
documents of recall higher than 33%. Recall changes if and only if the result contains a
relevant document. Therefore, the values are calculated for the documents 11, 15 and 20.
11. 𝑃 = 4
11 𝑅 = 4
8 = 50%
15. 𝑃 = 5
15 𝑅 = 5
8 = 62.5%
20. 𝑃 = 6
20 𝑅 = 6
8 = 75%
The requested recall value is exceeded by retrieving the document 9 (37.5 %). Now we only
have to find 𝑚𝑎𝑥{𝑃9, 𝑃11, 𝑃15, 𝑃20} = 𝑚𝑎𝑥{3
9 , 4
11 , 5
15 , 6
20 } = 4
11 = 0.36.
To estimate the MAP of the system (e) we use the Definition 4. Since we only have one
query 𝑁 = |𝑄| = 1 and the first 20 documents contain 𝑟𝑒𝑙 𝑞 = 6 relevant documents
𝑀 𝐴𝑃(𝑄) =
1
1
⎛
⎝
1∑︁
𝑗=1
1
6
(︃ 6∑︁
𝑖=1
𝑃(𝑑𝑜𝑐𝑖)
)︃⎞
⎠ =
1
1
·
1
6
⎛
⎜
⎜
⎝
1
1⏟ ⏞
𝑃 (1.)
+
2
2⏟ ⏞
𝑃 (2.)
+
3
9⏟ ⏞
𝑃 (9.)
+
4
11⏟ ⏞
𝑃 (11.)
+
5
15⏟ ⏞
𝑃 (15.)
+
6
20⏟ ⏞
𝑃 (20.)
⎞
⎟
⎟
⎠
=
1
1
·
1
6
·
1099
330
=
1099
1980
= 0.5550
From Definition 2 and Definition4 follows that if the two remaining relevant documents
were on positions 21 and 22 then MAP with 𝑟𝑒𝑙 𝑞 = 8 relevant documents is the highest
possible (f)
𝑀 𝐴𝑃(𝑄) =
1
8
(︂
1
1
+
2
2
+
3
9
+
4
11
+
5
15
+
6
20
+
7
21
+
8
22
)︂
= 0.503409
and, on the other hand, if they were on the last places the MAP would be minimal (g)
𝑀 𝐴𝑃(𝑄) =
1
8
(︂
1
1
+
2
2
+
3
9
+
4
11
+
5
15
+
6
20
+
7
9999
+
8
10000
)︂
= 0.41647538.
Exercise 2
Below is a table showing how two judges judged the relevance (0 = non-relevant, 1 = relevant)
of the set of 12 documents with respect to a query. Assume that you developed an IR system,
that for this query returns the documents {4, 5, 6, 7, 8}.
3
Doc ID Judge 1 Judge 2
1 0 0
2 0 0
3 1 1
4 1 1
5 1 0
6 1 0
7 1 0
8 1 0
9 0 1
10 0 1
11 0 1
12 0 1
Table 1: Judges judging the relevance of documents.
a) Calculate the 𝜅 statistic.
b) Calculate the recall, precision and 𝐹-measure of your system in which a document is
considered relevant if the judges agree.
c) Calculate the recall, precision and 𝐹-measure of your system in which a document is
considered relevant if at least one of the judges thinks so.
For (a) it is necessary to use the Definition 5. First we count 𝑃(𝐴) as the number of documents
on which the judges agree. Since these are the documents {1, 2, 3, 4} and the total number of
them is 𝑁 = 12, then 𝑃(𝐴) = |{1,2,3,4}|
12 = 4
12 = 1
3 . Now we need the counts of disagreements
between the judges. Judge 1 considers the documents 𝑁 𝑅1 = {1, 2, 9, 10, 11, 12} and judge
2 𝑁 𝑅2 = {1, 2, 5, 6, 7, 8} as non-relevant. Plugging in to the formula for 𝑃(𝑁 𝑅) we get
𝑃(𝑁 𝑅) = |𝑁 𝑅1|+|𝑁 𝑅2|
2·12 = 2·6
24 = 1
2 . We repeat this for 𝑃(𝑅). Since the number of relevant is
equal to non-relevant, then 𝑃(𝑅) = 𝑃(𝑁 𝑅) = 1
2 . We finally can count 𝑃(𝐸) as
𝑃(𝐸) = 𝑃(𝑁 𝑅)2
+ 𝑃(𝑅)2
=
(︂
1
2
)︂2
+
(︂
1
2
)︂2
=
1
4
+
1
4
=
1
2
and
𝜅 =
𝑃(𝐴) − 𝑃(𝐸)
1 − 𝑃(𝐸)
=
1
3 − 1
2
1 − 1
2
= −
1
3
.
If 𝜅 < 0 then the agreement between the judges is more than random.
For (b) it is necessary to calculate recall and precision. Our system retrieves the documents
{4, 5, 6, 7, 8} as relevant whereas the judges only agree on {3, 4}. The intersection is {4}. As
to the Definition 2 we obtain
𝑃 =
|{4}|
|{4, 5, 6, 7, 8}|
=
1
5
4
and, as the number of relevant documents is |{3, 4}| = 2, the recall is 𝑅 = 1
2 by Definition 1.
Then, by Definition 3 the 𝐹-measure is equal to
𝐹 =
(𝛽2
+ 1)𝑃 𝑅
𝛽2 𝑃 + 𝑅
=
(1 + 1)1
5
1
2
1
5 + 1
2
=
2
7
.
We similarly count these for (c) which says that a document is relevant if at least one
judge considers it relevant. This makes the documents {3, 4, 5, 6, 7, 8, 9, 10, 11, 12} relevant.
Their intersection with our result {4, 5, 6, 7, 8} is the set {4, 5, 6, 7, 8} of size 5. Recall is
𝑅 = |{4,5,6,7,8}|
|{3,4,5,6,7,8,9,10,11,12}| = 5
10 = 1
2 , precision is 𝑃 = 5
5 = 1 and 𝐹-measure is
𝐹 =
(1 + 1)1
2
1 + 1
2
=
2
3
.
Exercise 3
A user’s primary query is cheap CDs cheap DVDs extremely cheap CDs. The user has a look
on two documents: doc1 a doc2, marking doc1 CDs cheap software cheap CDs as relevant
and doc2 cheap thrills DVDs as non-relevant. Assume that we use a simple tf scheme without
vector length normalization. What would be the restructured query vector after considering
the Rocchio relevance feedback with values 𝛼 = 1, 𝛽 = 0.75 and 𝛾 = 0.25?
We rewrite the exercise to the table for an easier processing.
relevant non-relevant
terms doc1 doc2 query
Cds 2 0 2
cheap 2 1 3
software 1 0 0
thrills 0 1 0
DVDs 0 1 1
extremely 0 0 1
Table 2:
Now we mark the input if the algorithm by Definition 6.
𝑑 𝑟 ∈ 𝐷 𝑟 =
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎝
2
2
1
0
0
0
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎠
, 𝑑 𝑛𝑟 ∈ 𝐷 𝑛𝑟 =
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎝
0
1
0
1
1
0
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎠
, 𝑞 =
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎝
2
3
0
0
1
1
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎠
5
By filling the values to the formula for 𝑞 𝑚 we get
𝑞 𝑚 = 1 ·
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎝
2
3
0
0
1
1
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎠
+ 0.75 · 1
1
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎝
2
2
1
0
0
0
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎠
− 0.25 · 1
1
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎝
0
1
0
1
1
0
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎠
=
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎝
2
3
0
0
1
1
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎠
+
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎝
1.5
1.5
0.75
0
0
0
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎠
−
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎝
0
0.25
0
0.25
0.25
0
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎠
=
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎝
3.5
4.25
0.75
−0.25
0.75
1
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎠
6