Seminar 6 Definition 1 (Recall) Recall describes how many of the relevant documents are retrieved. π‘Ÿπ‘’π‘π‘Žπ‘™π‘™ = 𝑅 = #π‘Ÿπ‘’π‘™π‘’π‘£π‘Žπ‘›π‘‘ π‘Ÿπ‘’π‘‘π‘Ÿπ‘–π‘’π‘£π‘’π‘‘ #π‘Ÿπ‘’π‘™π‘’π‘£π‘Žπ‘›π‘‘ Definition 2 (Precision) Precision describes how many of the retrieved documents are relevant. π‘π‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› = 𝑃 = #π‘Ÿπ‘’π‘™π‘’π‘£π‘Žπ‘›π‘‘ π‘Ÿπ‘’π‘‘π‘Ÿπ‘–π‘’π‘£π‘’π‘‘ #π‘Ÿπ‘’π‘‘π‘Ÿπ‘–π‘’π‘£π‘’π‘‘ Definition 3 (𝐹 -measure) A balanced 𝐹-measure (𝐹1-measure) defines a recall-precision relationship represented by their weighted harmonic mean: 𝐹 = 2 Β· 𝑅 Β· 𝑃 𝑅 + 𝑃 Definition 4 (Mean Average Precision) MAP expresses the precision in each point a new relevant document is included in the result. Is counted as 𝑀 𝐴𝑃(𝑄) = 1 |𝑄| Β· βŽ› ⎝ βˆ‘οΈ π‘žβˆˆπ‘„ 1 π‘Ÿπ‘’π‘™ π‘ž Β· βŽ› ⎝ π‘Ÿπ‘’π‘™ π‘ž βˆ‘οΈ 𝑖=1 π‘π‘Ÿπ‘’π‘π‘– ⎞ ⎠ ⎞ ⎠ where π‘Ÿπ‘’π‘™ π‘ž is the number of relevant documents retrieved by query π‘ž and π‘π‘Ÿπ‘’π‘π‘– is the precision at the 𝑖-th document. Definition 5 (πœ… statistic) Let 𝑁 be the total number of documents, 𝐽 is a set of judges and 𝑃(𝐴) = #π‘Žπ‘”π‘Ÿπ‘’π‘’ 𝑁 the number of documents on which the judges agree. Let also define 𝑅 𝑗 and 𝑁 𝑅 𝑗 be the number of relevant and non-relevant documents, respectively, according to the judge 𝑗 ∈ 𝐽 and 𝑃(𝑅) = βˆ‘οΈ€ π‘—βˆˆπ½ 𝑅 𝑗 |𝐽| Β· 𝑁 and 𝑃(𝑁 𝑅) = βˆ‘οΈ€ π‘—βˆˆπ½ 𝑁 𝑅 𝑗 |𝐽| Β· 𝑁 as the number of relevant and non-relevant documents, respectively. Let finally define 𝑃(𝐸) = 𝑃(𝑅)2 + 𝑃(𝑁 𝑅)2 as the approximate number of disagreements between the judges. Then the πœ… statistic is defined as the measure of agreement between the judges πœ… = 𝑃(𝐴) βˆ’ 𝑃(𝐸) 1 βˆ’ 𝑃(𝐸) . Definition 6 (Rocchio relevance feedback) Rocchio relevance feedback has the form π‘ž π‘š = π›Όπ‘ž0 + 𝛽 1 |𝐷 π‘Ÿ| βˆ‘οΈ ⃗𝑑 π‘Ÿβˆˆπ· π‘Ÿ ⃗𝑑 π‘Ÿ βˆ’ 𝛾 1 |𝐷 π‘›π‘Ÿ| βˆ‘οΈ ⃗𝑑 π‘›π‘Ÿβˆˆπ· π‘›π‘Ÿ ⃗𝑑 π‘›π‘Ÿ where π‘ž0 is the original query vector, 𝐷 π‘Ÿ is the set of relevant documents, 𝐷 π‘›π‘Ÿ is the set of non-relevant documents and the values 𝛼, 𝛽, 𝛾 depend on the system setting. 1 Exercise 1 The following ordered list of 20 letters R and N represents relevant (R) and non-relevant (N) retrieved documents as an answer for a query on a collection of 10 000 documents. The leftmost document is expected to be the most relevant. The list contains 6 relevant documents. Assume that the collection contains 8 documents relevant to the query. R R N N N N N N R N R N N N R N N N N R a) What is the precision on the first 20 results? b) What is the 𝐹-measure on the first 20 results? c) What is the non-interpolated precision of the system at 25% recall? (R=25%) d) What is the interpolated precision of the system at 33% recall? (R>33%) e) Assume that these 20 documents is the complete list of retrieved documents. What is the MAP of the system? Now assume that the system returned all 10,000 documents in an ordered list and above is the top 20. f) What is the highest possible MAP the system can achieve? g) What is the lowest possible MAP the system can achieve? Applying the Definition 2 do calculate (a) we get 𝑃 = 6 20 = 3 10 . For (b) it is necessary to count the recall with Definition 1 as 𝑅 = 6 8 = 3 4 . Using Definition 3 with 𝛼 = 0.5 we count (𝛽2 + 1)𝑃 𝑅 𝛽2 𝑃 + 𝑅 = (12 + 1) Β· 3 10 Β· 3 4 3 10 + 3 4 = 9 20 21 20 = 3 7 . To count the non-interpolated precision of the system at 25% recall for (c) we need the precisions for the documents of recall equal to 25%: 1. 𝑃 = 1 1 𝑅 = 1 8 = 12.5% 2. 𝑃 = 2 2 𝑅 = 2 8 = 25% 3. 𝑃 = 2 3 𝑅 = 2 8 = 25% Β· Β· Β· 8. 𝑃 = 2 8 𝑅 = 2 8 = 25% 9. 𝑃 = 3 9 𝑅 = 3 8 = 37.5% We see that the first document has 𝑅 = 12.5% but this value is less than 25%. Documents 2 to 8 have the desired recall of 25%. Document 9 has already a higher value so we do not include it in the result. Non-interpolated precision is then the set of precisions of these 7 documents 𝑃 = {οΈ‚ 2 2 , 2 3 , 2 4 , 2 5 , 2 6 , 2 7 , 2 8 }οΈ‚ . 2 For the interpolated precision (d) we are looking for the highest precision for the relevant documents of recall higher than 33%. Recall changes if and only if the result contains a relevant document. Therefore, the values are calculated for the documents 11, 15 and 20. 11. 𝑃 = 4 11 𝑅 = 4 8 = 50% 15. 𝑃 = 5 15 𝑅 = 5 8 = 62.5% 20. 𝑃 = 6 20 𝑅 = 6 8 = 75% The requested recall value is exceeded by retrieving the document 9 (37.5 %). Now we only have to find π‘šπ‘Žπ‘₯{𝑃9, 𝑃11, 𝑃15, 𝑃20} = π‘šπ‘Žπ‘₯{3 9 , 4 11 , 5 15 , 6 20 } = 4 11 = 0.36. To estimate the MAP of the system (e) we use the Definition 4. Since we only have one query 𝑁 = |𝑄| = 1 and the first 20 documents contain π‘Ÿπ‘’π‘™ π‘ž = 6 relevant documents 𝑀 𝐴𝑃(𝑄) = 1 1 βŽ› ⎝ 1βˆ‘οΈ 𝑗=1 1 6 (οΈƒ 6βˆ‘οΈ 𝑖=1 𝑃(π‘‘π‘œπ‘π‘–) )οΈƒβŽž ⎠ = 1 1 Β· 1 6 βŽ› ⎜ ⎜ ⎝ 1 1⏟ ⏞ 𝑃 (1.) + 2 2⏟ ⏞ 𝑃 (2.) + 3 9⏟ ⏞ 𝑃 (9.) + 4 11⏟ ⏞ 𝑃 (11.) + 5 15⏟ ⏞ 𝑃 (15.) + 6 20⏟ ⏞ 𝑃 (20.) ⎞ ⎟ ⎟ ⎠ = 1 1 Β· 1 6 Β· 1099 330 = 1099 1980 = 0.5550 From Definition 2 and Definition4 follows that if the two remaining relevant documents were on positions 21 and 22 then MAP with π‘Ÿπ‘’π‘™ π‘ž = 8 relevant documents is the highest possible (f) 𝑀 𝐴𝑃(𝑄) = 1 8 (οΈ‚ 1 1 + 2 2 + 3 9 + 4 11 + 5 15 + 6 20 + 7 21 + 8 22 )οΈ‚ = 0.503409 and, on the other hand, if they were on the last places the MAP would be minimal (g) 𝑀 𝐴𝑃(𝑄) = 1 8 (οΈ‚ 1 1 + 2 2 + 3 9 + 4 11 + 5 15 + 6 20 + 7 9999 + 8 10000 )οΈ‚ = 0.41647538. Exercise 2 Below is a table showing how two judges judged the relevance (0 = non-relevant, 1 = relevant) of the set of 12 documents with respect to a query. Assume that you developed an IR system, that for this query returns the documents {4, 5, 6, 7, 8}. 3 Doc ID Judge 1 Judge 2 1 0 0 2 0 0 3 1 1 4 1 1 5 1 0 6 1 0 7 1 0 8 1 0 9 0 1 10 0 1 11 0 1 12 0 1 Table 1: Judges judging the relevance of documents. a) Calculate the πœ… statistic. b) Calculate the recall, precision and 𝐹-measure of your system in which a document is considered relevant if the judges agree. c) Calculate the recall, precision and 𝐹-measure of your system in which a document is considered relevant if at least one of the judges thinks so. For (a) it is necessary to use the Definition 5. First we count 𝑃(𝐴) as the number of documents on which the judges agree. Since these are the documents {1, 2, 3, 4} and the total number of them is 𝑁 = 12, then 𝑃(𝐴) = |{1,2,3,4}| 12 = 4 12 = 1 3 . Now we need the counts of disagreements between the judges. Judge 1 considers the documents 𝑁 𝑅1 = {1, 2, 9, 10, 11, 12} and judge 2 𝑁 𝑅2 = {1, 2, 5, 6, 7, 8} as non-relevant. Plugging in to the formula for 𝑃(𝑁 𝑅) we get 𝑃(𝑁 𝑅) = |𝑁 𝑅1|+|𝑁 𝑅2| 2Β·12 = 2Β·6 24 = 1 2 . We repeat this for 𝑃(𝑅). Since the number of relevant is equal to non-relevant, then 𝑃(𝑅) = 𝑃(𝑁 𝑅) = 1 2 . We finally can count 𝑃(𝐸) as 𝑃(𝐸) = 𝑃(𝑁 𝑅)2 + 𝑃(𝑅)2 = (οΈ‚ 1 2 )οΈ‚2 + (οΈ‚ 1 2 )οΈ‚2 = 1 4 + 1 4 = 1 2 and πœ… = 𝑃(𝐴) βˆ’ 𝑃(𝐸) 1 βˆ’ 𝑃(𝐸) = 1 3 βˆ’ 1 2 1 βˆ’ 1 2 = βˆ’ 1 3 . If πœ… < 0 then the agreement between the judges is more than random. For (b) it is necessary to calculate recall and precision. Our system retrieves the documents {4, 5, 6, 7, 8} as relevant whereas the judges only agree on {3, 4}. The intersection is {4}. As to the Definition 2 we obtain 𝑃 = |{4}| |{4, 5, 6, 7, 8}| = 1 5 4 and, as the number of relevant documents is |{3, 4}| = 2, the recall is 𝑅 = 1 2 by Definition 1. Then, by Definition 3 the 𝐹-measure is equal to 𝐹 = (𝛽2 + 1)𝑃 𝑅 𝛽2 𝑃 + 𝑅 = (1 + 1)1 5 1 2 1 5 + 1 2 = 2 7 . We similarly count these for (c) which says that a document is relevant if at least one judge considers it relevant. This makes the documents {3, 4, 5, 6, 7, 8, 9, 10, 11, 12} relevant. Their intersection with our result {4, 5, 6, 7, 8} is the set {4, 5, 6, 7, 8} of size 5. Recall is 𝑅 = |{4,5,6,7,8}| |{3,4,5,6,7,8,9,10,11,12}| = 5 10 = 1 2 , precision is 𝑃 = 5 5 = 1 and 𝐹-measure is 𝐹 = (1 + 1)1 2 1 + 1 2 = 2 3 . Exercise 3 A user’s primary query is cheap CDs cheap DVDs extremely cheap CDs. The user has a look on two documents: doc1 a doc2, marking doc1 CDs cheap software cheap CDs as relevant and doc2 cheap thrills DVDs as non-relevant. Assume that we use a simple tf scheme without vector length normalization. What would be the restructured query vector after considering the Rocchio relevance feedback with values 𝛼 = 1, 𝛽 = 0.75 and 𝛾 = 0.25? We rewrite the exercise to the table for an easier processing. relevant non-relevant terms doc1 doc2 query Cds 2 0 2 cheap 2 1 3 software 1 0 0 thrills 0 1 0 DVDs 0 1 1 extremely 0 0 1 Table 2: Now we mark the input if the algorithm by Definition 6. 𝑑 π‘Ÿ ∈ 𝐷 π‘Ÿ = βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 2 2 1 0 0 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ , 𝑑 π‘›π‘Ÿ ∈ 𝐷 π‘›π‘Ÿ = βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0 1 0 1 1 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ , π‘ž = βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 2 3 0 0 1 1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ 5 By filling the values to the formula for π‘ž π‘š we get π‘ž π‘š = 1 Β· βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 2 3 0 0 1 1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ + 0.75 Β· 1 1 βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 2 2 1 0 0 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ βˆ’ 0.25 Β· 1 1 βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0 1 0 1 1 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ = βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 2 3 0 0 1 1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ + βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 1.5 1.5 0.75 0 0 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ βˆ’ βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0 0.25 0 0.25 0.25 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ = βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 3.5 4.25 0.75 βˆ’0.25 0.75 1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ 6