Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 2 ¡The Cranfield Paradigm ¡Retrieval Performance Evaluation ¡Evaluation Using Reference Collections ¡Interactive Systems Evaluation ¡Search Log Analysis using Clickthrough Data To evaluate an IR system is to measure how well the system meets the information needs of the users This is troublesome, given that a same result set might be interpreted differently by distinct users To deal with this problem, some metrics have been defined that, on average, have a correlation with the preferences of a group of users Without proper retrieval evaluation, one cannot determine how well the IR system is performing compare the performance of the IR system with that of other systems, objectively Retrieval evaluation is a critical and integral component of any modern IR system Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 3 Cleverdon obtained a grant from the National Science Foundation to compare distinct indexing systems These experiments provided interesting insights, that culminated in the modern metrics of precision and recall Recall ratio: the fraction of relevant documents retrieved Precision ratio: the fraction of documents retrieved that are relevant For instance, it became clear that, in practical situations, the majority of searches does not require high recall Instead, the vast majority of the users require just a few relevant answers Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 5 The next step was to devise a set of experiments that would allow evaluating each indexing system in isolation more thoroughly The result was a test reference collection composed of documents, queries, and relevance judgements It became known as the Cranfield-2 collection The reference collection allows using the same set of documents and queries to evaluate different ranking systems The uniformity of this setup allows quick evaluation of new ranking functions Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 6 Reference collections, which are based on the foundations established by the Cranfield experiments, constitute the most used evaluation method in IR A reference collection is composed of: A set D of pre-selected documents A set I of information need descriptions used for testing A set of relevance judgements associated with each pair [im, dj ], im ∈ I and dj ∈ D The relevance judgement has a value of 0 if document dj is non-relevant to im, and 1 otherwise These judgements are produced by human specialists Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Consider, I: an information request R: the set of relevant documents for I A: the answer set for I, generated by an IR system R∩ A: the intersection of the sets R and A Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) The recall and precision measures are defined as follows Recall is the fraction of the relevant documents (the set R) which has been retrieved i.e., Recall = |R ∩ A| | R| Precision is the fraction of the retrieved documents (the set A) which is relevant i.e., Precision = |R ∩ A| | A| Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) The definition of precision and recall assumes that all docs in the set A have been examined However, the user is not usually presented with all docs in the answer set A at once User sees a ranked set of documents and examines them starting from the top Thus, precision and recall vary as the user proceeds with their examination of the set A Most appropriate then is to plot a curve of precision versus recall Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Consider a reference collection and a set of test queries Let Rq 1 be the set of relevant docs for a query q1: Rq 1 = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123} Consider a new IR algorithm that yields the following answer to q1 (relevant docs are marked with a bullet): 01. d123 • 02. d84 03. d56 • 04. d6 05. d8 06. d9 • 07. d511 08. d129 09. d187 10. d25 • 11. d38 12. d48 13. d250 14. d113 15. d3 • Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) If we examine this ranking, we observe that The document d123, ranked as number 1, is relevant This document corresponds to 10% of all relevant documents Thus, we say that we have a precision of 100% at 10% recall The document d56, ranked as number 3, is the next relevant At this point, two documents out of three are relevant, and two of the ten relevant documents have been seen Thus, we say that we have a precision of 66.6% at 20% recall Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 01. d123 • 02. d84 03. d56 • 04. d6 05. d8 06. d9 • 07. d511 08. d129 09. d187 10. d25 • 11. d38 12. d48 13. d250 14. d113 15. d3 • If we proceed with our examination of the ranking generated, we can plot a curve of precision versus recall as follows: Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Consider now a second query q2 whose set of relevant answers is given by Rq 2 = {d3, d56, d129} The previous IR algorithm processes the query q2 and returns a ranking, as follows 01. d425 06. d615 11. d193 02. d87 07. d512 12. d715 03. d56 • 08. d129 • 13. d810 04. d32 09. d4 14. d5 05. d124 10. d130 15. d3 • Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) If we examine this ranking, we observe The first relevant document is d56 It provides a recall and precision levels equal to 33.3% The second relevant document is d129 It provides a recall level of 66.6% (with precision equal to 25%) The third relevant document is d3 It provides a recall level of 100% (with precision equal to 20%) 01. d425 06. d615 11. d193 02. d87 07. d512 12. d715 03. d56 • 08. d129 • 13. d810 04. d32 09. d4 14. d5 05. d124 10. d130 15. d3 • Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) The precision figures at the 11 standard recall levels are interpolated as follows Let rj , j ∈ {0, 1, 2, . . . , 10}, be a reference to the j-th standard recall level Then, P (rj ) = max∀r | rj≤r P (r) In our last example, this interpolation rule yields the precision and recall figures illustrated below Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) In the examples above, the precision and recall figures have been computed for single queries Usually, however, retrieval algorithms are evaluated by running them for several distinct test queries To evaluate the retrieval performance for Nq queries, we average the precision at each recall level as follows where P (rj ) is the average precision at the recall level rj Pi(rj ) is the precision at recall level rj for the i-th query Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) To illustrate, the figure below illustrates precision-recall figures averaged over queries q1 and q2 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Average precision-recall curves are normally used to compare the performance of distinct IR algorithms The figure below illustrates average precision-recall curves for two distinct retrieval algorithms Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) In the case of Web search engines, the majority of searches does not require high recall Higher the number of relevant documents at the top of the ranking, more positive is the impression of the users Precision at 5 (P@5) and at 10 (P@10) measure the precision when 5 or 10 documents have been seen These metrics assess whether the users are getting relevant documents at the top of the ranking or not Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) To exemplify, consider again the ranking for the example query q1 we have been using: For this query, we have P@5 = 40% and P@10 = 40% Further, we can compute P@5 and P@10 averaged over a sample of 100 queries, for instance These metrics provide an early assessment of which algorithm might be preferable in the eyes of the users Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 01. d123 • 02. d84 03. d56 • 04. d6 05. d8 06. d9 • 07. d511 08. d129 09. d187 10. d25 • 11. d38 12. d48 13. d250 14. d113 15. d3 • The idea here is to average the precision figures obtained after each new relevant document is observed For relevant documents not retrieved, the precision is set to 0 To illustrate, consider again the precision-recall curve for the example query q1 The mean average precision (MAP) for q1 is given by MAP1 = 1 + 0.66 + 0.5 + 0.4 + 0.33 + 0 + 0 + 0 + 0 + 0 10 = 0.28 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Let R be the total number of relevant docs for a given query The idea here is to compute the precision at the R-th position in the ranking For the query q1, the R value is 10 and there are 4 relevants among the top 10 documents in the ranking Thus, the R-Precision value for this query is 0.4 The R-precision measure is a useful for observing the behavior of an algorithm for individual queries Additionally, one can also compute an average R-precision figure over a set of queries However, using a single number to evaluate a algorithm over several queries might be quite imprecise Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) The R-precision computed for several queries can be used to compare two algorithms as follows Let, RPA(i) : R-precision for algorithm A for the i-th query RPB (i) : R-precision for algorithm B for the i-th query Define, for instance, the difference RPA/B (i) = RPA(i) − RPB (i) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Figure below illustrates the RPA/B (i) values for two retrieval algorithms over 10 example queries The algorithm A performs better for 8 of the queries, while the algorithm B performs better for the other 2 queries Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) MRR is a good metric for those cases in which we are interested in the first correct answer such as Question-Answering (QA) systems Search engine queries that look for specific sites URL queries Homepage queries Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Let, Ri: ranking relative to a query qi Scorrect(Ri): position of the first correct answer in Ri Sh: threshold for ranking position Then, the reciprocal rank RR(Ri) for query qi is given by The mean reciprocal rank (MRR) for a set Q of Nq queries is given by Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) A measure that combines recall and precision The idea is to allow the user to specify whether he is more interested in recall or in precision The E measure is defined as follows where r(j) is the recall at the j-th position in the ranking P (j) is the precision at the j-th position in the ranking b ≥ 0 is a user specified parameter E(j) is the E metric at the j-th position in the ranking Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) The parameter b is specified by the user and reflects the relative importance of recall and precision If b =0 E(j) = 1 − P (j) low values of b make E(j) a function of precision If b → ∞ limb→∞ E(j) = 1 − r(j) high values of b make E(j) a function of recal For b =1, the E-measure becomes the F-measure Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) The F-measure is also a single measure that combines recall and precision ¡where ¡r(j) is the recall at the j-th position in the ranking ¡P (j) is the precision at the j-th position in the ranking ¡F (j) is the harmonic mean at the j-th position in the ranking Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) The function F assumes values in the interval [0, 1] It is 0 when no relevant documents have been retrieved and is 1 when all ranked documents are relevant Further, the harmonic mean F assumes a high value only when both recall and precision are high To maximize F requires finding the best possible compromise between recall and precision Notice that setting b =1 in the formula of the E-measure yields F (j) = 1 − E(j) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Precision and recall allow only binary relevance assessments As a result, there is no distinction between highly relevant docs and mildly relevant docs These limitations can be overcome by adopting graded relevance assessments and metrics that combine them The discounted cumulated gain (DCG) is a metric that combines graded relevance assessments effectively Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) When examining the results of a query, two key observations can be made: highly relevant documents are preferable at the top of the ranking than mildly relevant ones relevant documents that appear at the end of the ranking are less valuable Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Consider that the results of the queries are graded on a scale 0–3 (0 for non-relevant, 3 for strong relevant docs) For instance, for queries q1 and q2, consider that the graded relevance scores are as follows: Rq 1 = { [d3, 3], [d5, 3], [d9, 3], [d25, 2], [d39, 2], [d44, 2], [d56, 1], [d71, 1], [d89, 1], [d123, 1] } Rq 2 = { [d3, 3], [d56, 2], [d129, 1] } That is, while document d3 is highly relevant to query q1, document d56 is just mildly relevant Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Given these assessments, the results of a new ranking algorithm can be evaluated as follows Specialists associate a graded relevance score to the top 10–20 results produced for a given query q This list of relevance scores is referred to as the gain vector G Considering the top 15 docs in the ranking produced for queries q1 and q2, the gain vectors for these queries are: G1 = (1, 0, 1, 0, 0, 3, 0, 0, 0, 2, 0, 0, 0, 0, 3) G2 = (0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 3) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) By summing up the graded scores up to any point in the ranking, we obtain the cumulated gain (CG) For query q1, for instance, the cumulated gain at the first position is 1, at the second position is 1+0, and so on Thus, the cumulated gain vectors for queries q1 and q2 are given by CG1 = (1, 1, 2, 2, 2, 5, 5, 5, 5, 7, 7, 7, 7, 7, 10) CG2 = (0, 0, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 6) For instance, the cumulated gain at position 8 of CG1 is equal to 5 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) In formal terms, we define Given the gain vector Gj for a test query qj , the CGj associated with it is defined as where CGj [i] refers to the cumulated gain at the i-th position of the ranking for query qj Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) We also introduce a discount factor that reduces the impact of the gain as we move upper in the ranking A simple discount factor is the logarithm of the ranking position If we consider logs in base 2, this discount factor will be log2 2 at position 2, log2 3 at position 3, and so on By dividing a gain by the corresponding discount factor, we obtain the discounted cumulated gain (DCG) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) More formally, Given the gain vector Gj for a test query qj , the vector DCGj associated with it is defined as where DCGj [i] refers to the discounted cumulated gain at the i-th position of the ranking for query qj Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) For the example queries q1 and q2, the DCG vectors are given by DCG1 = (1.0, 1.0, 1.6, 1.6, 1.6, 2.8, 2.8, 2.8, 2.8, 3.4, 3.4, 3.4, 3.4, 3.4, 4.2) DCG2 = (0.0, 0.0, 1.3, 1.3, 1.3, 1.3, 1.3, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 2.4) Discounted cumulated gains are much less affected by relevant documents at the end of the ranking By adopting logs in higher bases the discount factor can be accentuated Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) To produce CG and DCG curves over a set of test queries, we need to average them over all queries Given a set of Nq queries, average CG[i] and DCG[i] over all queries are computed as follows For instance, for the example queries q1 and q2, these averages are given by CG = (0.5, 0.5, 2.0, 2.0, 2.0, 3.5, 3.5, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0) DCG = (0.5, 0.5, 1.5, 1.5, 1.5, 2.1, 2.1, 2.2, 2.2, 2.5, 2.5, 2.5, 2.5, 2.5, 3.3) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Then, average curves can be drawn by varying the rank positions from 1 to a pre-established threshold Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Recall and precision figures are computed relatively to the set of relevant documents CG and DCG scores, as defined above, are not computed relatively to any baseline This implies that it might be confusing to use them directly to compare two distinct retrieval algorithms One solution to this problem is to define a baseline to be used for normalization This baseline are the ideal CG and DCG metrics, as we now discuss Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) For a given test query q, assume that the relevance assessments made by the specialists produced: n3 documents evaluated with a relevance score of 3 n2 documents evaluated with a relevance score of 2 n1 documents evaluated with a score of 1 n0 documents evaluated with a score of 0 The ideal gain vector IG is created by sorting all relevance scores in decreasing order, as follows: IG = (3, . . . , 3, 2, . . . , 2, 1, . . . , 1, 0, . . . , 0) For instance, for the example queries q1 and q2, we have IG1 = (3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0) IG2 = (3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Ideal CG and ideal DCG vectors can be computed analogously to the computations of CG and DCG For the example queries q1 and q2, we have ICG1 = (3, 6, 9, 11, 13, 15, 16, 17, 18, 19, 19, 19, 19, 19, 19) ICG2 = (3, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6) The ideal DCG vectors are given by IDCG1 = (3.0, 6.0, 7.9, 8.9, 9.8, 10.5, 10.9, 11.2, 11.5, 11.8, 11.8, 11.8, 11.8, 11.8, 11.8) IDCG2 = (3.0, 5.0, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Further, average ICG and average IDCG scores can be computed as follows For instance, for the example queries q1 and q2, ICG and IDCG vectors are given by ICG = (3.0, 5.5, 7.5, 8.5, 9.5, 10.5, 11.0, 11.5, 12.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5) IDCG = (3.0, 5.5, 6.8, 7.3, 7.7, 8.1, 8.3, 8.4, 8.6, 8.7, 8.7, 8.7, 8.7, 8.7, 8.7) By comparing the average CG and DCG curves for an algorithm with the average ideal curves, we gain insight on how much room for improvement there is Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Precision and recall figures can be directly compared to the ideal curve of 100% precision at all recall levels DCG figures, however, are not build relative to any ideal curve, which makes it difficult to compare directly DCG curves for two distinct ranking algorithms This can be corrected by normalizing the DCG metric Given a set of Nq test queries, normalized CG and DCG metrics are given by Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) For instance, for the example queries q1 and q2, NCG and NDCG vectors are given by NCG = NDCG (0.17, 0.09, 0.27, 0.24, 0.21, 0.33, 0.32, 0.35, 0.33, 0.40, 0.40, 0.40, 0.40, 0.40, 0.64) = (0.17, 0.09, 0.21, 0.20, 0.19, 0.25, 0.25, 0.26, 0.26, 0.29, 0.29, 0.29, 0.29, 0.29, 0.38) The area under the NCG and NDCG curves represent the quality of the ranking algorithm Higher the area, better the results are considered to be Thus, normalized figures can be used to compare two distinct ranking algorithms Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) CG and DCG metrics aim at taking into account multiple level relevance assessments This has the advantage of distinguishing highly relevant documents from mildly relevant ones The inherent disadvantages are that multiple level relevance assessments are harder and more time consuming to generate Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Despite these inherent difficulties, the CG and DCG metrics present benefits: They allow systematically combining document ranks and relevance scores Cumulated gain provides a single metric of retrieval performance at any position in the ranking It also stresses the gain produced by relevant docs up to a position in the ranking, which makes the metrics more imune to outliers Further, discounted cumulated gain allows down weighting the impact of relevant documents found late in the ranking Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Precision and recall allow comparing the relevance of the results produced by two ranking functions However, there are situations in which we cannot directly measure relevance we are more interested in determining how differently a ranking function varies from a second one that we know well In these cases, we are interested in comparing the relative ordering produced by the two rankings This can be accomplished by using statistical functions called rank correlation metrics Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Let rankings R1 and R2 A rank correlation metric yields a correlation coefficient C(R1, R2) with the following properties: −1 ≤ C(R1, R2) ≤ 1 if C(R1, R2) = 1, the agreement between the two rankings is perfect i.e., they are the same. if C(R1, R2) = −1, the disagreement between the two rankings is perfect i.e., they are the reverse of each other. if C(R1, R2) = 0, the two rankings are completely independent. increasing values of C(R1, R2) imply increasing agreement between the two rankings. Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) The Spearman coefficient is likely the mostly used rank correlation metric It is based on the differences between the positions of a same document in two rankings Let s1,j be the position of a document dj in ranking R1 and s2,j be the position of dj in ranking R2 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Consider 10 example documents retrieved by two distinct rankings R1 and R2. Let s1,j and s2,j be the document position in these two rankings, as follows: documents s1,j s2,j si,j − s2,j (s1,j − s2,j)2 d123 1 2 -1 1 d84 2 3 -1 1 d56 3 1 +2 4 d6 4 5 -1 1 d8 5 4 +1 1 d9 6 7 -1 1 d511 7 8 -1 1 d129 8 10 -2 4 d187 9 6 +3 9 d25 10 9 +1 1 Sum of Square Distances 24 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) By plotting the rank positions for R1 and R2 in a 2-dimensional coordinate system, we observe that there is a strong correlation between the two rankings Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) To produce a quantitative assessment of this correlation, we sum the squares of the differences for each pair of rankings If there are K documents ranked, the maximum value for the sum of squares of ranking differences is given by K × (K2 − 1) 3 Let K = 10 If the two rankings were in perfect disagreement, then this value is (10 × (102 − 1))/ 3, or 330 On the other hand, if we have a complete agreement the sum is 0 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Let us consider the fraction Its value is 0 when the two rankings are in perfect agreement +1 when they are in perfect disagreement If we multiply the fraction by 2, its value shifts to the range [0, +2] If we now subtract the result from 1, the resultant value shifts to the range [−1, +1] Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) This reasoning suggests defining the correlation between the two rankings as follows Let s1,j and s2,j be the positions of a document dj in two rankings R1 and R2, respectively Define where S(R1, R2) is the Spearman rank correlation coefficient K indicates the size of the ranked sets Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) For the rankings in Figure below, we have documents s1,j s2,j si,j − s2,j (s1,j − s2,j)2 d123 1 2 -1 1 d84 2 3 -1 1 d56 3 1 +2 4 d6 4 5 -1 1 d8 5 4 +1 1 d9 6 7 -1 1 d511 7 8 -1 1 d129 8 10 -2 4 d187 9 6 +3 9 d25 10 9 +1 1 Sum of Square Distances 24 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) It is difficult to assign an operational interpretation to Spearman coefficient One alternative is to use a coefficient that has a natural and intuitive interpretation, as the Kendall Tau coefficient Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) When we think of rank correlations, we think of how two rankings tend to vary in similar ways To illustrate, consider two documents dj and dk and their positions in the rankings R1 and R2 Further, consider the differences in rank positions for these two documents in each ranking, i.e., s1,k s2,k − − s1,j s2,j If these differences have the same sign, we say that the document pair [dk , dj ] is concordant in both rankings If they have different signs, we say that the document pair is discordant in the two rankings Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) for a total of 1 × 5 × 4, or 10 ordered pairs Consider the top 5 documents in rankings R1 and R2 documents s1,j s2,j si,j − s2,j d123 1 2 -1 d84 2 3 -1 d56 3 1 +2 d6 4 5 -1 d8 5 4 +1 The ordered document pairs in ranking R1 are [d123, d84], [d123, d56], [d123, d6], [d123, d8], [d84, d56], [d84, d6], [d84, d8], [d56, d6], [d56, d8], [d6, d8] 2 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Repeating the same exercise for the top 5 documents in ranking R2, we obtain [d56, d123], [d123, d84], [d56, d84], [d123, d8], [d56, d8], [d56, d6], [d123, d6], [d84, d6], [d84, d8], [d8, d6] We compare these two sets of ordered pairs looking for concordant and discordant pairs Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Let us mark with a C the concordant pairs and with a D the discordant pairs For ranking R1, we have C, D, C, C, D, C, C, C, C, D For ranking R2, we have D, D, C, C, C, C, C, C, C, D Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) That is, a total of 20, i.e., K(K − 1), ordered pairs are produced jointly by the two rankings Among these, 14 pairs are concordant and 6 pairs are discordant The Kendall Tau coefficient is defined as τ (R1, R2) = P (R1 = R2) − P (R1 ≠ R2) In our example τ (R1, R2) = 14 6 − 20 20 = 0.4 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Let, ∆(R1, R2): number of discordant document pairs in R1 and R2 K(K − 1) − ∆(R1, R2): number of concordant document pairs in R1 and R2 Then, P (R1 = R2) = K(K − 1) − ∆(R1, R2) P (R1 ≠ R2) = K(K− 1) ∆(R1, R2) K(K− 1) which yields τ (R1, R2) = 1 − 2 × ∆(R1,R2) K(K−1) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) τ (R1, R2) = 1 − 5(5 − 1) = 0.4 For the case of our previous example, we have ∆(R1, R2) = 6 K = 5 Thus, 2 × 6 as before The Kendall Tau coefficient is defined only for rankings over a same set of elements Most important, it has a simpler algebraic structure than the Spearman coefficient Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) A form of evaluating two different systems is to evaluate their results side by side Typically, the top 10 results produced by the systems for a given query are displayed in side-by-side panels Presenting the results side by side allows controlling: differences of opinion among subjects influences on the user opinion produced by the ordering of the top results Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Side by side panels for Yahoo! and Google Top 5 answers produced by each search engine, with regard to the query “information retrieval evaluation” Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) The side-by-side experiment is simply a judgement on which side provides better results for a given query By recording the interactions of the users, we can infer which of the answer sets are preferred to the query Side by side panels can be used for quick comparison of distinct search engines Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) There are a number of limitations with current approaches for relevance evaluation For instance, the Cranfield paradigm is expensive and has obvious scalability issues Recently, crowdsourcing has emerged as a feasible alternative for relevance evaluation Crowdsourcing is a term used to describe tasks that are outsourced to a large group of people, called “workers” It is an open call to solve a problem or carry out a task, one which usually involves a monetary value in exchange for such service Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) To illustrate, crowdsourcing has been used to validate research on the quality of search snippets One of the most important aspects of crowdsourcing is to design the experiment carefully It is important to ask the right questions and to use well-known usability techniques Workers are not information retrieval experts, so the task designer should provide clear instructions Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)