Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz ¡Suppose our input data to a map-reduce system are integer values (the keys are not important) §The map function takes an integer i and produces pairs (p, i) such that p is a prime divisor of i §Example: map(‘any_key’, 12) = [(2,12), (3,12)] §The reduce function is addition §Example: reduce(p, [i , i , ..., i]) is (p, i + i + ... + i) ¡Compute the output, if the input is the set of integers 15, 21, 24, 30, 49 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 2 ¡Map-reduce input: a set of key-value pairs ¡Programmer specifies two methods: §Map(k, v) ® * §Takes a key-value pair and outputs a set of key-value pairs §E.g., key is the filename, value is a single line in the file §There is one Map call for every (k,v) pair §Reduce(k’, *) ® * §All values v’ with same key k’ are reduced together and processed in v’ order §There is one Reduce function call per unique key k’ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 3 ¡Map functions: §map(‘any_key’, 15) = [(3, 15), (5, 15)] §map(‘any_key’, 21) = [(3, 21), (7, 21)] §map(‘any_key’, 24) = [(2, 24), (3, 24)] §map(‘any_key’, 30) = [(2, 30), (3, 30), (5, 30)] §map(‘any_key’, 49) = [(7, 49)] ¡ ¡Reduce functions: §reduce(2, [24, 30]) = (2, 54) §reduce(3, [15, 21, 24, 30]) = (3, 90) §reduce(5, [15, 30]) = (5, 45) §reduce(7, [21, 49]) = (7, 70) ¡ ¡Output: (2, 54), (3, 90), (5, 45), (7, 70) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 4 ¡Suppose we have the following relations R, S: § R S ¡ ¡ ¡ ¡ ¡Apply the natural join algorithm §Apply the Map function to the tuples of relations §Construct the elements that are input to the Reduce function Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 5 Textové pole: A B 0 1 1 2 2 3 B C 0 1 1 2 2 3 ¡Natural-join algorithm §Finding tuples that agree on common attributes, i.e., only the attribute B is in both relations R and S Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 6 A B 0 1 1 2 2 3 4 1 B C 0 1 1 2 2 3 1 5 ¡Map functions – for each tuple (a, b) of R, the key-value pair (b, (R, a)) is produced and, analogically, for each tuple (b, c) of S, the pair (b, (S, c)) is created: §R: S: §map(R, (0, 1)) = (1, (R, 0)) map(S, (0, 1)) = (0, (S, 1)) §map(R, (1, 2)) = (2, (R, 1)) map(S, (1, 2)) = (1, (S, 2)) §map(R, (2, 3)) = (3, (R, 2)) map(S, (2, 3)) = (2, (S, 3)) § ¡Based on the 4 different keys as the result of all the map calls, the following elements are input to the 4 reduce functions: §(0, [(S, 1)]) reduce(0, [(S, 1)]) = {} §(1, [(R, 0), (S, 2)]) reduce(1, [(R, 0), (S, 2)]) = {(0, 1, 2)} §(2, [(R, 1), (S, 3)]) reduce(2, [(R, 1), (S, 3)]) = {(1, 2, 3)} §(3, [(R, 2)]) reduce(3, [(R, 2)]) = {} Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 7 ¡Design MapReduce algorithms that take a very large file of integers and produce as output: 1)The largest integer; 2)The average of all the integers; 3)The same set of integers, but with each integer appearing only once; 4)The count of the number of distinct integers in the input. ¡Suppose that the file is divided into parts that can be read in parallel by map functions Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 8 ¡map(key, value): ¡// key: document name; value: text of the document ¡ for each word w in value: ¡ emit(w, 1) ¡ reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(key, result) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 9 ¡Example of algorithm for counting words ¡1) The largest integer §The idea is to compute a local maximum independently within each map function and then compute the global maximum within a single reducer – ensured by using the same “max” key within all map-function calls § §map(file_id, iterator_over_numbers) § max_local = MIN_INTEGER § for each number n in interator_over_numbers § if (n > max_local) § max_local = n § emit(‘max’, max_local) § §reduce(key, iterator_over_all_max_values) § max_total = MIN_INTEGER § for each number n in iterator_over_all_max_values § if (n > max_total) § max_total = n § emit(‘max’, max_total) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 10 ¡2) The average of all the integers §The idea is to compute a local sum and count independently within each map function and then compute the global average within a single reducer – ensured by using the same “avg” key within all map-function calls § §map(file_id, iterator_over_numbers) § sum_local = 0 § count_local = 0 § for each number n in interator_over_numbers § sum_local += n § count_local += 1 § emit(‘avg’, (sum_local, count_local)) §reduce(key, iterator_over_sum_count_pairs) § sum_total = 0 § count_total = 0 § for each pair (sum_local, count_local) in iterator_over_sum_count_pairs § sum_total += sum_local § count_total += count_local § emit(‘avg’, sum_total/count_total) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 11 ¡3) The same set of integers, but with each integer appearing only once §The idea is to send each specific number to a single reducer, thus guaranteeing that each reducer emits the given value only once § §map(file_id, iterator_over_numbers) § for each number n in interator_over_numbers § emit(n, 1) § §reduce(key, iterator_over_numbers) § emit(key, 1) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 12 ¡4) The count of the number of distinct integers in the input §The idea is to send all the different numbers to a single reducer that eliminates duplicates using the union operation and counts the values § §map(file_id, iterator_over_numbers) § number_set = {} § for each number n in interator_over_numbers § number_set = number_set ∪ {n} § emit(‘count’, number_set) § §reduce(key, iterator_over_number_sets) § total_number_set = {} § for each number_set in iterator_over_number_sets § total_number_set = total_number_set ∪ number_set § emit(‘count’, |total_number_set|) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 13 ¡The algorithm retrieves the six most convenient documents for each query. We focus on the first relevant document retrieved. 1)Determine a convenient measure for this task 2)Compute the measure on the following four query rankings with relevant/irrelevant objects: §R1 = {d7, d5, d3, d8, d1} §R2 = {d5, d6, d3, d2, d4} §R3 = {d9, d3, d4, d8, d5} §R4 = {d9, d3, d1, d7, d5} 3)How can be the result value interpreted? § Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 14 ¡Mean Reciprocal Rank (MRR) §A good metric for those cases in which we are interested in the first correct answer §MRR = an average over reciprocal rankings RR §Definition of RR: §Ri: ranking relative to a query qi §Scorrect(Ri): position of the first correct answer in Ri §Sh: threshold for ranking position §Then, the reciprocal rank RR(Ri) for query qi is: Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 15 1)The Mean Reciprocal Rank (MRR) is the most convenient measure for this task 2)Results for individual rankings (RRi): §RR1 = 0.25 §RR2 = 0.5 §RR3 = 0.33 §RR4 = 0 3)The first correct answer is at the 3.7-th position within an algorithm ranking (1/0.27 = 3.7) on average Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 16 MRR = 0.27 ¡Assume the following two rankings of documents (for some query): §R1 = {d7, d5, d3, d8, d1} §R2 = {d5, d8, d3, d1, d7} ¡Based on these rankings compute: §Spearman rank correlation coefficient §Kendall Tau coefficient Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 17 ¡The Spearman coefficient §The mostly used rank correlation metric §Based on the differences between the positions of the same document in two rankings §Definition: §s1,j be the position of a document dj in ranking R1 §s2,j be the position of dj in ranking R2 §K indicates the size of the ranked sets §S(R1, R2) is the Spearman rank correlation coefficient § Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 18 ¡(s1,d7 – s2, d7)2 = 16 ¡(s1,d5 – s2, d5)2 = 1 ¡(s1,d3 – s2, d3)2 = 0 ¡(s1,d8 – s2, d8)2 = 4 ¡(s1,d1 – s2, d1)2 = 1 ¡ ¡Spearman coefficient: §1 – [6 * (16 + 1 + 0 + 4 + 1) / 120] = –0.1 ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 19 ¡The Kendall Tau coefficient §When we think of rank correlations, we think of how two rankings tend to vary in similar ways §Consider two documents dj and dk and their positions in rankings R1 and R2 §Further, consider the differences in rank positions for these two documents in each ranking, i.e., §s1,k – s1,j §s2,k – s2,j §If these differences have the same sign, we say that the document pair (dk, dj) is concordant (C) in both rankings; if they have different signs, it is discordant (D) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 20 ¡The Kendall Tau coefficient §Definition: §∆(R1, R2): number of discordant document pairs in R1 and R2 §K: the size of the ranked sets § Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 21 τ (R1, R2) = 1 − 2 × ∆(R1,R2) K(K−1) ¡R1: §(d7, d5), (d7, d3), (d7, d8), (d7, d1) => D D D D §(d5, d3), (d5, d8), (d5, d1) => C C C §(d3, d8), (d3, d1) => D C §(d8, d1) => C ¡R2: §(d5, d8), (d5, d3), (d5, d1), (d5, d7) => C C C D §(d8, d3), (d8, d1), (d8, d7) => D C D §(d3, d1), (d3, d7) => C D §(d1, d7) => D ¡ ¡∆(R1, R2) = 10 ¡Kendall Tau coefficient: §1 – [ (2 * 10) / 20 ] = 0 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 22 ¡The Sum Squared Error (SSE) is a common measure of the quality of a cluster §Sum of the squares of the distances between each of the points of the cluster and the centroid ¡Sometimes, we decide to split a cluster in order to reduce the SSE §Suppose a cluster consists of the following three points: (9,5), (2,2) and (4,8) §Calculate the reduction in the SSE if we partition the cluster optimally into two clusters Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 23 ¡Centroid of points is detemined by averaging the values in each dimension independently => centroid of that three points: (5,5) §[(9,5) – (5,5)]2 = 16 [(4,8) – (5,5)]2 = 10 [(2,2) – (5,5)]2 = 18 §=> SSE = 16 + 10 + 18 = 44 ¡ ¡Then, we group the closest two points, i.e., points (9,5) and (4,8), to one cluster and compute its centroid: (6.5,6.5) §[(9,5) – (6.5,6.5)]2 = 8.5 [(4,8) – (6.5,6.5)]2 = 8.5 => SSE1 = 8.5 + 8.5 = 17 ¡The second cluster has only one point, which is also centroid §[(2,2) – (2,2)]2 = 0 => SSE2 = 0 ¡=> SSE = SSE1 + SSE2 = 17 + 0 = 17 ¡ ¡The reduction in the SSE: §44 – 17 = 27 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 24 ¡Perform a hierarchical clustering on points A–F §Using the centroid proximity measure § § § § § § § § §There is a tie for which pair of clusters is closest. Follow both choices and identify the clusters. Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 25 E [5, 27] [33, 33] [21, 21] [28, 6] [10, 10] [0, 0] C D B A F ¡Hierarchical clustering §Key operation – repeatedly combine two nearest clusters Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 26 http://www.mathworks.com/help/toolbox/stats/dendrogram.gif ¡Centroid proximity measure – distance between two clusters is the distance between their centroids 1) 1){A, B} with centroid at (5,5) 2){C, F} with centroid at (24.5,13.5) 3)Tie: §{A, B, C, F} with centroid at (14.75,9.25) => {A, B, C, E, F}, {D} §{C, D, F} with centroid at (27.33,20) => {A, B, E}, {C, D, F} Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 27