Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz ¡Suppose 100 items (numbered 1 to 100) and 100 baskets (numbered 1 to 100) §Item i is in basket b if and only if i divides b with no remainder, i.e., item 1 is in all the baskets, item 2 is in all fifty of the even-numbered baskets, etc. ¡Tasks: 1)Identify the frequent items when the support threshold is set to 5 2)Compute the confidence of these association rules a){5, 7} → 2 b){2, 3, 4} → 5 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 2 ¡Simplest question: Find sets of items that appear together “frequently” in baskets ¡Support for item set I: Number of baskets containing all items in I §(Often expressed as a fraction of the total number of baskets) ¡Given a support threshold s, then sets of items that appear in at least s baskets are called frequent itemsets ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 3 Support of {Beer, Bread} = 2 4 ¡Association Rules: If-then rules about the contents of baskets ¡{i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it is likely to contain j” ¡In practice there are many rules, want to find significant/interesting ones! ¡Confidence of this association rule is the probability of j given I = {i1,…,ik} ¡ ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 1)20 frequent items: 1–20 2)Association rules: a)The baskets containing both items 5 and 7 are baskets 35 and 70, in which only basket 70 also contains item 2. Hence, the confidence of the rule {5, 7} → 2 is 1/2. b)The baskets whose numbers are the multiples of 12 contain item set {2, 3, 4} as a subset – there are 8 such baskets. The baskets whose numbers are the multiples of 60 contain item set {2, 3, 4, 5} as a subset – there is 1 such basket. Hence, the confidence of the rule {2, 3, 4} → 5 is 1/8. Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 5 ¡Consider the following twelve baskets, each of them contains 3 of 6 items (1 through 6): §{1, 2, 3} {2, 3, 4} {3, 4, 5} {4, 5, 6} §{1, 3, 5} {2, 4, 6} {1, 3, 4} {2, 4, 5} §{3, 5, 6} {1, 2, 4} {2, 3, 5} {3, 4, 6} ¡Suppose the support threshold is 4. On the first pass of the PCY algorithm, a hash table with 11 buckets is used, and the set {i, j} is hashed to bucket i×j mod 11: 1)Compute the support for each item and each pair of items 2)Which pairs hash to which buckets? 3)Which buckets are frequent? 4)Which pairs are counted on the second pass? Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 6 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 7 New in PCY Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 8 1)Compute the support for each item and each pair of items §Support for each item: ¡ ¡item 1 2 3 4 5 6 ¡support 4 6 8 8 6 4 § §Support for each pair of items: ¡ ¡pair {1, 2} {1, 3} {1, 4} {1, 5} {1, 6} {2, 3} {2, 4} {2, 5} ¡support 2 3 2 1 0 3 4 2 ¡ ¡pair {2, 6} {3, 4} {3, 5} {3, 6} {4, 5} {4, 6} {5, 6} ¡support 1 4 4 2 3 3 2 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 9 2)Which pairs hash to which buckets? §The set {i, j} is hashed to bucket no.: i×j mod 11 ¡ ¡pair {1, 2} {1, 3} {1, 4} {1, 5} {1, 6} {2, 3} {2, 4} {2, 5} ¡bucket 2 3 4 5 6 6 8 10 ¡ ¡pair {2, 6} {3, 4} {3, 5} {3, 6} {4, 5} {4, 6} {5, 6} ¡bucket 1 1 4 7 9 2 8 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 10 3)Which buckets are frequent? §Bucket support – sum of supports of pairs belonging to the given bucket: ¡ ¡bucket 0 1 2 3 4 5 6 7 ¡support 0 5 5 3 6 1 3 2 ¡ ¡bucket 8 9 10 ¡support 6 3 2 § §The frequent buckets are those with support above 4, i.e., buckets: 1, 2, 4, 8 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 11 4)Which pairs are counted on the second pass §As only pairs in frequent buckets will be counted on the second pass of PCY, they are: § {1, 2}, {1, 4}, {2, 4}, {2, 6}, {3, 4}, {3, 5}, {4, 6}, {5, 6} Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 12 ¡Compute the Jaccard similarities of each pair of the following three sets: §A = {1, 2, 3, 4} §B = {2, 3, 5, 7} §C = {2, 4, 6} Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 13 ¡sim(A, B) = 2/6 = 1/3 ¡sim(A, C) = 2/5 ¡sim(B, C) = 1/6 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 14 ¡Consider two documents A and B §If their 3-shingle resemblance is 1 (using Jaccard similarity), does that mean that A and B are identical? §If so, prove it. If not, give a counterexample. Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 15 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 16 ¡Document D1 is a set of its k-shingles C1=S(D1) ¡Equivalently, each document is a 0/1 vector in the space of k-shingles §Each unique shingle is a dimension §Vectors are very sparse ¡A natural similarity measure is the Jaccard similarity: ¡ sim(D1, D2) = |C1ÇC2|/|C1ÈC2| Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 17 ¡No, the documents A and B need not be identical §Counterexample: §A: abab §3-shingles: S(A) = {aba, bab} §B: baba §3-shingles: S(B) = {bab, aba} § §sim(A, B) = | S(A) Ç S(B) | / | S(A) È S(B)| = 1 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 18 ¡For the matrix ¡ ¡ ¡ ¡ ¡ 1)Compute the minhash signature for each column (document) using the following hash functions: §h1(x) = 2x + 1 mod 6 §h2(x) = 3x + 2 mod 6 §h3(x) = 5x + 2 mod 6 2)Which of these hash functions are true permutations? 3)How close are the estimated Jaccard similarities for the six pairs of columns to the true Jaccard similarities? Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 19 Element D1 D2 D3 D4 0 0 1 0 1 1 0 1 0 0 2 1 0 0 1 3 0 0 1 0 4 0 0 1 1 5 1 0 0 0 ¡Rows = elements (e.g., shingles) ¡Columns = sets (e.g., documents) §1 in row e (shingle) and column s (document) if and only if e is a member of s §Column similarity is the Jaccard similarity of the corresponding sets (rows with value 1) §Typical matrix is sparse! ¡Each document is a column: §Example: sim(C1 ,C2) = ? §Size of intersection = 3; size of union = 6, Jaccard similarity (not distance) = 3/6 §d(C1,C2) = 1 – (Jaccard similarity) = 3/6 ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 1 1 Documents 20 3 4 7 2 6 1 5 Signature matrix M 1 2 1 2 5 7 6 3 1 2 4 1 4 1 2 4 5 1 6 7 3 2 2 1 2 1 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 2nd element of the permutation is the first to map to a 1 4th element of the permutation is the first to map to a 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 Input matrix (Shingles x Documents) Permutation p 21 Min-Hashing Example 1)Compute the minhash signature for each column using the following hash functions: §h1(x) = 2x + 1 mod 6 §h2(x) = 3x + 2 mod 6 §h3(x) = 5x + 2 mod 6 §Hashes are computed on element IDs: § § § § § 2) 2)Which of these hash functions are true permutations: h3 only Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 22 Element D1 D2 D3 D4 h1(x) h2(x) h3(x) 0 0 1 0 1 1 2 2 1 0 1 0 0 3 5 1 2 1 0 0 1 5 2 0 3 0 0 1 0 1 5 5 4 0 0 1 1 3 2 4 5 1 0 0 0 5 5 3 D1 D2 D3 D4 5 1 1 1 2 2 2 2 0 1 4 0 Minhash signature: (rows correspond to hash functions) 3)How close are the estimated Jaccard similarities for the six pairs of columns (documents) to the true Jaccard similarities? 4) 4) ¡ ¡ ¡=> the estimated Jaccard similarities are not close to the true ones at all §To make the estimated similarity closer to the true one, there is a need of more and better (i.e., resulting in true permutations) hash functions Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 23 Jaccard similarities on D1 / D2 D1 / D3 D1 / D4 D2 / D3 D2 / D4 D3 / D4 Original documents 0 0 0.25 0 0.25 0.25 Minhash signatures 0.33 0.33 0.67 0.67 0.67 0.67 ¡Suppose we are maintaining a count of 1s using the DGIM method §Each bucket is represented by (i, t) §i – the number of 1s in the bucket §t – the bucket timestamp (time of the most recent 1) ¡Consider the following properties: §Current time is 200 §Window size is 60 §Current buckets are: §(16, 148) (8, 162) (8, 177) (4, 183) (2, 192) (1, 197) (1, 200) §At the next ten clocks (201 through 210), the stream has 0101010101 ¡What will the sequence of buckets be at the end of these ten inputs? Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 24 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 25 N 1 of size 2 2 of size 4 2 of size 8 At least 1 of size 16. Partially beyond window. 2 of size 1 1001010110001011010101010101011010101010101110101010111010100010110010 Each stream bit has a timestamp (starting 1, 2, …), recorded by modulo N A bucket is a record consisting of: (A) The timestamp of its end (B) The number of 1s between its beginning and end Three properties of buckets that are maintained: •Either one or two buckets with the same power-of-2 number of 1s •Buckets do not overlap in timestamps •Buckets are sorted by size Buckets disappear when their end-time is > N time units in the past 26 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) N 1001010110001011010101010101011010101010101110101010111010100010110010 ¡If the current bit is 1: §(1) Create a new bucket of size 1, for just this bit §End timestamp = current time §(2) If there are now three buckets of size 1, combine the oldest two into a bucket of size 2 §(3) If there are now three buckets of size 2, combine the oldest two into a bucket of size 4 §(4) And so on … 27 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 28 1001010110001011010101010101011010101010101110101010111010100010110010 0010101100010110101010101010110101010101011101010101110101000101100101 0010101100010110101010101010110101010101011101010101110101000101100101 0101100010110101010101010110101010101011101010101110101000101100101101 0101100010110101010101010110101010101011101010101110101000101100101101 0101100010110101010101010110101010101011101010101110101000101100101101 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Current state of the stream: Bit of value 1 arrives Two orange buckets get merged into a yellow bucket Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1: Buckets get merged… State of the buckets after merging Updating buckets (example): ¡There are 5 occurrences of 1s in the upcoming stream 0101010101. Each one updates the buckets to be: §(1) Combine the oldest two buckets of size 1 ¡ (16, 148) (8, 162) (8, 177) (4, 183) (2, 192) (1, 197) (1, 200) (1, 202) ¡ => (16, 148) (8, 162) (8, 177) (4, 183) (2, 192) (2, 200) (1, 202) §(2) No combination needed ¡ (16, 148) (8, 162) (8, 177) (4, 183) (2, 192) (2, 200) (1, 202) (1, 204) §(3) Combine the oldest two buckets of size 1, and then oldest two buckets of size 2 ¡ (16, 148) (8, 162) (8, 177) (4, 183) (2, 192) (2, 200) (1, 202) (1, 204) (1, 206) ¡ => (16, 148) (8, 162) (8, 177) (4, 183) (2, 192) (2, 200) (2, 204) (1, 206) ¡ => (16, 148) (8, 162) (8, 177) (4, 183) (4, 200) (2, 204) (1, 206) §(4) No combination needed; window size is 60, so (16, 148) should be dropped ¡ (16, 148) (8, 162) (8, 177) (4, 183) (4, 200) (2, 204) (1, 206) (1, 208) ¡ => (8, 162) (8, 177) (4, 183) (4, 200) (2, 204) (1, 206) (1, 208) §(5) Combine the oldest two buckets of size 1 ¡ (8, 162) (8, 177) (4, 183) (4, 200) (2, 204) (1, 206) (1, 208) (1, 210) ¡ => (8, 162) (8, 177) (4, 183) (4, 200) (2, 204) (2, 208) (1, 210) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 29 ¡Assume the Bloom Filter technique, B as a single array of 8 bits, and the following two hash functions: §h1(x) = x mod 3, §h2(x) = x mod 7. ¡For the set S = {3, 5} of two keys and the stream 7, 12, ... of integer values: 1)Determine the content of the B bit array; 2)Apply the Bloom Filter technique to the first two stream values (i.e., 7 and 12) and decide whether they pass through the filter, or not; 3)What should be the best number of hash functions for this scenario with |S| = 2 keys and |B| = 8 bits? Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 30 ¡Consider: |S| = m, |B| = n ¡k hash functions h1 ,…, hk ¡Initialization: §Set B to all 0s §Hash each element sÎ S using each hash function hi, set B[hi(s)] = 1 (for each i = 1,.., k) ¡Run-time: §When a stream element with key x arrives §If B[hi(x)] = 1 for all i = 1,..., k then declare that x is in S §That is, x hashes to a bucket set to 1 for every hash function hi(x) §Otherwise discard the element x Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 31 (note: we have a single array B!) 0010001011000 Bit array B Stream item x Hash x by k functions Drop the item x Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 32 Number of hash functions, k Fraction of 1s in B is (1 – e^-km/n) False positive probability = (1 – e^-km/n)^k ¡Keys: S = {3, 5} Stream: 7, 12, ... ¡Hash functions: h1(x) = x mod 3 h2(x) = x mod 7 ¡ 1)The B bit array of size 8 contains 1s at the positions to which the keys are hashed by all the hash functions §h1(3) = 3 mod 3 = 0 h1(5) = 5 mod 3 = 2 §h2(3) = 3 mod 7 = 3 h2(5) = 5 mod 7 = 5 2)Stream values 7 and 12: §h1(7) = 7 mod 3 = 1 h1(12) = 12 mod 3 = 0 §h2(7) = 7 mod 7 = 0 h2(12) = 12 mod 7 = 5 3)The best number of hash functions for |S| = 2 keys and |B| = 8 bits: § n / m ∙ ln(2) = |B| / |S| ∙ ln(2) = 8 / 2 ∙ ln(2) = 2.77 ~ 3 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 33 1 0 1 1 0 1 0 0 B: è 7 does not pass; 12 passes