Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz ¡Similarity search examples: §Images, faces, motions, time series… §+ visual examples Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 2 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors in high-dimensional space ¡Examples: §Pages with similar words §For duplicate detection, classification by topic §Customers who purchased similar products §Products with similar customer sets §Images with similar features §Users who visited similar websites Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) teaser_input 3 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 4 ¡Goal: Find near-neighbors in high-dim. space §We formally define “near neighbors” as points that are a “small distance” apart ¡For each application, we first need to define what “distance” means ¡Today: Jaccard distance/similarity §The Jaccard similarity of two sets is the size of their intersection divided by the size of their union: sim(C1, C2) = |C1ÇC2|/|C1ÈC2| §Jaccard distance: d(C1, C2) = 1 - |C1ÇC2|/|C1ÈC2| § § Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 3 in intersection 8 in union Jaccard similarity= 3/8 Jaccard distance = 5/8 6 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 7 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 8 Docu- ment The set of strings of length k that appear in the doc- ument Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- Sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 9 Step 1: Shingling: Convert documents to sets Docu- ment The set of strings of length k that appear in the doc- ument Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 11 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 12 ¡To compress long shingles, we can hash them to (say) 4 bytes ¡Represent a document by the set of hash values of its k-shingles §Idea: Two documents could (rarely) appear to have shingles in common, when in fact only the hash-values were shared ¡Example: k=2; document D1= abcab Set of 2-shingles: S(D1) = {ab, bc, ca} Hash the singles: h(D1) = {1, 5, 7} Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 13 ¡Document D1 is a set of its k-shingles C1=S(D1) ¡Equivalently, each document is a 0/1 vector in the space of k-shingles §Each unique shingle is a dimension §Vectors are very sparse ¡A natural similarity measure is the Jaccard similarity: ¡ sim(D1, D2) = |C1ÇC2|/|C1ÈC2| Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 14 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 15 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 16 Step 2: Minhashing: Convert large sets to short signatures, while preserving similarity Docu- ment The set of strings of length k that appear in the doc- ument Signatures: short integer vectors that represent the sets, and reflect their similarity Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 18 ¡Rows = elements (shingles) ¡Columns = sets (documents) §1 in row e and column s if and only if e is a member of s §Column similarity is the Jaccard similarity of the corresponding sets (rows with value 1) §Typical matrix is sparse! ¡Each document is a column: §Example: sim(C1 ,C2) = ? §Size of intersection = 3; size of union = 6, Jaccard similarity (not distance) = 3/6 §d(C1,C2) = 1 – (Jaccard similarity) = 3/6 ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 1 1 Documents 19 ¡So far: §Documents ® Sets of shingles §Represent sets as boolean vectors in a matrix ¡Next goal: Find similar columns while computing small signatures §Similarity of columns == similarity of signatures Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 20 ¡Next Goal: Find similar columns, Small signatures ¡Naïve approach: §1) Signatures of columns: small summaries of columns §2) Examine pairs of signatures to find similar columns §Essential: Similarities of signatures and columns are related §3) Optional: Check that columns with similar signatures are really similar ¡Warnings: §Comparing all pairs may take too much time: Job for LSH §These methods can produce false negatives, and even false positives (if the optional check is not made) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 21 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 22 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 23 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 24 3 4 7 2 6 1 5 Signature matrix M 1 2 1 2 5 7 6 3 1 2 4 1 4 1 2 4 5 1 6 7 3 2 2 1 2 1 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 2nd element of the permutation is the first to map to a 1 4th element of the permutation is the first to map to a 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 Input matrix (Shingles x Documents) Permutation p Note: Another (equivalent) way is to store row indexes: 1 5 1 5 2 3 1 3 6 4 6 4 25 ¡Choose a random permutation p ¡Claim: Pr[hp(C1) = hp(C2)] = sim(C1, C2) ¡Why? §Let X be a doc (set of shingles), yÎ X is a shingle §Then: Pr[p(y) = min(p(X))] = 1/|X| §It is equally likely that any yÎ X is mapped to the min element §Let y be s.t. p(y) = min(p(C1ÈC2)) §Then either: p(y) = min(p(C1)) if y Î C1 , or § p(y) = min(p(C2)) if y Î C2 §So the prob. that both are true is the prob. y Î C1 Ç C2 §Pr[min(p(C1))=min(p(C2))]=|C1ÇC2|/|C1ÈC2|= sim(C1, C2) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) One of the two cols had to have 1 at position y 26 Size of the universe of all possible vals of min((C[1]C[2])) is |C[1]C[2]| and in |C[1]C[2]| of cases it can be that min((C[1]))=min((C[2])) which exactly the jaccard between C1 and C2 For two columns A and B, we have h_π(A) = h_π(B) exactly when the minimum hash value of the union A ∪ B lies in the intersection A ∩ B. Thus Pr[h_π(A) = h_π(B)] = |A ∩ B| / |A ∪ B|. ¡Given cols C1 and C2, rows may be classified as: § C1 C2 § A 1 1 § B 1 0 § C 0 1 § D 0 0 §a = # rows of type A, etc. ¡Note: sim(C1, C2) = a/(a +b +c) ¡Then: Pr[h(C1) = h(C2)] = Sim(C1, C2) §Look down the cols C1 and C2 until we see a 1 §If it’s a type-A row, then h(C1) = h(C2) If a type-B or type-C row, then not ¡ Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 27 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 28 Each agress with prob s. So to estimate s we compute what fraction of hash functions agree Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Similarities: 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0 Signature matrix M 1 2 1 2 5 7 6 3 1 2 4 1 4 1 2 4 5 1 6 7 3 2 2 1 2 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 Input matrix (Shingles x Documents) 3 4 7 2 6 1 5 Permutation p 29 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 30 ¡Permuting rows even once is prohibitive ¡Row hashing! §Pick K = 100 hash functions ki §Ordering under ki gives a random row permutation! ¡One-pass implementation §For each column C and hash-func. ki keep a “slot” for the min-hash value §Initialize all sig(C)[i] = ¥ §Scan rows looking for 1s §Suppose row j has 1 in column C §Then for each ki : §If ki(j) < sig(C)[i], then sig(C)[i] ¬ ki(j) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) How to pick a random hash function h(x)? Universal hashing: ha,b(x)=((a·x+b) mod p) mod N where: a,b … random integers p … prime number (p > N) 31 Docu- ment The set of strings of length k that appear in the doc- ument Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- Sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 1 2 1 2 1 4 1 2 2 1 2 1 33 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 1 2 1 2 1 4 1 2 2 1 2 1 34 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 1 2 1 2 1 4 1 2 2 1 2 1 35 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Signature matrix M r rows per band b bands One signature 1 2 1 2 1 4 1 2 2 1 2 1 36 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 37 Matrix M r rows b bands Buckets Columns 2 and 6 are probably identical (candidate pair) Columns 6 and 7 are surely different. Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 38 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 39 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 1 2 1 2 1 4 1 2 2 1 2 1 40 ¡Find pairs of ³ s=0.8 similarity, set b=20, r=5 ¡Assume: sim(C1, C2) = 0.8 §Since sim(C1, C2) ³ s, we want C1, C2 to be a candidate pair: We want them to hash to at least 1 common bucket (at least one band is identical) ¡Probability C1, C2 identical in one particular band: (0.8)5 = 0.328 ¡Probability C1, C2 are not similar in all of the 20 bands: (1-0.328)20 = 0.00035 §i.e., about 1/3000th of the 80%-similar column pairs are false negatives (we miss them) §We would find 99.965% pairs of truly similar documents Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 1 2 1 2 1 4 1 2 2 1 2 1 41 ¡Find pairs of ³ s=0.8 similarity, set b=20, r=5 ¡Assume: sim(C1, C2) = 0.3 §Since sim(C1, C2) < s we want C1, C2 to hash to NO common buckets (all bands should be different) ¡Probability C1, C2 identical in one particular band: (0.3)5 = 0.00243 ¡Probability C1, C2 identical in at least 1 of 20 bands: 1 - (1 - 0.00243)20 = 0.0474 §In other words, approximately 4.74% pairs of docs with similarity 0.3% end up becoming candidate pairs §They are false positives since we will have to examine them (they are candidate pairs) but then it will turn out their similarity is below threshold s Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 1 2 1 2 1 4 1 2 2 1 2 1 42 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 1 2 1 2 1 4 1 2 2 1 2 1 43 Similarity t =sim(C1, C2) of two sets Probability of sharing a bucket No chance if t < s Probability = 1 if t > s Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 44 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Remember: Probability of equal hash-values = similarity Similarity t =sim(C1, C2) of two sets Probability of sharing a bucket 45 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 46 t r All rows of a band are equal 1 - Some row of a band unequal ( )b No bands identical 1 - At least one band identical s ~ (1/b)1/r Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Similarity t=sim(C1, C2) of two sets Probability of sharing a bucket 47 ¡Similarity threshold s ¡Prob. that at least 1 band is identical: Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) s 1-(1-sr)b .2 .006 .3 .047 .4 .186 .5 .470 .6 .802 .7 .975 .8 .9996 48 ¡Picking r and b to get the best S-curve §50 hash-functions (r=5, b=10) Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) Blue area: False Negative rate Green area: False Positive rate Similarity 49 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 50 ¡Shingling: Convert documents to sets §We used hashing to assign each shingle an ID ¡Min-Hashing: Convert large sets to short signatures, while preserving similarity §We used similarity preserving hashing to generate signatures with property Pr[hp(C1) = hp(C2)] = sim(C1, C2) §We used hashing to get around generating random permutations ¡Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents §We used hashing to find candidate pairs of similarity ³ s § Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 51