Advanced Search Techniques for Large Scale Data Analytics
Pavel Zezula and Jan Sedmidubsky
Masaryk University
http://disa.fi.muni.cz

¡Similarity search examples:
§Images, faces, motions, time series…
§+ visual examples
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
2

¡Many problems can be expressed as
finding “similar” sets:
§Find near-neighbors in high-dimensional space
¡Examples:
§Pages with similar words
§For duplicate detection, classification by topic
§Customers who purchased similar products
§Products with similar customer sets
§Images with similar features
§Users who visited similar websites
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
teaser_input
3

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
4


¡Goal: Find near-neighbors in high-dim. space
§We formally define “near neighbors” as
points that are a “small distance” apart
¡For each application, we first need to define what “distance” means
¡Today: Jaccard distance/similarity
§The Jaccard similarity of two sets is the size of their intersection divided by the size of their
union:
sim(C1, C2) = |C1ÇC2|/|C1ÈC2|
§Jaccard distance: d(C1, C2) = 1 - |C1ÇC2|/|C1ÈC2|
§
§
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
3 in intersection
8 in union
Jaccard similarity= 3/8
Jaccard distance = 5/8
6

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
7


Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
8


Docu-
ment
The set
of strings
of length k
that appear
in the doc-
ument
Signatures:
short integer
vectors that
represent the
sets, and
reflect their
similarity
Locality-
Sensitive
Hashing
Candidate
pairs:
those pairs
of signatures
that we need
to test for
similarity
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
9

Step 1: Shingling: Convert documents to sets
Docu-
ment
The set
of strings
of length k
that appear
in the doc-
ument

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
11


Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
12


¡To compress long shingles, we can hash them to (say) 4 bytes
¡Represent a document by the set of hash values of its k-shingles
§Idea: Two documents could (rarely) appear to have shingles in common, when in fact only the
hash-values were shared
¡Example: k=2; document D1= abcab
Set of 2-shingles: S(D1) = {ab, bc, ca}
Hash the singles: h(D1) = {1, 5, 7}
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
13

¡Document D1 is a set of its k-shingles C1=S(D1)
¡Equivalently, each document is a
0/1 vector in the space of k-shingles
§Each unique shingle is a dimension
§Vectors are very sparse
¡A natural similarity measure is the
Jaccard similarity:
¡ sim(D1, D2) = |C1ÇC2|/|C1ÈC2|
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
14

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
15


Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
16


Step 2: Minhashing: Convert large sets to short signatures, while preserving similarity
Docu-
ment
The set
of strings
of length k
that appear
in the doc-
ument
Signatures:
short integer
vectors that
represent the
sets, and
reflect their
similarity

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
18


¡Rows = elements (shingles)
¡Columns = sets (documents)
§1 in row e and column s if and only if e is a member of s
§Column similarity is the Jaccard similarity of the corresponding sets (rows with value 1)
§Typical matrix is sparse!
¡Each document is a column:
§Example: sim(C1 ,C2) = ?
§Size of intersection = 3; size of union = 6,
Jaccard similarity (not distance) = 3/6
§d(C1,C2) = 1 – (Jaccard similarity) = 3/6
¡
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
0
1
0
1
0
1
1
1
1
0
0
1
1
0
0
0
1
0
1
0
1
0
1
1
0
1
1
1
Documents
19

¡So far:
§Documents ® Sets of shingles
§Represent sets as boolean vectors in a matrix
¡Next goal: Find similar columns while computing small signatures
§Similarity of columns == similarity of signatures
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
20

¡Next Goal: Find similar columns, Small signatures
¡Naïve approach:
§1) Signatures of columns: small summaries of columns
§2) Examine pairs of signatures to find similar columns
§Essential: Similarities of signatures and columns are related
§3) Optional: Check that columns with similar signatures are really similar
¡Warnings:
§Comparing all pairs may take too much time: Job for LSH
§These methods can produce false negatives, and even false positives (if the optional check is not
made)
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
21

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
22


Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
23


Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
24


3
4
7
2
6
1
5
Signature matrix M
1
2
1
2
5
7
6
3
1
2
4
1
4
1
2
4
5
1
6
7
3
2
2
1
2
1
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
2nd element of the permutation is the first to map to a 1
4th element of the permutation is the first to map to a 1
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1
0
1
0
1
Input matrix (Shingles x Documents)
Permutation p
Note: Another (equivalent) way is to
store row indexes:
1
5
1
5
2
3
1
3
6
4
6
4
25

¡Choose a random permutation p
¡Claim: Pr[hp(C1) = hp(C2)] = sim(C1, C2)
¡Why?
§Let X be a doc (set of shingles), yÎ X is a shingle
§Then: Pr[p(y) = min(p(X))] = 1/|X|
§It is equally likely that any yÎ X is mapped to the min element
§Let y be s.t. p(y) = min(p(C1ÈC2))
§Then either: p(y) = min(p(C1))  if y Î C1 , or
§ p(y) = min(p(C2))  if y Î C2
§So the prob. that both are true is the prob. y Î C1 Ç C2
§Pr[min(p(C1))=min(p(C2))]=|C1ÇC2|/|C1ÈC2|= sim(C1, C2)
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
One of the two
cols had to have
1 at position y
26

Size of the universe of all possible vals of min((C[1]C[2])) is |C[1]C[2]| and in |C[1]C[2]| of
cases it can be that min((C[1]))=min((C[2])) which exactly the jaccard between C1 and C2
For two columns A and B, we have h_π(A) = h_π(B) exactly when the minimum hash value of the union A
∪ B lies in the intersection A ∩ B. Thus Pr[h_π(A) = h_π(B)] = |A ∩ B| / |A ∪ B|.

¡Given cols C1 and C2, rows may be classified as:
§ C1 C2
§ A 1 1
§ B 1 0
§ C 0 1
§ D 0 0
§a = # rows of type A, etc.
¡Note: sim(C1, C2) = a/(a +b +c)
¡Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)
§Look down the cols C1 and C2 until we see a 1
§If it’s a type-A row, then h(C1) = h(C2)
If a type-B or type-C row, then not
¡
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
27

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
28

Each agress with prob s.
So to estimate s we compute what fraction of hash functions agree

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Similarities:
                   1-3      2-4    1-2   3-4
Col/Col   0.75    0.75    0       0
Sig/Sig   0.67    1.00    0       0
Signature matrix M
1
2
1
2
5
7
6
3
1
2
4
1
4
1
2
4
5
1
6
7
3
2
2
1
2
1
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1
0
1
0
1
Input matrix (Shingles x Documents)
3
4
7
2
6
1
5
Permutation p
29

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
30


¡Permuting rows even once is prohibitive
¡Row hashing!
§Pick K = 100 hash functions ki
§Ordering under ki gives a random row permutation!
¡One-pass implementation
§For each column C and hash-func. ki keep a “slot” for the min-hash value
§Initialize all sig(C)[i] = ¥
§Scan rows looking for 1s
§Suppose row j has 1 in column C
§Then for each ki :
§If ki(j) < sig(C)[i], then sig(C)[i] ¬ ki(j)
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
How to pick a random
hash function h(x)?
Universal hashing:
ha,b(x)=((a·x+b) mod p) mod N
where:
a,b … random integers
p … prime number (p > N)
31

Docu-
ment
The set
of strings
of length k
that appear
in the doc-
ument
Signatures:
short integer
vectors that
represent the
sets, and
reflect their
similarity
Locality-
Sensitive
Hashing
Candidate
pairs:
those pairs
of signatures
that we need
to test for
similarity

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
1
2
1
2
1
4
1
2
2
1
2
1
33

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
1
2
1
2
1
4
1
2
2
1
2
1
34

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
1
2
1
2
1
4
1
2
2
1
2
1
35

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Signature matrix  M
r rows
per band
b  bands
   One
signature
1
2
1
2
1
4
1
2
2
1
2
1
36

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
37


Matrix M
r rows
b bands
Buckets
Columns 2 and 6
are probably identical
(candidate pair)
Columns 6 and 7 are
surely different.
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
38

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
39


Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
1
2
1
2
1
4
1
2
2
1
2
1
40

¡Find pairs of ³ s=0.8 similarity, set b=20, r=5
¡Assume: sim(C1, C2) = 0.8
§Since sim(C1, C2) ³ s, we want C1, C2 to be a candidate pair: We want them to hash to at least 1
common bucket (at least one band is identical)
¡Probability C1, C2 identical in one particular
band: (0.8)5 = 0.328
¡Probability C1, C2 are not similar in all of the 20 bands: (1-0.328)20 = 0.00035
§i.e., about 1/3000th of the 80%-similar column pairs
are false negatives (we miss them)
§We would find 99.965% pairs of truly similar documents
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
1
2
1
2
1
4
1
2
2
1
2
1
41

¡Find pairs of ³ s=0.8 similarity, set b=20, r=5
¡Assume: sim(C1, C2) = 0.3
§Since sim(C1, C2) < s we want C1, C2 to hash to NO
common buckets (all bands should be different)
¡Probability C1, C2 identical in one particular band: (0.3)5  = 0.00243
¡Probability C1, C2 identical in at least 1 of 20 bands: 1 - (1 - 0.00243)20 = 0.0474
§In other words, approximately 4.74% pairs of docs with similarity 0.3% end up becoming candidate
pairs
§They are false positives since we will have to examine them (they are candidate pairs) but then it
will turn out their similarity is below threshold s
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
1
2
1
2
1
4
1
2
2
1
2
1
42

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
1
2
1
2
1
4
1
2
2
1
2
1
43

       Similarity t =sim(C1, C2) of two sets
Probability
of sharing
a bucket
No chance
if t < s
Probability = 1 if t > s
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
44

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Remember:
Probability of
equal hash-values
= similarity
       Similarity t =sim(C1, C2) of two sets
Probability
of sharing
a bucket
45

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
46


t r
All rows
of a band
are equal
1 -
Some row
of a band
unequal
(
)b
No bands
identical
1 -
At least
one band
identical
s ~ (1/b)1/r
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
       Similarity t=sim(C1, C2) of two sets
Probability
of sharing
a bucket
47

¡Similarity threshold s
¡Prob. that at least 1 band is identical:
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
 s
 1-(1-sr)b
.2
   .006
.3
   .047
.4
   .186
.5
   .470
.6
   .802
.7
   .975
.8
   .9996
48

¡Picking r and b to get the best S-curve
§50 hash-functions (r=5, b=10)
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Blue area: False Negative rate
Green area: False Positive rate
Similarity
49

Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
50


¡Shingling: Convert documents to sets
§We used hashing to assign each shingle an ID
¡Min-Hashing: Convert large sets to short signatures, while preserving similarity
§We used similarity preserving hashing to generate signatures with property Pr[hp(C1) = hp(C2)] =
sim(C1, C2)
§We used hashing to get around generating random permutations
¡Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents
§We used hashing to find candidate pairs of similarity ³ s
§
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
51