Similarity Search by Aggregation of Multiple
Pivot-Permutation Ranking
David Novak, Pavel Zezula
Masaryk University
Brno, Czech Republic
}w¡¢£¤¥¦§¨!"#$%&123456789@ACDEFGHIPQRS`ye|
CEMI meeting, April 16, 2014
David Novak (MU Brno) PPP-Codes CEMI meeting 1 / 17
Outline of the Talk
1 Approximate Distance-based Similarity Search
2 PPP-Codes Approach
Multiple Pivot Space Partitioning
Ranking of the Data Objects
Indexing and Searching
3 Eﬃciency of our Approach
4 Summary
David Novak (MU Brno) PPP-Codes CEMI meeting 2 / 17
Approximate Distance-based Similarity Search
Preliminaries
generic similarity search
data modeled as metric space (D, δ), where D is a domain of objects
and δ is a total distance function δ : D × D −→ R+
0 satisfying
postulates of identity, symmetry, and triangle inequality
David Novak (MU Brno) PPP-Codes CEMI meeting 3 / 17
Approximate Distance-based Similarity Search
Preliminaries
generic similarity search
data modeled as metric space (D, δ), where D is a domain of objects
and δ is a total distance function δ : D × D −→ R+
0 satisfying
postulates of identity, symmetry, and triangle inequality
query by example: K-NN(q) returns K objects x from the dataset
X ⊆ D with the smallest δ(q, x)
David Novak (MU Brno) PPP-Codes CEMI meeting 3 / 17
Approximate Distance-based Similarity Search
Preliminaries
generic similarity search
data modeled as metric space (D, δ), where D is a domain of objects
and δ is a total distance function δ : D × D −→ R+
0 satisfying
postulates of identity, symmetry, and triangle inequality
query by example: K-NN(q) returns K objects x from the dataset
X ⊆ D with the smallest δ(q, x)
dataset X may be very large
distance function δ may be time consuming
David Novak (MU Brno) PPP-Codes CEMI meeting 3 / 17
Approximate Distance-based Similarity Search
Preliminaries
generic similarity search
data modeled as metric space (D, δ), where D is a domain of objects
and δ is a total distance function δ : D × D −→ R+
0 satisfying
postulates of identity, symmetry, and triangle inequality
query by example: K-NN(q) returns K objects x from the dataset
X ⊆ D with the smallest δ(q, x)
dataset X may be very large
distance function δ may be time consuming
requires approximate search
David Novak (MU Brno) PPP-Codes CEMI meeting 3 / 17
Approximate Distance-based Similarity Search
Motivation
current indexes for large-scale approximate search:
dataset X is partitioned
given query q, the “most-promising” partitions form the candidate set
the candidate set SC is reﬁned by calculating δ(q, x), ∀x ∈ SC
David Novak (MU Brno) PPP-Codes CEMI meeting 4 / 17
Approximate Distance-based Similarity Search
Motivation
current indexes for large-scale approximate search:
dataset X is partitioned
given query q, the “most-promising” partitions form the candidate set
the candidate set SC is reﬁned by calculating δ(q, x), ∀x ∈ SC
reading and reﬁnement of SC form majority of the search costs
accuracy of the candidate set is key
David Novak (MU Brno) PPP-Codes CEMI meeting 4 / 17
PPP-Codes Approach
Our Approach in a Nutshell
1 data space is partitioned multiple-times independently
each partitioning is deﬁned by one pivot space
David Novak (MU Brno) PPP-Codes CEMI meeting 5 / 17
PPP-Codes Approach
Our Approach in a Nutshell
1 data space is partitioned multiple-times independently
each partitioning is deﬁned by one pivot space
2 given query q, multiple ranked candidate sets are generated
David Novak (MU Brno) PPP-Codes CEMI meeting 5 / 17
PPP-Codes Approach
Our Approach in a Nutshell
1 data space is partitioned multiple-times independently
each partitioning is deﬁned by one pivot space
2 given query q, multiple ranked candidate sets are generated
3 these multiple candidate rankings are eﬀectively merged
the merged candidate set is smaller and more accurate
David Novak (MU Brno) PPP-Codes CEMI meeting 5 / 17
PPP-Codes Approach
Our Approach in a Nutshell
1 data space is partitioned multiple-times independently
each partitioning is deﬁned by one pivot space
2 given query q, multiple ranked candidate sets are generated
3 these multiple candidate rankings are eﬀectively merged
the merged candidate set is smaller and more accurate
4 the ﬁnal candidate set is retrieved and reﬁned
David Novak (MU Brno) PPP-Codes CEMI meeting 5 / 17
PPP-Codes Approach Multiple Pivot Space Partitioning
Voronoi Partitioning & Pivot Permutations
Pivot space is deﬁned by a set of k pivots {p1, . . . , pk} ⊆ D
David Novak (MU Brno) PPP-Codes CEMI meeting 6 / 17
PPP-Codes Approach Multiple Pivot Space Partitioning
Voronoi Partitioning & Pivot Permutations
Pivot space is deﬁned by a set of k pivots {p1, . . . , pk} ⊆ D
David Novak (MU Brno) PPP-Codes CEMI meeting 6 / 17
PPP-Codes Approach Multiple Pivot Space Partitioning
Voronoi Partitioning & Pivot Permutations
Pivot space is deﬁned by a set of k pivots {p1, . . . , pk} ⊆ D
David Novak (MU Brno) PPP-Codes CEMI meeting 6 / 17
PPP-Codes Approach Multiple Pivot Space Partitioning
Voronoi Partitioning & Pivot Permutations
Pivot space is deﬁned by a set of k pivots {p1, . . . , pk} ⊆ D
Formally:
object x ∈ X is mapped to its
pivot permutation (PP):
Πx on {1, . . . , k} such that Πx (i)
is the i-th closest pivot from x
David Novak (MU Brno) PPP-Codes CEMI meeting 6 / 17
PPP-Codes Approach Multiple Pivot Space Partitioning
Voronoi Partitioning & Pivot Permutations
Pivot space is deﬁned by a set of k pivots {p1, . . . , pk} ⊆ D
Formally:
object x ∈ X is mapped to its
pivot permutation (PP):
Πx on {1, . . . , k} such that Πx (i)
is the i-th closest pivot from x
each Voronoi cell corresponds to
a pivot permutation preﬁx (PPP)
of length l: Πx (1..l)
David Novak (MU Brno) PPP-Codes CEMI meeting 6 / 17
PPP-Codes Approach Multiple Pivot Space Partitioning
Multiple Pivot Space Partitioning
We propose to create λ independent pivot space partitionings
p
1
1
p2
1
p3
1
p4
1
p5
1
p6
1
p7
1
p8
1
x5
p1
2
p2
2
p3
2
p4
2
p5
2
p6
2
p7
2
p
8
2
x
5
David Novak (MU Brno) PPP-Codes CEMI meeting 7 / 17
PPP-Codes Approach Multiple Pivot Space Partitioning
Multiple Pivot Space Partitioning
We propose to create λ independent pivot space partitionings
p
1
1
p2
1
p3
1
p4
1
p5
1
p6
1
p7
1
p8
1
x5
p1
2
p2
2
p3
2
p4
2
p5
2
p6
2
p7
2
p
8
2
x
5
data objects x ∈ X are encoded as
PPP1..λ
l (x) = Π1
x (1..l), . . . , Πλ
x (1..l)
David Novak (MU Brno) PPP-Codes CEMI meeting 7 / 17
PPP-Codes Approach Multiple Pivot Space Partitioning
Multiple Pivot Space Partitioning
We propose to create λ independent pivot space partitionings
p
1
1
p2
1
p3
1
p4
1
p5
1
p6
1
p7
1
p8
1
x5
p1
2
p2
2
p3
2
p4
2
p5
2
p6
2
p7
2
p
8
2
x
5
data objects x ∈ X are encoded as
PPP1..λ
l (x) = Π1
x (1..l), . . . , Πλ
x (1..l)
in the example above λ = 2, k = 8, l = 4:
PPP1..2
4 (x5) = 7, 4, 8, 5 , 7, 8, 4, 6
David Novak (MU Brno) PPP-Codes CEMI meeting 7 / 17
PPP-Codes Approach Ranking of the Data Objects
Ranking within a Single Pivot Space
Task: Having data x ∈ X encoded by PPP Πx (1..l) (single recursive
Voronoi partitioning), deﬁne ranking of the PPPs with respect to q ∈ D
David Novak (MU Brno) PPP-Codes CEMI meeting 8 / 17
PPP-Codes Approach Ranking of the Data Objects
Ranking within a Single Pivot Space
Task: Having data x ∈ X encoded by PPP Πx (1..l) (single recursive
Voronoi partitioning), deﬁne ranking of the PPPs with respect to q ∈ D
p3
p2
p4
C<2,1>
C<1,2>
C<4,3>
<4,1>C
C<3,4>
<1,3>C
q
p1
C<3,2>
C<1,4>
CC <2,3><3,1>
David Novak (MU Brno) PPP-Codes CEMI meeting 8 / 17
PPP-Codes Approach Ranking of the Data Objects
Ranking within a Single Pivot Space
Solution: We deﬁne distance between Voronoi cell C i1,...,il
and query q as
a weighted arithmetic mean of distances δ(q, pi1 ), . . . , δ(q, pil
)
p3
p2
p4
C<2,1>
C<1,2>
C<4,3>
<4,1>C
C<3,4>
<1,3>C
q
p1
C<3,2>
C<1,4>
CC <2,3><3,1>
David Novak (MU Brno) PPP-Codes CEMI meeting 8 / 17
PPP-Codes Approach Ranking of the Data Objects
Ranking using Multiple Pivot Spaces
Task: Having λ rankings of PPPs from λ pivot spaces, aggregate these
rankings eﬀectively into a ﬁnal ranking
David Novak (MU Brno) PPP-Codes CEMI meeting 9 / 17
PPP-Codes Approach Ranking of the Data Objects
Ranking using Multiple Pivot Spaces
Task: Having λ rankings of PPPs from λ pivot spaces, aggregate these
rankings eﬀectively into a ﬁnal ranking
q ∈ D
ψq
1
: {x y1
y2
} {y3
y4
y5
} {y6
} ...
ψq
2
: {y3
y2
} {y1
y4
y6
y7
} {x y8
} ...
ψq
3
: {x} {y3
y4
y5
} {y2
y6
} ...
ψq
4
: {y1
y2
} {y3
y4
y5
} {y8
} {y6
} ...
ψq
5
: {y1
y2
} {y4
y5
} {y3
} {x y7
} ...
objects with the rank '1'
rank '2'
rank '3'
David Novak (MU Brno) PPP-Codes CEMI meeting 9 / 17
PPP-Codes Approach Ranking of the Data Objects
Ranking using Multiple Pivot Spaces
Solution: Ranking of object x is p-percentile (e.g. median) of its λ ranks
Ψp(q, x) = percentilep(ψ1
q(x), ψ2
q(x), . . . , ψλ
q (x))
q ∈ D
ψq
1
: {x y1
y2
} {y3
y4
y5
} {y6
} ...
ψq
2
: {y3
y2
} {y1
y4
y6
y7
} {x y8
} ...
ψq
3
: {x} {y3
y4
y5
} {y2
y6
} ...
ψq
4
: {y1
y2
} {y3
y4
y5
} {y8
} {y6
} ...
ψq
5
: {y1
y2
} {y4
y5
} {y3
} {x y7
} ...
objects with the rank '1'
Ψ0.5
(q, x) = percentile0.5
{1, 1, 3, 4, ?} = 3
rank '2'
rank '3'
David Novak (MU Brno) PPP-Codes CEMI meeting 9 / 17
PPP-Codes Approach Ranking of the Data Objects
Idea Behind the Rank Aggregation
the Voronoi cells span large areas of the space
David Novak (MU Brno) PPP-Codes CEMI meeting 10 / 17
PPP-Codes Approach Ranking of the Data Objects
Idea Behind the Rank Aggregation
the Voronoi cells span large areas of the space
given a query, the “close” cells contain also distant data objects
there is many more distant ones
David Novak (MU Brno) PPP-Codes CEMI meeting 10 / 17
PPP-Codes Approach Ranking of the Data Objects
Idea Behind the Rank Aggregation
the Voronoi cells span large areas of the space
given a query, the “close” cells contain also distant data objects
there is many more distant ones
having several “orthogonal” partitionings
the query-relevant objects should be often at top positions
the distant objects vary
David Novak (MU Brno) PPP-Codes CEMI meeting 10 / 17
PPP-Codes Approach Ranking of the Data Objects
Idea Behind the Rank Aggregation
the Voronoi cells span large areas of the space
given a query, the “close” cells contain also distant data objects
there is many more distant ones
having several “orthogonal” partitionings
the query-relevant objects should be often at top positions
the distant objects vary
the percentile-based aggregation increases probability that
query-relevant objects are ranked higher than the distant ones
David Novak (MU Brno) PPP-Codes CEMI meeting 10 / 17
PPP-Codes Approach Indexing and Searching
Indexing the PPP-Codes
We build trie-like structure for each pivot space
leafs: only suﬃxes of PPPs (spare memory)
dynamic splits to optimize the memory usage
possible grouping and delta-encoding of IDs in leaves
1 2 k3 ...
2 k3 ... 1 k3 ...Π(2)=
Π(1)=
3 k4 ...Π(3)= −1k
−1k
Π l(4.. ),ID Π l(4.. ),ID Π l(4.. ),ID Π l(4.. ),ID
Π l(3.. ),ID
...
1 3 ...
1 2 ...
... ... ...
...
David Novak (MU Brno) PPP-Codes CEMI meeting 11 / 17
PPP-Codes Approach Indexing and Searching
PPPRank: The Search Algorithm
ψq
1
: {x y1
y2
} {y3
y4
y5
} {y6
} ...
ψq
2
: {y3
y2
} {y1
y4
y6
y7
} {x y8
} ...
ψq
3
: {x} {y3
y4
y5
} {y2
y6
} ...
ψq
4
: {y1
y2
} {y3
y4
y5
} {y8
} {y6
} ...
ψq
5
: {y1
y2
} {y4
y5
} {y3
} {x y7
} ...
objects with the rank '1'
Ψ0.5
(q, x) = percentile0.5
{1, 1, 3, 4, ?} = 3
rank '2'
rank '3'
Given query q ∈ D, our search algorithm:
David Novak (MU Brno) PPP-Codes CEMI meeting 12 / 17
PPP-Codes Approach Indexing and Searching
PPPRank: The Search Algorithm
ψq
1
: {x y1
y2
} {y3
y4
y5
} {y6
} ...
ψq
2
: {y3
y2
} {y1
y4
y6
y7
} {x y8
} ...
ψq
3
: {x} {y3
y4
y5
} {y2
y6
} ...
ψq
4
: {y1
y2
} {y3
y4
y5
} {y8
} {y6
} ...
ψq
5
: {y1
y2
} {y4
y5
} {y3
} {x y7
} ...
objects with the rank '1'
Ψ0.5
(q, x) = percentile0.5
{1, 1, 3, 4, ?} = 3
rank '2'
rank '3'
Given query q ∈ D, our search algorithm:
1 generates one-by-one individual rankings ψj
q
(GetNextIDs algorithm, it uses the trie structures)
David Novak (MU Brno) PPP-Codes CEMI meeting 12 / 17
PPP-Codes Approach Indexing and Searching
PPPRank: The Search Algorithm
ψq
1
: {x y1
y2
} {y3
y4
y5
} {y6
} ...
ψq
2
: {y3
y2
} {y1
y4
y6
y7
} {x y8
} ...
ψq
3
: {x} {y3
y4
y5
} {y2
y6
} ...
ψq
4
: {y1
y2
} {y3
y4
y5
} {y8
} {y6
} ...
ψq
5
: {y1
y2
} {y4
y5
} {y3
} {x y7
} ...
objects with the rank '1'
Ψ0.5
(q, x) = percentile0.5
{1, 1, 3, 4, ?} = 3
rank '2'
rank '3'
Given query q ∈ D, our search algorithm:
1 generates one-by-one individual rankings ψj
q
(GetNextIDs algorithm, it uses the trie structures)
2 outputs objects with the best aggregated ranks
(PPPRank algorithm based on MedRank by Fagin et al.)
David Novak (MU Brno) PPP-Codes CEMI meeting 12 / 17
PPP-Codes Approach Indexing and Searching
PPPRank: The Search Algorithm
calculate λ∙k
query-pivot
distances δ(q,pi
j
)
K-NN(q) PPPRank(q,p,R):
merge λ ranks to
get top R objects
GetNextIDs(q,2):
generate ψq
2
ranking
GetNextIDs(q,1):
generate ψq
1
ranking
... GetNextIDs(q,λ):
generate ψq
λ
ranking
retrieve
R objects
SSD
refine R objects
by δ(q,x)
1 2
3
4 5
Given query q ∈ D, our search algorithm:
1 generates one-by-one individual rankings ψj
q
(GetNextIDs algorithm, it uses the trie structures)
2 outputs objects with the best aggregated ranks
(PPPRank algorithm based on MedRank by Fagin et al.)
David Novak (MU Brno) PPP-Codes CEMI meeting 12 / 17
Eﬃciency of our Approach
Evaluation: Accuracy of the Candidate Set
Given K-NN, we consider recall(A) = |A∩AP |
K · 100% vs. candidate set size
David Novak (MU Brno) PPP-Codes CEMI meeting 13 / 17
Eﬃciency of our Approach
Evaluation: Accuracy of the Candidate Set
Given K-NN, we consider recall(A) = |A∩AP |
K · 100% vs. candidate set size
λ=1
λ=2
λ=4
λ=6
λ=8
10
100
1000
10000
32 64 128 256
candidatesetsizeR
0.2
0.2
0.2
0.2
0.05
0.07
0.05
0.06
k# of pivots
Candidate set size R necessary to achieve 80% of 1-NN recall
Settings: 1M CoPhIR dataset, l = 8 and p = 0.75
David Novak (MU Brno) PPP-Codes CEMI meeting 13 / 17
Eﬃciency of our Approach
Experimental Evaluation Criteria
three datasets:
100M CoPhIR (280-dim, complex metric, obj.: 600 B, δ time 0.01 ms)
1M SQFD (quadratic form distance, obj.: 2 kB, δ time 0.5 ms)
10M ADJ ([0, 1]32 uniform, L2, obj.: 0.5–4.0 kB, δ time 0.001–1.0 ms)
David Novak (MU Brno) PPP-Codes CEMI meeting 14 / 17
Eﬃciency of our Approach
Experimental Evaluation Criteria
three datasets:
100M CoPhIR (280-dim, complex metric, obj.: 600 B, δ time 0.01 ms)
1M SQFD (quadratic form distance, obj.: 2 kB, δ time 0.5 ms)
10M ADJ ([0, 1]32 uniform, L2, obj.: 0.5–4.0 kB, δ time 0.001–1.0 ms)
technical evaluation of our approach:
mutual inﬂuence of various parameters to recall
k, l, λ, p, size of the PPP-Code representation
David Novak (MU Brno) PPP-Codes CEMI meeting 14 / 17
Eﬃciency of our Approach
Experimental Evaluation Criteria
three datasets:
100M CoPhIR (280-dim, complex metric, obj.: 600 B, δ time 0.01 ms)
1M SQFD (quadratic form distance, obj.: 2 kB, δ time 0.5 ms)
10M ADJ ([0, 1]32 uniform, L2, obj.: 0.5–4.0 kB, δ time 0.001–1.0 ms)
technical evaluation of our approach:
mutual inﬂuence of various parameters to recall
k, l, λ, p, size of the PPP-Code representation
k ∈ {64, 128, 256, 512}, l = 8, λ = 5, p = 0.5 (3rd rank out of 5)
David Novak (MU Brno) PPP-Codes CEMI meeting 14 / 17
Eﬃciency of our Approach
Evaluation: Candidate Set vs. Recall
candidate set size R vs. recall
David Novak (MU Brno) PPP-Codes CEMI meeting 15 / 17
Eﬃciency of our Approach
Evaluation: Candidate Set vs. Recall
candidate set size R vs. recall
1−NN recall (left axis)
10−NN recall (left axis)
50−NN recall (left axis)
search time (righ axis)
0
20
40
60
80
100
2000 4000 6000 8000 10000
0
200
400
600
800
1000
1200
1400
K−NNrecall
searchtime[ms]
candidate set size R
Recall and search time on while increasing candidate set size R.
Settings: 100M CoPhIR dataset, k = 512
David Novak (MU Brno) PPP-Codes CEMI meeting 15 / 17
Eﬃciency of our Approach
Evaluation: Tradeoﬀ
complexity of the PPPRank algorithm vs. candidate set reduction
David Novak (MU Brno) PPP-Codes CEMI meeting 16 / 17
Eﬃciency of our Approach
Evaluation: Tradeoﬀ
complexity of the PPPRank algorithm vs. candidate set reduction
PPP-Code / size of an object [bytes]
M-Index [ms] 512 1024 2048 4096
δtime
0.001 ms 370 / 240 370 / 410 370 / 1270 370 / 1700
0.01 ms 380 / 660 380 / 750 380 / 1350 380 / 1850
0.1 ms 400 / 5400 400 / 5400 420 / 5500 420 / 5700
1 ms 1100 / 52500 1100 / 52500 1100 / 52500 1100 / 52500
Search times [ms] of PPP-Codes / M-Index smaller search times are in boldface.
Settings: 10M ADJUSTABLE dataset, 10-NN recall = 85 %, k = 128;
PPP-Codes: R = 1000; M-Index: R = 400000
David Novak (MU Brno) PPP-Codes CEMI meeting 16 / 17
Summary
Conclusions
The PPP-Codes technique
use multiple pivot spaces to encode data objects
rank data with respect to query within individual pivot spaces
ﬁnal candidate set is aggregation of these rankings
eﬃcient indexing and searching mechanisms are deﬁned
David Novak (MU Brno) PPP-Codes CEMI meeting 17 / 17
Summary
Conclusions
The PPP-Codes technique
use multiple pivot spaces to encode data objects
rank data with respect to query within individual pivot spaces
ﬁnal candidate set is aggregation of these rankings
eﬃcient indexing and searching mechanisms are deﬁned
The results show that
even two pivot spaces help, more than ﬁve do not help much
the candidate set is reduced by one–two orders of magnitude
the rank & merge algorithm is complex but usually worth
David Novak (MU Brno) PPP-Codes CEMI meeting 17 / 17