Similarity Search by Aggregation of Multiple Pivot-Permutation Ranking David Novak, Pavel Zezula Masaryk University Brno, Czech Republic }w¡¢£¤¥¦§¨!"#$%&123456789@ACDEFGHIPQRS`ye| CEMI meeting, April 16, 2014 David Novak (MU Brno) PPP-Codes CEMI meeting 1 / 17 Outline of the Talk 1 Approximate Distance-based Similarity Search 2 PPP-Codes Approach Multiple Pivot Space Partitioning Ranking of the Data Objects Indexing and Searching 3 Efficiency of our Approach 4 Summary David Novak (MU Brno) PPP-Codes CEMI meeting 2 / 17 Approximate Distance-based Similarity Search Preliminaries generic similarity search data modeled as metric space (D, δ), where D is a domain of objects and δ is a total distance function δ : D × D −→ R+ 0 satisfying postulates of identity, symmetry, and triangle inequality David Novak (MU Brno) PPP-Codes CEMI meeting 3 / 17 Approximate Distance-based Similarity Search Preliminaries generic similarity search data modeled as metric space (D, δ), where D is a domain of objects and δ is a total distance function δ : D × D −→ R+ 0 satisfying postulates of identity, symmetry, and triangle inequality query by example: K-NN(q) returns K objects x from the dataset X ⊆ D with the smallest δ(q, x) David Novak (MU Brno) PPP-Codes CEMI meeting 3 / 17 Approximate Distance-based Similarity Search Preliminaries generic similarity search data modeled as metric space (D, δ), where D is a domain of objects and δ is a total distance function δ : D × D −→ R+ 0 satisfying postulates of identity, symmetry, and triangle inequality query by example: K-NN(q) returns K objects x from the dataset X ⊆ D with the smallest δ(q, x) dataset X may be very large distance function δ may be time consuming David Novak (MU Brno) PPP-Codes CEMI meeting 3 / 17 Approximate Distance-based Similarity Search Preliminaries generic similarity search data modeled as metric space (D, δ), where D is a domain of objects and δ is a total distance function δ : D × D −→ R+ 0 satisfying postulates of identity, symmetry, and triangle inequality query by example: K-NN(q) returns K objects x from the dataset X ⊆ D with the smallest δ(q, x) dataset X may be very large distance function δ may be time consuming requires approximate search David Novak (MU Brno) PPP-Codes CEMI meeting 3 / 17 Approximate Distance-based Similarity Search Motivation current indexes for large-scale approximate search: dataset X is partitioned given query q, the “most-promising” partitions form the candidate set the candidate set SC is refined by calculating δ(q, x), ∀x ∈ SC David Novak (MU Brno) PPP-Codes CEMI meeting 4 / 17 Approximate Distance-based Similarity Search Motivation current indexes for large-scale approximate search: dataset X is partitioned given query q, the “most-promising” partitions form the candidate set the candidate set SC is refined by calculating δ(q, x), ∀x ∈ SC reading and refinement of SC form majority of the search costs accuracy of the candidate set is key David Novak (MU Brno) PPP-Codes CEMI meeting 4 / 17 PPP-Codes Approach Our Approach in a Nutshell 1 data space is partitioned multiple-times independently each partitioning is defined by one pivot space David Novak (MU Brno) PPP-Codes CEMI meeting 5 / 17 PPP-Codes Approach Our Approach in a Nutshell 1 data space is partitioned multiple-times independently each partitioning is defined by one pivot space 2 given query q, multiple ranked candidate sets are generated David Novak (MU Brno) PPP-Codes CEMI meeting 5 / 17 PPP-Codes Approach Our Approach in a Nutshell 1 data space is partitioned multiple-times independently each partitioning is defined by one pivot space 2 given query q, multiple ranked candidate sets are generated 3 these multiple candidate rankings are effectively merged the merged candidate set is smaller and more accurate David Novak (MU Brno) PPP-Codes CEMI meeting 5 / 17 PPP-Codes Approach Our Approach in a Nutshell 1 data space is partitioned multiple-times independently each partitioning is defined by one pivot space 2 given query q, multiple ranked candidate sets are generated 3 these multiple candidate rankings are effectively merged the merged candidate set is smaller and more accurate 4 the final candidate set is retrieved and refined David Novak (MU Brno) PPP-Codes CEMI meeting 5 / 17 PPP-Codes Approach Multiple Pivot Space Partitioning Voronoi Partitioning & Pivot Permutations Pivot space is defined by a set of k pivots {p1, . . . , pk} ⊆ D David Novak (MU Brno) PPP-Codes CEMI meeting 6 / 17 PPP-Codes Approach Multiple Pivot Space Partitioning Voronoi Partitioning & Pivot Permutations Pivot space is defined by a set of k pivots {p1, . . . , pk} ⊆ D David Novak (MU Brno) PPP-Codes CEMI meeting 6 / 17 PPP-Codes Approach Multiple Pivot Space Partitioning Voronoi Partitioning & Pivot Permutations Pivot space is defined by a set of k pivots {p1, . . . , pk} ⊆ D David Novak (MU Brno) PPP-Codes CEMI meeting 6 / 17 PPP-Codes Approach Multiple Pivot Space Partitioning Voronoi Partitioning & Pivot Permutations Pivot space is defined by a set of k pivots {p1, . . . , pk} ⊆ D Formally: object x ∈ X is mapped to its pivot permutation (PP): Πx on {1, . . . , k} such that Πx (i) is the i-th closest pivot from x David Novak (MU Brno) PPP-Codes CEMI meeting 6 / 17 PPP-Codes Approach Multiple Pivot Space Partitioning Voronoi Partitioning & Pivot Permutations Pivot space is defined by a set of k pivots {p1, . . . , pk} ⊆ D Formally: object x ∈ X is mapped to its pivot permutation (PP): Πx on {1, . . . , k} such that Πx (i) is the i-th closest pivot from x each Voronoi cell corresponds to a pivot permutation prefix (PPP) of length l: Πx (1..l) David Novak (MU Brno) PPP-Codes CEMI meeting 6 / 17 PPP-Codes Approach Multiple Pivot Space Partitioning Multiple Pivot Space Partitioning We propose to create λ independent pivot space partitionings p 1 1 p2 1 p3 1 p4 1 p5 1 p6 1 p7 1 p8 1 x5 p1 2 p2 2 p3 2 p4 2 p5 2 p6 2 p7 2 p 8 2 x 5 David Novak (MU Brno) PPP-Codes CEMI meeting 7 / 17 PPP-Codes Approach Multiple Pivot Space Partitioning Multiple Pivot Space Partitioning We propose to create λ independent pivot space partitionings p 1 1 p2 1 p3 1 p4 1 p5 1 p6 1 p7 1 p8 1 x5 p1 2 p2 2 p3 2 p4 2 p5 2 p6 2 p7 2 p 8 2 x 5 data objects x ∈ X are encoded as PPP1..λ l (x) = Π1 x (1..l), . . . , Πλ x (1..l) David Novak (MU Brno) PPP-Codes CEMI meeting 7 / 17 PPP-Codes Approach Multiple Pivot Space Partitioning Multiple Pivot Space Partitioning We propose to create λ independent pivot space partitionings p 1 1 p2 1 p3 1 p4 1 p5 1 p6 1 p7 1 p8 1 x5 p1 2 p2 2 p3 2 p4 2 p5 2 p6 2 p7 2 p 8 2 x 5 data objects x ∈ X are encoded as PPP1..λ l (x) = Π1 x (1..l), . . . , Πλ x (1..l) in the example above λ = 2, k = 8, l = 4: PPP1..2 4 (x5) = 7, 4, 8, 5 , 7, 8, 4, 6 David Novak (MU Brno) PPP-Codes CEMI meeting 7 / 17 PPP-Codes Approach Ranking of the Data Objects Ranking within a Single Pivot Space Task: Having data x ∈ X encoded by PPP Πx (1..l) (single recursive Voronoi partitioning), define ranking of the PPPs with respect to q ∈ D David Novak (MU Brno) PPP-Codes CEMI meeting 8 / 17 PPP-Codes Approach Ranking of the Data Objects Ranking within a Single Pivot Space Task: Having data x ∈ X encoded by PPP Πx (1..l) (single recursive Voronoi partitioning), define ranking of the PPPs with respect to q ∈ D p3 p2 p4 C<2,1> C<1,2> C<4,3> <4,1>C C<3,4> <1,3>C q p1 C<3,2> C<1,4> CC <2,3><3,1> David Novak (MU Brno) PPP-Codes CEMI meeting 8 / 17 PPP-Codes Approach Ranking of the Data Objects Ranking within a Single Pivot Space Solution: We define distance between Voronoi cell C i1,...,il and query q as a weighted arithmetic mean of distances δ(q, pi1 ), . . . , δ(q, pil ) p3 p2 p4 C<2,1> C<1,2> C<4,3> <4,1>C C<3,4> <1,3>C q p1 C<3,2> C<1,4> CC <2,3><3,1> David Novak (MU Brno) PPP-Codes CEMI meeting 8 / 17 PPP-Codes Approach Ranking of the Data Objects Ranking using Multiple Pivot Spaces Task: Having λ rankings of PPPs from λ pivot spaces, aggregate these rankings effectively into a final ranking David Novak (MU Brno) PPP-Codes CEMI meeting 9 / 17 PPP-Codes Approach Ranking of the Data Objects Ranking using Multiple Pivot Spaces Task: Having λ rankings of PPPs from λ pivot spaces, aggregate these rankings effectively into a final ranking q ∈ D ψq 1 : {x y1 y2 } {y3 y4 y5 } {y6 } ... ψq 2 : {y3 y2 } {y1 y4 y6 y7 } {x y8 } ... ψq 3 : {x} {y3 y4 y5 } {y2 y6 } ... ψq 4 : {y1 y2 } {y3 y4 y5 } {y8 } {y6 } ... ψq 5 : {y1 y2 } {y4 y5 } {y3 } {x y7 } ... objects with the rank '1' rank '2' rank '3' David Novak (MU Brno) PPP-Codes CEMI meeting 9 / 17 PPP-Codes Approach Ranking of the Data Objects Ranking using Multiple Pivot Spaces Solution: Ranking of object x is p-percentile (e.g. median) of its λ ranks Ψp(q, x) = percentilep(ψ1 q(x), ψ2 q(x), . . . , ψλ q (x)) q ∈ D ψq 1 : {x y1 y2 } {y3 y4 y5 } {y6 } ... ψq 2 : {y3 y2 } {y1 y4 y6 y7 } {x y8 } ... ψq 3 : {x} {y3 y4 y5 } {y2 y6 } ... ψq 4 : {y1 y2 } {y3 y4 y5 } {y8 } {y6 } ... ψq 5 : {y1 y2 } {y4 y5 } {y3 } {x y7 } ... objects with the rank '1' Ψ0.5 (q, x) = percentile0.5 {1, 1, 3, 4, ?} = 3 rank '2' rank '3' David Novak (MU Brno) PPP-Codes CEMI meeting 9 / 17 PPP-Codes Approach Ranking of the Data Objects Idea Behind the Rank Aggregation the Voronoi cells span large areas of the space David Novak (MU Brno) PPP-Codes CEMI meeting 10 / 17 PPP-Codes Approach Ranking of the Data Objects Idea Behind the Rank Aggregation the Voronoi cells span large areas of the space given a query, the “close” cells contain also distant data objects there is many more distant ones David Novak (MU Brno) PPP-Codes CEMI meeting 10 / 17 PPP-Codes Approach Ranking of the Data Objects Idea Behind the Rank Aggregation the Voronoi cells span large areas of the space given a query, the “close” cells contain also distant data objects there is many more distant ones having several “orthogonal” partitionings the query-relevant objects should be often at top positions the distant objects vary David Novak (MU Brno) PPP-Codes CEMI meeting 10 / 17 PPP-Codes Approach Ranking of the Data Objects Idea Behind the Rank Aggregation the Voronoi cells span large areas of the space given a query, the “close” cells contain also distant data objects there is many more distant ones having several “orthogonal” partitionings the query-relevant objects should be often at top positions the distant objects vary the percentile-based aggregation increases probability that query-relevant objects are ranked higher than the distant ones David Novak (MU Brno) PPP-Codes CEMI meeting 10 / 17 PPP-Codes Approach Indexing and Searching Indexing the PPP-Codes We build trie-like structure for each pivot space leafs: only suffixes of PPPs (spare memory) dynamic splits to optimize the memory usage possible grouping and delta-encoding of IDs in leaves 1 2 k3 ... 2 k3 ... 1 k3 ...Π(2)= Π(1)= 3 k4 ...Π(3)= −1k −1k Π l(4.. ),ID Π l(4.. ),ID Π l(4.. ),ID Π l(4.. ),ID Π l(3.. ),ID ... 1 3 ... 1 2 ... ... ... ... ... David Novak (MU Brno) PPP-Codes CEMI meeting 11 / 17 PPP-Codes Approach Indexing and Searching PPPRank: The Search Algorithm ψq 1 : {x y1 y2 } {y3 y4 y5 } {y6 } ... ψq 2 : {y3 y2 } {y1 y4 y6 y7 } {x y8 } ... ψq 3 : {x} {y3 y4 y5 } {y2 y6 } ... ψq 4 : {y1 y2 } {y3 y4 y5 } {y8 } {y6 } ... ψq 5 : {y1 y2 } {y4 y5 } {y3 } {x y7 } ... objects with the rank '1' Ψ0.5 (q, x) = percentile0.5 {1, 1, 3, 4, ?} = 3 rank '2' rank '3' Given query q ∈ D, our search algorithm: David Novak (MU Brno) PPP-Codes CEMI meeting 12 / 17 PPP-Codes Approach Indexing and Searching PPPRank: The Search Algorithm ψq 1 : {x y1 y2 } {y3 y4 y5 } {y6 } ... ψq 2 : {y3 y2 } {y1 y4 y6 y7 } {x y8 } ... ψq 3 : {x} {y3 y4 y5 } {y2 y6 } ... ψq 4 : {y1 y2 } {y3 y4 y5 } {y8 } {y6 } ... ψq 5 : {y1 y2 } {y4 y5 } {y3 } {x y7 } ... objects with the rank '1' Ψ0.5 (q, x) = percentile0.5 {1, 1, 3, 4, ?} = 3 rank '2' rank '3' Given query q ∈ D, our search algorithm: 1 generates one-by-one individual rankings ψj q (GetNextIDs algorithm, it uses the trie structures) David Novak (MU Brno) PPP-Codes CEMI meeting 12 / 17 PPP-Codes Approach Indexing and Searching PPPRank: The Search Algorithm ψq 1 : {x y1 y2 } {y3 y4 y5 } {y6 } ... ψq 2 : {y3 y2 } {y1 y4 y6 y7 } {x y8 } ... ψq 3 : {x} {y3 y4 y5 } {y2 y6 } ... ψq 4 : {y1 y2 } {y3 y4 y5 } {y8 } {y6 } ... ψq 5 : {y1 y2 } {y4 y5 } {y3 } {x y7 } ... objects with the rank '1' Ψ0.5 (q, x) = percentile0.5 {1, 1, 3, 4, ?} = 3 rank '2' rank '3' Given query q ∈ D, our search algorithm: 1 generates one-by-one individual rankings ψj q (GetNextIDs algorithm, it uses the trie structures) 2 outputs objects with the best aggregated ranks (PPPRank algorithm based on MedRank by Fagin et al.) David Novak (MU Brno) PPP-Codes CEMI meeting 12 / 17 PPP-Codes Approach Indexing and Searching PPPRank: The Search Algorithm calculate λ∙k query-pivot distances δ(q,pi j ) K-NN(q) PPPRank(q,p,R): merge λ ranks to get top R objects GetNextIDs(q,2): generate ψq 2 ranking GetNextIDs(q,1): generate ψq 1 ranking ... GetNextIDs(q,λ): generate ψq λ ranking retrieve R objects SSD refine R objects by δ(q,x) 1 2 3 4 5 Given query q ∈ D, our search algorithm: 1 generates one-by-one individual rankings ψj q (GetNextIDs algorithm, it uses the trie structures) 2 outputs objects with the best aggregated ranks (PPPRank algorithm based on MedRank by Fagin et al.) David Novak (MU Brno) PPP-Codes CEMI meeting 12 / 17 Efficiency of our Approach Evaluation: Accuracy of the Candidate Set Given K-NN, we consider recall(A) = |A∩AP | K · 100% vs. candidate set size David Novak (MU Brno) PPP-Codes CEMI meeting 13 / 17 Efficiency of our Approach Evaluation: Accuracy of the Candidate Set Given K-NN, we consider recall(A) = |A∩AP | K · 100% vs. candidate set size λ=1 λ=2 λ=4 λ=6 λ=8 10 100 1000 10000 32 64 128 256 candidatesetsizeR 0.2 0.2 0.2 0.2 0.05 0.07 0.05 0.06 k# of pivots Candidate set size R necessary to achieve 80% of 1-NN recall Settings: 1M CoPhIR dataset, l = 8 and p = 0.75 David Novak (MU Brno) PPP-Codes CEMI meeting 13 / 17 Efficiency of our Approach Experimental Evaluation Criteria three datasets: 100M CoPhIR (280-dim, complex metric, obj.: 600 B, δ time 0.01 ms) 1M SQFD (quadratic form distance, obj.: 2 kB, δ time 0.5 ms) 10M ADJ ([0, 1]32 uniform, L2, obj.: 0.5–4.0 kB, δ time 0.001–1.0 ms) David Novak (MU Brno) PPP-Codes CEMI meeting 14 / 17 Efficiency of our Approach Experimental Evaluation Criteria three datasets: 100M CoPhIR (280-dim, complex metric, obj.: 600 B, δ time 0.01 ms) 1M SQFD (quadratic form distance, obj.: 2 kB, δ time 0.5 ms) 10M ADJ ([0, 1]32 uniform, L2, obj.: 0.5–4.0 kB, δ time 0.001–1.0 ms) technical evaluation of our approach: mutual influence of various parameters to recall k, l, λ, p, size of the PPP-Code representation David Novak (MU Brno) PPP-Codes CEMI meeting 14 / 17 Efficiency of our Approach Experimental Evaluation Criteria three datasets: 100M CoPhIR (280-dim, complex metric, obj.: 600 B, δ time 0.01 ms) 1M SQFD (quadratic form distance, obj.: 2 kB, δ time 0.5 ms) 10M ADJ ([0, 1]32 uniform, L2, obj.: 0.5–4.0 kB, δ time 0.001–1.0 ms) technical evaluation of our approach: mutual influence of various parameters to recall k, l, λ, p, size of the PPP-Code representation k ∈ {64, 128, 256, 512}, l = 8, λ = 5, p = 0.5 (3rd rank out of 5) David Novak (MU Brno) PPP-Codes CEMI meeting 14 / 17 Efficiency of our Approach Evaluation: Candidate Set vs. Recall candidate set size R vs. recall David Novak (MU Brno) PPP-Codes CEMI meeting 15 / 17 Efficiency of our Approach Evaluation: Candidate Set vs. Recall candidate set size R vs. recall 1−NN recall (left axis) 10−NN recall (left axis) 50−NN recall (left axis) search time (righ axis) 0 20 40 60 80 100 2000 4000 6000 8000 10000 0 200 400 600 800 1000 1200 1400 K−NNrecall searchtime[ms] candidate set size R Recall and search time on while increasing candidate set size R. Settings: 100M CoPhIR dataset, k = 512 David Novak (MU Brno) PPP-Codes CEMI meeting 15 / 17 Efficiency of our Approach Evaluation: Tradeoff complexity of the PPPRank algorithm vs. candidate set reduction David Novak (MU Brno) PPP-Codes CEMI meeting 16 / 17 Efficiency of our Approach Evaluation: Tradeoff complexity of the PPPRank algorithm vs. candidate set reduction PPP-Code / size of an object [bytes] M-Index [ms] 512 1024 2048 4096 δtime 0.001 ms 370 / 240 370 / 410 370 / 1270 370 / 1700 0.01 ms 380 / 660 380 / 750 380 / 1350 380 / 1850 0.1 ms 400 / 5400 400 / 5400 420 / 5500 420 / 5700 1 ms 1100 / 52500 1100 / 52500 1100 / 52500 1100 / 52500 Search times [ms] of PPP-Codes / M-Index smaller search times are in boldface. Settings: 10M ADJUSTABLE dataset, 10-NN recall = 85 %, k = 128; PPP-Codes: R = 1000; M-Index: R = 400000 David Novak (MU Brno) PPP-Codes CEMI meeting 16 / 17 Summary Conclusions The PPP-Codes technique use multiple pivot spaces to encode data objects rank data with respect to query within individual pivot spaces final candidate set is aggregation of these rankings efficient indexing and searching mechanisms are defined David Novak (MU Brno) PPP-Codes CEMI meeting 17 / 17 Summary Conclusions The PPP-Codes technique use multiple pivot spaces to encode data objects rank data with respect to query within individual pivot spaces final candidate set is aggregation of these rankings efficient indexing and searching mechanisms are defined The results show that even two pivot spaces help, more than five do not help much the candidate set is reduced by one–two orders of magnitude the rank & merge algorithm is complex but usually worth David Novak (MU Brno) PPP-Codes CEMI meeting 17 / 17