Enhancing Similarity Search
Performance by Dynamic
Query Reordering
Filip Nálepa, Michal Batko, Pavel Zezula
Stream Processing in Similarity Search
▪ Motivation
▪ Image annotation – annotate a stream of images collected by a web crawler
▪ Publish/subscribe applications – categorize a stream of documents
▪ → stream of query objects
▪ Stream: potentially infinite sequence of query objects (q1, q2, …)
▪ Process as many query objects as possible, processing of a
query object can be delayed → maximize throughput
2
Problem Definition
▪ Domain of objects D
▪ DB of objects D indexed in the metric space
▪ Distance function d: D x D → R determines the similarity of two
objects
▪ Stream of query objects ((q1, t1), (q2, t2), …)
▪ qi ∈ D
▪ ti – time of arrival, ti ≤ ti+1
▪ Evaluate k-NN query for each qi, i.e., find k most similar
objects in DB to qi
▪ Optimization criteria – throughput
▪ Maximize the number of processed query objects
3
Similarity Search Approach
▪ Typical similarity search techniques:
▪ Partitioned data of DB stored on a disk
▪ Read a subset of partitions during query evaluation → bottleneck
▪ Assumption: similar query objects need similar sets of partitions
▪ Idea: reuse loaded partitions to save disk accesses → data partition
caching
▪ Problem: huge metric space → low probability of data partition
intersection
▪ Solution: reorder query objects to obtain sequences of similar query
objects
4
Architecture
▪ Buffer: waiting query objects, query object reordering
▪ Metric index: query evaluation
▪ Cache: in-memory caching of data partitions
5
Disk
Buffer
Query Cache
Query
Stream
Result
Metric index
Query Object Reordering within the Buffer
▪ Task: find sequences of similar query objects
▪ Solution:
▪ cluster query objects
▪ select a cluster and evaluate all the query objects in that cluster
6
How to Cluster?
▪ Has to be efficient
▪ Pivot-based clustering
▪ Fixed set of pivots p1, …, pn in the metric space
▪ Compute metric distance of a new query object to all the pivots
▪ Order the pivots from the nearest to the farthest one → pivot
permutation = cluster
7
p1
p2
p3
(p2, p3, p1)
(p1, p3, p2)
(p2, p1, p3)
Hierarchical Clustering
▪ Individual levels correspond to the length of the common pivot
permutation prefix
▪ Internal node – common prefix of all children
▪ Leaves – query objects
▪ Query ordering: depth-first tree traversal
▪ Find lowest nonempty parent of previous query object → similar cluster
▪ Select child containing the oldest query object → no query starvation,
sufficient cluster density
8
p1 p2 p3
p1p2 p1p3 p2p1 p2p3 p3p1 p3p2
q1 q4 q8q2 q7 q3 q9
q5
Hierarchical Clustering
▪ Individual levels correspond to the length of the common pivot
permutation prefix
▪ Internal node – common prefix of all children
▪ Leaves – query objects
▪ Query ordering: depth-first tree traversal
▪ Find lowest nonempty parent of previous query object → similar cluster
▪ Select child containing the oldest query object → no query starvation,
sufficient cluster density
9
p1 p2 p3
p1p3
p2p1 p2p3 p3p1 p3p2
q8q2 q7 q3 q9
q5 q10
Hierarchical Clustering
▪ Individual levels correspond to the length of the common pivot
permutation prefix
▪ Internal node – common prefix of all children
▪ Leaves – query objects
▪ Query ordering: depth-first tree traversal
▪ Find lowest nonempty parent of previous query object → similar cluster
▪ Select child containing the oldest query object → no query starvation,
sufficient cluster density
10
p1 p2 p3
p1p3
p2p1 p2p3 p3p1 p3p2
q8q2 q7 q3 q9
q11
p1p2
q12 q10
Experiments – Fixed Input Rate
▪ DB: 10 mil. images represented by MPEG-7 descriptors
▪ Stream of query objects: evaluation of approximate 10-NN
queries (10 nearest neighbors)
▪ Cache size: 90,000 objects (0.9% of the DB)
▪ Fixed input rate: new query object arrives every x time units
▪ Average query time for no reordering and no caching: 113 ms
0
20
40
60
80
100
120
0 5000 10000
Buffersize(thousands)
Elapsed time [s]
20 ms 30 ms 60 ms
Throughput Delay Tradeoff
▪ Motivation
▪ publish-subscribe application
▪ requirement to obtain the latest data
▪ e.g., 10% of the data is required to be processed with the delay of at most
1 minute
▪ image annotation
▪ requirement to search for latest images
▪ 10% of images findable by keywords until 1 minute after their acquisition
▪ Goal: maximize expression w * |beforeDelayLimit| + |afterDelayLimit| for a
given delay limit
▪ |beforeDelayLimit| = set of query object processed until the given delay limit
▪ |afterDelayLimit| = set of query object processed after the given delay limit
▪ w = weight parameter
▪ Solution: modification of cluster ordering
12
Throughput Delay Tradeoff Approach
▪ Original ordering: oldest cluster first
▪ Modification:
▪ score for each cluster = a ∙ |beforeLimitQueries| + b ∙ oldestQueryAge
▪ beforeLimitQueries: set of query objects younger than the delay limit
▪ oldestQueryAge: age of the oldest query object in the cluster
▪ a, b: weighting parameters
▪ Depth-first traversal of the tree of clusters: select a child with the
highest score
13
Throughput Delay Tradeoff Approach Experiments
▪ 30 ms input frequency
▪ Delay limit: 1 minute
▪ Runtime: 4 hours
▪ Experiments with different „a“ weights (thousands in graphs)
▪ b weight = 1
14
0
50
100
150
200
250
0 5000 10000
Buffersize(thousands)
Elapsed time [s]
0 100 500 5000
0
100
200
300
400
500
0 100 500 5000
#processedqueries
(thousands)
Weight (thousands)
Before delay limit After delay limit
Throughput Delay Tradeoff Approach #2
▪ a ∙ |beforeLimitQueries| + b ∙ oldestQueryAge
▪ Switch between different strategies for cluster ordering, i.e.,
change the weights dynamically
▪ throughput maximization: select cluster containing the oldest
query object
▪ a = 0; b = 1
▪ maximization of low-delayed query objects: select a cluster
containing the highest number of newest query objects
▪ a = 1; b = 0
▪ switch strategies based on buffer size limits
▪ upper limit exceeded → maximize throughput
▪ lower limit reached → focus on low delays
15
Throughput Delay Tradeoff Approach #2 Experiments
▪ 30 ms input frequency, different buffer size limits
▪ Delay limit (DL) = 1 minute
16
0
50
100
150
0 5000 10000
Buffersize(thousands)
Elapsed time [s]
Experiments with various buffer size limits
(thousands)
30; 40
70; 80
100; 110
Lower limit Upper limit Queries before
DL [%]
30,000 40,000 13
70,000 80,000 18
100,000 110,000 19
Results computed
after 2nd switch
Summary
▪ Stream of similarity query objects
▪ Enhancing the throughput by query reordering and data partition
caching
▪ Throughput delay tradeoff by modification of ordering strategies
17
Disk
Buffer
Query Cache
Query
Stream
Result
Metric index