Distributed Similarity
Search Architectures
David Novak, Vladimír Míč, Pavel Zezula
DISA Lab
Metric-based Similarity
● generic similarity search
○ applicable to many domains
data modeled metric space (D, δ), where D is a domain of objects
and δ is a total distance function δ : D × D → R+
0
satisfying
postulates of identity, symmetry, triangle inequality
● search - query by example
● K-NN(q) query returns K objects
x with the smallest δ(q,x)
disk storage
query
Similarity Indexing
● objective: organize the dataset X ⊆ D
○ so that similarity queries are processed efficiently
disk storage
● data volumes can be large
● distance δ can be demanding
Objectives
● Distributed, horizontally scalable architecture
● ... for generic similarity search
● ... in Big data collections (hundreds of millions)
● ... single query efficiency
● … high query throughput
Outline
● motivation
● specific distributed systems
○ building blocks
○ our existing solutions (M-Chord, distributed M-Index)
○ other possible solutions
● analytical approach:
○ system model + cost model (just basic ideas)
● future work
Building Blocks - Notation
PPP-Tree (Pivot Permutation Prefix)
● recursive Voronoi tree
PPP-Tree with leaves pointing to
buckets with data (disk or memory)
PPP
PPP
data data data
PPP
IDs IDs IDs
PPP-Tree with leaves storing only IDs
of objects (typically memory)
M-Index
● simple PPP-Tree + memory
or disk storagePPP
data data
● given an approximate query
k-NN(q)
q uery
datadata
● most relevant buckets are
(read from disk) and refined
data
Multi M-Index
● λ independent PPP-Trees
● data in buckets
○ either in memory (shared)
○ or on disk (replicated)
● given query k-NN(q)
● most relevant buckets from
each tree are accessed
1
PPP
data datadata
2
PPP
data datadata
3
PPP
data datadata
q uery
● λ independent PPP-Trees
○ IDs in leaf nodes (memory)
● ID-object storage
○ SSD disk
● given query k-NN(q)
q uery
1
PPP
IDs IDs IDs
2
PPP
IDs IDs IDs
3
PPP
IDs IDs IDs
PPP-Rank algorithm
data objs.
ID-obj index
1. relevant leaves from each tree
accessed (λ priority queues of IDs)
2. PPP-Rank merges the ID queues
○ final candidate set is “very” small
3. refine candidate set one-by-one
PPP-Codes
PPP-Code: Pros & Cons
● Weak points
○ if dataset large, requires SSD (not feasible on HDD)
○ PPP-Trees with IDs take some memory
○ PPP-Rank algorithm takes some time
● Strong points
○ candidate set can be much smaller (2 orders of mgn.)
■ important for larger objects or expensive distance
○ data stored in ID-object store
■ a shared store for other indexes...
Distributed Indexes
● M-Chord (2006)
● Distributed M-Index (2012)
● Future organizations
basic component of distributed
system is a “node”
PPPPPP
M-Chord
● basic component is “fixedlevel”
PPP mapping
○ static
● PPP mapping determines
query-relevant buckets
PPP
data data
PPP
data data
data data
data
data
data
data
q uery
● these buckets are accessed
and refined on nodes
PPP
data data
M-Chord: Pros & Cons
● The fixed (static) PPP is not that precise as
dynamic PPP-Tree
● ..but it allows easy replication
PPP PPP
Distributed M-Index
● Space partitioning by
dynamic PPP-Tree
● PPP mapping determines
query-relevant bucketsdata data
data data
data data
data data
data
data
data
q uery
● these buckets are accessed
and refined on nodes
PPPPPP
Distributed M-Index (Local Indexes)
data data
data data
data data
data
data
q uery
PPPPPP
● Data buckets can be
organized by local indexes
PPP
PPP
PPP
Distributed M-Index (Multiple PPPs)
● Space partitioning by λ
PPP-Trees
data data
data data
data data
data
data
data
q uery
PPPPPP
● Search in all PPP spaces
● Data replicated λ-times
data
data data
M-Chord + Distribute M-Index
● Relatively large candidate set
○ but navigation on the bucket level - few messages
● Data distributed by the similarity space
○ data retrieved by sequential reads (HDD)
■ ...but difficult to build secondary indexes on the
same data
■ e.g. ID-object index
data objs.
ID-obj hash
data objs.
ID-obj hash
data objs.
ID-obj hash
data objs.
ID-obj hash
data objs.
ID-obj hash
PPP-Rank algorithm
Distributed PPP-Codes: Variant 1
IDs
IDsIDsIDs
q uery
1
PPP
IDsIDs
2
PPP
IDsIDsIDs
3
PPP
PPP-Codes
candidate set CX
of IDs1
distributed
refinement of CX
2
merge partial answers3
● PPP-Codes can be bottleneck
● all nodes participate on each search
● all advantages of ID-obj approach
○ data on one place
○ multi-modal search
key-value store (ID-object) on dataset X
ID-obj hash
data objs.
ID-obj hash
data objs.
ID-obj hash
data objs.
ID-obj hash
data objs.
ID-obj hash
Distributed PPP-Codes: Variant 2
● PPP-Codes index is
somewhat distributed
○ “no” bottleneck
● but complex
communication
during PPP-Ranking
data objs.
ID-obj hash
PPP-Rank
3
1
2
q uery
candidate set CX
⊆ X1
key-value store (ID-object) on dataset X
distributed
refinement of CX
2
merge partial answers3
data objs.
ID-obj hash
data objs.
ID-obj hash
data objs.
ID-obj hash
data objs.
ID-obj hash
PPP-Codes on Local Data
● no central point
● all m nodes
participate
● uniform load
distribution
data objs.
ID-obj hash
q uery
local refinement of
partial cand. set
2
key-value store (ID-object) on dataset X
local PPP-Rank
at each node
1
PPP-Codes
PPP-Codes
PPP-Codes PPP-Codes
PPP-Codes
merge partial answers3
● replication of each
partition (3-times)
● query: access btw.
m/3 to m nodes
System Model
1. What global metric approach is used?
○ fixed/dynamic Voronoi, PPP-Codes, other (sketches)
○ result: data partitions (buckets, single objs., sketches)
2. How are the data partitions distributed?
○ using the metric partitioning, independent hash, …
3. Other questions:
○ replication of partitions
○ local indexes on the level of partitions or nodes
○ communication among nodes (log n or direct)
Cost Model
Analytical evaluation of:
● time of a one query processing
● query throughput
○ number of queries per second
1. Derive formulas for different system settings
2. Simulate costs for various parameters
Cost Model: Variables
● size of dataset: n objects
● number of nodes: m
● time(d) of d(q,x) comp. [ms], size of each objects [B]
● I/O speed [B/s], network latency [ms] & bandwidth [B/s]
For each approach (get from papers or measure)
● # of partitions
● size of candidate set |C(q)| for given query recall
● time of C(q) generation
● distribution of partitions and candidate parts. to nodes
Cost Model: Query Processing
1. processing on the coordinating node:
#pivots * time(d) + cand_generation_time(|C(q)|) [ms]
2. communication:
latency + (message_size / bandwidth) [ms]
3. slowest node processing:
#objects_to_process = max_cand_frac * |C(q)|
○ read the data:
#objects_to_process * size(x) / I/O_speed [ms]
○ refine the data:
#objects_to_process * time(d) [ms]
4. communication back: latency + (0 / bandwidth)
5. merge of the answers and return
Thank you for your attention
Distributed M-Index (Local Indexes)
data data
data data
data data
data
data
q uery
PPPPPP
● Data buckets can be
organized by local indexes
PPP
PPP
Advantages of ID-object store
● ID-object queries (w/o another database)
○ all data in one places (good for consistency)
● efficient access by multiple modalities
○ data not partitioned by a single similarity modality
○ enables indexes on other attributes/modalities
○ final ranking by combination of modalities + filtering
● multiple collections use common ID-obj. store
M-Chord + Distribute M-Index
● Relatively large candidate set
○ but navigation on the bucket level - few messages
○ e.g. ID-object index could be built, but:
■ global part of the index + local on each bucket
■ as index grows, data is redistributed
● data objects moved to other buckets
■ another similarity index almost impossible
● Data distributed by the similarity space
○ data retrieved by sequential reads (HDD)
■ ...but difficult to build secondary indexes
Distributed PPP-Codes: Other Indexes
● other indexes:
○ multi-modal search: combination, filtering, re-ranking
key-value store (ID-object) on the whole dataset X
worker worker
similarity
index IXi
field inverted file
index IXi
field2
attribute
index IXi
field3
similarity
index IXj
field
worker
worker
worker
PPP-Rank: the Details
q uery
1
PPP
IDs IDs IDs
2
PPP
IDs IDs IDs
3
PPP
IDs IDs IDs
PPP-Rank algorithm
data objs.
ID-obj index
data objs.
ID-obj hash
data objs.
ID-obj hash
data objs.
ID-obj hash
data objs.
ID-obj hash
Distributed PPP-Codes: Variant 3
● The PPP-Ranking
is fully distributed
○ to all nodes
● Communication is
“one-way”
● PPP-Trees can be
replicated
● Size of ψq
j
must be
set a priori
data objs.
ID-obj hash
3
1
2
q uery
generate ψq
j1
local refinement of
partial cand. set
4
merge partial answers5
PPP-
Rank
PPP-
Rank
PPP-
Rank
PPP-
Rank
PPP-
Rank
employ ID hash
+ reshuffle IDs
2
key-value store (ID-object) on dataset X
local PPP-Rank
at each node
3