Metric learning, product quantization,
approximate search
Jan Sedmidubský Masaryk University
sedmidubsky@mail.muni.cz
Jan Sedmidubský | PV056 Machine Learning and Data Mining 1
Outline
• Metric learning
• Vector/Product quantization
• Approximate similarity search (e.g., using FAISS)
Jan Sedmidubský | PV056 Machine Learning and Data Mining 2
Index
(e.g., 64x compression)
Metric learning
• Metric learning goal – representing data objects, such as images, text
or whatever, with numerical vectors
• Vectors = embeddings or embedding vectors
• Function 𝑓 transforms a given object (e.g., image 𝑥) into an n-dimensional
vector 𝑓(𝑥) ∈ ℝ 𝑛
• Former approach – individual features of the vector representation had to be
manually specified
• Current approach – the vector representation is learned automatically
Jan Sedmidubský | PV056 Machine Learning and Data Mining 3
Neural network
(0.2, 0.9, …, 0,7)
𝑥
𝑓(𝑥) ∈ ℝ 𝑛
Metric learning
• Desired properties:
• Similar data objects → vectors that are close together
• Dissimilar data objects → vectors that are far apart
• Quantification of similarity/closeness:
• Requires a distance measure in the underlying vector space
• Commonly used measure – Euclidean distance function (L2 norm)
• 𝑑𝑖𝑠𝑡 𝑓 𝑥 , 𝑓 𝑦 = 𝑓 𝑥 − 𝑓 𝑦 2
• Metric learning process – pulling together the embeddings for similar
objects and pushing apart those for dissimilar objects
• What exactly is meant by similar and dissimilar objects?
Jan Sedmidubský | PV056 Machine Learning and Data Mining 4
Metric learning
• Examples of similar and dissimilar objects on identity-based similarity:
• Face recognition
• Retail-product recognition
• Object identities (products, persons) lead to supervised clustering of
the learned embeddings → why not just use a classifier?
• Extreme classification – a very large number of classes (e.g., tens of thousands)
with highly unbalanced training data
• Stanford Online Products dataset (scraped from eBay) contains 120 K images
for 23 K product classes
• Output layer of some deep neural network with 23 K nodes and 4 images/class → you will get
an unsatisfactory result
Jan Sedmidubský | PV056 Machine Learning and Data Mining 5
Metric learning
• Embedding vectors are:
• As close together as they can be for the images in each class
• As far as they can be from the embeddings for the other classes
• Example of recognition of facial expressions
• Basic ideas in metric learning revolve around:
• Pairwise contrastive loss
• [Hadsell et al.: Dimensionality Reduction by Learning an Invariant Mapping, CVPR 2006]
• Triplet loss
• [Schroff et al.: FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR 2015]
Jan Sedmidubský | PV056 Machine Learning and Data Mining 6
[Roy at al.: Contrastive Learning of View-invariant Representations
for Facial Expressions Recognition, ACM TOMM 2023]
Pairwise Contrastive (PC) loss
• Training a neural network in batches
• Goal – extract positive and negative pairs of training samples from a
batch (batch – list of training samples)
• Positive pairs – carry the same class label
• Negative pairs – carry different labels
Jan Sedmidubský | PV056 Machine Learning and Data Mining 7
Positive pair Negative pair
Three classes: x o *
PC loss – idea of calculation
• Loss (cost or objective) function 𝐿 measures the discrepancy between
the predicted output of the model and the actual target values
• Purpose – to give the network feedback on how well it is performing so that it
can adjust its parameters (weights and biases) to improve over time
• During the training process, the loss should gradually decrease (up to 0)
• PC loss calculation:
• A sum of the values calculated separately from positive pairs and negative pairs
• Contribution to the loss by positive pairs (𝐿 𝑝) + contribution to the loss by
negative pairs (𝐿 𝑛)
Jan Sedmidubský | PV056 Machine Learning and Data Mining 8
PC loss – positive pairs
• Contribution to the loss by positive pairs
• Positive pair 𝑥1
𝑖
, 𝑥2
𝑖
– pairwise distances as small as possible
• Positive loss – sum over all positive pairs from the batch:
𝐿 𝑝 = ෍
𝑖
𝑑𝑖𝑠𝑡 𝑓 𝑥1
𝑖
, 𝑓 𝑥2
𝑖
2
• 𝑖 – indexes all the positive pairs from the batch
• Square of the distance because it is differentiable everywhere
Jan Sedmidubský | PV056 Machine Learning and Data Mining 9
PC loss – negative pairs
• Contribution to the loss by negative pairs
• Negative pair 𝑥1
𝑗
, 𝑥2
𝑗
– pairwise distances as large as possible
• j – indexes all the negative pairs from the batch
• But very dissimilar items amount to wasting the learning effort
• Two well-separated samples in a negative pair should not even participate in learning
• Threshold on maximal dissimilarity quantified by margin 𝑚
• If 𝑑𝑖𝑠𝑡 𝑥1
𝑗
, 𝑥2
𝑗
> 𝑚 then the contribution to the loss should be 0
• If 𝑑𝑖𝑠𝑡 𝑥1
𝑗
, 𝑥2
𝑗
≤ 𝑚: 𝐿 𝑛 = 𝑚 − 𝑑𝑖𝑠𝑡 𝑥1
𝑗
, 𝑥2
𝑗
• Negative loss – sum over all negative pairs from the batch:
𝐿 𝑛 = ෍
𝑗
max 0, 𝑚 − 𝑑𝑖𝑠𝑡 𝑓 𝑥1
𝑗
, 𝑓 𝑥2
𝑗
2
Jan Sedmidubský | PV056 Machine Learning and Data Mining 10
PC loss
• Overall loss for all pairs in batch by combining 𝐿 𝑝 and 𝐿 𝑛
• Binary variable 𝑦 ∈ 0,1 :
• 𝑦 = 0 → positive pair
• 𝑦 = 1 → negative pair
𝐿 = ෍
𝑖
(1 − 𝑦𝑖) 𝑑𝑖𝑠𝑡 𝑓 𝑥1
𝑖
, 𝑓 𝑥2
𝑖
2
+ 𝑦𝑖 max 0, 𝑚 − 𝑑𝑖𝑠𝑡 𝑓 𝑥1
𝑖
, 𝑓 𝑥2
𝑖
2
• 𝑖 – goes over all the pairs from the batch
Jan Sedmidubský | PV056 Machine Learning and Data Mining 11
⋮
positive
𝑦3 = 0
⋮
negative
𝑦78 = 1
⋮
Batch example
Triplet loss
• Creating triplets (Anchor, Positive, Negative) from a batch
• (Anchor, Positive) – carry the same class label
• (Anchor, Negative) – carry different labels
• Different mining strategies with different computational properties:
• Negative-hard mining
• Negative semi-hard mining
Jan Sedmidubský | PV056 Machine Learning and Data Mining 12
Triplet loss – creating triplets
• For every pair having the same class label:
• One selected as Anchor, the other as Positive: (Anchor, Positive)
• For every (Anchor, Positive) pair:
• Negative objects are identified – objects with a different class than Anchor/Positive
• 𝑛1 (hard negative) – must be pushed further out
• 𝑛2 (semi-hard negative)
• 𝑛3 (easy negative)
Jan Sedmidubský | PV056 Machine Learning and Data Mining 13
Triplet loss – determining negatives
• Criteria for dividing a set of negatives:
• Hard-negative mining (𝑛~𝑛1): negatives closer to anchor than positive
• 𝑑𝑖𝑠𝑡 𝑎, 𝑛 < 𝑑𝑖𝑠𝑡(𝑎, 𝑝)
• Semi-hard negative mining (𝑛~𝑛2): negatives fall within margin δ
• 𝑑𝑖𝑠𝑡 𝑎, 𝑝 < 𝑑𝑖𝑠𝑡 𝑎, 𝑛 < 𝑑𝑖𝑠𝑡 𝑎, 𝑝 + δ
• Easy negatives (𝑛~𝑛3): negatives that lie beyond the margin
• 𝑑𝑖𝑠𝑡 𝑎, 𝑝 + δ < 𝑑𝑖𝑠𝑡 𝑎, 𝑛
• Suitability of negatives for training:
• Only easy negatives for training – insignificant role in learning, if any at all
• Only hard negatives for training – network:
• Converges to a local minimum at best, or
• Collapses to a state in which all the embeddings are zero
• The most suitable for training are semi-hard negatives
Jan Sedmidubský | PV056 Machine Learning and Data Mining 14
Triplet loss – determining negatives
• Semi-hard negative mining (𝑛2)
• Negatives sufficiently close to anchor
• Margin δ has to be carefully set:
• Appropriate value – network correctly distinguishes between positive/negative samples
• Too large/small value – network optimization process may get stuck in a local minimum
• To properly set the margin, all embeddings are normalized to be of size unity
• Margin typically set to δ = 0.2
Jan Sedmidubský | PV056 Machine Learning and Data Mining 15
Triplet loss – calculation
• List of triplets (𝑥𝑖
𝑎
, 𝑥𝑖
𝑝
, 𝑥𝑖
𝑛
) of a batch of cardinality 𝑁
𝐿 = ෍
𝑖=1
𝑁
max 0, 𝑑𝑖𝑠𝑡2
(𝑓 𝑥𝑖
𝑎
, 𝑓 𝑥𝑖
𝑝
) − 𝑑𝑖𝑠𝑡2
(𝑓 𝑥𝑖
𝑎
, 𝑓 𝑥𝑖
𝑛
) + δ
• No contributions from the negatives that are outside the margin (max returns 0)
Jan Sedmidubský | PV056 Machine Learning and Data Mining 16
Results on CIFAR-10
• Training on the CIFAR-10 dataset
• Contrastive loss learning: Precision@1 = 74%
• Triplet loss learning: Precision@1 = 84%
Jan Sedmidubský | PV056 Machine Learning and Data Mining 17
Coding issues
• Goal – determine pairs/triplets → calculate pair-wise distances
• Variables:
• Batch of size 𝐵
• Dimensionality of embeddings: 𝑀
• Embeddings-data array 𝑋 of shape 𝐵 × 𝑀
• Labels of embeddings: 𝐵 × 1
• Easy solution – iterative processing (for-loops) to determine distances
• The cost of iterative processing is simply too much great
• GPU solution – matrix multiplications – not a friend with for-loops
• Thousand-fold speedup when you eliminate the loops that you would otherwise
need for estimating
Jan Sedmidubský | PV056 Machine Learning and Data Mining 18
Coding issues
• Easy solution – iterative processing
Jan Sedmidubský | PV056 Machine Learning and Data Mining 19
Coding issues
• GPU solution – matrix multiplications
• Implementation based on tensors
• In case of iterative processing: if batch size is 128, this results in 8,192 fetches
from the GPU memory (1282
/2)
• Solution using 1 GPU fetch → >8 K speedup
1) Determining pairs using the structure with labels (of 𝐵 dimensionality):
Jan Sedmidubský | PV056 Machine Learning and Data Mining 20
Coding issues
• GPU solution – matrix multiplications
2) Calculating the distance matrix:
• Euclidean distance between vectors Ԧ𝑥 and Ԧ𝑦 : Ԧ𝑥 − Ԧ𝑦 2 = Ԧ𝑥 2 − 2 Ԧ𝑥 Ԧ𝑦 𝑇
+ Ԧ𝑦 2
• The square of the norm of each of the vectors
• The value of the dot product between the two vectors
• Calculating a dot product of every pair of embedding vectors in X:
𝑑𝑜𝑡_𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑠 = 𝑋@𝑋. 𝑇
• Vector norms for the embedding vectors are on diagonal
𝑠𝑞𝑢𝑎𝑟𝑒𝑑_𝑛𝑜𝑟𝑚𝑠_𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔_𝑣𝑒𝑐𝑠 = 𝑡𝑜𝑟𝑐ℎ. 𝑑𝑖𝑎𝑔𝑜𝑛𝑎𝑙(𝑑𝑜𝑡_𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑠)
• Dot products between pairs of embedding vectors are in the off-diagonal elements
• Calculating the Euclidean distance matrix 𝐵 × 𝐵 (𝐵 = 6):
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒_𝑚𝑎𝑡𝑟𝑖𝑥
= 𝑠𝑞𝑢𝑎𝑟𝑒𝑑_𝑛𝑜𝑟𝑚𝑠_𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔_𝑣𝑒𝑐𝑠. 𝑣𝑖𝑒𝑤(1, 6) − 2.0 ∙ 𝑑𝑜𝑡_𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑠
+ 𝑠𝑞𝑢𝑎𝑟𝑒𝑑_𝑛𝑜𝑟𝑚𝑠_𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔_𝑣𝑒𝑐𝑠. 𝑣𝑖𝑒𝑤(6, 1)
• Note: operators for tensors are overloaded to add (or subtract) three tensors of different shapes
Jan Sedmidubský | PV056 Machine Learning and Data Mining 21
Image embeddings – what we know
• Training a neural network based on PC/Triplet loss learning
• Transform query and each database image into an embedding
• Mapping function 𝑓() extracting embedding 𝑓 𝑥 for object 𝑥
• Given two images 𝑥 and 𝑦, their closeness can be quantified by the Euclidean
distance: 𝑑𝑖𝑠𝑡 𝑓 𝑥 , 𝑓 𝑦 = 𝑓 𝑥 − 𝑓 𝑦 2
Jan Sedmidubský | PV056 Machine Learning and Data Mining 22
Neural network
(0.2, 0.9, …, 0,7)
𝑥
𝑓 𝑥
Search over image embeddings
• Query image ~ query embedding ~ query
• k-nearest neighbor (k-NN) query
• Finding the k database images that are the most similar to the query image
• Similarity between query and database images based on the Euclidean/cosine
distance between their embeddings
Jan Sedmidubský | PV056 Machine Learning and Data Mining 23
I need some
information…
Image
database
Image
representation
Index
Query
Query
representation
Similarity
Document
DocumentDocument
DocumentDocumentRanked
images
Sort by decreasing
similarity
Search over image embeddings
• Many applications of image search, e.g.:
• Photo organization – grouping images into albums automatically by recognizing
similar faces, locations, or objects
• Fashion and e-commerce – outfit pairing in online stores
• Recommending matching items (e.g., shoes, bags, or accessories) by comparing product
images based on style, color, and texture
• Visual similarity in product search – enabling users to upload an image of a product to find
similar items in the store
• Cultural heritage and art preservation
• Artifact identification – comparing images of newly found artifacts with existing ones to
determine origin or classification
• Style similarity matching – finding paintings with similar styles for study or curation
• Duplicate image detection – removing similar or exact copies of an image in
large databases to ensure unique content
Jan Sedmidubský | PV056 Machine Learning and Data Mining 24
Search over large image databases
• Brute-force approach:
• Comparing the query embedding against each database embedding
• 𝑂(𝑁) complexity (𝑁 is the size of the dataset) – not scalable
• How to search when the database of embeddings is very large?
• Solution – Approximate Nearest Neighbor (ANN) algorithms
• Many algorithms but leading ones are:
• Locality Sensitive Hashing (LSH) – PA212 course
• Product Quantization (PQ)
Jan Sedmidubský | PV056 Machine Learning and Data Mining 25
Product Quantization (PQ)
• PQ – extension of the very old Vector Quantization (VQ) idea
• VQ idea:
• Compressing a high-dimensional image vector (embedding) into a single
codeword, typically an integer value
• Pre-processing phase:
• Create a mapping from all codewords to the original image vectors within a database
• Each codeword points to the list of all the database images with the same codeword
• This mapping structure is referred to as the lookup table or inverted index
• Search phase:
• Transform the query into a single codeword
• Use the lookup table to get the candidate image vectors of the same codeword as the query
• It is also possible to consider candidate vectors having the codeword “relevant” (not strictly the same)
• Compute the distance between the original query vector and all candidate vectors
• Return the k-nearest candidate vectors with respect to the query vector
Jan Sedmidubský | PV056 Machine Learning and Data Mining 26
(0.2, 0.9, …, 0,7) 4
1 2 3 4
DB vectors
VQ
• Compression idea:
• Partitioning a vector space into Voronoi cells with respect to a set of points
• Points are typically called the centroids (pivots) of the cells in which they reside
• Voronoi diagram – partitioning the D-dimensional space of embeddings into cells {𝑐1, 𝑐2, … , 𝑐 𝑛}
determined by pivots {𝑝1, 𝑝2, … , 𝑝 𝑛}; each cell 𝑐𝑖 gets its ID ~ codeword
• All the vectors in cell 𝑐𝑖 are closer to the pivot 𝑝𝑖 than to any other pivot
Jan Sedmidubský | PV056 Machine Learning and Data Mining 27
An example of the Voronoi diagram for the case of
four pivots {𝑝1, 𝑝2, 𝑝3, 𝑝4} in the 2D plane
𝑐2:
𝑐1: 4
1
32
𝑝2
𝑝4
𝑝3𝑝1
VQ
• Pre-processing phase:
• Creating a codebook of 𝑛 = 2 𝐵 codewords
• The 𝑛 centroids (~cells) are typically the K centroids generated by applying the K-means
algorithm to the (sample of) database of vectors
• Any vector can be quantized in the underlying vector space to one of the K cluster centers
• Vectors that fall in the same cluster will be mapped to the same codeword
• Clusters can be differently populated based on data distribution
• Each codeword is represented by an integer value → B-bit representation
• Managing a lookup table with 2 𝐵
codewords
• Transforming each database image vector into a codeword
• Each codeword in the lookup table is associated with a list of the database images (or paths
to these images) of the same codeword
• Dimensionality of data can be dramatically reduced – example scenario:
• Each image represented by a 512-D embedding vector of floats → 512 ∙ 4 = 2,048 bytes
• 𝐵 = 16 → 512-D vectors quantized to 216 = 65,536 codewords
• Each image then represented by the 16-bit code (2 bytes) → 1,024x compression
Jan Sedmidubský | PV056 Machine Learning and Data Mining 28
• Search phase:
• Before the query is transformed to codeword, the ranking of pivots is created
• The distance between the query vector and each pivot vector must be calculated
• The query gets the codeword corresponding to the nearest (most-ranked) pivot
• Other “query-relevant” codewords can also be considered, e.g., as 2nd or 3rd most-ranked
• The database vectors associated with the same codeword as the query
codeword (or any of “query-relevant” codewords) become candidates
• The distance between the query vector and each candidate is evaluated
• Each candidate vector must be loaded, e.g., from secondary storage
• The k-most similar candidates (with the smallest distance) are returned as the
query result
𝑝1
VQ
Jan Sedmidubský | PV056 Machine Learning and Data Mining 29
4
1
32
𝑝2
𝑝4
𝑝3 Database
vectors
Query vector
Illustration of query evaluation:
• Query vector
• Candidates are database vectors
associated with codeword
4
4
VQ
• Limitations of VQ:
• If the codebook is too large (e.g., 𝐵 = 64 → 264 clusters are needed to be
found), it is impossible for K-means to detect such a huge number of clusters
• Returning the k-most query relevant database vectors requires to load a large
set of candidate vectors and calculate their distance with respect to the query
Jan Sedmidubský | PV056 Machine Learning and Data Mining 30
PQ
• Product quantization (PQ) idea:
• Original database vector split into several segments (sub-vectors)
• Each sub-vector follows the vector quantization independently
• Quantized vector – concatenation of codewords of individual segments
• Pre-processing phase:
• Create a sub-quantizer – a codebook for each of the segments – separately (i.e., #codebooks
= #segments)
• Clustering operation applied to each set of sub-vectors
• Create a mapping from all codewords of each codebook to the original image vectors within a
database
• Search phase:
• Original database vectors need not be loaded (e.g., from secondary storage)
• Distance between the query and a database vector is efficiently approximated
• Based on pre-computed distances between the query segments and centroids in each codebook
Jan Sedmidubský | PV056 Machine Learning and Data Mining 31
2 8 9 16
Vector
Segments
• Pre-processing phase:
• Creating 𝑚 segments → 𝑚 codebooks, each of 𝑛 = 2 𝐵 codewords
• Each codeword is again represented by an integer value → 𝐵-bit representation
• Codewords of all segments are concatenated →(𝑚 ∙ 𝐵)-bit representation
• Managing a lookup table with 2 𝐵 codewords for each segment (𝑚 lookup tables)
PQ
Jan Sedmidubský | PV056 Machine Learning and Data Mining 32
Example scenario:
• Image as a 512-D vector of floats → 512 ∙ 4 = 2,048 bytes
• 𝑚 = 32 → 32 segments (codebooks)
• 𝐵 = 8 → 16-D sub-vectors quantized to 28
= 256 codewords
• Each image then as (32 ∙ 8)-bit code (32 bytes) → 64x
compression
𝑚 = 4
2 8 9 16
Vector
Segments
12
9
1110
𝑝2
𝑝4
𝑝3𝑝14
1
3
2
8
5
7
6 16
13
15
14
𝑛 = 4 (𝐵 = 2)
• Search phase:
• 1) Precompute the distances between each query sub-vector and each centroid
in the corresponding codebook → 𝑚 ∙ 2 𝐵
distances in total
• Distances kept within a query distance matrix 𝑄𝐷𝑀 of size 𝑚 × 2 𝐵
• E.g., 32 ∙ 256 = 8,192 distances in the previous example scenario, which is cheap to compute
• Illustration of distance precomputation:
• 𝑄𝐷𝑀 of size 𝑚 × 2 𝐵
= 4 × 4
• 𝑚 = 4, 𝐵 = 2
PQ
Jan Sedmidubský | PV056 Machine Learning and Data Mining 33
.4 .2 .6 .3
.6 .4 .4 .1
.2 .4 .6 .7
.4 .3 .2 .1
𝑑𝑖𝑠𝑡(. , . )
2 𝐵
𝑚
𝑚 = 4
2 8 9 16
Query vector
Segments
12
9
1110
𝑝2
𝑝4
𝑝3𝑝14
1
3
2
8
5
7
6 16
13
15
14
𝑛 = 4 (𝐵 = 2)
PQ
• Search phase:
• 2) Identify the nearest cluster(s) in the same way as in the VQ approach
• 3) Approximate the distance of each candidate vector in the nearest cluster(s):
• For 𝑖-th segment and associated codeword c𝑖 of the candidate vector, approximate the subdistance
between 𝑖-th query segment and 𝑖-th candidate segment by the precomputed subdistance
𝑄𝐷𝑀[𝑖, 𝑐𝑖]
• Sum the sub-distances for all the segments: σ𝑖=0
𝑚−1
𝑄𝐷𝑀[𝑖, 𝑐𝑖]
Illustration of approximating the distance (for query ):
• Candidate : σ𝑖=0
3
𝑄𝐷𝑀 𝑖, 𝑐𝑖 = 0.2 + 0.1 + 0.2 + 0.2 = 0.7
• Candidate : 0.2 + 0.1 + 0.7 + 0.2 = 1.2
• Candidate : 0.4 + 0.1 + 0.2 + 0.1 = 0.8
Jan Sedmidubský | PV056 Machine Learning and Data Mining 34
.4 .2 .6 .3
.6 .4 .4 .1
.2 .4 .6 .7
.4 .3 .2 .1
2 𝐵
𝑚
2 8 9 15
2 8 9 16
2 8 12 15
1 8 9 16
2 8 9 15
Candidate
PQ
• Summary:
• For distance approximations, the candidate vectors are not accessed
• Accessing candidate vectors can be bottleneck in VQ, especially when vectors are stored
within secondary storage
• ANN search carried out efficiently in high dimensional vector spaces even when a database
has billions of vectors
• Approximated distances need not be perfect
Jan Sedmidubský | PV056 Machine Learning and Data Mining 35
Similarity search using FAISS
• FAISS – Facebook AI Similarity Search
• Library developed by Facebook AI Research for efficient similarity
search and clustering of dense vectors
• Useful for large-scale similarity search problems, which are common
in various machine learning and information retrieval tasks
• Designed to work on either the GPU or CPU and provides significant
performance improvements compared to other nearest neighbor
search algorithms
• One of the best implementation of the Product Quantization approach
to similarity search
• Implemented in C++ with Python bindings
Jan Sedmidubský | PV056 Machine Learning and Data Mining 36
FAISS
• Several techniques to achieve efficient similarity search:
• Quantization – compresses the embeddings which significantly reduces
memory usage and accelerates distance computations
• Supports Product Quantization (PQ)
• Indexing – FAISS provides multiple index types for different use cases and
trade-offs between search speed and search quality
• Flat index – brute-force index that computes exact distances between query vectors and
indexed vectors
• IVF (Inverted File) index – partitioned index that divides the vector space into Voronoi cells
• HNSW (Hierarchical Navigable Small World) – graph-based index that builds a hierarchical
graph structure, enabling efficient nearest neighbor search with logarithmic complexity
Jan Sedmidubský | PV056 Machine Learning and Data Mining 37
FAISS
• Example of IVF (Inverted File) index:
• nprobes parameter specifies the number of nearest
cluster(s) to be visited
• Clusters ranked by the distance between the cluster
centroid and query vector
Jan Sedmidubský | PV056 Machine Learning and Data Mining 38
Indexselectionguidelines
FAISS
• Useful references
• Tutorials:
• https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-
search/
• https://github.com/facebookresearch/faiss/wiki/
• Research papers:
• Johnson et al.: Billion-scale similarity search with GPUs, 2017:
https://arxiv.org/abs/1702.08734
• Douze et al.: The FAISS library, 2024: https://arxiv.org/abs/2401.08281
Jan Sedmidubský | PV056 Machine Learning and Data Mining 39
Sources
• Avi Kak and Charles Bouman: Metric Learning with Deep Neural
Networks. Purdue University, 2024
• https://www.pinecone.io/learn/series/faiss/faiss-tutorial/
Jan Sedmidubský | PV056 Machine Learning and Data Mining 40