NEURAL DOCUMENT RETRIEVAL
Martin Fajčík
 MUNI PV211 2024
Last edited: 15.04.2024

Outline


Dense Retrieval vs Sparse Retrieval
Model
PV211 final term
may ask about this
|1.1|16.3|7.8|-14.1|1.1|-4.3|-0.4|9.2|
|0|0|0|0|-3.1|0|0|2.8|-4.1|0|-4.3|0|14.1|1.2|0|0|0|0|0
or

Dense Retrieval vs Sparse Retrieval
Model
PV211 final term
may ask about this
|1.1|16.3|7.8|-14.1|1.1|-4.3|-0.4|9.2|
|0|0|0|0|-3.1|0|0|2.8|-4.1|0|-4.3|0|14.1|1.2|0|0|0|0|0
or
Dense Vector
Sparse Vector (majority elements has the same value)

Dense Passage Retrieval
Question Encoder
What will the PV211 ask about?
Passage Encoder
The 2022 test contained skip-lists, explanation of
posting lists, and BM25

Dense Passage Retrieval
•The similarity function between two representations
•Assume dataset with m training instances
•Each instance contains one question    and one relevant (positive) passage
along with n irrelevant (negative) passages

Dense Passage Retrieval Training
•Cross-Entropy Contrastive Loss
•Hyperparameters and Model choices
•Encoder and Decoder are both based on BERT pre-trained models
•21M of 100 words passages of Wikipedia were indexed
•Each passage is also prepended with the title of the Wikipedia article where the passage is from,
separated with a [SEP] token
•batch size 128
•Learning rate 1e-5, using Adam optimizer, linear scheduling with warmup rate 0.1, and dropout 0.1
•Training sets of each dataset had around ~60k samples
•

Dense Passage Retrieval Inference
•Index Construction
•Create embeddings for all (21M) passages, using passage encoder.
•This takes a lot of memory (fp16 of 21M 768 dimensional embeddings ~ 32 GB)
•Query Time
•Use query encoder to encode question.
•Find nearest neighbor doing full dot-product (O(n)) with 21M embeddings.
•Then compute arg top-K, to find K nearest values.
•(Optional) use approximate nearest neighbor methods, with logarithmic expected computational
complexity, such as HNSW.

Dense Passage Retrieval Negative Sample Mining
A white pictogram of a person holding a pickaxe Description automatically generated
This Photo by Unknown Author is licensed under CC BY
•Top BM25 passage that does not contain answer string (original DPR was made on open-domain QA)
•
•
•In-batch negatives
•
•Pros and Cons of both? (discussion)
•
•

Dense Passage Retrieval Negative Sample Mining
A white pictogram of a person holding a pickaxe Description automatically generated
This Photo by Unknown Author is licensed under CC BY
•Top BM25 passage that does not contain answer string (original DPR was made on open-domain QA)
•C: Answer list is non-exhaustive, possibility of False negatives
•C: Unclear how to mine for non-QA applications
•P: BM25 is unsupervised
•P: BM25 provides near-to-relevant negatives
•In-batch negatives
•P: Cheaply obtained, no need for extra encoding
•C: Requires large batch size
•

Cross-Encoder
•Cross-Encoder tends to work better then Bi-Encoder
•Cross-Encoder can’t be used for (First stage) Retrieval
•
Reranking
How much plastic Coca-cola produces globally?
How much plastic Coca-cola produces globally?
How much plastic Coca-cola produces globally?
How much plastic Coca-cola produces globally?
…3.43 million metric tons in 2022.
…admits it produces 3m tonnes of plastic packaging a year.
Coca Cola and PepsiCo responsible for 25% of packaging pollution found on UK beaches.
…25 per cent of its packaging globally to be reusable by 2030
Question
Candidate Passage
Concatenate
BERT
Binary Cross-Entropy
Each Passage Classified Independently

Cross-Encoder
•Reranking dataset (from the original paper)
•Positives: relevant passage from dataset.
•Negatives: top-1000 non-relevant passages.
•
Reranking
How much plastic Coca-cola produces globally?
How much plastic Coca-cola produces globally?
How much plastic Coca-cola produces globally?
How much plastic Coca-cola produces globally?
…3.43 million metric tons in 2022.
…admits it produces 3m tonnes of plastic packaging a year.
Coca Cola and PepsiCo responsible for 25% of packaging pollution found on UK beaches.
…25 per cent of its packaging globally to be reusable by 2030
Question
Candidate Passage
Concatenate
BERT
Binary Cross-Entropy
Each Passage Classified Independently

Contriever Parallels with DPR
A math symbols and symbols Description automatically generated with medium confidence
•Cross-Entropy Loss similar to DPR
•   is a temperature hyperparameter (0.05, in pretraining and finetuning)
•Based on bi-encoder BERT architecture (same as DPR), but encoder is shared
•
•Inference is the same as with DPR
\documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $\tau$
\end{document}

Contriever Unsupervised Pretraining
•Pretraining with Independent cropping
•Positive samples are random subsequences of the same document in pre-training corpus.
•“We use the random cropping data augmentation, with documents of 256 tokens and span sizes sampled
between 5% and 50% of the document length. Documents are simply random piece of text sampled from a
mix between Wikipedia and CCNet data, where half the batches are sampled from each source. We also
apply token deletion with a probability of 10%.”

Contriever Unsupervised Pretraining
•Pretraining with Independent cropping
•Negative samples come from
•In-batch negatives (with very large batch 2048)
•MoCo algorithm (He et al., 2020)
•

Contriever Unsupervised Pretraining
•MoCo algorithm
•Store representations from last N previous batches in a queue and
 use them as (cheap) negative examples in the loss.
•The gradients for similarity with these representations is only computed w.r.t. parameters of the
“query” encoder.
•As the model might sometimes change rapidly, reusing the old representations might lead to a drop
of performance when the network rapidly changes during training.
•Hence authors define document encoder is an exponential average of query encoder.
•At every training update step:
•Query network     is updated via gradient computed from SGD-like optimizer.
•Document network      is updated from the new parameters of query network     .
•
•
•
•Authors use m=0.9995
\documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$ \theta_d = m
\theta _d + (1-m)\theta _q $$ \end{document} \documentclass{article} \usepackage{amsmath}
\pagestyle{empty} \begin{document} $\theta_q$ \end{document} \documentclass{article}
\usepackage{amsmath} \pagestyle{empty} \begin{document} $\theta_q$ \end{document}
\documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $\theta_d$
\end{document}

Contriever Unsupervised Pretraining Hyperparameters
Optimizer: AdamW
Learning rate: 5e-5
Batch size: 2048
Steps: 500,000
Queue: 131,072 (representations from last 64 batches were considered)

Contriever Supervised Finetuning
•Batch size 1024
•Learning rate 1e-5
•Negatives:
•First phase: 20k steps only with in-batch negatives
•Second phase: in-batch negatives in 90% of cases, 10% of cases are hard negatives mined from the
model trained in phase 1 (how many is not documented)
•

SPLADE (V2)
•Learning sparse vectors for retrieval using BERT-based model.
•
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)

SPLADE (V2)
•Learning sparse vectors for retrieval using BERT-based model.
•A sequence of tokens from volabulary V set
for question / passage:
•
•
•A sequence of contextualized representations extracted as
•
•
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)
\documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $t = (t_1, t_2,
..., t_N)$ \end{document} \documentclass{article} \usepackage{amsmath} \pagestyle{empty}
\begin{document} $h_1,h_2, ..., h_N =\operatorname{BERT}(t)$ \end{document}

SPLADE (V2)
•
•A sequence of contextualized representations extracted as
•
•Next, let’s reuse BERT-pretraining “head” to compute importance scores of
•i-th position representation to the
•j-th vocabulary token
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)
\documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $h_1,h_2, ..., h_N
=\operatorname{BERT}(t)$ \end{document} \documentclass{article} \usepackage{amsmath}
\pagestyle{empty} \begin{document} $$ w_{ij} = \operatorname{transform}(h_i)^\top E_j + b_j $$
\end{document}

SPLADE (V2)
•Next, let’s reuse BERT-pretraining “head” to compute importance scores of
•i-th position representation to the
•j-th vocabulary token
•
•
•Ej is a token-embedding matrix of BERT
•bj is a token-embedding bias
•transform is a non-linear function
•both learned during BERT pretraining
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)
\documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$ w_{ij} =
\operatorname{transform}(h_i)^\top E_j + b_j $$ \end{document} \documentclass{article}
\usepackage{amsmath} \pagestyle{empty} \begin{document} $$\operatorname{transform}(h_i) =
\operatorname{ln}(\operatorname{GeLU}(Wh_i))$$ \end{document}

SPLADE (V2)
•
•Ej is a token-embedding matrix of BERT
•bj is a token-embedding bias
•transform is a non-linear function
•both learned during BERT pretraining
•
•
•wij are actually pre-softmax scores of i-th representation hi during pretraining
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)
\documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$ w_{ij} =
\operatorname{transform}(h_i)^\top E_j + b_j $$ \end{document}

SPLADE (V2)
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)
\documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$ w_{ij} =
\operatorname{transform}(h_i)^\top E_j + b_j $$ \end{document} \documentclass{article}
\usepackage{amsmath} \pagestyle{empty} \begin{document} $$\operatorname{ReLU}(x) =
\operatorname{max}(x,0)$$ \end{document}

SPLADE (V2)
•Sanity check: What is the size of vector w?
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)
\documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$ w_{ij} =
\operatorname{transform}(h_i)^\top E_j + b_j $$ \end{document} \documentclass{article}
\usepackage{amsmath} \pagestyle{empty} \begin{document} $$\operatorname{ReLU}(x) =
\operatorname{max}(x,0)$$ \end{document}

SPLADE (V2)
•Sanity check: What is the size of vector w?
•|w| = |V| (same as size of vocabulary)
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)
\documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$w_j =
\sum_{i=0}^N \operatorname{log}(1+\operatorname{ReLU}(w_{i,j})) $$ \end{document}
\documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document}
$$\operatorname{ReLU}(x) = \operatorname{max}(x,0)$$ \end{document}

SPLADE Training
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)
•Uses the same style contrastive loss as DPR

SPLADE Training
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)
•Uses the same style contrastive loss as DPR
•
•
•
•
•Combined with sparsity losses for query and passage representations
•
\documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$L =
L_{rank}(q_i,p^+_i,p^-_{i,1},p^-_{i,2},...,p^-_{i,n})+ \lambda_q L_{sparse}(w_q) + \lambda_d
L_{sparse}(w_d)$$ \end{document}

SPLADE Training
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)
•Uses the same style contrastive loss as DPR
•
•
•
•
•Combined with sparsity losses for query and passage representations
•
\documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$L =
L_{rank}(q_i,p^+_i,p^-_{i,1},p^-_{i,2},...,p^-_{i,n})+ \lambda_q L_{sparse}(w_q) + \lambda_d
L_{sparse}(w_d)$$ \end{document}
Tuneable hyperparameters that
control sparsity strength

Sparsity Loss (vanilla)
•Common sparsity losses are L1/L2 regularization loss
A blue curved line in a black background Description automatically generated

Sparsity Loss (vanilla)
•Common sparsity losses are L1/L2 regularization loss
A blue curved line in a black background Description automatically generated

Sparsity Loss (vanilla)
•Common sparsity losses are L1/L2 regularization loss
A blue curved line in a black background Description automatically generated
Offsetting loss won’t affect the optimization (argmax [f(x)] = argmax [f(x)+c])

Sparsity Loss (complexity analysis)
•However, different representations might have same representations not sparse
•L* losses only care about sparsity across vector dimension
•
•
•
•
•
•Expected # of FLOPS for sim(wq,wp) is d/p,
    where d is |w| dimensionality (V) and
              1/p is average proportion of non-0 elements in w
•
•
•
•
•
2
0
0
0
0
3
p1
4
0
0
0
0
8
p2
1
0
0
0
0
2
p3
Biswajit Paria, Chih-Kuan Yeh, Ian E. H. Yen, Ning Xu, Pradeep Ravikumar, and Barnabás Póczos.
2020. Minimizing FLOPs to Learn Efficient Sparse Representations. arXiv:2004.05665

Sparsity Loss (complexity analysis)
•However, different representations might have same representations not sparse
•L* losses only care about sparsity across vector dimension
2
0
0
0
0
3
p1
4
0
0
0
0
8
p2
1
0
0
0
0
2
p3
How to enforce sparsity across vectors?
Biswajit Paria, Chih-Kuan Yeh, Ian E. H. Yen, Ning Xu, Pradeep Ravikumar, and Barnabás Póczos.
2020. Minimizing FLOPs to Learn Efficient Sparse Representations. arXiv:2004.05665

FLOPS Sparsity Loss
•Estimate expected wj  across minibatch of M elements
 (p1, p2, …, pM)
•Enforce sparsity of such estimate
•Expected # of FLOPS for sim(wq,wp) is d/p^2,
    where d is |w| dimensionality (V) and
              1/p is average proportion of non-0 elements in w
•
•
•
•
•The probability that n-th element is sparse is the same for all n
Biswajit Paria, Chih-Kuan Yeh, Ian E. H. Yen, Ning Xu, Pradeep Ravikumar, and Barnabás Póczos.
2020. Minimizing FLOPs to Learn Efficient Sparse Representations. arXiv:2004.05665

SPLADE Training
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)
•From the paper

DistilSPLADE-max Training
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)
1.Get triples for MS-MARCO from traditional system, picking negatives at random from top-1000
ranked passages. Construct first dataset of triplets (D#1)
2.Train Cross-Encoder on D#1 (CE#1).
3.Train SPLADE#1 on D#1 using distillation from CE#1.
4.Generate new negatives from SPLADE #1 top-K retrieved; construct new dataset of triplets (D#2).
5.Train Cross-Encoder on D#2 (CE#2).
6.Train SPLADE#2 on D#2 using distillation from CE#2.
7.
7.
7.
7.

Distillation via Margin-MSE
Margin Mean Squared Error Loss
•Assume Teacher Model score function MT.
•E.g. BERT output projected to a scalar, as in Cross-Encoder.
•And Student Model score function MS.
•E.g., DPR/Contriever/SPLADE model’s dot-product.
•To avoid enforcing specific scores from MT to MS only margins can be distilled from Teacher into
Student.
Notes:
  - Model Distillation = Transferring information between two models,
                                  training Student Model from Teacher Model.
  - Score function = Similarity function

Distillation via Margin-MSE
Margin Mean Squared Error Loss
•Assume Teacher Model score function MT.
•E.g. BERT output projected to a scalar, as in Cross-Encoder.
•And Student Model score function MS.
•E.g., DPR/Contriever/SPLADE model’s dot-product.
•Loss function for triples of queries Q, positive passages P+, negative passages P-.
•
Student’s error margin
Teacher’s error margin

SPLADE Inference
Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information
Retrieval.” ArXiv abs/2109.10086 (2021)
•Similar to DPR/Contriever.
•Sparse-vector products, are efficiently implemented in Numba/Numpy.

COLBERT V2
A diagram of a computer network Description automatically generated
•Multi-vector query/document representations.
•Middle ground between cross-encoder and bi-encoder.
•Can be used for (first-stage) retrieval.
•However, it’s large passage index makes it suited for smaller collections.

COLBERT V2
•Punctuation symbol embeddings are removed.
•Query is always represented with Nq(=32) tokens. Shorter queries are padded.
•Documents are split so that they contain Nd(=300 on BEIR ) representations.
•Representations are low-domensional
•(Each DPR vector has 768d, each of COLBERT’s vectors is 128d).
•Each vector representation is L2 normalized (= unit vectors)
•

COLBERT MaxSim
A diagram of a computer network Description automatically generated
•Each token in Question/Passage is encoded into vector using BERT.
•Similarity between query and passage is computed between two matrices Q and P.
•Max-pooled over document representations.

COLBERT MaxSim
A diagram of a computer network Description automatically generated
•Each token in Question/Passage is encoded into vector using BERT.
•Similarity between query and passage is computed between two matrices Q and P.
•Max-pooled over document representations.
Cosine Similarity

COLBERT Training
•Similar to previous methods.
•V2 uses hard negatives, cross-encoder distillation, in-batch negatives.

COLBERT Inference (PLAID)
•Tomorrow practical


MTEB Massive Text Embedding Benchmark
•https://huggingface.co/spaces/mteb/leaderboard
•