NEURAL DOCUMENT RETRIEVAL Martin Fajčík MUNI PV211 2024 Last edited: 15.04.2024 Outline Dense Retrieval vs Sparse Retrieval Model PV211 final term may ask about this |1.1|16.3|7.8|-14.1|1.1|-4.3|-0.4|9.2| |0|0|0|0|-3.1|0|0|2.8|-4.1|0|-4.3|0|14.1|1.2|0|0|0|0|0 or Dense Retrieval vs Sparse Retrieval Model PV211 final term may ask about this |1.1|16.3|7.8|-14.1|1.1|-4.3|-0.4|9.2| |0|0|0|0|-3.1|0|0|2.8|-4.1|0|-4.3|0|14.1|1.2|0|0|0|0|0 or Dense Vector Sparse Vector (majority elements has the same value) Dense Passage Retrieval Question Encoder What will the PV211 ask about? Passage Encoder The 2022 test contained skip-lists, explanation of posting lists, and BM25 Dense Passage Retrieval •The similarity function between two representations •Assume dataset with m training instances •Each instance contains one question and one relevant (positive) passage along with n irrelevant (negative) passages Dense Passage Retrieval Training •Cross-Entropy Contrastive Loss •Hyperparameters and Model choices •Encoder and Decoder are both based on BERT pre-trained models •21M of 100 words passages of Wikipedia were indexed •Each passage is also prepended with the title of the Wikipedia article where the passage is from, separated with a [SEP] token •batch size 128 •Learning rate 1e-5, using Adam optimizer, linear scheduling with warmup rate 0.1, and dropout 0.1 •Training sets of each dataset had around ~60k samples • Dense Passage Retrieval Inference •Index Construction •Create embeddings for all (21M) passages, using passage encoder. •This takes a lot of memory (fp16 of 21M 768 dimensional embeddings ~ 32 GB) •Query Time •Use query encoder to encode question. •Find nearest neighbor doing full dot-product (O(n)) with 21M embeddings. •Then compute arg top-K, to find K nearest values. •(Optional) use approximate nearest neighbor methods, with logarithmic expected computational complexity, such as HNSW. Dense Passage Retrieval Negative Sample Mining A white pictogram of a person holding a pickaxe Description automatically generated This Photo by Unknown Author is licensed under CC BY •Top BM25 passage that does not contain answer string (original DPR was made on open-domain QA) • • •In-batch negatives • •Pros and Cons of both? (discussion) • • Dense Passage Retrieval Negative Sample Mining A white pictogram of a person holding a pickaxe Description automatically generated This Photo by Unknown Author is licensed under CC BY •Top BM25 passage that does not contain answer string (original DPR was made on open-domain QA) •C: Answer list is non-exhaustive, possibility of False negatives •C: Unclear how to mine for non-QA applications •P: BM25 is unsupervised •P: BM25 provides near-to-relevant negatives •In-batch negatives •P: Cheaply obtained, no need for extra encoding •C: Requires large batch size • Cross-Encoder •Cross-Encoder tends to work better then Bi-Encoder •Cross-Encoder can’t be used for (First stage) Retrieval • Reranking How much plastic Coca-cola produces globally? How much plastic Coca-cola produces globally? How much plastic Coca-cola produces globally? How much plastic Coca-cola produces globally? …3.43 million metric tons in 2022. …admits it produces 3m tonnes of plastic packaging a year. Coca Cola and PepsiCo responsible for 25% of packaging pollution found on UK beaches. …25 per cent of its packaging globally to be reusable by 2030 Question Candidate Passage Concatenate BERT Binary Cross-Entropy Each Passage Classified Independently Cross-Encoder •Reranking dataset (from the original paper) •Positives: relevant passage from dataset. •Negatives: top-1000 non-relevant passages. • Reranking How much plastic Coca-cola produces globally? How much plastic Coca-cola produces globally? How much plastic Coca-cola produces globally? How much plastic Coca-cola produces globally? …3.43 million metric tons in 2022. …admits it produces 3m tonnes of plastic packaging a year. Coca Cola and PepsiCo responsible for 25% of packaging pollution found on UK beaches. …25 per cent of its packaging globally to be reusable by 2030 Question Candidate Passage Concatenate BERT Binary Cross-Entropy Each Passage Classified Independently Contriever Parallels with DPR A math symbols and symbols Description automatically generated with medium confidence •Cross-Entropy Loss similar to DPR • is a temperature hyperparameter (0.05, in pretraining and finetuning) •Based on bi-encoder BERT architecture (same as DPR), but encoder is shared • •Inference is the same as with DPR \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $\tau$ \end{document} Contriever Unsupervised Pretraining •Pretraining with Independent cropping •Positive samples are random subsequences of the same document in pre-training corpus. •“We use the random cropping data augmentation, with documents of 256 tokens and span sizes sampled between 5% and 50% of the document length. Documents are simply random piece of text sampled from a mix between Wikipedia and CCNet data, where half the batches are sampled from each source. We also apply token deletion with a probability of 10%.” Contriever Unsupervised Pretraining •Pretraining with Independent cropping •Negative samples come from •In-batch negatives (with very large batch 2048) •MoCo algorithm (He et al., 2020) • Contriever Unsupervised Pretraining •MoCo algorithm •Store representations from last N previous batches in a queue and use them as (cheap) negative examples in the loss. •The gradients for similarity with these representations is only computed w.r.t. parameters of the “query” encoder. •As the model might sometimes change rapidly, reusing the old representations might lead to a drop of performance when the network rapidly changes during training. •Hence authors define document encoder is an exponential average of query encoder. •At every training update step: •Query network is updated via gradient computed from SGD-like optimizer. •Document network is updated from the new parameters of query network . • • • •Authors use m=0.9995 \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$ \theta_d = m \theta _d + (1-m)\theta _q $$ \end{document} \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $\theta_q$ \end{document} \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $\theta_q$ \end{document} \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $\theta_d$ \end{document} Contriever Unsupervised Pretraining Hyperparameters Optimizer: AdamW Learning rate: 5e-5 Batch size: 2048 Steps: 500,000 Queue: 131,072 (representations from last 64 batches were considered) Contriever Supervised Finetuning •Batch size 1024 •Learning rate 1e-5 •Negatives: •First phase: 20k steps only with in-batch negatives •Second phase: in-batch negatives in 90% of cases, 10% of cases are hard negatives mined from the model trained in phase 1 (how many is not documented) • SPLADE (V2) •Learning sparse vectors for retrieval using BERT-based model. • Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) SPLADE (V2) •Learning sparse vectors for retrieval using BERT-based model. •A sequence of tokens from volabulary V set for question / passage: • • •A sequence of contextualized representations extracted as • • Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $t = (t_1, t_2, ..., t_N)$ \end{document} \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $h_1,h_2, ..., h_N =\operatorname{BERT}(t)$ \end{document} SPLADE (V2) • •A sequence of contextualized representations extracted as • •Next, let’s reuse BERT-pretraining “head” to compute importance scores of •i-th position representation to the •j-th vocabulary token Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $h_1,h_2, ..., h_N =\operatorname{BERT}(t)$ \end{document} \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$ w_{ij} = \operatorname{transform}(h_i)^\top E_j + b_j $$ \end{document} SPLADE (V2) •Next, let’s reuse BERT-pretraining “head” to compute importance scores of •i-th position representation to the •j-th vocabulary token • • •Ej is a token-embedding matrix of BERT •bj is a token-embedding bias •transform is a non-linear function •both learned during BERT pretraining Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$ w_{ij} = \operatorname{transform}(h_i)^\top E_j + b_j $$ \end{document} \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$\operatorname{transform}(h_i) = \operatorname{ln}(\operatorname{GeLU}(Wh_i))$$ \end{document} SPLADE (V2) • •Ej is a token-embedding matrix of BERT •bj is a token-embedding bias •transform is a non-linear function •both learned during BERT pretraining • • •wij are actually pre-softmax scores of i-th representation hi during pretraining Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$ w_{ij} = \operatorname{transform}(h_i)^\top E_j + b_j $$ \end{document} SPLADE (V2) Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$ w_{ij} = \operatorname{transform}(h_i)^\top E_j + b_j $$ \end{document} \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$\operatorname{ReLU}(x) = \operatorname{max}(x,0)$$ \end{document} SPLADE (V2) •Sanity check: What is the size of vector w? Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$ w_{ij} = \operatorname{transform}(h_i)^\top E_j + b_j $$ \end{document} \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$\operatorname{ReLU}(x) = \operatorname{max}(x,0)$$ \end{document} SPLADE (V2) •Sanity check: What is the size of vector w? •|w| = |V| (same as size of vocabulary) Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$w_j = \sum_{i=0}^N \operatorname{log}(1+\operatorname{ReLU}(w_{i,j})) $$ \end{document} \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$\operatorname{ReLU}(x) = \operatorname{max}(x,0)$$ \end{document} SPLADE Training Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) •Uses the same style contrastive loss as DPR SPLADE Training Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) •Uses the same style contrastive loss as DPR • • • • •Combined with sparsity losses for query and passage representations • \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$L = L_{rank}(q_i,p^+_i,p^-_{i,1},p^-_{i,2},...,p^-_{i,n})+ \lambda_q L_{sparse}(w_q) + \lambda_d L_{sparse}(w_d)$$ \end{document} SPLADE Training Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) •Uses the same style contrastive loss as DPR • • • • •Combined with sparsity losses for query and passage representations • \documentclass{article} \usepackage{amsmath} \pagestyle{empty} \begin{document} $$L = L_{rank}(q_i,p^+_i,p^-_{i,1},p^-_{i,2},...,p^-_{i,n})+ \lambda_q L_{sparse}(w_q) + \lambda_d L_{sparse}(w_d)$$ \end{document} Tuneable hyperparameters that control sparsity strength Sparsity Loss (vanilla) •Common sparsity losses are L1/L2 regularization loss A blue curved line in a black background Description automatically generated Sparsity Loss (vanilla) •Common sparsity losses are L1/L2 regularization loss A blue curved line in a black background Description automatically generated Sparsity Loss (vanilla) •Common sparsity losses are L1/L2 regularization loss A blue curved line in a black background Description automatically generated Offsetting loss won’t affect the optimization (argmax [f(x)] = argmax [f(x)+c]) Sparsity Loss (complexity analysis) •However, different representations might have same representations not sparse •L* losses only care about sparsity across vector dimension • • • • • •Expected # of FLOPS for sim(wq,wp) is d/p, where d is |w| dimensionality (V) and 1/p is average proportion of non-0 elements in w • • • • • 2 0 0 0 0 3 p1 4 0 0 0 0 8 p2 1 0 0 0 0 2 p3 Biswajit Paria, Chih-Kuan Yeh, Ian E. H. Yen, Ning Xu, Pradeep Ravikumar, and Barnabás Póczos. 2020. Minimizing FLOPs to Learn Efficient Sparse Representations. arXiv:2004.05665 Sparsity Loss (complexity analysis) •However, different representations might have same representations not sparse •L* losses only care about sparsity across vector dimension 2 0 0 0 0 3 p1 4 0 0 0 0 8 p2 1 0 0 0 0 2 p3 How to enforce sparsity across vectors? Biswajit Paria, Chih-Kuan Yeh, Ian E. H. Yen, Ning Xu, Pradeep Ravikumar, and Barnabás Póczos. 2020. Minimizing FLOPs to Learn Efficient Sparse Representations. arXiv:2004.05665 FLOPS Sparsity Loss •Estimate expected wj across minibatch of M elements (p1, p2, …, pM) •Enforce sparsity of such estimate •Expected # of FLOPS for sim(wq,wp) is d/p^2, where d is |w| dimensionality (V) and 1/p is average proportion of non-0 elements in w • • • • •The probability that n-th element is sparse is the same for all n Biswajit Paria, Chih-Kuan Yeh, Ian E. H. Yen, Ning Xu, Pradeep Ravikumar, and Barnabás Póczos. 2020. Minimizing FLOPs to Learn Efficient Sparse Representations. arXiv:2004.05665 SPLADE Training Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) •From the paper DistilSPLADE-max Training Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) 1.Get triples for MS-MARCO from traditional system, picking negatives at random from top-1000 ranked passages. Construct first dataset of triplets (D#1) 2.Train Cross-Encoder on D#1 (CE#1). 3.Train SPLADE#1 on D#1 using distillation from CE#1. 4.Generate new negatives from SPLADE #1 top-K retrieved; construct new dataset of triplets (D#2). 5.Train Cross-Encoder on D#2 (CE#2). 6.Train SPLADE#2 on D#2 using distillation from CE#2. 7. 7. 7. 7. Distillation via Margin-MSE Margin Mean Squared Error Loss •Assume Teacher Model score function MT. •E.g. BERT output projected to a scalar, as in Cross-Encoder. •And Student Model score function MS. •E.g., DPR/Contriever/SPLADE model’s dot-product. •To avoid enforcing specific scores from MT to MS only margins can be distilled from Teacher into Student. Notes: - Model Distillation = Transferring information between two models, training Student Model from Teacher Model. - Score function = Similarity function Distillation via Margin-MSE Margin Mean Squared Error Loss •Assume Teacher Model score function MT. •E.g. BERT output projected to a scalar, as in Cross-Encoder. •And Student Model score function MS. •E.g., DPR/Contriever/SPLADE model’s dot-product. •Loss function for triples of queries Q, positive passages P+, negative passages P-. • Student’s error margin Teacher’s error margin SPLADE Inference Formal, Thibault et al. “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” ArXiv abs/2109.10086 (2021) •Similar to DPR/Contriever. •Sparse-vector products, are efficiently implemented in Numba/Numpy. COLBERT V2 A diagram of a computer network Description automatically generated •Multi-vector query/document representations. •Middle ground between cross-encoder and bi-encoder. •Can be used for (first-stage) retrieval. •However, it’s large passage index makes it suited for smaller collections. COLBERT V2 •Punctuation symbol embeddings are removed. •Query is always represented with Nq(=32) tokens. Shorter queries are padded. •Documents are split so that they contain Nd(=300 on BEIR ) representations. •Representations are low-domensional •(Each DPR vector has 768d, each of COLBERT’s vectors is 128d). •Each vector representation is L2 normalized (= unit vectors) • COLBERT MaxSim A diagram of a computer network Description automatically generated •Each token in Question/Passage is encoded into vector using BERT. •Similarity between query and passage is computed between two matrices Q and P. •Max-pooled over document representations. COLBERT MaxSim A diagram of a computer network Description automatically generated •Each token in Question/Passage is encoded into vector using BERT. •Similarity between query and passage is computed between two matrices Q and P. •Max-pooled over document representations. Cosine Similarity COLBERT Training •Similar to previous methods. •V2 uses hard negatives, cross-encoder distillation, in-batch negatives. COLBERT Inference (PLAID) •Tomorrow practical MTEB Massive Text Embedding Benchmark •https://huggingface.co/spaces/mteb/leaderboard •