Natural Language Processing with Deep Learning CS224N/Ling284 John Hewitt Lecture 8: Self-Attention and Transformers Adapted from slides by Anna Goldie, John Hewitt The Transformer Decoder 23 • A Transformer decoder is how we’ll build systems like language models. • It’s a lot like our minimal selfattention architecture, but with a few more components. • The embeddings and position embeddings are identical. • We’ll next replace our selfattention with multi-head self- attention. Transformer Decoder Recall the Self-Attention Hypothetical Example 24 Hypothetical Example of Multi-Head Attention 25 Sequence-Stacked form of Attention • Let’s look at how key-query-value attention is computed, in matrices. • Let 𝑋 = 𝑥1; … ; 𝑥 𝑛 ∈ ℝ 𝑛×𝑑 be the concatenation of input vectors. • First, note that 𝑋𝐾 ∈ ℝ 𝑛×𝑑 , 𝑋𝑄 ∈ ℝ 𝑛×𝑑 , 𝑋𝑉 ∈ ℝ 𝑛×𝑑 . • The output is defined as output = softmax 𝑋𝑄 𝑋𝐾 ⊤ 𝑋𝑉 ∈∈ ℝ 𝑛×𝑑 . = 𝑋𝑄𝐾⊤ 𝑋⊤ ∈ ℝ 𝑛×𝑛 All pairs of attention scores! output ∈ ℝ 𝑛×𝑑 = 𝐾⊤ 𝑋⊤ 𝑋𝑄 First, take the query-key dot products in one matrix multiplication: 𝑋𝑄 𝑋𝐾 ⊤ Next, softmax, and compute the weighted average with another matrix multiplication. 𝑋𝑄𝐾⊤ 𝑋⊤softmax 𝑋𝑉 26 Multi-headed attention • What if we want to look in multiple places in the sentence at once? • For word 𝑖, self-attention “looks” where 𝑥𝑖 ⊤ 𝑄⊤ 𝐾𝑥𝑗 is high, but maybe we want to focus on different 𝑗 for different reasons? • We’ll define multiple attention “heads” through multiple Q,K,V matrices • Let, 𝑄ℓ, 𝐾ℓ, 𝑉ℓ ∈ ℝ 𝑑× 𝑑 ℎ, where ℎ is the number of attention heads, and ℓ ranges from 1 to ℎ. • Each attention head performs attention independently: • outputℓ = softmax 𝑋𝑄ℓ 𝐾ℓ ⊤ 𝑋⊤ ∗ 𝑋𝑉ℓ, where outputℓ ∈ ℝ 𝑑/ℎ • Then the outputs of all the heads are combined! • output = output1; … ; outputℎ 𝑌, where 𝑌 ∈ ℝ 𝑑×𝑑 • Each head gets to “look” at different things, and construct value vectors differently.27 Multi-head self-attention is computationally efficient • Even though we compute ℎ many attention heads, it’s not really more costly. • We compute 𝑋𝑄 ∈ ℝ 𝑛×𝑑 , and then reshape to ℝ 𝑛×ℎ×𝑑/ℎ . (Likewise for 𝑋𝐾, 𝑋𝑉.) • Then we transpose to ℝℎ×𝑛×𝑑/ℎ; now the head axis is like a batch axis. • Almost everything else is identical, and the matrices are the same sizes. 28 𝑋𝑄 First, take the query-key dot products in one matrix multiplication: 𝑋𝑄 𝑋𝐾 ⊤ 𝐾⊤ 𝑋⊤ Next, softmax, and compute the weighted average with another matrix multiplication. softmax 𝑋𝑉𝑋𝑄𝐾⊤ 𝑋⊤ 𝑋𝑉 output ∈ ℝ 𝑛×𝑑 = 𝑃 = mix ∈ ℝ3×𝑛×𝑛 3 sets of all pairs of attention scores!𝑋𝑄𝐾⊤ 𝑋⊤ = Scaled Dot Product [Vaswani et al., 2017] • “Scaled Dot Product” attention aids in training. • When dimensionality 𝑑 becomes large, dot products between vectors tend to become large. • Because of this, inputs to the softmax function can be large, making the gradients small. • Instead of the self-attention function we’ve seen: outputℓ = softmax 𝑋𝑄ℓ 𝐾ℓ ⊤ 𝑋⊤ ∗ 𝑋𝑉ℓ • We divide the attention scores by 𝑑/ℎ, to stop the scores from becoming large just as a function of 𝑑/ℎ (The dimensionality divided by the number of heads.) outputℓ = softmax 𝑋𝑄ℓ 𝐾ℓ ⊤ 𝑋⊤ 𝑑/ℎ ∗ 𝑋𝑉ℓ 29 The Transformer Decoder 30 • Now that we’ve replaced selfattention with multi-head selfattention, we’ll go through two optimization tricks that end up being : • Residual Connections • Layer Normalization • In most Transformer diagrams, these are often written together as “Add & Norm” Transformer Decoder The Transformer Encoder: Residual connections [He et al., 2016] • Residual connections are a trick to help models train better. • Instead of 𝑋(𝑖) = Layer(𝑋 𝑖−1 ) (where 𝑖 represents the layer) • We let 𝑋(𝑖) = 𝑋(𝑖−1) + Layer(𝑋 𝑖−1 ) (so we only have to learn “the residual” from the previous layer) • Gradient is great through the residual connection; it’s 1! • Bias towards the identity function! 𝑋(𝑖−1) Layer 𝑋(𝑖) 𝑋(𝑖−1) Layer 𝑋(𝑖) + [no residuals] [residuals] [Loss landscape visualization, Li et al., 2018, on a ResNet]31 The Transformer Encoder: Layer normalization [Ba et al., 2016] • Layer normalization is a trick to help models train faster. • Idea: cut down on uninformative variation in hidden vector values by normalizing to unit mean and standard deviation within each layer. • LayerNorm’s success may be due to its normalizing gradients [Xu et al., 2019] • Let 𝑥 ∈ ℝ 𝑑 be an individual (word) vector in the model. • Let 𝜇 = σ 𝑗=1 𝑑 𝑥𝑗; this is the mean; 𝜇 ∈ ℝ. • Let 𝜎 = 1 𝑑 σ 𝑗=1 𝑑 𝑥𝑗 − 𝜇 2 ; this is the standard deviation; 𝜎 ∈ ℝ. • Let 𝛾 ∈ ℝ 𝑑 and 𝛽 ∈ ℝ 𝑑 be learned “gain” and “bias” parameters. (Can omit!) • Then layer normalization computes: output = 𝑥 − 𝜇 𝜎 + 𝜖 ∗ 𝛾 + 𝛽 Normalize by scalar mean and variance Modulate by learned elementwise gain and bias32 The Transformer Decoder 33 • The Transformer Decoder is a stack of Transformer Decoder Blocks. • Each Block consists of: • Self-attention • Add & Norm • Feed-Forward • Add & Norm • That’s it! We’ve gone through the Transformer Decoder. Transformer Decoder The Transformer Encoder 34 • The Transformer Decoder constrains to unidirectional context, as for language models. • What if we want bidirectional context, like in a bidirectional RNN? • This is the Transformer Encoder. The only difference is that we remove the masking in the self-attention. Transformer DecoderNo Masking! The Transformer Encoder-Decoder 35 • Recall that in machine translation, we processed the source sentence with a bidirectional model and generated the target with a unidirectional model. • For this kind of seq2seq format, we often use a Transformer Encoder-Decoder. • We use a normal Transformer Encoder. • Our Transformer Decoder is modified to perform crossattention to the output of the Encoder. Cross-attention (details) • We saw that self-attention is when keys, queries, and values come from the same source. • In the decoder, we have attention that looks more like what we saw last week. • Let ℎ1, … , ℎ 𝑛 be output vectors from the Transformer encoder; 𝑥𝑖 ∈ ℝ 𝑑 • Let 𝑧1, … , 𝑧 𝑛 be input vectors from the Transformer decoder, 𝑧𝑖 ∈ ℝ 𝑑 • Then keys and values are drawn from the encoder (like a memory): • 𝑘𝑖 = 𝐾ℎ𝑖, 𝑣𝑖 = 𝑉ℎ𝑖. • And the queries are drawn from the decoder, 𝑞𝑖 = 𝑄𝑧𝑖. 36 ℎ1, … , ℎ 𝑛 𝑧1, … , 𝑧 𝑛 Great Results with Transformers: Machine Translation [Vaswani et al., 2017] Not just better Machine Translation BLEU scores Also more efficient to train! First, Machine Translation results from the original Transformers paper! [Test sets: WMT 2014 English-German and English-French]7 Great Results with Transformers: SuperGLUE [Wang et al., 2019] Not just better Machine Translation BLEU scores Also more efficient to train! [Test sets: SuperGLUE Leaderboard Version: 2.0]8 SuperGLUE is a suite of challenging NLP tasks, including question-answering, word sense disambiguation, coreference resolution, and natural language inference. Great Results with Transformers: Rise of Large Language Models! 9 Today, Transformer-based models dominate LMSYS Chatbot Arena Leaderboard! [Chiang et al., 2024] Transformers Even Show Promise Outside of NLP 13 Protein Folding Image Classification [Dosovitskiy et al. 2020]: Vision Transformer (ViT) outperforms ResNet-based baselines with substantially less compute. ML for Systems [Zhou et al. 2020]: A Transformer-based compiler model (GO-one) speeds up a Transformer model! [Jumper et al. 2021] aka AlphaFold2! Scaling Laws: Are Transformers All We Need? 14 • With Transformers, language modeling performance improves smoothly as we increase model size, training data, and compute resources in tandem. • This power-law relationship has been observed over multiple orders of magnitude with no sign of slowing! • If we keep scaling up these models (with no change to the architecture), could they eventually match or exceed human-level performance? [Kaplan et al., 2020] • Quadratic compute in self-attention (today): • Computing all pairs of interactions means our computation grows quadratically with the sequence length! • For recurrent models, it only grew linearly! • Position representations: • Are simple absolute indices the best we can do to represent position? • As we learned: Relative linear position attention [Shaw et al., 2018] • Dependency syntax-based position [Wang et al., 2019] • Rotary Embeddings [Su et al., 2021] What would we like to fix about the Transformer? 58 • Considerable recent work has gone into the question, Can we build models like Transformers without paying the 𝑂 𝑇2 all-pairs self-attention cost? • For example, Linformer [Wang et al., 2020] Recent work on improving on quadratic self-attention cost 59 Key idea: map the sequence length dimension to a lowerdimensional space for values, keys Inferencetime(s) Sequence length / batch size • Considerable recent work has gone into the question, Can we build models like Transformers without paying the 𝑂 𝑇2 all-pairs self-attention cost? • For example, BigBird [Zaheer et al., 2021] Recent work on improving on quadratic self-attention cost 60 Key idea: replace all-pairs interactions with a family of other interactions, like local windows, looking at everything, and random interactions. Do Transformer Modifications Transfer? 61 • "Surprisingly, we find that most modifications do not meaningfully improve performance."