Natural Language Processing
with Deep Learning
CS224N/Ling284
John Hewitt
Lecture 8: Self-Attention and Transformers
Adapted from slides by Anna Goldie, John Hewitt
The Transformer Decoder
23
• A Transformer decoder is how
we’ll build systems like
language models.
• It’s a lot like our minimal selfattention
architecture, but
with a few more components.
• The embeddings and position
embeddings are identical.
• We’ll next replace our selfattention
with multi-head self-
attention.
Transformer Decoder
Recall the Self-Attention Hypothetical Example
24
Hypothetical Example of Multi-Head Attention
25
Sequence-Stacked form of Attention
• Let’s look at how key-query-value attention is computed, in matrices.
• Let 𝑋 = 𝑥1; … ; 𝑥 𝑛 ∈ ℝ 𝑛×𝑑
be the concatenation of input vectors.
• First, note that 𝑋𝐾 ∈ ℝ 𝑛×𝑑
, 𝑋𝑄 ∈ ℝ 𝑛×𝑑
, 𝑋𝑉 ∈ ℝ 𝑛×𝑑
.
• The output is defined as output = softmax 𝑋𝑄 𝑋𝐾 ⊤
𝑋𝑉 ∈∈ ℝ 𝑛×𝑑
.
= 𝑋𝑄𝐾⊤
𝑋⊤
∈ ℝ 𝑛×𝑛
All pairs of
attention scores!
output ∈ ℝ 𝑛×𝑑
=
𝐾⊤ 𝑋⊤
𝑋𝑄
First, take the query-key dot
products in one matrix
multiplication: 𝑋𝑄 𝑋𝐾 ⊤
Next, softmax, and
compute the weighted
average with another
matrix multiplication.
𝑋𝑄𝐾⊤ 𝑋⊤softmax 𝑋𝑉
26
Multi-headed attention
• What if we want to look in multiple places in the sentence at once?
• For word 𝑖, self-attention “looks” where 𝑥𝑖
⊤
𝑄⊤
𝐾𝑥𝑗 is high, but maybe we want
to focus on different 𝑗 for different reasons?
• We’ll define multiple attention “heads” through multiple Q,K,V matrices
• Let, 𝑄ℓ, 𝐾ℓ, 𝑉ℓ ∈ ℝ 𝑑×
𝑑
ℎ, where ℎ is the number of attention heads, and ℓ ranges
from 1 to ℎ.
• Each attention head performs attention independently:
• outputℓ = softmax 𝑋𝑄ℓ 𝐾ℓ
⊤
𝑋⊤ ∗ 𝑋𝑉ℓ, where outputℓ ∈ ℝ 𝑑/ℎ
• Then the outputs of all the heads are combined!
• output = output1; … ; outputℎ 𝑌, where 𝑌 ∈ ℝ 𝑑×𝑑
• Each head gets to “look” at different things, and construct value vectors
differently.27
Multi-head self-attention is computationally efficient
• Even though we compute ℎ many attention heads, it’s not really more costly.
• We compute 𝑋𝑄 ∈ ℝ 𝑛×𝑑
, and then reshape to ℝ 𝑛×ℎ×𝑑/ℎ
. (Likewise for 𝑋𝐾, 𝑋𝑉.)
• Then we transpose to ℝℎ×𝑛×𝑑/ℎ; now the head axis is like a batch axis.
• Almost everything else is identical, and the matrices are the same sizes.
28
𝑋𝑄
First, take the query-key dot
products in one matrix
multiplication: 𝑋𝑄 𝑋𝐾 ⊤
𝐾⊤ 𝑋⊤
Next, softmax, and
compute the weighted
average with another
matrix multiplication.
softmax 𝑋𝑉𝑋𝑄𝐾⊤
𝑋⊤
𝑋𝑉
output ∈ ℝ 𝑛×𝑑
=
𝑃
=
mix
∈ ℝ3×𝑛×𝑛
3 sets of all pairs of
attention scores!𝑋𝑄𝐾⊤ 𝑋⊤
=
Scaled Dot Product [Vaswani et al., 2017]
• “Scaled Dot Product” attention aids in training.
• When dimensionality 𝑑 becomes large, dot products between vectors tend to
become large.
• Because of this, inputs to the softmax function can be large, making the
gradients small.
• Instead of the self-attention function we’ve seen:
outputℓ = softmax 𝑋𝑄ℓ 𝐾ℓ
⊤
𝑋⊤
∗ 𝑋𝑉ℓ
• We divide the attention scores by 𝑑/ℎ, to stop the scores from becoming large
just as a function of 𝑑/ℎ (The dimensionality divided by the number of heads.)
outputℓ = softmax
𝑋𝑄ℓ 𝐾ℓ
⊤
𝑋⊤
𝑑/ℎ
∗ 𝑋𝑉ℓ
29
The Transformer Decoder
30
• Now that we’ve replaced selfattention
with multi-head selfattention,
we’ll go through two
optimization tricks that end up
being :
• Residual Connections
• Layer Normalization
• In most Transformer diagrams,
these are often written
together as “Add & Norm”
Transformer Decoder
The Transformer Encoder: Residual connections [He et al., 2016]
• Residual connections are a trick to help models train better.
• Instead of 𝑋(𝑖)
= Layer(𝑋 𝑖−1
) (where 𝑖 represents the layer)
• We let 𝑋(𝑖)
= 𝑋(𝑖−1)
+ Layer(𝑋 𝑖−1
) (so we only have to learn “the residual”
from the previous layer)
• Gradient is great through the residual
connection; it’s 1!
• Bias towards the identity function!
𝑋(𝑖−1)
Layer 𝑋(𝑖)
𝑋(𝑖−1)
Layer 𝑋(𝑖)
+
[no residuals] [residuals]
[Loss landscape visualization,
Li et al., 2018, on a ResNet]31
The Transformer Encoder: Layer normalization [Ba et al., 2016]
• Layer normalization is a trick to help models train faster.
• Idea: cut down on uninformative variation in hidden vector values by normalizing
to unit mean and standard deviation within each layer.
• LayerNorm’s success may be due to its normalizing gradients [Xu et al., 2019]
• Let 𝑥 ∈ ℝ 𝑑 be an individual (word) vector in the model.
• Let 𝜇 = σ 𝑗=1
𝑑
𝑥𝑗; this is the mean; 𝜇 ∈ ℝ.
• Let 𝜎 =
1
𝑑
σ 𝑗=1
𝑑
𝑥𝑗 − 𝜇
2
; this is the standard deviation; 𝜎 ∈ ℝ.
• Let 𝛾 ∈ ℝ 𝑑 and 𝛽 ∈ ℝ 𝑑 be learned “gain” and “bias” parameters. (Can omit!)
• Then layer normalization computes:
output =
𝑥 − 𝜇
𝜎 + 𝜖
∗ 𝛾 + 𝛽
Normalize by scalar
mean and variance
Modulate by learned
elementwise gain and bias32
The Transformer Decoder
33
• The Transformer Decoder is a
stack of Transformer Decoder
Blocks.
• Each Block consists of:
• Self-attention
• Add & Norm
• Feed-Forward
• Add & Norm
• That’s it! We’ve gone through
the Transformer Decoder.
Transformer Decoder
The Transformer Encoder
34
• The Transformer Decoder
constrains to unidirectional
context, as for language
models.
• What if we want bidirectional
context, like in a bidirectional
RNN?
• This is the Transformer
Encoder. The only difference is
that we remove the masking
in the self-attention.
Transformer DecoderNo Masking!
The Transformer Encoder-Decoder
35
• Recall that in machine
translation, we processed the
source sentence with a
bidirectional model and
generated the target with a
unidirectional model.
• For this kind of seq2seq
format, we often use a
Transformer Encoder-Decoder.
• We use a normal Transformer
Encoder.
• Our Transformer Decoder is
modified to perform crossattention
to the output of the
Encoder.
Cross-attention (details)
• We saw that self-attention is when keys,
queries, and values come from the same
source.
• In the decoder, we have attention that
looks more like what we saw last week.
• Let ℎ1, … , ℎ 𝑛 be output vectors from the
Transformer encoder; 𝑥𝑖 ∈ ℝ 𝑑
• Let 𝑧1, … , 𝑧 𝑛 be input vectors from the
Transformer decoder, 𝑧𝑖 ∈ ℝ 𝑑
• Then keys and values are drawn from the
encoder (like a memory):
• 𝑘𝑖 = 𝐾ℎ𝑖, 𝑣𝑖 = 𝑉ℎ𝑖.
• And the queries are drawn from the
decoder, 𝑞𝑖 = 𝑄𝑧𝑖.
36
ℎ1, … , ℎ 𝑛
𝑧1, … , 𝑧 𝑛
Great Results with Transformers: Machine Translation
[Vaswani et al., 2017]
Not just better Machine
Translation BLEU scores
Also more efficient to
train!
First, Machine Translation results from the original Transformers paper!
[Test sets: WMT 2014 English-German and English-French]7
Great Results with Transformers: SuperGLUE
[Wang et al., 2019]
Not just better Machine
Translation BLEU scores
Also more efficient to
train!
[Test sets: SuperGLUE Leaderboard Version: 2.0]8
SuperGLUE is a suite of challenging NLP tasks, including question-answering, word sense
disambiguation, coreference resolution, and natural language inference.
Great Results with Transformers: Rise of Large Language Models!
9
Today, Transformer-based models dominate LMSYS Chatbot Arena Leaderboard!
[Chiang et al., 2024]
Transformers Even Show Promise Outside of NLP
13
Protein Folding
Image Classification
[Dosovitskiy et al. 2020]: Vision Transformer (ViT) outperforms
ResNet-based baselines with substantially less compute.
ML for Systems
[Zhou et al. 2020]: A Transformer-based
compiler model (GO-one) speeds up a
Transformer model!
[Jumper et al. 2021] aka AlphaFold2!
Scaling Laws: Are Transformers All We Need?
14
• With Transformers, language modeling performance improves smoothly as we increase
model size, training data, and compute resources in tandem.
• This power-law relationship has been observed over multiple orders of magnitude with
no sign of slowing!
• If we keep scaling up these models (with no change to the architecture), could they
eventually match or exceed human-level performance?
[Kaplan et al., 2020]
• Quadratic compute in self-attention (today):
• Computing all pairs of interactions means our computation grows
quadratically with the sequence length!
• For recurrent models, it only grew linearly!
• Position representations:
• Are simple absolute indices the best we can do to represent position?
• As we learned: Relative linear position attention [Shaw et al., 2018]
• Dependency syntax-based position [Wang et al., 2019]
• Rotary Embeddings [Su et al., 2021]
What would we like to fix about the Transformer?
58
• Considerable recent work has gone into the question, Can we build models like
Transformers without paying the 𝑂 𝑇2 all-pairs self-attention cost?
• For example, Linformer [Wang et al., 2020]
Recent work on improving on quadratic self-attention cost
59
Key idea: map the
sequence length
dimension to a lowerdimensional
space for
values, keys
Inferencetime(s)
Sequence length / batch size
• Considerable recent work has gone into the question, Can we build models like
Transformers without paying the 𝑂 𝑇2 all-pairs self-attention cost?
• For example, BigBird [Zaheer et al., 2021]
Recent work on improving on quadratic self-attention cost
60
Key idea: replace all-pairs interactions with a family of other interactions, like local
windows, looking at everything, and random interactions.
Do Transformer Modifications Transfer?
61
• "Surprisingly, we find that most modifications do not meaningfully improve
performance."