Machine Translation
40
Machine Translation (MT) is the task of translating a sentence x from one language (the
source language) to a sentence y in another language (the target language).
x: L'homme est né libre, et partout il est dans les fers
y: Man is born free, but everywhere he is in chains
– Rousseau
The early history of MT: 1950s
• Machine translation research began in the early 1950s on machines less
powerful than high school calculators (before term “A.I.” coined!)
• Concurrent with foundational work on automata, formal languages,
probabilities, and information theory
• MT heavily funded by military, but basically just simple rule-based
systems doing word substitution
• Human language is more complicated than that, and varies more across
languages!
• Little understanding of natural language syntax, semantics, pragmatics
• Problem soon appeared intractable
1 minute video showing 1954 MT:
https://youtu.be/K-HfpsHPmvw
The early history of MT: 1950s
1990s-2010s: Statistical Machine Translation
• Core idea: Learn a probabilistic model from data
• Suppose we’re translating French → English.
• We want to find best English sentence y, given French sentence x
• Use Bayes Rule to break this down into two components to be learned
separately:
Translation Model
Models how words and phrases
should be translated (fidelity).
Learned from parallel data.
Language Model
Models how to write
good English (fluency).
Learned from monolingual data.43
What happens in translation isn’t trivial to model!
1519年600名西班牙人在墨西哥登陆，去征服几百万人口
的阿兹特克帝国，初次交锋他们损兵三分之二。
In 1519, six hundred Spaniards landed in Mexico to conquer the Aztec Empire with a
population of a few million. They lost two thirds of their soldiers in the first clash.
translate.google.com (2009): 1519 600 Spaniards landed in Mexico, millions of people to
conquer the Aztec empire, the first two-thirds of soldiers against their loss.
translate.google.com (2013): 1519 600 Spaniards landed in Mexico to conquer the Aztec
empire, hundreds of millions of people, the initial confrontation loss of soldiers two-thirds.
translate.google.com (2015): 1519 600 Spaniards landed in Mexico, millions of people to
conquer the Aztec empire, the first two-thirds of the loss of soldiers they clash.
1990s–2010s: Statistical Machine Translation
• SMT was a huge research field
• The best systems were extremely complex
• Hundreds of important details
• Systems had many separately-designed subcomponents
• Lots of feature engineering
• Need to design features to capture particular language phenomena
• Required compiling and maintaining extra resources
• Like tables of equivalent phrases
• Lots of human effort to maintain
• Repeated effort for each language pair!
45
What is Neural Machine Translation?
46
• Neural Machine Translation (NMT) is a way to do Machine Translation with a single
end-to-end neural network
• The neural network architecture is called a sequence-to-sequence model (aka seq2seq)
and it involves two RNNs
EncoderRNN
Neural Machine Translation (NMT)
<START>
Source sentence (input)
il a m’ entarté
The sequence-to-sequence model
Target sentence (output)
DecoderRNN
Encoder RNN produces
an encoding of the
source sentence.
Encoding of the source sentence.
Provides initial hidden state
for Decoder RNN.
Decoder RNN is a Language Model that generates
target sentence, conditioned on encoding.
he
argmax
he
argmax
hit
hit
argmax
me
Note: This diagram shows test time behavior: decoder
output is fed in as next step’s input
with a pie <END>
me with a pie
argmax
argmax
argmax
argmax
47
Sequence-to-sequence is versatile!
• The general notion here is an encoder-decoder model
• One neural network takes input and produces a neural representation
• Another network produces output based on that neural representation
• If the input and output are sequences, we call it a seq2seq model
• Sequence-to-sequence is useful for more than just MT
• Many NLP tasks can be phrased as sequence-to-sequence:
• Summarization (long text → short text)
• Dialogue (previous utterances → next utterance)
• Parsing (input text → output parse as sequence)
• Code generation (natural language → Python code)
48
Neural Machine Translation (NMT)
• The sequence-to-sequence model is an example of a Conditional Language Model
• Language Model because the decoder is predicting the next word of the target sentence y
• Conditional because its predictions are also conditioned on the source sentence x
• NMT directly calculates :
• Question: How to train an NMT system?
• (Easy) Answer: Get a big parallel corpus…
• But there is now exciting work on “unsupervised NMT”, data augmentation, etc.
Probability of next target word, given
target words so far and source sentence x
49
Training a Neural Machine Translation system
EncoderRNN
Source sentence (from corpus)
<START> he hit me with a pieil a m’ entarté
Target sentence (from corpus)
Seq2seq is optimized as a single system. Backpropagation operates “end-to-end”.
DecoderRNN
"𝑦! "𝑦" "𝑦# "𝑦$ "𝑦% "𝑦& "𝑦'
𝐽! 𝐽" 𝐽# 𝐽$ 𝐽% 𝐽& 𝐽'
= negative log
prob of “he”
𝐽 =
1
𝑇
(
()!
*
𝐽( = + + + + + +
= negative log
prob of <END>
= negative log
prob of “with”
50
Multi-layer deep encoder-decoder machine translation net
Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend
0.2
0.6
-0.1
-0.7
0.1
0.4
-0.6
0.2
-0.3
0.4
0.2
-0.3
-0.1
-0.4
0.2
0.2
0.4
0.1
-0.5
-0.2
0.4
-0.2
-0.3
-0.4
-0.2
0.2
0.6
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.7
0.1
-0.1
0.3
-0.1
-0.7
0.1
-0.2
0.6
0.1
0.3
0.1
-0.4
0.5
-0.5
0.4
0.1
0.2
0.6
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.7
0.1
0.2
-0.2
-0.1
0.1
0.1
0.2
0.6
-0.1
-0.7
0.1
0.1
0.3
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.4
0.1
0.2
-0.8
-0.1
-0.5
0.1
0.2
0.6
-0.1
-0.7
0.1
-0.4
0.6
-0.1
-0.7
0.1
0.2
0.6
-0.1
0.3
0.1
-0.1
0.6
-0.1
0.3
0.1
0.2
0.4
-0.1
0.2
0.1
0.3
0.6
-0.1
-0.5
0.1
0.2
0.6
-0.1
-0.7
0.1
0.2
-0.1
-0.1
-0.7
0.1
0.1
0.3
0.1
-0.4
0.2
0.2
0.6
-0.1
-0.7
0.1
0.4
0.4
0.3
-0.2
-0.3
0.5
0.5
0.9
-0.3
-0.2
0.2
0.6
-0.1
-0.5
0.1
-0.1
0.6
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.7
0.1
0.3
0.6
-0.1
-0.7
0.1
0.4
0.4
-0.1
-0.7
0.1
-0.2
0.6
-0.1
-0.7
0.1
-0.4
0.6
-0.1
-0.7
0.1
-0.3
0.5
-0.1
-0.7
0.1
0.2
0.6
-0.1
-0.7
0.1
The protests escalated over the weekend <EOS>
Encoder:
Builds up
sentence
meaning
Source
sentence
Translation
generated
Feeding in
last word
Decoder
Conditioning =
Bottleneck
[Sutskever et al. 2014; Luong et al. 2015]
The hidden states from RNN layer i
are the inputs to RNN layer i+1
51
How do we evaluate Machine Translation?
BLEU (Bilingual Evaluation Understudy)
• BLEU compares the machine-written translation to one or several human-written
translation(s), and computes a similarity score based on:
• n-gram precision (usually for 1, 2, 3 and 4-grams)
• Plus a penalty for too-short system translations
• BLEU is useful but imperfect
• There are many valid ways to translate a sentence
• So a good translation can get a poor BLEU score because it has low n-gram overlap
with the human translation L
52
You’ll see BLEU in detail
in Assignment 4!
Source: ”BLEU: a Method for Automatic Evaluation of Machine Translation", Papineni et al, 2002. http://aclweb.org/anthology/P02-1040
Reference translation 1:
The U.S. island of Guam is maintaining
a high state of alert after the Guam
airport and its offices both received an
e-mail from someone calling himself
the Saudi Arabian Osama bin Laden
and threatening a biological/chemical
attack against public places such as
the airport .
Reference translation 3:
The US International Airport of Guam
and its office has received an email
from a self-claimed Arabian millionaire
named Laden , which threatens to
launch a biochemical attack on such
public places as airport . Guam
authority has been on alert .
Reference translation 4:
US Guam International Airport and its
office received an email from Mr. Bin
Laden and other rich businessman
from Saudi Arabia . They said there
would be biochemistry air raid to Guam
Airport and other public places . Guam
needs to be in high precaution about
this matter .
Reference translation 2:
Guam International Airport and its
offices are maintaining a high state of
alert after receiving an e-mail that was
from a person claiming to be the
wealthy Saudi Arabian businessman
Bin Laden and that threatened to
launch a biological and chemical attack
on the airport and other public places .
Machine translation:
The American [?] international airport
and its the office all receives one calls
self the sand Arab rich business [?]
and so on electronic mail , which
sends out ; The threat will be able
after public place and so on the
airport to start the biochemistry attack
, [?] highly alerts after the
maintenance.
Reference translation 1:
The U.S. island of Guam is maintaining
a high state of alert after the Guam
airport and its offices both received an
e-mail from someone calling himself
the Saudi Arabian Osama bin Laden
and threatening a biological/chemical
attack against public places such as
the airport .
Reference translation 3:
The US International Airport of Guam
and its office has received an email
from a self-claimed Arabian millionaire
named Laden , which threatens to
launch a biochemical attack on such
public places as airport . Guam
authority has been on alert .
Reference translation 4:
US Guam International Airport and its
office received an email from Mr. Bin
Laden and other rich businessman
from Saudi Arabia . They said there
would be biochemistry air raid to Guam
Airport and other public places . Guam
needs to be in high precaution about
this matter .
Reference translation 2:
Guam International Airport and its
offices are maintaining a high state of
alert after receiving an e-mail that was
from a person claiming to be the
wealthy Saudi Arabian businessman
Bin Laden and that threatened to
launch a biological and chemical attack
on the airport and other public places .
Machine translation:
The American [?] international airport
and its the office all receives one calls
self the sand Arab rich business [?]
and so on electronic mail , which
sends out ; The threat will be able
after public place and so on the
airport to start the biochemistry attack
, [?] highly alerts after the
maintenance.
BLEU score against 4 reference translations
[Papineni et al. 2002]
0
5
10
15
20
25
30
35
40
45
2013 2014 2015 2016 2017 2018 2019
Phrase-based SMT
Syntax-based SMT
Neural MT
MT progress over time
54
Sources: http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf & http://matrix.statmt.org/
[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal; NMT 2019 FAIR on newstest2019]
Advantages of NMT
Compared to SMT, NMT has many advantages:
• Better performance
• More fluent
• Better use of context
• Better use of phrase similarities
• A single neural network to be optimized end-to-end
• No subcomponents to be individually optimized
• Requires much less human engineering effort
• No feature engineering
• Same method for all language pairs
55
Disadvantages of NMT?
Compared to SMT:
• NMT is less interpretable
• Hard to debug
• NMT is difficult to control
• For example, can’t easily specify rules or guidelines for translation
• Safety concerns!
56
NMT: the first big success story of NLP Deep Learning
57
Neural Machine Translation went from a fringe research attempt in 2014 to the leading
standard method in 2016
• 2014: First seq2seq paper published
• 2016: Google Translate switches from SMT to NMT – and by 2018 everyone has
• This is amazing!
• SMT systems, built by hundreds of engineers over many years, outperformed by
NMT systems trained by a small group of engineers in a few months
Decoding: Greedy decoding
• We saw how to generate (or “decode”) the target sentence by taking argmax on each
step of the decoder
• This is greedy decoding (take most probable word on each step)
4
<START>
he
argmax
heargmax
hit
hit
argmax
me with a pie <END>
me with a pie
argmax
argmax
argmax
argmax
Problems with greedy decoding
• Greedy decoding has no way to undo decisions!
• Input: il a m’entarté (he hit me with a pie)
• → he ____
• → he hit ____
• → he hit a ____ (whoops! no going back now…)
• How to fix this?
5
Exhaustive search decoding
• Ideally, we want to find a (length T) translation y that maximizes
• We could try computing all possible sequences y
• This means that on each step t of the decoder, we’re tracking Vt possible partial
translations, where V is vocab size
• This O(VT) complexity is far too expensive!
6
Beam search decoding
• Core idea: On each step of decoder, keep track of the k most probable partial
translations (which we call hypotheses)
• k is the beam size (in practice around 5 to 10, in NMT)
• A hypothesis has a score which is its log probability:
• Scores are all negative, and higher score is better
• We search for high-scoring hypotheses, tracking top k on each step
• Beam search is not guaranteed to find optimal solution
• But much more efficient than exhaustive search!
7
Beam search decoding: example
Beam size = k = 2. Blue numbers =
<START>
8
Calculate prob
dist of next word
Beam search decoding: example
Beam size = k = 2. Blue numbers =
<START>
he
I
9
-0.7
-0.9
Take top k words
and compute scores
= log PLM(he|<START>)
= log PLM(I|<START>)
Beam search decoding: example
Beam size = k = 2. Blue numbers =
hit
struck
was
got
<START>
he
I
10
-0.7
-0.9
-1.6
-1.8
-1.7
-2.9
For each of the k hypotheses, find
top k next words and calculate scores
= log PLM(hit|<START> he) + -0.7
= log PLM(struck|<START> he) + -0.7
= log PLM(was|<START> I) + -0.9
= log PLM(got|<START> I) + -0.9
Beam search decoding: example
Beam size = k = 2. Blue numbers =
hit
struck
was
got
<START>
he
I
11
-0.7
-0.9
-1.6
-1.8
-1.7
-2.9
Of these k2 hypotheses,
just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
hit
struck
was
got
a
me
hit
struck
<START>
he
I
12
-0.7
-0.9
-1.6
-1.8
-1.7
-2.9
-2.5
-2.8
-3.8
-2.9
For each of the k hypotheses, find
top k next words and calculate scores
= log PLM(a|<START> he hit) + -1.7
= log PLM(me|<START> he hit) + -1.7
= log PLM(hit|<START> I was) + -1.6
= log PLM(struck|<START> I was) + -1.6
Beam search decoding: example
Beam size = k = 2. Blue numbers =
hit
struck
was
got
a
me
hit
struck
<START>
he
I
13
-0.7
-0.9
-1.6
-1.8
-1.7
-2.9
-2.5
-2.8
-3.8
-2.9
Of these k2 hypotheses,
just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
hit
struck
was
got
a
me
hit
struck
tart
pie
with
on<START>
he
I
14
-0.7
-0.9
-1.6
-1.8
-1.7
-2.9
-2.5
-2.8
-3.8
-2.9
-3.5
-3.3
-4.0
-3.4
For each of the k hypotheses, find
top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
hit
struck
was
got
a
me
hit
struck
tart
pie
with
on<START>
he
I
15
-0.7
-0.9
-1.6
-1.8
-1.7
-2.9
-2.5
-2.8
-3.8
-2.9
-3.5
-3.3
-4.0
-3.4
Of these k2 hypotheses,
just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
hit
struck
was
got
a
me
hit
struck
tart
pie
with
on
in
with
a
one<START>
he
I
16
-0.7
-0.9
-1.6
-1.8
-1.7
-2.9
-2.5
-2.8
-3.8
-2.9
-3.5
-3.3
-4.0
-3.4
-3.7
-4.3
-4.5
-4.8
For each of the k hypotheses, find
top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
hit
struck
was
got
a
me
hit
struck
tart
pie
with
on
in
with
a
one<START>
he
I
17
-0.7
-0.9
-1.6
-1.8
-1.7
-2.9
-2.5
-2.8
-3.8
-2.9
-3.5
-3.3
-4.0
-3.4
-3.7
-4.3
-4.5
-4.8
Of these k2 hypotheses,
just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
hit
struck
was
got
a
me
hit
struck
tart
pie
with
on
in
with
a
one
pie
tart
pie
tart
<START>
he
I
18
-0.7
-0.9
-1.6
-1.8
-1.7
-2.9
-2.5
-2.8
-3.8
-2.9
-3.5
-3.3
-4.0
-3.4
-3.7
-4.3
-4.5
-4.8
-4.3
-4.6
-5.0
-5.3
For each of the k hypotheses, find
top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
hit
struck
was
got
a
me
hit
struck
tart
pie
with
on
in
with
a
one
pie
tart
pie
tart
<START>
he
I
19
-0.7
-0.9
-1.6
-1.8
-1.7
-2.9
-2.5
-2.8
-3.8
-2.9
-3.5
-3.3
-4.0
-3.4
-3.7
-4.3
-4.5
-4.8
-4.3
-4.6
-5.0
-5.3
This is the top-scoring hypothesis!
Beam search decoding: example
Beam size = k = 2. Blue numbers =
hit
struck
was
got
a
me
hit
struck
tart
pie
with
on
in
with
a
one
pie
tart
pie
tart
<START>
he
I
20
-0.7
-0.9
-1.6
-1.8
-1.7
-2.9
-2.5
-2.8
-3.8
-2.9
-3.5
-3.3
-4.0
-3.4
-3.7
-4.3
-4.5
-4.8
-4.3
-4.6
-5.0
-5.3
Backtrack to obtain the full hypothesis
Beam search decoding: stopping criterion
• In greedy decoding, usually we decode until the model produces an <END> token
• For example: <START> he hit me with a pie <END>
• In beam search decoding, different hypotheses may produce <END> tokens on
different timesteps
• When a hypothesis produces <END>, that hypothesis is complete.
• Place it aside and continue exploring other hypotheses via beam search.
• Usually we continue beam search until:
• We reach timestep T (where T is some pre-defined cutoff), or
• We have at least n completed hypotheses (where n is pre-defined cutoff)
21
Beam search decoding: finishing up
• We have our list of completed hypotheses.
• How to select top one?
• Each hypothesis on our list has a score
• Problem with this: longer hypotheses have lower scores
• Fix: Normalize by length. Use this to select top one instead:
22
See also discussion of
sampling-based decoding
in the NLG lecture
2. Why attention? Sequence-to-sequence: the bottleneck problem
EncoderRNN
Source sentence (input)
<START> he hit me with a pieil a m’ entarté
he hit me with a pie <END>
DecoderRNN
Target sentence (output)
Problems with this architecture?
Encoding of the
source sentence.
29
1. Why attention? Sequence-to-sequence: the bottleneck problem
EncoderRNN
Source sentence (input)
<START> he hit me with a pieil a m’ entarté
he hit me with a pie <END>
DecoderRNN
Target sentence (output)
Encoding of the
source sentence.
This needs to capture all
information about the
source sentence.
Information bottleneck!
30
Attention
• Attention provides a solution to the bottleneck problem.
• Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence
• First, we will show via diagram (no equations), then we will show with equations
31
Sequence-to-sequence with attention
Encoder
RNN
Source sentence (input)
<START>il a m’ entarté
DecoderRNN
Attention
scores dot product
32
Core idea: on each step of the decoder, use direct connection to the encoder to focus on a
particular part of the source sequence
Sequence-to-sequence with attention
Encoder
RNN
Source sentence (input)
<START>il a m’ entarté
DecoderRNN
Attention
scores dot product
33
Sequence-to-sequence with attention
Encoder
RNN
Source sentence (input)
<START>il a m’ entarté
DecoderRNN
Attention
scores dot product
34
Sequence-to-sequence with attention
Encoder
RNN
Source sentence (input)
<START>il a m’ entarté
DecoderRNN
Attention
scores dot product
35
Sequence-to-sequence with attention
Encoder
RNN
Source sentence (input)
<START>il a m’ entarté
DecoderRNN
Attention
scores
On this decoder timestep, we’re
mostly focusing on the first
encoder hidden state (”he”)
Attention
distribution
Take softmax to turn the scores
into a probability distribution
36
Sequence-to-sequence with attention
Encoder
RNN
Source sentence (input)
<START>il a m’ entarté
DecoderRNN
Attention
distribution
Attention
scores
Attention
output
Use the attention distribution to take a
weighted sum of the encoder hidden states.
The attention output mostly contains
information from the hidden states that
received high attention.
37
Sequence-to-sequence with attention
Encoder
RNN
Source sentence (input)
<START>il a m’ entarté
DecoderRNN
Attention
distribution
Attention
scores
Attention
output
Concatenate attention output
with decoder hidden state, then
use to compute !𝑦! as before
!𝑦!
he
38
Sequence-to-sequence with attention
Encoder
RNN
Source sentence (input)
<START>il a m’ entarté
DecoderRNN
Attention
scores
he
Attention
distribution
Attention
output
!𝑦"
hit
39
Sometimes we take the
attention output from the
previous step, and also
feed it into the decoder
(along with the usual
decoder input). We do
this in Assignment 4.
Sequence-to-sequence with attention
Encoder
RNN
Source sentence (input)
<START>il a m’ entarté
DecoderRNN
Attention
scores
Attention
distribution
Attention
output
he hit
!𝑦#
me
40
Sequence-to-sequence with attention
Encoder
RNN
Source sentence (input)
<START>il a m’ entarté
DecoderRNN
Attention
scores
Attention
distribution
Attention
output
he hit me
!𝑦$
with
41
Sequence-to-sequence with attention
Encoder
RNN
Source sentence (input)
<START>il a m’ entarté
DecoderRNN
Attention
scores
Attention
distribution
Attention
output
he hit with
!𝑦%
a
me
42
Sequence-to-sequence with attention
Encoder
RNN
Source sentence (input)
<START>il a m’ entarté
DecoderRNN
Attention
scores
Attention
distribution
Attention
output
he hit me with a
!𝑦&
pie
43
Attention: in equations
• We have encoder hidden states
• On timestep t, we have decoder hidden state
• We get the attention scores for this step:
• We take softmax to get the attention distribution for this step (this is a probability distribution and
sums to 1)
• We use to take a weighted sum of the encoder hidden states to get the attention output
• Finally we concatenate the attention output with the decoder hidden
state and proceed as in the non-attention seq2seq model
44
Attention is great!
• Attention significantly improves NMT performance
• It’s very useful to allow decoder to focus on certain parts of the source
• Attention provides a more “human-like” model of the MT process
• You can look back at the source sentence while translating, rather than needing to remember it all
• Attention solves the bottleneck problem
• Attention allows decoder to look directly at source; bypass bottleneck
• Attention helps with the vanishing gradient problem
• Provides shortcut to faraway states
• Attention provides some interpretability
• By inspecting attention distribution, we see what the decoder was focusing on
• We get (soft) alignment for free!
• This is cool because we never explicitly trained an alignment system
• The network just learned alignment by itself
45
he
hit
me
with
a
pie
il
a
m’
entarté
There are several attention variants
• We have some values and a query
• Attention always involves:
1. Computing the attention scores
2. Taking softmax to get attention distribution ⍺:
3. Using attention distribution to take weighted sum of values:
thus obtaining the attention output a (sometimes called the context vector)
46
There are
multiple ways
to do this
Attention variants
There are several ways you can compute from and :
• Basic dot-product attention:
• Note: this assumes . This is the version we saw earlier.
• Multiplicative attention: [Luong, Pham, and Manning 2015]
• Where is a weight matrix. Perhaps better called “bilinear attention”
• Reduced-rank multiplicative attention: 𝑒! = 𝑠" 𝑼" 𝑽 ℎ! = (𝑼𝑠)"(𝑽ℎ!)
• For low rank matrices 𝑼 ∈ ℝ#×%', 𝑽 ∈ ℝ#×%(, 𝑘 ≪ 𝑑&, 𝑑'
• Additive attention: [Bahdanau, Cho, and Bengio 2014]
• Where are weight matrices and is a weight vector.
• d3 (the attention dimensionality) is a hyperparameter
• “Additive” is a weird/bad name. It’s really using a feed-forward neural net layer.
47
More information: “Deep Learning for NLP Best Practices”, Ruder, 2017. http://ruder.io/deep-learning-nlp-best-practices/index.html#attention
“Massive Exploration of Neural Machine Translation Architectures”, Britz et al, 2017, https://arxiv.org/pdf/1703.03906.pdf
You’ll think about the relative
advantages/disadvantages of these in Assignment 4!
Remember this when we look
at Transformers next week!
Attention is a general Deep Learning technique
• We’ve seen that attention is a great way to improve the sequence-to-sequence model
for Machine Translation.
• However: You can use attention in many architectures
(not just seq2seq) and many tasks (not just MT)
• More general definition of attention:
• Given a set of vector values, and a vector query, attention is a technique to compute
a weighted sum of the values, dependent on the query.
• We sometimes say that the query attends to the values.
• For example, in the seq2seq + attention model, each decoder hidden state (query)
attends to all the encoder hidden states (values).
48
Attention is a general Deep Learning technique
49
• More general definition of attention:
• Given a set of vector values, and a vector query, attention is a technique to compute
a weighted sum of the values, dependent on the query.
Intuition:
• The weighted sum is a selective summary of the information contained in the values,
where the query determines which values to focus on.
• Attention is a way to obtain a fixed-size representation of an arbitrary set of
representations (the values), dependent on some other representation (the query).
Upshot:
• Attention has become the powerful, flexible, general way pointer and memory
manipulation in all deep learning models. A new idea from after 2010! From NMT!
As of last lecture: recurrent models for (most) NLP!
• Circa 2016, the de facto strategy in NLP is to
encode sentences with a bidirectional LSTM:
(for example, the source sentence in a translation)
3
• Define your output (parse, sentence,
summary) as a sequence, and use an LSTM to
generate it.
• Use attention to allow flexible access to
memory
Today: Same goals, different building blocks
• Last week, we learned about sequence-to-sequence problems and
encoder-decoder models.
• Today, we’re not trying to motivate entirely new ways of looking at
problems (like Machine Translation)
• Instead, we’re trying to find the best building blocks to plug into our
models and enable broad progress.
4
2014-2017ish
Recurrence
Lots of trial
and error
2021
??????
Issues with recurrent models: Linear interaction distance
• RNNs are unrolled “left-to-right”.
• This encodes linear locality: a useful heuristic!
• Nearby words often affect each other’s meanings
• Problem: RNNs take O(sequence length) steps for
distant word pairs to interact.
5
tasty pizza
The chef waswho …
O(sequence length)
Issues with recurrent models: Linear interaction distance
• O(sequence length) steps for distant word pairs to interact means:
• Hard to learn long-distance dependencies (because gradient problems!)
• Linear order of words is “baked in”; we already know linear order isn’t the
right way to think about sentences…
6
The waschef who …
Info of chef has gone through
O(sequence length) many layers!
Issues with recurrent models: Lack of parallelizability
• Forward and backward passes have O(sequence length)
unparallelizable operations
• GPUs can perform a bunch of independent computations at once!
• But future RNN hidden states can’t be computed in full before past RNN
hidden states have been computed
• Inhibits training on very large datasets!
7
h1
0
1 n
hTh2
1
2
2
3
Numbers indicate min # of steps before a state can be computed
If not recurrence, then what? How about attention?
• Attention treats each word’s representation as a query to access and
incorporate information from a set of values.
• We saw attention from the decoder to the encoder; today we’ll think about
attention within a single sentence.
• Number of unparallelizable operations does not increase with sequence length.
• Maximum interaction distance: O(1), since all words interact at every layer!
embedding 0 0 0 0 0 0 0 0
h1 h2 hT
2 2 2 2 2 2 2 2
attention
attention
1 1 1 1 1 1 1 1
All words attend
to all words in
previous layer;
most arrows here
are omitted
8
Attention as a soft, averaging lookup table
9
We can think of attention as performing fuzzy lookup in a key-value store.
In a lookup table, we have a table of keys
that map to values. The query matches
one of the keys, returning its value.
In attention, the query matches all keys softly,
to a weight between 0 and 1. The keys’ values
are multiplied by the weights and summed.
Self-Attention Hypothetical Example
10
Self-Attention: keys, queries, values from the same sequence
11
Let 𝒘1:𝑛 be a sequence of words in vocabulary 𝑉, like Zuko made his uncle tea.
For each 𝒘𝑖 , let 𝒙𝑖 = 𝐸𝒘𝒊, where 𝐸 ∈ ℝ 𝑑×|𝑉|
is an embedding matrix.
1. Transform each word embedding with weight matrices Q, K, V , each in ℝ 𝑑×𝑑
2. Compute pairwise similarities between keys and queries; normalize with softmax
𝒆𝑖𝑗 = 𝒒𝒊
⊤
𝒌𝒋 𝜶𝑖𝑗 =
exp(𝒆𝑖𝑗)
σ 𝑗′ exp(𝒆𝑖𝑗′)
3. Compute output for each word as weighted sum of values
𝒒𝑖 = 𝑄𝒙𝒊 (queries) 𝒌𝑖 = 𝐾𝒙𝒊 (keys) 𝒗𝑖 = 𝑉𝒙𝒊 (values)
𝒐𝑖 = ෍
𝒋
𝜶𝑖𝑗 𝒗𝑖
Barriers
• Doesn’t have an inherent
notion of order!
Barriers and solutions for Self-Attention as a building block
12
Solutions
Fixing the first self-attention problem: sequence order
• Since self-attention doesn’t build in order information, we need to encode the order of the
sentence in our keys, queries, and values.
• Consider representing each sequence index as a vector
𝒑𝑖 ∈ ℝ 𝑑
, for 𝑖 ∈ {1,2, … , 𝑛} are position vectors
• Don’t worry about what the 𝑝𝑖 are made of yet!
• Easy to incorporate this info into our self-attention block: just add the 𝒑𝑖 to our inputs!
• Recall that 𝒙𝑖 is the embedding of the word at index 𝑖. The positioned embedding is:
෥𝒙𝑖 = 𝒙𝑖 + 𝒑𝑖
In deep self-attention
networks, we do this at the
first layer! You could
concatenate them as well,
but people mostly just add…
13
• Sinusoidal position representations: concatenate sinusoidal functions of varying periods:
• Pros:
• Periodicity indicates that maybe “absolute position” isn’t as important
• Maybe can extrapolate to longer sequences as periods restart!
• Cons:
• Not learnable; also the extrapolation doesn’t really work!
Position representation vectors through sinusoids
cos(𝑖/100002∗1/𝑑)
𝒑𝑖 =
sin(𝑖/100002∗1/𝑑)
sin(𝑖/100002∗
𝑑
2
/𝑑
)
cos(𝑖/100002∗
𝑑
2
/𝑑
)
Image: https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/
Index in the sequence
Dimension
14
• Learned absolute position representations: Let all 𝑝𝑖 be learnable parameters!
Learn a matrix 𝒑 ∈ ℝ 𝑑×𝑛
, and let each 𝒑𝑖 be a column of that matrix!
• Pros:
• Flexibility: each position gets to be learned to fit the data
• Cons:
• Definitely can’t extrapolate to indices outside 1, … , 𝑛.
• Most systems use this!
• Sometimes people try more flexible representations of position:
• Relative linear position attention [Shaw et al., 2018]
• Dependency syntax-based position [Wang et al., 2019]
Position representation vectors learned from scratch
15
Barriers
• Doesn’t have an inherent
notion of order!
• No nonlinearities for deep
learning! It’s all just weighted
averages
Barriers and solutions for Self-Attention as a building block
Solutions
• Add position representations to
the inputs
16
Adding nonlinearities in self-attention
• Note that there are no elementwise
nonlinearities in self-attention;
stacking more self-attention layers
just re-averages value vectors
(Why? Look at the notes!)
• Easy fix: add a feed-forward network
to post-process each output vector.
𝑚𝑖 = 𝑀𝐿𝑃 output 𝑖
= 𝑊2 ∗ ReLU 𝑊1 output 𝑖 + 𝑏1 + 𝑏2
The
𝑤1 𝑤2
chef
𝑤3
who
𝑤 𝑛
food
…
self-attention
Intuition: the FF network processes the result of attention
FF FF FF FF
…
self-attention
FF FF FF FF
17
Barriers
• Doesn’t have an inherent
notion of order!
• No nonlinearities for deep
learning magic! It’s all just
weighted averages
• Need to ensure we don’t
“look at the future” when
predicting a sequence
• Like in machine translation
• Or language modeling
Barriers and solutions for Self-Attention as a building block
Solutions
• Add position representations to
the inputs
• Easy fix: apply the same
feedforward network to each selfattention
output.
18
Masking the future in self-attention
• To use self-attention in
decoders, we need to ensure
we can’t peek at the future.
• At every timestep, we could
change the set of keys and
queries to include only past
words. (Inefficient!)
• To enable parallelization, we
mask out attention to future
words by setting attention
scores to −∞.
The
chef
who
[START]
For encoding
these words
We can look at these
(not greyed out) words
𝑒𝑖𝑗 = ൝
𝑞𝑖
⊤
𝑘𝑗, 𝑗 ≤ 𝑖
−∞, 𝑗 > 𝑖
−∞
−∞−∞
−∞−∞ −∞
19
Barriers
• Doesn’t have an inherent
notion of order!
• No nonlinearities for deep
learning magic! It’s all just
weighted averages
• Need to ensure we don’t
“look at the future” when
predicting a sequence
• Like in machine translation
• Or language modeling
Barriers and solutions for Self-Attention as a building block
Solutions
• Add position representations to
the inputs
• Easy fix: apply the same
feedforward network to each selfattention
output.
• Mask out the future by artificially
setting attention weights to 0!
20
• Self-attention:
• the basis of the method.
• Position representations:
• Specify the sequence order, since self-attention
is an unordered function of its inputs.
• Nonlinearities:
• At the output of the self-attention block
• Frequently implemented as a simple feedforward
network.
• Masking:
• In order to parallelize operations while not
looking at the future.
• Keeps information about the future from
“leaking” to the past.
Necessities for a self-attention building block:
21