Neural Machine Translation
Philipp Koehn
6 October 2020
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
1Language Models
• Modeling variants
– feed-forward neural network
– recurrent neural network
– long short term memory neural network
• May include input context
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
2Feed Forward Neural Language Model
Softmax
FF
wi
h Hidden Layer
Output Word
wi-4 wi-3 wi-2 wi-1 History
Embed Embed Embed Embed EmbeddingEw
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
3Recurrent Neural Language Model
<s>
Embed
Input Word
Embedding
Input Word
Output Word
Prediction
ti
Output Wordyi
E xj
xj
Recurrent
State
hj
Softmax
the
RNN
Predict the ﬁrst word of a sentence
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
4Recurrent Neural Language Model
<s>
Embed
the
Embed
Input Word
Embedding
Input Word
Softmax
Output Word
Prediction
ti
house Output Wordyi
E xj
xj
Recurrent
State
hj
Softmax
the
RNN RNN
Predict the second word of a sentence
Re-use hidden state from ﬁrst word prediction
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
5Recurrent Neural Language Model
<s>
Embed
the
Embed
house
Embed
Input Word
Embedding
Input Word
Softmax Softmax
Output Word
Prediction
ti
house is Output Wordyi
E xj
xj
Recurrent
State
hj
Softmax
the
RNN RNN RNN
Predict the third word of a sentence
... and so on
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
6Recurrent Neural Language Model
<s>
Embed
the
Embed
house
Embed
is
Embed
big
Embed
.
Embed
Input Word
Embedding
Input Word
Softmax Softmax Softmax Softmax Softmax
Output Word
Prediction
ti
house is big . </s> Output Wordyi
E xj
xj
Recurrent
State
hj
Softmax
the
RNN RNN RNN RNN RNN RNN
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
7Recurrent Neural Translation Model
• We predicted the words of a sentence
• Why not also predict their translations?
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
8Encoder-Decoder Model
<s>
Embed
the
Embed
house
Embed
is
Embed
big
Embed
.
Embed
</s>
Embed
Input Word
Embedding
Input Word
Softmax Softmax Softmax Softmax Softmax Softmax
Output Word
Prediction
ti
house is big . </s> das Output Wordyi
E xj
xj
Recurrent
State
hj
das
Embed
Haus
Embed
ist
Embed
groß
Embed
.
Embed
Softmax Softmax Softmax Softmax Softmax
Haus ist groß . </s>
Softmax
the
RNN RNNRNN RNN RNN RNN RNN RNN RNN RNN RNN RNN
• Obviously madness
• Proposed by Google (Sutskever et al. 2014)
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
9What is Missing?
• Alignment of input words to output words
⇒ Solution: attention mechanism
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
10
neural translation model
with attention
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
11Input Encoding
<s>
Embed
the
Embed
house
Embed
is
Embed
big
Embed
.
Embed
Input Word
Embedding
Input Word
Softmax Softmax Softmax Softmax Softmax
Output Word
Prediction
ti
house is big . </s> Output Wordyi
E xj
xj
Recurrent
State
hj
Softmax
the
RNN RNN RNN RNN RNN RNN
• Inspiration: recurrent neural network language model on the input side
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
12Hidden Language Model States
• This gives us the hidden states
RNN RNNRNN RNN RNN RNN RNN
• These encode left context for each word
• Same process in reverse: right context for each word
RNN RNN RNN RNN RNN RNN RNN
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
13Input Encoder
<s>
RNN
Embed
RNN
the
Embed
RNN
house
Embed
RNN
is
Embed
RNN
big
Embed
RNN
.
Embed
RNN
</s>
RNN
Embed
RNN
Right-to-Left
Encoder
Left-to-Right
Encoder
Input Word
Embedding
Input Word
hj
E xj
xj
hj RNN RNN RNN RNN RNN
• Input encoder: concatenate bidrectional RNN states
• Each word representation includes full left and right sentence context
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
14Encoder: Math
<s>
RNN
Embed
RNN
the
Embed
RNN
house
Embed
RNN
is
Embed
RNN
big
Embed
RNN
.
Embed
RNN
</s>
RNN
Embed
RNN
Right-to-Left
Encoder
Left-to-Right
Encoder
Input Word
Embedding
Input Word
hj
E xj
xj
hj RNN RNN RNN RNN RNN
• Input is sequence of words xj, mapped into embedding space ¯E xj
• Bidirectional recurrent neural networks
←−
hj = f(
←−−
hj+1, ¯E xj)
−→
hj = f(
−−→
hj−1, ¯E xj)
• Various choices for the function f(): feed-forward layer, GRU, LSTM, ...
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
15Decoder
• We want to have a recurrent neural network predicting output words
RNN RNN RNN RNN
Output Word
Prediction
Decoder State
ti
si
Softmax Softmax Softmax
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
16Decoder
• We want to have a recurrent neural network predicting output words
Embed
RNN
Embed Embed Embed
RNN RNN RNN
Output Word
Prediction
Output Word
Embeddings
Decoder State
ti
E yi
si
Softmax Softmax Softmax
• We feed decisions on output words back into the decoder state
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
17Decoder
• We want to have a recurrent neural network predicting output words
Embed
RNN
Embed Embed Embed
RNN RNN RNN
Output Word
Prediction
Output Word
Embeddings
Decoder State
Input Context
ti
E yi
si
ci
Softmax Softmax Softmax
• We feed decisions on output words back into the decoder state
• Decoder state is also informed by the input context
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
18More Detail
RNN RNN
Output Word
Prediction
Output Word
Output Word
Embeddings
Decoder State
Input Context
ti
<s>
Embed
das
Embed
yi
E yi
si
ci
Softmax
• Decoder is also recurrent neural network
over sequence of hidden states si
si = f(si−1, Ey−1, ci)
• Again, various choices for the function f():
feed-forward layer, GRU, LSTM, ...
• Output word yi is selected by computing a
vector ti (same size as vocabulary)
ti = W(Usi−1 + V Eyi−1 + Cci)
then ﬁnding the highest value in vector ti
• If we normalize ti, we can view it as a
probability distribution over words
• Eyi is the embedding of the output word yi
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
19Attention
RNN RNN
Attention
RNN RNN RNN
RNN RNN Decoder State
Input Context
Attention
Right-to-Left
Encoder
Left-to-Right
Encoder
si
αij
hj
hj RNN RNN RNN RNN RNN
• Given what we have generated so far (decoder hidden state)
• ... which words in the input should we pay attention to (encoder states)?
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
20Attention
RNN RNN
Attention
RNN RNN RNN
RNN RNN Decoder State
Input Context
Attention
Right-to-Left
Encoder
Left-to-Right
Encoder
si
αij
hj
hj RNN RNN RNN RNN RNN
• Given: – the previous hidden state of the decoder si−1
– the representation of input words hj = (
←−
hj,
−→
hj)
• Predict an alignment probability a(si−1, hj) to each input word j
(modeled with with a feed-forward neural network layer)
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
21Attention
RNN RNN
Attention
RNN RNN RNN
RNN RNN Decoder State
Input Context
Attention
Right-to-Left
Encoder
Left-to-Right
Encoder
si
αij
hj
hj RNN RNN RNN RNN RNN
• Normalize attention (softmax)
αij =
exp(a(si−1, hj))
k exp(a(si−1, hk))
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
22Attention
RNN RNN
Weighted
Sum
Attention
RNN RNN RNN
RNN RNN Decoder State
Input Context
Attention
Right-to-Left
Encoder
Left-to-Right
Encoder
si
ci
αij
hj
hj RNN RNN RNN RNN RNN
• Relevant input context: weigh input words according to attention: ci = j αijhj
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
23Attention
RNN RNN
Weighted
Sum
Attention
RNN RNN RNN
RNN RNN RNN Decoder State
Input Context
Attention
Right-to-Left
Encoder
Left-to-Right
Encoder
si
ci
αij
hj
hj RNN RNN RNN RNN RNN
• Use context to predict next hidden state and output word
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
24
training
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
25Comparing Prediction to Correct Word
<s> das
Cost
Haus
Cost
ist
Cost
Output Word
Prediction
Output Word
Error
ti
yi
- log ti [yi]
Softmax Softmax Softmax
• Current model gives some probability ti[yi] to correct word yi
• We turn this into an error by computing cross-entropy: −log ti[yi]
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
26Computation Graph
• Math behind neural machine translation deﬁnes a computation graph
• Forward and backward computation to compute gradients for model training
Product
W1
Sum
Sigmoid W2
b1
b2
x
Product
1.0
0.0
3
2
1
-2
.731
.119
Sum
Sigmoid
3.06
1.06
.743
3
2
4
3
-2
-4
5 -5
-2
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
27Unrolled Computation Graph
<s>
<s>
Embed
RNN
Weighted
Sum
Attention
RNN
Embed
RNN
the
das
Embed
Cost
Weighted
Sum
Attention
Embed
RNN
house
Haus
Embed
Cost
Weighted
Sum
Attention
Embed
RNN
is
ist
Embed
Cost
Weighted
Sum
Attention
Embed
RNN
big
groß
Embed
Cost
Softmax
Weighted
Sum
Attention
Embed
RNN
.
.
Embed
Cost
Weighted
Sum
Attention
Embed
RNN
</s>
</s>
Embed
Cost
Softmax
RNN
Weighted
Sum
Attention
RNN
Embed
RNN
RNNRNN RNN RNN RNN
Output Word
Prediction
Output Word
Output Word
Embeddings
Error
Decoder State
Input Context
Attention
Right-to-Left
Encoder
Left-to-Right
Encoder
Input Word
Embedding
Input Word
ti
yi
E yi
- log ti [yi]
si
ci
αij
hj
E xj
xj
hj RNN RNN RNN RNN RNN
Softmax Softmax Softmax Softmax
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
28Batching
• Already large degree of parallelism
– most computations on vectors, matrices
– efﬁcient implementations for CPU and GPU
• Further parallelism by batching
– processing several sentence pairs at once
– scalar operation → vector operation
– vector operation → matrix operation
– matrix operation → 3d tensor operation
• Typical batch sizes 50–100 sentence pairs
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
29Batches
• Sentences have different length
• When batching, ﬁll up unneeded cells in tensors
⇒ A lot of wasted computations
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
30Mini-Batches
• Sort sentences by length, break up into mini-batches
• Example: Maxi-batch 1600 sentence pairs, mini-batch 80 sentence pairs
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
31Overall Organization of Training
• Shufﬂe corpus
• Break into maxi-batches
• Break up each maxi-batch into mini-batches
• Process mini-batch, update parameters
• Once done, repeat
• Typically 5-15 epochs needed (passes through entire training corpus)
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
32
deeper models
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
33Deeper Models
• Encoder and decoder are recurrent neural networks
• We can add additional layers for each step
• Recall shallow and deep language models
Output
Hidden
Layer
Input Word
Embedding
Softmax
RNN
Softmax
RNN
Softmax
RNN
Embed Embed Embed
yt
ht
E xt
Shallow
Softmax
RNN
RNN
Softmax
RNN
RNN
Softmax
RNN
RNN
RNN RNN RNN
Embed Embed Embed
yt
ht,3
ht,2
ht,1
E xi
Softmax
RNN
RNN
Softmax
RNN
RNN
Softmax
RNN
RNN
RNN RNN RNN
Output
Hidden
Layer 1
Hidden
Layer 2
Hidden
Layer 3
Input Word
Embedding
Embed Embed Embed
yt
ht,3
ht,2
ht,1
E xi
Deep Stacked Deep Transitional
• Adding residual connections (short-cuts through deep layers) help
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
34Deep Decoder
• Two ways of adding layers
– deep transitions: several layers on path to output
– deeply stacking recurrent neural networks
• Why not both?
RNN RNN RNN RNN RNN
Decoder State
Stack 1, Transition 1
Input Context
vt,1,1
ct
FF FF FF FF FFvt,1,2
Decoder State
Stack 1, Transition 2
RNN RNN RNN RNN RNNvt,2,1
FF FF FF FF FFvt,2,2
Decoder State
Stack 2, Transition 1
Decoder State
Stack 2, Transition 2
FF FF FF FF FFst,1=vt,1,3
Decoder State
Stack 1, Transition 3
FF FF FF FF FFst,2=vt,2,3
Decoder State
Stack 2, Transition 3
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
35Deep Encoder
• Previously proposed encoder already has 2 layers
– left-to-right recurrent network, to encode left context
– right-to-left recurrent network, to encode right context
⇒ Third way of adding layers
RNN RNN RNN RNN RNN
Encoder State
Layer 1, L2R
Input Word
Embedding
ci
RNN RNN RNN RNN RNNhj,2
Encoder State
Layer 1, R2L
RNN RNN RNN RNN RNNhj,4
Encoder State
Layer 2, R2L
RNN RNN RNN RNN RNNhj,3
Encoder State
Layer 2, L2R
hj,1
Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020