Neural Machine Translation
Philipp Koehn
9 October 2018
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
1Language Models
• Modeling variants
– feed-forward neural network
– recurrent neural network
– long short term memory neural network
• May include input context
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
2Feed Forward Neural Language Model
Word 1
Word 2
Word 3
Word 4
Word 5
HiddenLayer
C
C
C
C
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
3Recurrent Neural Language Model
<s>
the
Given
word
Embedding
Hidden
state
Predicted
word
Predict
the ﬁrst word
of a sentence
Same as before,
just drawn top-down
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
4Recurrent Neural Language Model
<s>
the
the
house
Given
word
Embedding
Hidden
state
Predicted
word
Predict
the second word
of a sentence
Re-use hidden state
from
ﬁrst word prediction
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
5Recurrent Neural Language Model
<s>
the
the
house
house
is
Given
word
Embedding
Hidden
state
Predicted
word
Predict
the third word
of a sentence
... and so on
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
6Recurrent Neural Language Model
<s>
the
the
house
house is big .
is big . </s>
Given
word
Embedding
Hidden
state
Predicted
word
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
7Recurrent Neural Translation Model
• We predicted the words of a sentence
• Why not also predict their translations?
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
8Encoder-Decoder Model
<s>
the
the
house
house is big .
is big . </s>
Given
word
Embedding
Hidden
state
Predicted
word
</s>
das
das
Haus
Haus ist groß .
ist groß . </s>
• Obviously madness
• Proposed by Google (Sutskever et al. 2014)
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
9What is Missing?
• Alignment of input words to output words
⇒ Solution: attention mechanism
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
10
neural translation model
with attention
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
11Input Encoding
Given
word
Embedding
Hidden
state
Predicted
word
• Inspiration: recurrent neural network language model on the input side
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
12Hidden Language Model States
• This gives us the hidden states
H1 H2 H3 H4 H5 H6
• These encode left context for each word
• Same process in reverse: right context for each word
Ĥ1 Ĥ2 Ĥ3 Ĥ4 Ĥ5 Ĥ6
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
13Input Encoder
Input Word
Embeddings
Left-to-Right
Recurrent NN
Right-to-Left
Recurrent NN
• Input encoder: concatenate bidrectional RNN states
• Each word representation includes full left and right sentence context
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
14Encoder: Math
Input Word
Embeddings
Left-to-Right
Recurrent NN
Right-to-Left
Recurrent NN
• Input is sequence of words xj, mapped into embedding space ¯E xj
• Bidirectional recurrent neural networks
←−
hj = f(
←−−
hj+1, ¯E xj)
−→
hj = f(
−−→
hj−1, ¯E xj)
• Various choices for the function f(): feed-forward layer, GRU, LSTM, ...
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
15Decoder
• We want to have a recurrent neural network predicting output words
Hidden State
Output Words
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
16Decoder
• We want to have a recurrent neural network predicting output words
Hidden State
Output Words
• We feed decisions on output words back into the decoder state
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
17Decoder
• We want to have a recurrent neural network predicting output words
Input Context
Hidden State
Output Words
• We feed decisions on output words back into the decoder state
• Decoder state is also informed by the input context
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
18More Detail
Context
State
ti-1 ti
Word
Prediction
yi-1
Eyi-1
Selected
Wordyi
Eyi Embedding
sisi-1
cici-1
• Decoder is also recurrent neural network
over sequence of hidden states si
si = f(si−1, Ey−1, ci)
• Again, various choices for the function f():
feed-forward layer, GRU, LSTM, ...
• Output word yi is selected by computing a
vector ti (same size as vocabulary)
ti = W(Usi−1 + V Eyi−1 + Cci)
then ﬁnding the highest value in vector ti
• If we normalize ti, we can view it as a
probability distribution over words
• Eyi is the embedding of the output word yi
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
19Attention
Encoder States
Attention
Hidden State
Output Words
• Given what we have generated so far (decoder hidden state)
• ... which words in the input should we pay attention to (encoder states)?
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
20Attention
Encoder States
Attention
Hidden State
Output Words
• Given: – the previous hidden state of the decoder si−1
– the representation of input words hj = (
←−
hj,
−→
hj)
• Predict an alignment probability a(si−1, hj) to each input word j
(modeled with with a feed-forward neural network layer)
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
21Attention
Encoder States
Attention
Input Context
Hidden State
Output Words
• Normalize attention (softmax)
αij =
exp(a(si−1, hj))
k exp(a(si−1, hk))
• Relevant input context: weigh input words according to attention: ci = j αijhj
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
22Attention
Encoder States
Attention
Input Context
Hidden State
Output Words
• Use context to predict next hidden state and output word
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
23Encoder-Decoder with Attention
Input Word
Embeddings
Left-to-Right
Recurrent NN
Right-to-Left
Recurrent NN
Attention
Input Context
Hidden State
Output Words
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
24
training
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
25Computation Graph
• Math behind neural machine translation deﬁnes a computation graph
• Forward and backward computation to compute gradients for model training
sigmoid
sum
b2prod
W2sigmoid
sum
b1prod
W1x
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
26Unrolled Computation Graph
Input Word
Embeddings
Left-to-Right
Recurrent NN
Right-to-Left
Recurrent NN
Attention
Input Context
Hidden State
Output Word
Predictions
Given
Output Words
Error
Output Word
Embedding
<s> the house is big . </s>
<s> das Haus ist groß , </s>
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
27Batching
• Already large degree of parallelism
– most computations on vectors, matrices
– efﬁcient implementations for CPU and GPU
• Further parallelism by batching
– processing several sentence pairs at once
– scalar operation → vector operation
– vector operation → matrix operation
– matrix operation → 3d tensor operation
• Typical batch sizes 50–100 sentence pairs
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
28Batches
• Sentences have different length
• When batching, ﬁll up unneeded cells in tensors
⇒ A lot of wasted computations
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
29Mini-Batches
• Sort sentences by length, break up into mini-batches
• Example: Maxi-batch 1600 sentence pairs, mini-batch 80 sentence pairs
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
30Overall Organization of Training
• Shufﬂe corpus
• Break into maxi-batches
• Break up each maxi-batch into mini-batches
• Process mini-batch, update parameters
• Once done, repeat
• Typically 5-15 epochs needed (passes through entire training corpus)
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
31
deeper models
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
32Deeper Models
• Encoder and decoder are recurrent neural networks
• We can add additional layers for each step
• Recall shallow and deep language models
Input
Hidden
Layer
Output
Input
Hidden
Layer 2
Output
Hidden
Layer 1
Hidden
Layer 3
Shallow Deep
• Adding residual connections (short-cuts through deep layers) help
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
33Deep Decoder
• Two ways of adding layers
– deep transitions: several layers on path to output
– deeply stacking recurrent neural networks
• Why not both?
Context
Decoder State: Stack 1, Transition 1
Decoder State: Stack 1, Transition 2
Decoder State: Stack 2, Transition 1
Decoder State: Stack 2, Transition 2
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018
34Deep Encoder
• Previously proposed encoder already has 2 layers
– left-to-right recurrent network, to encode left context
– right-to-left recurrent network, to encode right context
⇒ Third way of adding layers
Input Word Embedding
Encoder Layer 1: L2R
Encoder Layer 2: R2L
Encoder Layer 3: L2R
Encoder Layer 4: R2L
Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018