Neural Machine Translation Philipp Koehn 9 October 2018 Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 1Language Models • Modeling variants – feed-forward neural network – recurrent neural network – long short term memory neural network • May include input context Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 2Feed Forward Neural Language Model Word 1 Word 2 Word 3 Word 4 Word 5 HiddenLayer C C C C Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 3Recurrent Neural Language Model the Given word Embedding Hidden state Predicted word Predict the first word of a sentence Same as before, just drawn top-down Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 4Recurrent Neural Language Model the the house Given word Embedding Hidden state Predicted word Predict the second word of a sentence Re-use hidden state from first word prediction Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 5Recurrent Neural Language Model the the house house is Given word Embedding Hidden state Predicted word Predict the third word of a sentence ... and so on Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 6Recurrent Neural Language Model the the house house is big . is big . Given word Embedding Hidden state Predicted word Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 7Recurrent Neural Translation Model • We predicted the words of a sentence • Why not also predict their translations? Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 8Encoder-Decoder Model the the house house is big . is big . Given word Embedding Hidden state Predicted word das das Haus Haus ist groß . ist groß . • Obviously madness • Proposed by Google (Sutskever et al. 2014) Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 9What is Missing? • Alignment of input words to output words ⇒ Solution: attention mechanism Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 10 neural translation model with attention Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 11Input Encoding Given word Embedding Hidden state Predicted word • Inspiration: recurrent neural network language model on the input side Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 12Hidden Language Model States • This gives us the hidden states H1 H2 H3 H4 H5 H6 • These encode left context for each word • Same process in reverse: right context for each word Ĥ1 Ĥ2 Ĥ3 Ĥ4 Ĥ5 Ĥ6 Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 13Input Encoder Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN • Input encoder: concatenate bidrectional RNN states • Each word representation includes full left and right sentence context Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 14Encoder: Math Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN • Input is sequence of words xj, mapped into embedding space ¯E xj • Bidirectional recurrent neural networks ←− hj = f( ←−− hj+1, ¯E xj) −→ hj = f( −−→ hj−1, ¯E xj) • Various choices for the function f(): feed-forward layer, GRU, LSTM, ... Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 15Decoder • We want to have a recurrent neural network predicting output words Hidden State Output Words Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 16Decoder • We want to have a recurrent neural network predicting output words Hidden State Output Words • We feed decisions on output words back into the decoder state Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 17Decoder • We want to have a recurrent neural network predicting output words Input Context Hidden State Output Words • We feed decisions on output words back into the decoder state • Decoder state is also informed by the input context Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 18More Detail Context State ti-1 ti Word Prediction yi-1 Eyi-1 Selected Wordyi Eyi Embedding sisi-1 cici-1 • Decoder is also recurrent neural network over sequence of hidden states si si = f(si−1, Ey−1, ci) • Again, various choices for the function f(): feed-forward layer, GRU, LSTM, ... • Output word yi is selected by computing a vector ti (same size as vocabulary) ti = W(Usi−1 + V Eyi−1 + Cci) then finding the highest value in vector ti • If we normalize ti, we can view it as a probability distribution over words • Eyi is the embedding of the output word yi Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 19Attention Encoder States Attention Hidden State Output Words • Given what we have generated so far (decoder hidden state) • ... which words in the input should we pay attention to (encoder states)? Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 20Attention Encoder States Attention Hidden State Output Words • Given: – the previous hidden state of the decoder si−1 – the representation of input words hj = ( ←− hj, −→ hj) • Predict an alignment probability a(si−1, hj) to each input word j (modeled with with a feed-forward neural network layer) Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 21Attention Encoder States Attention Input Context Hidden State Output Words • Normalize attention (softmax) αij = exp(a(si−1, hj)) k exp(a(si−1, hk)) • Relevant input context: weigh input words according to attention: ci = j αijhj Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 22Attention Encoder States Attention Input Context Hidden State Output Words • Use context to predict next hidden state and output word Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 23Encoder-Decoder with Attention Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Attention Input Context Hidden State Output Words Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 24 training Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 25Computation Graph • Math behind neural machine translation defines a computation graph • Forward and backward computation to compute gradients for model training sigmoid sum b2prod W2sigmoid sum b1prod W1x Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 26Unrolled Computation Graph Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Attention Input Context Hidden State Output Word Predictions Given Output Words Error Output Word Embedding the house is big . das Haus ist groß , Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 27Batching • Already large degree of parallelism – most computations on vectors, matrices – efficient implementations for CPU and GPU • Further parallelism by batching – processing several sentence pairs at once – scalar operation → vector operation – vector operation → matrix operation – matrix operation → 3d tensor operation • Typical batch sizes 50–100 sentence pairs Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 28Batches • Sentences have different length • When batching, fill up unneeded cells in tensors ⇒ A lot of wasted computations Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 29Mini-Batches • Sort sentences by length, break up into mini-batches • Example: Maxi-batch 1600 sentence pairs, mini-batch 80 sentence pairs Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 30Overall Organization of Training • Shuffle corpus • Break into maxi-batches • Break up each maxi-batch into mini-batches • Process mini-batch, update parameters • Once done, repeat • Typically 5-15 epochs needed (passes through entire training corpus) Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 31 deeper models Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 32Deeper Models • Encoder and decoder are recurrent neural networks • We can add additional layers for each step • Recall shallow and deep language models Input Hidden Layer Output Input Hidden Layer 2 Output Hidden Layer 1 Hidden Layer 3 Shallow Deep • Adding residual connections (short-cuts through deep layers) help Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 33Deep Decoder • Two ways of adding layers – deep transitions: several layers on path to output – deeply stacking recurrent neural networks • Why not both? Context Decoder State: Stack 1, Transition 1 Decoder State: Stack 1, Transition 2 Decoder State: Stack 2, Transition 1 Decoder State: Stack 2, Transition 2 Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018 34Deep Encoder • Previously proposed encoder already has 2 layers – left-to-right recurrent network, to encode left context – right-to-left recurrent network, to encode right context ⇒ Third way of adding layers Input Word Embedding Encoder Layer 1: L2R Encoder Layer 2: R2L Encoder Layer 3: L2R Encoder Layer 4: R2L Philipp Koehn Machine Translation: Neural Machine Translation 9 October 2018