Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 1Language Models • Modeling variants – feed-forward neural network – recurrent neural network – long short term memory neural network • May include input context Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 2Feed Forward Neural Language Model Softmax FF wi h Hidden Layer Output Word wi-4 wi-3 wi-2 wi-1 History Embed Embed Embed Embed EmbeddingEw Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 3Recurrent Neural Language Model Embed Input Word Embedding Input Word Output Word Prediction ti Output Wordyi E xj xj Recurrent State hj Softmax the RNN Predict the first word of a sentence Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 4Recurrent Neural Language Model Embed the Embed Input Word Embedding Input Word Softmax Output Word Prediction ti house Output Wordyi E xj xj Recurrent State hj Softmax the RNN RNN Predict the second word of a sentence Re-use hidden state from first word prediction Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 5Recurrent Neural Language Model Embed the Embed house Embed Input Word Embedding Input Word Softmax Softmax Output Word Prediction ti house is Output Wordyi E xj xj Recurrent State hj Softmax the RNN RNN RNN Predict the third word of a sentence ... and so on Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 6Recurrent Neural Language Model Embed the Embed house Embed is Embed big Embed . Embed Input Word Embedding Input Word Softmax Softmax Softmax Softmax Softmax Output Word Prediction ti house is big . Output Wordyi E xj xj Recurrent State hj Softmax the RNN RNN RNN RNN RNN RNN Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 7Recurrent Neural Translation Model • We predicted the words of a sentence • Why not also predict their translations? Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 8Encoder-Decoder Model Embed the Embed house Embed is Embed big Embed . Embed Embed Input Word Embedding Input Word Softmax Softmax Softmax Softmax Softmax Softmax Output Word Prediction ti house is big . das Output Wordyi E xj xj Recurrent State hj das Embed Haus Embed ist Embed groß Embed . Embed Softmax Softmax Softmax Softmax Softmax Haus ist groß . Softmax the RNN RNNRNN RNN RNN RNN RNN RNN RNN RNN RNN RNN • Obviously madness • Proposed by Google (Sutskever et al. 2014) Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 9What is Missing? • Alignment of input words to output words ⇒ Solution: attention mechanism Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 10 neural translation model with attention Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 11Input Encoding Embed the Embed house Embed is Embed big Embed . Embed Input Word Embedding Input Word Softmax Softmax Softmax Softmax Softmax Output Word Prediction ti house is big . Output Wordyi E xj xj Recurrent State hj Softmax the RNN RNN RNN RNN RNN RNN • Inspiration: recurrent neural network language model on the input side Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 12Hidden Language Model States • This gives us the hidden states RNN RNNRNN RNN RNN RNN RNN • These encode left context for each word • Same process in reverse: right context for each word RNN RNN RNN RNN RNN RNN RNN Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 13Input Encoder RNN Embed RNN the Embed RNN house Embed RNN is Embed RNN big Embed RNN . Embed RNN RNN Embed RNN Right-to-Left Encoder Left-to-Right Encoder Input Word Embedding Input Word hj E xj xj hj RNN RNN RNN RNN RNN • Input encoder: concatenate bidrectional RNN states • Each word representation includes full left and right sentence context Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 14Encoder: Math RNN Embed RNN the Embed RNN house Embed RNN is Embed RNN big Embed RNN . Embed RNN RNN Embed RNN Right-to-Left Encoder Left-to-Right Encoder Input Word Embedding Input Word hj E xj xj hj RNN RNN RNN RNN RNN • Input is sequence of words xj, mapped into embedding space ¯E xj • Bidirectional recurrent neural networks ←− hj = f( ←−− hj+1, ¯E xj) −→ hj = f( −−→ hj−1, ¯E xj) • Various choices for the function f(): feed-forward layer, GRU, LSTM, ... Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 15Decoder • We want to have a recurrent neural network predicting output words RNN RNN RNN RNN Output Word Prediction Decoder State ti si Softmax Softmax Softmax Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 16Decoder • We want to have a recurrent neural network predicting output words Embed RNN Embed Embed Embed RNN RNN RNN Output Word Prediction Output Word Embeddings Decoder State ti E yi si Softmax Softmax Softmax • We feed decisions on output words back into the decoder state Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 17Decoder • We want to have a recurrent neural network predicting output words Embed RNN Embed Embed Embed RNN RNN RNN Output Word Prediction Output Word Embeddings Decoder State Input Context ti E yi si ci Softmax Softmax Softmax • We feed decisions on output words back into the decoder state • Decoder state is also informed by the input context Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 18More Detail RNN RNN Output Word Prediction Output Word Output Word Embeddings Decoder State Input Context ti Embed das Embed yi E yi si ci Softmax • Decoder is also recurrent neural network over sequence of hidden states si si = f(si−1, Ey−1, ci) • Again, various choices for the function f(): feed-forward layer, GRU, LSTM, ... • Output word yi is selected by computing a vector ti (same size as vocabulary) ti = W(Usi−1 + V Eyi−1 + Cci) then finding the highest value in vector ti • If we normalize ti, we can view it as a probability distribution over words • Eyi is the embedding of the output word yi Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 19Attention RNN RNN Attention RNN RNN RNN RNN RNN Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj RNN RNN RNN RNN RNN • Given what we have generated so far (decoder hidden state) • ... which words in the input should we pay attention to (encoder states)? Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 20Attention RNN RNN Attention RNN RNN RNN RNN RNN Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj RNN RNN RNN RNN RNN • Given: – the previous hidden state of the decoder si−1 – the representation of input words hj = ( ←− hj, −→ hj) • Predict an alignment probability a(si−1, hj) to each input word j (modeled with with a feed-forward neural network layer) Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 21Attention RNN RNN Attention RNN RNN RNN RNN RNN Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj RNN RNN RNN RNN RNN • Normalize attention (softmax) αij = exp(a(si−1, hj)) k exp(a(si−1, hk)) Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 22Attention RNN RNN Weighted Sum Attention RNN RNN RNN RNN RNN Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si ci αij hj hj RNN RNN RNN RNN RNN • Relevant input context: weigh input words according to attention: ci = j αijhj Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 23Attention RNN RNN Weighted Sum Attention RNN RNN RNN RNN RNN RNN Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si ci αij hj hj RNN RNN RNN RNN RNN • Use context to predict next hidden state and output word Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 24 training Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 25Comparing Prediction to Correct Word das Cost Haus Cost ist Cost Output Word Prediction Output Word Error ti yi - log ti [yi] Softmax Softmax Softmax • Current model gives some probability ti[yi] to correct word yi • We turn this into an error by computing cross-entropy: −log ti[yi] Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 26Computation Graph • Math behind neural machine translation defines a computation graph • Forward and backward computation to compute gradients for model training Product W1 Sum Sigmoid W2 b1 b2 x Product 1.0 0.0 3 2 1 -2 .731 .119 Sum Sigmoid 3.06 1.06 .743 3 2 4 3 -2 -4 5 -5 -2 Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 27Unrolled Computation Graph Embed RNN Weighted Sum Attention RNN Embed RNN the das Embed Cost Weighted Sum Attention Embed RNN house Haus Embed Cost Weighted Sum Attention Embed RNN is ist Embed Cost Weighted Sum Attention Embed RNN big groß Embed Cost Softmax Weighted Sum Attention Embed RNN . . Embed Cost Weighted Sum Attention Embed RNN Embed Cost Softmax RNN Weighted Sum Attention RNN Embed RNN RNNRNN RNN RNN RNN Output Word Prediction Output Word Output Word Embeddings Error Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder Input Word Embedding Input Word ti yi E yi - log ti [yi] si ci αij hj E xj xj hj RNN RNN RNN RNN RNN Softmax Softmax Softmax Softmax Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 28Batching • Already large degree of parallelism – most computations on vectors, matrices – efficient implementations for CPU and GPU • Further parallelism by batching – processing several sentence pairs at once – scalar operation → vector operation – vector operation → matrix operation – matrix operation → 3d tensor operation • Typical batch sizes 50–100 sentence pairs Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 29Batches • Sentences have different length • When batching, fill up unneeded cells in tensors ⇒ A lot of wasted computations Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 30Mini-Batches • Sort sentences by length, break up into mini-batches • Example: Maxi-batch 1600 sentence pairs, mini-batch 80 sentence pairs Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 31Overall Organization of Training • Shuffle corpus • Break into maxi-batches • Break up each maxi-batch into mini-batches • Process mini-batch, update parameters • Once done, repeat • Typically 5-15 epochs needed (passes through entire training corpus) Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 32 deeper models Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 33Deeper Models • Encoder and decoder are recurrent neural networks • We can add additional layers for each step • Recall shallow and deep language models Output Hidden Layer Input Word Embedding Softmax RNN Softmax RNN Softmax RNN Embed Embed Embed yt ht E xt Shallow Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN Embed Embed Embed yt ht,3 ht,2 ht,1 E xi Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN Output Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Word Embedding Embed Embed Embed yt ht,3 ht,2 ht,1 E xi Deep Stacked Deep Transitional • Adding residual connections (short-cuts through deep layers) help Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 34Deep Decoder • Two ways of adding layers – deep transitions: several layers on path to output – deeply stacking recurrent neural networks • Why not both? RNN RNN RNN RNN RNN Decoder State Stack 1, Transition 1 Input Context vt,1,1 ct FF FF FF FF FFvt,1,2 Decoder State Stack 1, Transition 2 RNN RNN RNN RNN RNNvt,2,1 FF FF FF FF FFvt,2,2 Decoder State Stack 2, Transition 1 Decoder State Stack 2, Transition 2 FF FF FF FF FFst,1=vt,1,3 Decoder State Stack 1, Transition 3 FF FF FF FF FFst,2=vt,2,3 Decoder State Stack 2, Transition 3 Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 35Deep Encoder • Previously proposed encoder already has 2 layers – left-to-right recurrent network, to encode left context – right-to-left recurrent network, to encode right context ⇒ Third way of adding layers RNN RNN RNN RNN RNN Encoder State Layer 1, L2R Input Word Embedding ci RNN RNN RNN RNN RNNhj,2 Encoder State Layer 1, R2L RNN RNN RNN RNN RNNhj,4 Encoder State Layer 2, R2L RNN RNN RNN RNN RNNhj,3 Encoder State Layer 2, L2R hj,1 Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020