Neural Machine Translation Philipp Koehn 1 October 2024 Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 1Language Models • Modeling variants – feed-forward neural network – recurrent neural network – long short term memory neural network • May include input context Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 2Feed Forward Neural Language Model Softmax FF wi h Hidden Layer Output Word wi-4 wi-3 wi-2 wi-1 History Embed Embed Embed Embed EmbeddingEw Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 3Recurrent Neural Language Model Embed Input Word Embedding Input Word Output Word Prediction ti Output Wordyi E xj xj Recurrent State hj Softmax the RNN Predict the first word of a sentence Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 4Recurrent Neural Language Model Embed the Embed Input Word Embedding Input Word Softmax Output Word Prediction ti house Output Wordyi E xj xj Recurrent State hj Softmax the RNN RNN Predict the second word of a sentence Re-use hidden state from first word prediction Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 5Recurrent Neural Language Model Embed the Embed house Embed Input Word Embedding Input Word Softmax Softmax Output Word Prediction ti house is Output Wordyi E xj xj Recurrent State hj Softmax the RNN RNN RNN Predict the third word of a sentence ... and so on Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 6Recurrent Neural Language Model Embed the Embed house Embed is Embed big Embed . Embed Input Word Embedding Input Word Softmax Softmax Softmax Softmax Softmax Output Word Prediction ti house is big . Output Wordyi E xj xj Recurrent State hj Softmax the RNN RNN RNN RNN RNN RNN Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 7Recurrent Neural Translation Model • We predicted the words of a sentence • Why not also predict their translations? Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 8Encoder-Decoder Model Embed the Embed house Embed is Embed big Embed . Embed Embed Input Word Embedding Input Word Softmax Softmax Softmax Softmax Softmax Softmax Output Word Prediction ti house is big . das Output Wordyi E xj xj Recurrent State hj das Embed Haus Embed ist Embed groß Embed . Embed Softmax Softmax Softmax Softmax Softmax Haus ist groß . Softmax the RNN RNNRNN RNN RNN RNN RNN RNN RNN RNN RNN RNN • Obviously madness • Proposed by Google (Sutskever et al. 2014) Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 9What is Missing? • Alignment of input words to output words ⇒ Solution: attention mechanism Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 10 neural translation model with attention Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 11Input Encoding Embed the Embed house Embed is Embed big Embed . Embed Input Word Embedding Input Word Softmax Softmax Softmax Softmax Softmax Output Word Prediction ti house is big . Output Wordyi E xj xj Recurrent State hj Softmax the RNN RNN RNN RNN RNN RNN • Inspiration: recurrent neural network language model on the input side Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 12Hidden Language Model States • This gives us the hidden states RNN RNNRNN RNN RNN RNN RNN • These encode left context for each word • Same process in reverse: right context for each word RNN RNN RNN RNN RNN RNN RNN Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 13Input Encoder RNN Embed RNN the Embed RNN house Embed RNN is Embed RNN big Embed RNN . Embed RNN RNN Embed RNN Right-to-Left Encoder Left-to-Right Encoder Input Word Embedding Input Word hj E xj xj hj RNN RNN RNN RNN RNN • Input encoder: concatenate bidrectional RNN states • Each word representation includes full left and right sentence context Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 14Encoder: Math RNN Embed RNN the Embed RNN house Embed RNN is Embed RNN big Embed RNN . Embed RNN RNN Embed RNN Right-to-Left Encoder Left-to-Right Encoder Input Word Embedding Input Word hj E xj xj hj RNN RNN RNN RNN RNN • Input is sequence of words xj, mapped into embedding space ¯E xj • Bidirectional recurrent neural networks ←− hj = f( ←−− hj+1, ¯E xj) −→ hj = f( −−→ hj−1, ¯E xj) • Various choices for the function f(): feed-forward layer, GRU, LSTM, ... Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 15Decoder • We want to have a recurrent neural network predicting output words RNN RNN RNN RNN Output Word Prediction Decoder State ti si Softmax Softmax Softmax Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 16Decoder • We want to have a recurrent neural network predicting output words Embed RNN Embed Embed Embed RNN RNN RNN Output Word Prediction Output Word Embeddings Decoder State ti E yi si Softmax Softmax Softmax • We feed decisions on output words back into the decoder state Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 17Decoder • We want to have a recurrent neural network predicting output words Embed RNN Embed Embed Embed RNN RNN RNN Output Word Prediction Output Word Embeddings Decoder State Input Context ti E yi si ci Softmax Softmax Softmax • We feed decisions on output words back into the decoder state • Decoder state is also informed by the input context Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 18More Detail RNN RNN Output Word Prediction Output Word Output Word Embeddings Decoder State Input Context ti Embed das Embed yi E yi si ci Softmax • Decoder is also recurrent neural network over sequence of hidden states si si = f(si−1, Ey−1, ci) • Again, various choices for the function f(): feed-forward layer, GRU, LSTM, ... • Output word yi is selected by computing a vector ti (same size as vocabulary) ti = W(Usi−1 + V Eyi−1 + Cci) then finding the highest value in vector ti • If we normalize ti, we can view it as a probability distribution over words • Eyi is the embedding of the output word yi Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 19Attention RNN RNN RNN RNN RNN RNN RNN Decoder State Right-to-Left Encoder Left-to-Right Encoder si hj hj RNN RNN RNN RNN RNN • Given what we have generated so far (decoder hidden state) • ... which words in the input should we pay attention to (encoder states)? Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 20Attention RNN RNN RNN RNN RNN RNN RNN Decoder State Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj RNN RNN RNN RNN RNN • Given: – the previous hidden state of the decoder si−1 – the representation of input words hj = ( ←− hj, −→ hj) • Predict an alignment probability a(si−1, hj) to each input word j (modeled with with a feed-forward neural network layer) Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 21Attention RNN RNN RNN RNN RNN RNN RNN Decoder State Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj RNN RNN RNN RNN RNN Input Context • Normalize attention (softmax) αij = exp(a(si−1, hj)) k exp(a(si−1, hk)) Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 22Attention RNN RNN RNN RNN RNN RNN RNN Decoder State Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj RNN RNN RNN RNN RNN Input Context+ + + + • Relevant input context: weigh input words according to attention: ci = j αijhj Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 23Attention RNN RNN Weighted Sum Attention RNN RNN RNN RNN RNN RNN Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si ci αij hj hj RNN RNN RNN RNN RNN • Use context to predict next hidden state and output word Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 24 training Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 25Comparing Prediction to Correct Word das Cost Haus Cost ist Cost Output Word Prediction Output Word Error ti yi - log ti [yi] Softmax Softmax Softmax • Current model gives some probability ti[yi] to correct word yi • We turn this into an error by computing cross-entropy: −log ti[yi] Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 27Unrolled Computation Graph Embed RNN Weighted Sum Attention RNN Embed RNN the das Embed Cost Weighted Sum Attention Embed RNN house Haus Embed Cost Weighted Sum Attention Embed RNN is ist Embed Cost Weighted Sum Attention Embed RNN big groß Embed Cost Softmax Weighted Sum Attention Embed RNN . . Embed Cost Weighted Sum Attention Embed RNN Embed Cost Softmax RNN Weighted Sum Attention RNN Embed RNN RNNRNN RNN RNN RNN Output Word Prediction Output Word Output Word Embeddings Error Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder Input Word Embedding Input Word ti yi E yi - log ti [yi] si ci αij hj E xj xj hj RNN RNN RNN RNN RNN Softmax Softmax Softmax Softmax Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 28Batching • Already large degree of parallelism – most computations on vectors, matrices – efficient implementations for CPU and GPU • Further parallelism by batching – processing several sentence pairs at once – scalar operation → vector operation – vector operation → matrix operation – matrix operation → 3d tensor operation • Typical batch sizes 50–100 sentence pairs Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 29Batches • Sentences have different length • When batching, fill up unneeded cells in tensors ⇒ A lot of wasted computations Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 30Mini-Batches • Sort sentences by length, break up into mini-batches • Example: Maxi-batch 1600 sentence pairs, mini-batch 80 sentence pairs Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 31Overall Organization of Training • Shuffle corpus • Break into maxi-batches • Break up each maxi-batch into mini-batches • Process mini-batch, update parameters • Once done, repeat • Typically 5-15 epochs needed (passes through entire training corpus) Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024 33Deeper Models • Encoder and decoder are recurrent neural networks • We can add additional layers for each step • Recall shallow and deep language models Output Hidden Layer Input Word Embedding Softmax RNN Softmax RNN Softmax RNN Embed Embed Embed yt ht E xt Shallow Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN Embed Embed Embed yt ht,3 ht,2 ht,1 E xi Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN Output Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Word Embedding Embed Embed Embed yt ht,3 ht,2 ht,1 E xi Deep Stacked Deep Transitional • Adding residual connections (short-cuts through deep layers) help Philipp Koehn Machine Translation: Neural Machine Translation 1 October 2024