Alternative Architectures Philipp Koehn 10 October 2024 Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 24 attention Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 25Attention • Machine translation is a structured prediction task – output is not a single label – output structure needs to be built, word by word • Relevant information for each word prediction varies • Human translators pay attention to different parts of the input sentence when translating ⇒ Attention mechanism Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 26Attention RNN RNN RNN RNN RNN RNN RNN Decoder State Right-to-Left Encoder Left-to-Right Encoder si hj hj RNN RNN RNN RNN RNN • Given what we have generated so far (decoder hidden state) • ... which words in the input should we pay attention to (encoder states)? Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 27Attention RNN RNN RNN RNN RNN RNN RNN Decoder State Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj RNN RNN RNN RNN RNN • Given: – the previous hidden state of the decoder si−1 – the representation of input words hj = ( ←− hj, −→ hj) • Predict an alignment probability a(si−1, hj) to each input word j (modeled with with a feed-forward neural network layer) Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 28Attention RNN RNN RNN RNN RNN RNN RNN Decoder State Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj RNN RNN RNN RNN RNN Input Context • Normalize attention (softmax) αij = exp(a(si−1, hj)) k exp(a(si−1, hk)) Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 29Attention RNN RNN RNN RNN RNN RNN RNN Decoder State Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj RNN RNN RNN RNN RNN Input Context+ + + + • Relevant input context: weigh input words according to attention: ci = j αijhj Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 30Attention RNN RNN Weighted Sum Attention RNN RNN RNN RNN RNN RNN Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si ci αij hj hj RNN RNN RNN RNN RNN • Use context to predict next hidden state and output word Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 31Computing Attention • Attention mechanism in neural translation model (Bahdanau et al., 2015) – previous hidden state si−1 – input word embedding hj – trainable parameters b, Wa, Ua, va a(si−1, hj) = vT a tanh(Wasi−1 + Uahj + b) • Other ways to compute attention (Luong et al., 2015) – Dot product: a(si−1, hj) = sT i−1hj – Scaled dot product: a(si−1, hj) = 1√ |hj| sT i−1hj – General: a(si−1, hj) = sT i−1Wahj – Local: a(si−1) = Wasi−1 Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 32General View of Dot-Product Attention • Three elements Query : decoder state Key : encoder state Value : encoder state • Intuition – given a query (the decoder state) – we check how well it matches keys in the database (the encoder states) – and then use the matching score to scale the retrieved value (also the encoder state) • Computation Attention(Q, K, V ) = softmax( QKT √ dk )V Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 33General View of Dot-Product Attention Attention(Q, K, V ) RNN RNN RNN RNN RNN RNN RNN Decoder State Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj RNN RNN RNN RNN RNN Query Key Key Key Key Key Value Value Value Value Value • Query: encoder state, Key and Value: decoder state Attention(S, H, H) Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 35Self Attention • Finally, a very different take at attention • Motivation so far: need for alignment between input words and output words • Now: refine representation of input words in the encoder – representation of an input word mostly depends on itself – but also informed by the surrounding context – previously: recurrent neural networks (considers left or right context) – now: attention mechanism • Self attention: Which of the surrounding words is most relevant to refine representation? Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 36Self Attention Embed Embed Embed Embed Embed Input Word Embedding Exj • Given: input word embeddings • Task: consider how each should be refined in view of others • Needed: how much attention to pay to others Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 37Self Attention Attentionαij Embed Embed Embed Embed Embed Input Word Embedding Exj • Computation of attention weights as before – Key: word embedding (or generally: encoder state for word H) – Query: word embedding (or generally: encoder state for word H) • Again, multiple with weight matrices: Q=HWQ and K=HWK • Attention weights: QKT Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 38Self Attention Attentionαij Embed Embed Embed Embed Embed Input Word Embedding Exj Refined Input Word Representation + + + + • Full self attention self-attention(H) = Attention(HWQ , HWK , H) • Resulting vector uses weighted context words Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 39Multi-Head Attention • Add redundancy – say, 16 attention weights – each based on its own parameters WQ i , WK i , WV i • Formally: headi = Attention(QWQ i , KWK i , V WV i ) MultiHead(Q, K, V ) = Concat(head1, ..., headh)WO • Multi-head attention is a form of ensembling Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 40Multi-Head Attention Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 41Multi-Head Attention “Many of the attention heads exhibit behaviour that seems related to the structure of the sentence.“ Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 42 transformer Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 43Self Attention: Transformer • Self-attention in encoder – refine word representation based on relevant context words – relevance determined by self attention • Self-attention in decoder – refine output word predictions based on relevant previous output words – relevance determined by self attention • Also regular attention to encoder states in decoder • Currently most successful model (maybe only with self attention in decoder, but regular recurrent decoder) Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 44Self Attention Layer • Given: input word representations hj, packed into a matrix H = (h1, ..., hj) • Self attention self-attention(H) = MultiHead(H, H, H) • Shortcut connection self-attention(hj) + hj • Layer normalization ˆhj = layer-normalization(self-attention(hj) + hj) • Feed-forward step with ReLU activation function relu(Wˆhj + b) • Again, shortcut connection and layer normalization layer-normalization(relu(Wˆhj + b) + ˆhj) Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 45Encoder Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Input Context Attention Embed Embed Embed Embed Embed Embed Embed Word and Position Embedding Ewxj Epj the house is big . Input Wordxj Add Add Add Add Add Add Add Positional Input Word Embedding Ewxj + Epj Embed Embed Embed Embed Embed Embed Embed Input Word Positionj 0 1 2 3 4 5 6 Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Input Context with Shortcut ĥj Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Encoder State Refinement hj Sequence of self-attention layers Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 46Self-Attention in the Decoder • Same idea as in the encoder • Output words are initially encoded by word embeddings si = Eyi. • Self attention is computed over previous output words – association of a word si is limited to words sk (k ≤ i) – resulting representation ˜si self-attention( ˜S) = MultiHead( ˜S, ˜S, ˜S) Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 47Attention in the Decoder • Original intuition of attention mechanism: focus on relevant input words • Computed with dot product ˜SHT • Compute attention between the decoder states ˜S and the final encoder states H attention( ˜S, H) = MultiHead( ˜S, H, H) • Note: attention mechanism formally mirrors self-attention Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 48Full Decoder • Self-attention self-attention( ˜S) = MultiHead( ˜S, ˜S, ˜S) – shortcut connections – layer normalization • Attention attention( ˜S, H) = softmaxMultiHead( ˜S, H, H) – shortcut connections – layer normalization – feed-forward layer • Multiple stacked layers Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 49Decoder Self Attention Self Attention Self Attention Self Attention Self Attention Self Attention Self Attention Self-Attention Embed Embed Embed Embed Embed Embed Embed Word and Position Embedding the house is big . Output Wordyi Add Add Add Add Add Add Add Positional Output Word Embedding si Embed Embed Embed Embed Embed Embed Embed Output Word Position i 0 1 2 3 4 5 6 Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Output Context Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Normalization with Shortcut Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Context Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Normalization with Shortcut ŝi+1 Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Decoder State Refinement si+1 Attention Attention Attention Attention Attention Attention Attention Encoder State Attention h Decoder computes attention-based representations of the output in several layers, initialized with the embeddings of the previous output words Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 50Multiple Layers • Stack several transformer layers (say, D = 6) • Encoder – Start with input word embedding h0,j = Exj – Stacked layers hd,j = self-attention-layer(hd−1,j) • Same for decoder Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 51Multiple Layers in Encoder and Decoder Encoder Layer Input Word Encoder Layer Encoder Layer Encoder Layer Output Word Embedding Decoder Layer Decoder Layer Decoder Layer Decoder Layer Softmax Softmax Softmax Softmax Softmax Softmax Softmax Argmax Argmax Argmax Argmax Argmax Argmax Argmax Output Word Prediction Output Word Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024 8Learning Rate • Gradient computation gives direction of change • Scaled by learning rate • Weight updates • Simplest form: fixed value • Annealing – start with larger value (big changes at beginning) – reduce over time (minor adjustments to refine model) Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 16Ensuring Randomness • Typical theoretical assumption independent and identically distributed training examples • Approximate this ideal – avoid undue structure in the training data – avoid undue structure in initial weight setting • ML approach: Maximum entropy training – Fit properties of training data – Otherwise, model should be as random as possible (i.e., has maximum entropy) Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 17Shuffling the Training Data • Typical training data in machine translation – different types of corpora ∗ European Parliament Proceedings ∗ collection of movie subtitles – temporal structure in each corpus – similar sentences next too each other (e.g., same story / debate) • Online updating: last examples matter more • Convergence criterion: no improvement recently → stretch of hard examples following easy examples: prematurely stopped ⇒ randomly shuffle the training data (maybe each epoch) Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 18Weight Initialization • Initialize weights to random values • Values are chosen from a uniform distribution • Ideal weights lead to node values in transition area for activation function Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 19For Example: Sigmoid • Input values in range [−1; 1] ⇒ Output values in range [0.269;0.731] • Magic formula (n size of the previous layer) − 1 √ n , 1 √ n • Magic formula for hidden layers − √ 6 √ nj + nj+1 , √ 6 √ nj + nj+1 – nj is the size of the previous layer – nj+1 size of next layer Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 20Problem: Overconfident Models • Predictions of the neural machine translation models are surprisingly confident • Often almost all the probability mass is assigned to a single word (word prediction probabilities of over 99%) • Problem for decoding and training – decoding: sensible alternatives get low scores, bad for beam search – training: overfitting is more likely • Solution: label smoothing • Jargon notice – in classification tasks, we predict a label – jargon term for any output → here, we smooth the word predictions Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 21Label Smoothing during Decoding • Common strategy to combat peaked distributions: smooth them • Recall – prediction layer produces numbers for each word – converted into probabilities using the softmax p(yi) = exp si j exp sj • Softmax calculation can be smoothed with so-called temperature T p(yi) = exp si/T j exp sj/T • Higher temperature → distribution smoother (i.e., less probability is given to most likely choice) Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 22Label Smoothing during Training • Root of problem: training • Training object: assign all probability mass to single correct word • Label smoothing – truth gives some probability mass to other words (say, 10% of it) – uniformly distributed over all words – relative to unigram word probabilities (relative counts of each word in the target side of the training data) Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 23 adjusting the learning rate Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 24Adjusting the Learning Rate • Gradient descent training: weight update follows the gradient downhill • Actual gradients have fairly large values, scale with a learning rate (low number, e.g., µ = 0.001) • Change the learning rate over time – starting with larger updates – refining weights with smaller updates – adjust for other reasons • Learning rate schedule Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 25Momentum Term • Consider case where weight value far from optimum • Most training examples push the weight value in the same direction • Small updates take long to accumulate • Solution: momentum term mt – accumulate weight updates at each time step t – some decay rate for sum (e.g., 0.9) – combine momentum term mt−1 with weight update value ∆wt mt = 0.9mt−1 + ∆wt wt = wt−1 − µ mt Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 26Adapting Learning Rate per Parameter • Common strategy: reduce the learning rate µ over time • Initially parameters are far away from optimum → change a lot • Later nuanced refinements needed → change little • Now: different learning rate for each parameter Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 27Adagrad • Different parameters at different stages of training → different learning rate for each parameter • Adagrad – record gradients for each parameter – accumulate their square values over time – use this sum to reduce learning rate • Update formula – gradient gt = dEt dw of error E with respect to weight w – divide the learning rate µ by accumulated sum ∆wt = µ t τ=1 g2 τ gt • Big changes in the parameter value (corresponding to big gradients gt) → reduction of the learning rate of the weight parameter Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 28Adam: Elements • Combine idea of momentum term and reduce parameter update by accumulated change • Momentum term idea (e.g., β1 = 0.9) mt = β1mt−1 + (1 − β1)gt • Accumulated gradients (decay with β2 = 0.999) vt = β2vt−1 + (1 − β2)g2 t Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 29Adam: Technical Correction • Initially, values for mt and vt are close to initial value of 0 • Adjustment ˆmt = mt 1 − βt 1 , ˆvt = vt 1 − βt 2 • With t → ∞ this correction goes away limt→∞ 1 1 − βt → 1 Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 30Adam • Given – learning rate µ – momentum ˆmt – accumulated change ˆvt • Weight update per Adam (e.g., = 10−8 ) ∆wt = µ √ ˆvt + ˆmt Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 31Batched Gradient Updates • Accumulate all weight updates for all the training example → update (converges slowly) • Process each training example → update (stochastic gradient descent) (quicker convergence, but last training disproportionately higher impact) • Process data in batches – compute all their gradients for individual word predictions errors – use sum over each batch to update parameters → better parallelization on GPUs • Process data on multiple compute cores – batch processing may take different amount of time – asynchronous training: apply updates when they arrive – mismatch between original weights and updates may not matter much Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 32 avoiding local optima Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 33Avoiding Local Optima • One of hardest problem for designing neural network architectures and optimization methods • Ensure that model converges to at least to a set of parameter values that give results close to this optimum on unseen test data. • There is no real solution to this problem. • It requires experimentation and analysis that is more craft than science. • Still, this section presents a number of methods that generally help avoiding getting stuck in local optima. Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 34Overfitting and Underfitting • Neural machine translation models – 100s of millions of parameters – 100s of millions of training examples (individual word predictions) • No hard rules for relationship between these two numbers • Too many parameters and too few training examples → overfitting • Too few parameters and many training examples → underfitting Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 35Regularization • Motivation: prefer as few parameters as possible • Strategy: set un-needed paramters a value of 0 • Method – adjust training objective – add cost for any non-zero parameter – typically done with L2 norm • Practical impact – derivative of L2 norm is value of parameter – if not signal from training: reduce value of parameter – alsp called weight decay • Not common in deep learning, but other methods understood as regularization Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 36Curriculum Learning • Human learning – learn simple concepts first – learn more complex material later • Early epochs: only easy training examples – only short sentences – create artificial data by extracting smaller segments (similar to phrase pair extraction in statistical machine translation) – Later epochs: all training data • Not easy to callibrate Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 37Dropout • Training may get stuck in local optima – some properties of task have been learned – discovery of other properties would take it too far out of its comfort zone. • Machine translation example – model learned the language model aspects – but cannot figure out role of input sentence • Drop out: for each batch, eliminate some nodes Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 38Dropout • Dropout – For each batch, different random set of nodes is removed – Their values are set to 0 and their weights are not updated – 10%, 20% or even 50% of all the nodes • Why does this work? – robustness: redundant nodes play similar nodes – ensemble learning: different subnetworks are different models Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 39Gradient Clipping • Exploding gradients: gradients become too large during backward pass ⇒ Limit total value of gradients for a layer to threshold (τ) • Use of L2 norm of gradient values g L2(g) = j g2 j • Adjust each gradient value gi for each element i in the vector gi = gi × τ max(τ, L2(g)) Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 40Layer Normalization • During inference, average node values may become too large or too small • Has also impact on training (gradients are multiplied with node values) ⇒ Normalize node values • During training, learn bias layer Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 41Layer Normalization: Math • Feed-forward layer hl , weights W, computed sum sl , activation function sl = W hl−1 hl = sigmoid(hl ) • Compute mean µl and variance σl of sum vector sl µl = 1 H H i−1 sl i σl = 1 H H i−1 (sl i − µl)2 Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 42Layer Normalization: Math • Normalize sl ˆsl = 1 σl (sl − µl ) • Learnable bias vectors g and b ˆsl = g σl (sl − µl ) + b Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 43Shortcuts and Highways • Deep learning: many layers of processing ⇒ Error propagation has to travel farther • All parameters in processing change have to be adjusted • Instead of always passing through all layers, add connections from first to last • Jargon alert – shortcuts – residual connections – skip connections Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 44Shortcuts • Feed-forward layer y = f(x) • Pass through input x y = f(x) + x • Note: gradient is y = f (x) + 1 • Constant 1 → gradient is passed through unchanged Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 45Highways • Regulate how much information from f(x) and x should impact the output y • Gate t(x) (typically computed by a feed-forward layer) y = t(x) f(x) + (1 − t(x)) x Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024 46Shortcuts and Highways FF Basic Layer Skip Connection Highway Network Add FF Add FF Gate Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024