Alternative Architectures Philipp Koehn 12 October 2023 Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 23 attention Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 24Attention • Machine translation is a structured prediction task – output is not a single label – output structure needs to be built, word by word • Relevant information for each word prediction varies • Human translators pay attention to different parts of the input sentence when translating ⇒ Attention mechanism Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 25Computing Attention • Attention mechanism in neural translation model (Bahdanau et al., 2015) – previous hidden state si−1 – input word embedding hj – trainable parameters b, Wa, Ua, va a(si−1, hj) = vT a tanh(Wasi−1 + Uahj + b) • Other ways to compute attention – Dot product: a(si−1, hj) = sT i−1hj – Scaled dot product: a(si−1, hj) = 1√ |hj| sT i−1hj – General: a(si−1, hj) = sT i−1Wahj – Local: a(si−1) = Wasi−1 Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 26Attention of Luong et al. (2015) • Luong et al. (2015) demonstrate good results with the dot product a(si−1, hj) = sT i−1hj • No trainable parameters • Additional changes • Currently more popular Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 27General View of Dot-Product Attention • Three element Query : decoder state Key : encoder state Value : encoder state • Intuition – given a query (the decoder state) – we check how well it matches keys in the database (the encoder states) – and then use the matching score to scale the retrieved value (also the encoder state) • Computation Attention(Q, K, V ) = softmax( QKT √ dk )V Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 28Scaled Dot-Product Attention • Refinement of query, key, and value • Scale it down to lower-dimensional vectors (e.g., 512 from 4096) • Using a weight matrix for each: QWQ , KWK , V WV Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 29Multi-Head Attention • Add redundancy – say, 16 attention weights – each based on its own parameters • Formally: headi = Attention(QWQ i , KWK i , V WV i ) MultiHead(Q, K, V ) = Concat(head1, ..., headh)WO • Multi-head attention is a form of ensembling Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 30Self Attention • Finally, a very different take at attention • Motivation so far: need for alignment between input words and output words • Now: refine representation of input words in the encoder – representation of an input word mostly depends on itself – but also informed by the surrounding context – previously: recurrent neural networks (considers left or right context) – now: attention mechanism • Self attention: Which of the surrounding words is most relevant to refine representation? Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 31Self Attention • Formal definition (based on sequence of vectors hj, packed into matrix H self-attention(H) = Attention(HWQ i , HWK i , HWV i ) • Association between every word representation hj any other context word hk • Resulting vector of normalized association values used to weigh context words Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 38 transformer Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 39Self Attention: Transformer • Self-attention in encoder – refine word representation based on relevant context words – relevance determined by self attention • Self-attention in decoder – refine output word predictions based on relevant previous output words – relevance determined by self attention • Also regular attention to encoder states in decoder • Currently most successful model (maybe only with self attention in decoder, but regular recurrent decoder) Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 40Encoder Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Input Context Attention Embed Embed Embed Embed Embed Embed Embed Word and Position Embedding Ewxj Epj the house is big . Input Wordxj Add Add Add Add Add Add Add Positional Input Word Embedding Ewxj + Epj Embed Embed Embed Embed Embed Embed Embed Input Word Positionj 0 1 2 3 4 5 6 Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Input Context with Shortcut ĥj Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Encoder State Refinement hj Sequence of self-attention layers Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 41Self Attention Layer • Given: input word representations hj, packed into a matrix H = (h1, ..., hj) • Self attention self-attention(H) = MultiHead(H, H, H) • Shortcut connection self-attention(hj) + hj • Layer normalization ˆhj = layer-normalization(self-attention(hj) + hj) • Feed-forward step with ReLU activation function relu(Wˆhj + b) • Again, shortcut connection and layer normalization layer-normalization(relu(Wˆhj + b) + ˆhj) Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 42Stacked Self Attention Layers • Stack several such layers (say, D = 6) • Start with input word embedding h0,j = Exj • Stacked layers hd,j = self-attention-layer(hd−1,j) Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 43Decoder Self Attention Self Attention Self Attention Self Attention Self Attention Self Attention Self Attention Self-Attention Embed Embed Embed Embed Embed Embed Embed Word and Position Embedding the house is big . Output Wordyi Add Add Add Add Add Add Add Positional Output Word Embedding si Embed Embed Embed Embed Embed Embed Embed Output Word Positioni 0 1 2 3 4 5 6 Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Output Context Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Normalization with Shortcut Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Output State Refinement Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Context Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Normalization with Shortcut ŝi Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Decoder State Refinement si Attention Attention Attention Attention Attention Attention Attention Encoder State Attention h Decoder computes attention-based representations of the output in several layers, initialized with the embeddings of the previous output words Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 44Self-Attention in the Decoder • Same idea as in the encoder • Output words are initially encoded by word embeddings si = Eyi. • Self attention is computed over previous output words – association of a word si is limited to words sk (k ≤ i) – resulting representation ˜si self-attention( ˜S) = MultiHead( ˜S, ˜S, ˜S) Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 45Attention in the Decoder • Original intuition of attention mechanism: focus on relevant input words • Computed with dot product ˜SHT • Compute attention between the decoder states ˜S and the final encoder states H attention( ˜S, H) = MultiHead( ˜S, H, H) • Note: attention mechanism formally mirrors self-attention Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 46Full Decoder Encoder Layer Input Word Encoder Layer Encoder Layer Encoder Layer Output Word Embedding Decoder Layer Decoder Layer Decoder Layer Decoder Layer Softmax Softmax Softmax Softmax Softmax Softmax Softmax Argmax Argmax Argmax Argmax Argmax Argmax Argmax Output Word Prediction Output Word Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 47Full Decoder • Self-attention self-attention( ˜S) = MultiHead( ˜S, ˜S, ˜S) – shortcut connections – layer normalization – feed-forward layer • Attention attention( ˜S, H) = softmaxMultiHead( ˜S, H, H) – shortcut connections – layer normalization – feed-forward layer • Multiple stacked layers Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023 8Learning Rate • Gradient computation gives direction of change • Scaled by learning rate • Weight updates • Simplest form: fixed value • Annealing – start with larger value (big changes at beginning) – reduce over time (minor adjustments to refine model) Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 16Ensuring Randomness • Typical theoretical assumption independent and identically distributed training examples • Approximate this ideal – avoid undue structure in the training data – avoid undue structure in initial weight setting • ML approach: Maximum entropy training – Fit properties of training data – Otherwise, model should be as random as possible (i.e., has maximum entropy) Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 17Shuffling the Training Data • Typical training data in machine translation – different types of corpora ∗ European Parliament Proceedings ∗ collection of movie subtitles – temporal structure in each corpus – similar sentences next too each other (e.g., same story / debate) • Online updating: last examples matter more • Convergence criterion: no improvement recently → stretch of hard examples following easy examples: prematurely stopped ⇒ randomly shuffle the training data (maybe each epoch) Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 18Weight Initialization • Initialize weights to random values • Values are chosen from a uniform distribution • Ideal weights lead to node values in transition area for activation function Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 19For Example: Sigmoid • Input values in range [−1; 1] ⇒ Output values in range [0.269;0.731] • Magic formula (n size of the previous layer) − 1 √ n , 1 √ n • Magic formula for hidden layers − √ 6 √ nj + nj+1 , √ 6 √ nj + nj+1 – nj is the size of the previous layer – nj+1 size of next layer Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 20Problem: Overconfident Models • Predictions of the neural machine translation models are surprisingly confident • Often almost all the probability mass is assigned to a single word (word prediction probabilities of over 99%) • Problem for decoding and training – decoding: sensible alternatives get low scores, bad for beam search – training: overfitting is more likely • Solution: label smoothing • Jargon notice – in classification tasks, we predict a label – jargon term for any output → here, we smooth the word predictions Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 21Label Smoothing during Decoding • Common strategy to combat peaked distributions: smooth them • Recall – prediction layer produces numbers for each word – converted into probabilities using the softmax p(yi) = exp si j exp sj • Softmax calculation can be smoothed with so-called temperature T p(yi) = exp si/T j exp sj/T • Higher temperature → distribution smoother (i.e., less probability is given to most likely choice) Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 24Adjusting the Learning Rate • Gradient descent training: weight update follows the gradient downhill • Actual gradients have fairly large values, scale with a learning rate (low number, e.g., µ = 0.001) • Change the learning rate over time – starting with larger updates – refining weights with smaller updates – adjust for other reasons • Learning rate schedule Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 25Momentum Term • Consider case where weight value far from optimum • Most training examples push the weight value in the same direction • Small updates take long to accumulate • Solution: momentum term mt – accumulate weight updates at each time step t – some decay rate for sum (e.g., 0.9) – combine momentum term mt−1 with weight update value ∆wt mt = 0.9mt−1 + ∆wt wt = wt−1 − µ mt Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 26Adapting Learning Rate per Parameter • Common strategy: reduce the learning rate µ over time • Initially parameters are far away from optimum → change a lot • Later nuanced refinements needed → change little • Now: different learning rate for each parameter Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 27Adagrad • Different parameters at different stages of training → different learning rate for each parameter • Adagrad – record gradients for each parameter – accumulate their square values over time – use this sum to reduce learning rate • Update formula – gradient gt = dEt dw of error E with respect to weight w – divide the learning rate µ by accumulated sum ∆wt = µ t τ=1 g2 τ gt • Big changes in the parameter value (corresponding to big gradients gt) → reduction of the learning rate of the weight parameter Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 28Adam: Elements • Combine idea of momentum term and reduce parameter update by accumulated change • Momentum term idea (e.g., β1 = 0.9) mt = β1mt−1 + (1 − β1)gt • Accumulated gradients (decay with β2 = 0.999) vt = β2vt−1 + (1 − β2)g2 t Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 29Adam: Technical Correction • Initially, values for mt and vt are close to initial value of 0 • Adjustment ˆmt = mt 1 − βt 1 , ˆvt = vt 1 − βt 2 • With t → ∞ this correction goes away limt→∞ 1 1 − βt → 1 Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 30Adam • Given – learning rate µ – momentum ˆmt – accumulated change ˆvt • Weight update per Adam (e.g., = 10−8 ) ∆wt = µ √ ˆvt + ˆmt Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 31Batched Gradient Updates • Accumulate all weight updates for all the training example → update (converges slowly) • Process each training example → update (stochastic gradient descent) (quicker convergence, but last training disproportionately higher impact) • Process data in batches – compute all their gradients for individual word predictions errors – use sum over each batch to update parameters → better parallelization on GPUs • Process data on multiple compute cores – batch processing may take different amount of time – asynchronous training: apply updates when they arrive – mismatch between original weights and updates may not matter much Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 32 avoiding local optima Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 33Avoiding Local Optima • One of hardest problem for designing neural network architectures and optimization methods • Ensure that model converges to at least to a set of parameter values that give results close to this optimum on unseen test data. • There is no real solution to this problem. • It requires experimentation and analysis that is more craft than science. • Still, this section presents a number of methods that generally help avoiding getting stuck in local optima. Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 34Overfitting and Underfitting • Neural machine translation models – 100s of millions of parameters – 100s of millions of training examples (individual word predictions) • No hard rules for relationship between these two numbers • Too many parameters and too few training examples → overfitting • Too few parameters and many training examples → underfitting Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 35Regularization • Motivation: prefer as few parameters as possible • Strategy: set un-needed paramters a value of 0 • Method – adjust training objective – add cost for any non-zero parameter – typically done with L2 norm • Practical impact – derivative of L2 norm is value of parameter – if not signal from training: reduce value of parameter – alsp called weight decay • Not common in deep learning, but other methods understood as regularization Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 36Curriculum Learning • Human learning – learn simple concepts first – learn more complex material later • Early epochs: only easy training examples – only short sentences – create artificial data by extracting smaller segments (similar to phrase pair extraction in statistical machine translation) – Later epochs: all training data • Not easy to callibrate Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 37Dropout • Training may get stuck in local optima – some properties of task have been learned – discovery of other properties would take it too far out of its comfort zone. • Machine translation example – model learned the language model aspects – but cannot figure out role of input sentence • Drop out: for each batch, eliminate some nodes Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 38Dropout • Dropout – For each batch, different random set of nodes is removed – Their values are set to 0 and their weights are not updated – 10%, 20% or even 50% of all the nodes • Why does this work? – robustness: redundant nodes play similar nodes – ensemble learning: different subnetworks are different models Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 39Gradient Clipping • Exploding gradients: gradients become too large during backward pass ⇒ Limit total value of gradients for a layer to threshold (τ) • Use of L2 norm of gradient values g L2(g) = j g2 j • Adjust each gradient value gi for each element i in the vector gi = gi × τ max(τ, L2(g)) Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 40Layer Normalization • During inference, average node values may become too large or too small • Has also impact on training (gradients are multiplied with node values) ⇒ Normalize node values • During training, learn bias layer Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 41Layer Normalization: Math • Feed-forward layer hl , weights W, computed sum sl , activation function sl = W hl−1 hl = sigmoid(hl ) • Compute mean µl and variance σl of sum vector sl µl = 1 H H i−1 sl i σl = 1 H H i−1 (sl i − µl)2 Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 42Layer Normalization: Math • Normalize sl ˆsl = 1 σl (sl − µl ) • Learnable bias vectors g and b ˆsl = g σl (sl − µl ) + b Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 43Shortcuts and Highways • Deep learning: many layers of processing ⇒ Error propagation has to travel farther • All parameters in processing change have to be adjusted • Instead of always passing through all layers, add connections from first to last • Jargon alert – shortcuts – residual connections – skip connections Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 44Shortcuts • Feed-forward layer y = f(x) • Pass through input x y = f(x) + x • Note: gradient is y = f (x) + 1 • Constant 1 → gradient is passed through unchanged Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 45Highways • Regulate how much information from f(x) and x should impact the output y • Gate t(x) (typically computed by a feed-forward layer) y = t(x) f(x) + (1 − t(x)) x Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 46Shortcuts and Highways FF Basic Layer Skip Connection Highway Network Add FF Add FF Gate Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023 28Batching • Already large degree of parallelism – most computations on vectors, matrices – efficient implementations for CPU and GPU • Further parallelism by batching – processing several sentence pairs at once – scalar operation → vector operation – vector operation → matrix operation – matrix operation → 3d tensor operation • Typical batch sizes 50–100 sentence pairs Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 29Batches • Sentences have different length • When batching, fill up unneeded cells in tensors ⇒ A lot of wasted computations Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 30Mini-Batches • Sort sentences by length, break up into mini-batches • Example: Maxi-batch 1600 sentence pairs, mini-batch 80 sentence pairs Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023 31Overall Organization of Training • Shuffle corpus • Break into maxi-batches • Break up each maxi-batch into mini-batches • Process mini-batch, update parameters • Once done, repeat • Typically 5-15 epochs needed (passes through entire training corpus) Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023