Alternative Architectures
Philipp Koehn
10 October 2024
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
24
attention
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
25Attention
• Machine translation is a structured prediction task
– output is not a single label
– output structure needs to be built, word by word
• Relevant information for each word prediction varies
• Human translators pay attention to different parts of the input sentence when
translating
⇒ Attention mechanism
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
26Attention
RNN RNN RNN RNN RNN
RNN RNN Decoder State
Right-to-Left
Encoder
Left-to-Right
Encoder
si
hj
hj RNN RNN RNN RNN RNN
• Given what we have generated so far (decoder hidden state)
• ... which words in the input should we pay attention to (encoder states)?
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
27Attention
RNN RNN RNN RNN RNN
RNN RNN Decoder State
Attention
Right-to-Left
Encoder
Left-to-Right
Encoder
si
αij
hj
hj RNN RNN RNN RNN RNN
• Given: – the previous hidden state of the decoder si−1
– the representation of input words hj = (
←−
hj,
−→
hj)
• Predict an alignment probability a(si−1, hj) to each input word j
(modeled with with a feed-forward neural network layer)
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
28Attention
RNN RNN RNN RNN RNN
RNN RNN Decoder State
Attention
Right-to-Left
Encoder
Left-to-Right
Encoder
si
αij
hj
hj RNN RNN RNN RNN RNN
Input Context
• Normalize attention (softmax)
αij =
exp(a(si−1, hj))
k exp(a(si−1, hk))
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
29Attention
RNN RNN RNN RNN RNN
RNN RNN Decoder State
Attention
Right-to-Left
Encoder
Left-to-Right
Encoder
si
αij
hj
hj RNN RNN RNN RNN RNN
Input Context+ + + +
• Relevant input context: weigh input words according to attention: ci = j αijhj
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
30Attention
RNN RNN
Weighted
Sum
Attention
RNN RNN RNN
RNN RNN RNN Decoder State
Input Context
Attention
Right-to-Left
Encoder
Left-to-Right
Encoder
si
ci
αij
hj
hj RNN RNN RNN RNN RNN
• Use context to predict next hidden state and output word
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
31Computing Attention
• Attention mechanism in neural translation model (Bahdanau et al., 2015)
– previous hidden state si−1
– input word embedding hj
– trainable parameters b, Wa, Ua, va
a(si−1, hj) = vT
a tanh(Wasi−1 + Uahj + b)
• Other ways to compute attention (Luong et al., 2015)
– Dot product: a(si−1, hj) = sT
i−1hj
– Scaled dot product: a(si−1, hj) = 1√
|hj|
sT
i−1hj
– General: a(si−1, hj) = sT
i−1Wahj
– Local: a(si−1) = Wasi−1
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
32General View of Dot-Product Attention
• Three elements
Query : decoder state
Key : encoder state
Value : encoder state
• Intuition
– given a query (the decoder state)
– we check how well it matches keys in the database (the encoder states)
– and then use the matching score to scale the retrieved value (also the encoder
state)
• Computation
Attention(Q, K, V ) = softmax(
QKT
√
dk
)V
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
33General View of Dot-Product Attention
Attention(Q, K, V )
RNN RNN RNN RNN RNN
RNN RNN Decoder State
Attention
Right-to-Left
Encoder
Left-to-Right
Encoder
si
αij
hj
hj RNN RNN RNN RNN RNN
Query
Key Key Key Key Key
Value Value Value Value Value
• Query: encoder state, Key and Value: decoder state
Attention(S, H, H)
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
35Self Attention
• Finally, a very different take at attention
• Motivation so far: need for alignment between input words and output words
• Now: refine representation of input words in the encoder
– representation of an input word mostly depends on itself
– but also informed by the surrounding context
– previously: recurrent neural networks (considers left or right context)
– now: attention mechanism
• Self attention:
Which of the surrounding words is most relevant to refine representation?
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
36Self Attention
Embed Embed Embed Embed Embed
Input Word
Embedding
Exj
• Given: input word embeddings
• Task: consider how each should be refined in view of others
• Needed: how much attention to pay to others
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
37Self Attention
Attentionαij
Embed Embed Embed Embed Embed
Input Word
Embedding
Exj
• Computation of attention weights as before
– Key: word embedding (or generally: encoder state for word H)
– Query: word embedding (or generally: encoder state for word H)
• Again, multiple with weight matrices: Q=HWQ
and K=HWK
• Attention weights: QKT
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
38Self Attention
Attentionαij
Embed Embed Embed Embed Embed
Input Word
Embedding
Exj
Refined Input Word
Representation
+ + + +
• Full self attention
self-attention(H) = Attention(HWQ
, HWK
, H)
• Resulting vector uses weighted context words
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
39Multi-Head Attention
• Add redundancy
– say, 16 attention weights
– each based on its own parameters WQ
i , WK
i , WV
i
• Formally:
headi = Attention(QWQ
i , KWK
i , V WV
i )
MultiHead(Q, K, V ) = Concat(head1, ..., headh)WO
• Multi-head attention is a form of ensembling
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
40Multi-Head Attention
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
41Multi-Head Attention
“Many of the attention heads exhibit behaviour that seems related to the structure
of the sentence.“
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
42
transformer
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
43Self Attention: Transformer
• Self-attention in encoder
– refine word representation based on relevant context words
– relevance determined by self attention
• Self-attention in decoder
– refine output word predictions based on relevant previous output words
– relevance determined by self attention
• Also regular attention to encoder states in decoder
• Currently most successful model
(maybe only with self attention in decoder, but regular recurrent decoder)
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
44Self Attention Layer
• Given: input word representations hj, packed into a matrix H = (h1, ..., hj)
• Self attention
self-attention(H) = MultiHead(H, H, H)
• Shortcut connection
self-attention(hj) + hj
• Layer normalization
ˆhj = layer-normalization(self-attention(hj) + hj)
• Feed-forward step with ReLU activation function
relu(Wˆhj + b)
• Again, shortcut connection and layer normalization
layer-normalization(relu(Wˆhj + b) + ˆhj)
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
45Encoder
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Input Context
Attention
Embed Embed Embed Embed Embed Embed Embed
Word and Position
Embedding
Ewxj Epj
the house is big . Input Wordxj
Add Add Add Add Add Add Add
Positional Input
Word Embedding
Ewxj + Epj
Embed
Embed
Embed
Embed
Embed
Embed
Embed
Input Word Positionj
0 1 2 3 4 5 6
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Input Context
with Shortcut
ĥj
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Encoder State
Refinement
hj
Sequence of self-attention layers
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
46Self-Attention in the Decoder
• Same idea as in the encoder
• Output words are initially encoded by word embeddings si = Eyi.
• Self attention is computed over previous output words
– association of a word si is limited to words sk (k ≤ i)
– resulting representation ˜si
self-attention( ˜S) = MultiHead( ˜S, ˜S, ˜S)
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
47Attention in the Decoder
• Original intuition of attention mechanism: focus on relevant input words
• Computed with dot product ˜SHT
• Compute attention between the decoder states ˜S and the final encoder states H
attention( ˜S, H) = MultiHead( ˜S, H, H)
• Note: attention mechanism formally mirrors self-attention
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
48Full Decoder
• Self-attention
self-attention( ˜S) = MultiHead( ˜S, ˜S, ˜S)
– shortcut connections
– layer normalization
• Attention
attention( ˜S, H) = softmaxMultiHead( ˜S, H, H)
– shortcut connections
– layer normalization
– feed-forward layer
• Multiple stacked layers
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
49Decoder
Self
Attention
Self
Attention
Self
Attention
Self
Attention
Self
Attention
Self
Attention
Self
Attention
Self-Attention
Embed Embed Embed Embed Embed Embed Embed
Word and Position
Embedding
the house is big . Output Wordyi
Add Add Add Add Add Add Add
Positional Output
Word Embedding
si
Embed
Embed
Embed
Embed
Embed
Embed
Embed
Output Word
Position
i 0 1 2 3 4 5 6
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Output Context
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Normalization
with Shortcut
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Context
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Normalization
with Shortcut
ŝi+1
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Decoder State
Refinement
si+1
Attention Attention Attention Attention Attention Attention Attention
Encoder State
Attention
h
Decoder computes attention-based representations of the output in several layers,
initialized with the embeddings of the previous output words
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
50Multiple Layers
• Stack several transformer layers (say, D = 6)
• Encoder
– Start with input word embedding
h0,j = Exj
– Stacked layers
hd,j = self-attention-layer(hd−1,j)
• Same for decoder
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
51Multiple Layers in Encoder and Decoder
Encoder Layer
Input Word
Encoder Layer
Encoder Layer
Encoder Layer
Output Word
Embedding
Decoder Layer
Decoder Layer
Decoder Layer
Decoder Layer
Softmax Softmax Softmax Softmax Softmax Softmax Softmax
Argmax Argmax Argmax Argmax Argmax Argmax Argmax
Output Word
Prediction
Output Word
Philipp Koehn Machine Translation: Alternative Architectures 10 October 2024
8Learning Rate
• Gradient computation gives direction of change
• Scaled by learning rate
• Weight updates
• Simplest form: fixed value
• Annealing
– start with larger value (big changes at beginning)
– reduce over time (minor adjustments to refine model)
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
16Ensuring Randomness
• Typical theoretical assumption
independent and identically distributed
training examples
• Approximate this ideal
– avoid undue structure in the training data
– avoid undue structure in initial weight setting
• ML approach: Maximum entropy training
– Fit properties of training data
– Otherwise, model should be as random as possible
(i.e., has maximum entropy)
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
17Shuffling the Training Data
• Typical training data in machine translation
– different types of corpora
∗ European Parliament Proceedings
∗ collection of movie subtitles
– temporal structure in each corpus
– similar sentences next too each other (e.g., same story / debate)
• Online updating: last examples matter more
• Convergence criterion: no improvement recently
→ stretch of hard examples following easy examples: prematurely stopped
⇒ randomly shuffle the training data
(maybe each epoch)
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
18Weight Initialization
• Initialize weights to random values
• Values are chosen from a uniform distribution
• Ideal weights lead to node values in transition area for activation function
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
19For Example: Sigmoid
• Input values in range [−1; 1]
⇒ Output values in range [0.269;0.731]
• Magic formula (n size of the previous layer)
−
1
√
n
,
1
√
n
• Magic formula for hidden layers
−
√
6
√
nj + nj+1
,
√
6
√
nj + nj+1
– nj is the size of the previous layer
– nj+1 size of next layer
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
20Problem: Overconfident Models
• Predictions of the neural machine translation models are surprisingly confident
• Often almost all the probability mass is assigned to a single word
(word prediction probabilities of over 99%)
• Problem for decoding and training
– decoding: sensible alternatives get low scores, bad for beam search
– training: overfitting is more likely
• Solution: label smoothing
• Jargon notice
– in classification tasks, we predict a label
– jargon term for any output
→ here, we smooth the word predictions
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
21Label Smoothing during Decoding
• Common strategy to combat peaked distributions: smooth them
• Recall
– prediction layer produces numbers for each word
– converted into probabilities using the softmax
p(yi) =
exp si
j exp sj
• Softmax calculation can be smoothed with so-called temperature T
p(yi) =
exp si/T
j exp sj/T
• Higher temperature → distribution smoother
(i.e., less probability is given to most likely choice)
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
22Label Smoothing during Training
• Root of problem: training
• Training object: assign all probability mass to single correct word
• Label smoothing
– truth gives some probability mass to other words (say, 10% of it)
– uniformly distributed over all words
– relative to unigram word probabilities
(relative counts of each word in the target side of the training data)
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
23
adjusting the learning rate
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
24Adjusting the Learning Rate
• Gradient descent training: weight update follows the gradient downhill
• Actual gradients have fairly large values, scale with a learning rate
(low number, e.g., µ = 0.001)
• Change the learning rate over time
– starting with larger updates
– refining weights with smaller updates
– adjust for other reasons
• Learning rate schedule
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
25Momentum Term
• Consider case where weight value far from optimum
• Most training examples push the weight value in the same direction
• Small updates take long to accumulate
• Solution: momentum term mt
– accumulate weight updates at each time step t
– some decay rate for sum (e.g., 0.9)
– combine momentum term mt−1 with weight update value ∆wt
mt = 0.9mt−1 + ∆wt
wt = wt−1 − µ mt
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
26Adapting Learning Rate per Parameter
• Common strategy: reduce the learning rate µ over time
• Initially parameters are far away from optimum → change a lot
• Later nuanced refinements needed → change little
• Now: different learning rate for each parameter
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
27Adagrad
• Different parameters at different stages of training
→ different learning rate for each parameter
• Adagrad
– record gradients for each parameter
– accumulate their square values over time
– use this sum to reduce learning rate
• Update formula
– gradient gt = dEt
dw of error E with respect to weight w
– divide the learning rate µ by accumulated sum
∆wt =
µ
t
τ=1 g2
τ
gt
• Big changes in the parameter value (corresponding to big gradients gt)
→ reduction of the learning rate of the weight parameter
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
28Adam: Elements
• Combine idea of momentum term and reduce parameter update by accumulated
change
• Momentum term idea (e.g., β1 = 0.9)
mt = β1mt−1 + (1 − β1)gt
• Accumulated gradients (decay with β2 = 0.999)
vt = β2vt−1 + (1 − β2)g2
t
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
29Adam: Technical Correction
• Initially, values for mt and vt are close to initial value of 0
• Adjustment
ˆmt =
mt
1 − βt
1
, ˆvt =
vt
1 − βt
2
• With t → ∞ this correction goes away
limt→∞
1
1 − βt
→ 1
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
30Adam
• Given
– learning rate µ
– momentum ˆmt
– accumulated change ˆvt
• Weight update per Adam (e.g., = 10−8
)
∆wt =
µ
√
ˆvt +
ˆmt
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
31Batched Gradient Updates
• Accumulate all weight updates for all the training example → update
(converges slowly)
• Process each training example → update (stochastic gradient descent)
(quicker convergence, but last training disproportionately higher impact)
• Process data in batches
– compute all their gradients for individual word predictions errors
– use sum over each batch to update parameters
→ better parallelization on GPUs
• Process data on multiple compute cores
– batch processing may take different amount of time
– asynchronous training: apply updates when they arrive
– mismatch between original weights and updates may not matter much
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
32
avoiding local optima
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
33Avoiding Local Optima
• One of hardest problem for designing neural network architectures and
optimization methods
• Ensure that model converges to at least to a set of parameter values that give
results close to this optimum on unseen test data.
• There is no real solution to this problem.
• It requires experimentation and analysis that is more craft than science.
• Still, this section presents a number of methods that generally help avoiding
getting stuck in local optima.
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
34Overfitting and Underfitting
• Neural machine translation models
– 100s of millions of parameters
– 100s of millions of training examples (individual word predictions)
• No hard rules for relationship between these two numbers
• Too many parameters and too few training examples → overfitting
• Too few parameters and many training examples → underfitting
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
35Regularization
• Motivation: prefer as few parameters as possible
• Strategy: set un-needed paramters a value of 0
• Method
– adjust training objective
– add cost for any non-zero parameter
– typically done with L2 norm
• Practical impact
– derivative of L2 norm is value of parameter
– if not signal from training: reduce value of parameter
– alsp called weight decay
• Not common in deep learning, but other methods understood as regularization
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
36Curriculum Learning
• Human learning
– learn simple concepts first
– learn more complex material later
• Early epochs: only easy training examples
– only short sentences
– create artificial data by extracting smaller segments
(similar to phrase pair extraction in statistical machine translation)
– Later epochs: all training data
• Not easy to callibrate
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
37Dropout
• Training may get stuck in local optima
– some properties of task have been learned
– discovery of other properties would take it too far out of its comfort zone.
• Machine translation example
– model learned the language model aspects
– but cannot figure out role of input sentence
• Drop out: for each batch, eliminate some nodes
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
38Dropout
• Dropout
– For each batch, different random set of nodes is removed
– Their values are set to 0 and their weights are not updated
– 10%, 20% or even 50% of all the nodes
• Why does this work?
– robustness: redundant nodes play similar nodes
– ensemble learning: different subnetworks are different models
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
39Gradient Clipping
• Exploding gradients: gradients become too large during backward pass
⇒ Limit total value of gradients for a layer to threshold (τ)
• Use of L2 norm of gradient values g
L2(g) =
j
g2
j
• Adjust each gradient value gi for each element i in the vector
gi = gi ×
τ
max(τ, L2(g))
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
40Layer Normalization
• During inference, average node values may become too large or too small
• Has also impact on training (gradients are multiplied with node values)
⇒ Normalize node values
• During training, learn bias layer
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
41Layer Normalization: Math
• Feed-forward layer hl
, weights W, computed sum sl
, activation function
sl
= W hl−1
hl
= sigmoid(hl
)
• Compute mean µl
and variance σl
of sum vector sl
µl
=
1
H
H
i−1
sl
i
σl
=
1
H
H
i−1
(sl
i − µl)2
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
42Layer Normalization: Math
• Normalize sl
ˆsl =
1
σl
(sl
− µl
)
• Learnable bias vectors g and b
ˆsl =
g
σl
(sl
− µl
) + b
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
43Shortcuts and Highways
• Deep learning: many layers of processing
⇒ Error propagation has to travel farther
• All parameters in processing change have to be adjusted
• Instead of always passing through all layers, add connections from first to last
• Jargon alert
– shortcuts
– residual connections
– skip connections
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
44Shortcuts
• Feed-forward layer
y = f(x)
• Pass through input x
y = f(x) + x
• Note: gradient is
y = f (x) + 1
• Constant 1 → gradient is passed through unchanged
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
45Highways
• Regulate how much information from f(x) and x should impact the output y
• Gate t(x) (typically computed by a feed-forward layer)
y = t(x) f(x) + (1 − t(x)) x
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024
46Shortcuts and Highways
FF
Basic Layer Skip Connection Highway Network
Add
FF
Add
FF
Gate
Philipp Koehn Machine Translation: Machine Learning Tricks 8 October 2024