Alternative Architectures
Philipp Koehn
12 October 2023
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
23
attention
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
24Attention
• Machine translation is a structured prediction task
– output is not a single label
– output structure needs to be built, word by word
• Relevant information for each word prediction varies
• Human translators pay attention to different parts of the input sentence when
translating
⇒ Attention mechanism
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
25Computing Attention
• Attention mechanism in neural translation model (Bahdanau et al., 2015)
– previous hidden state si−1
– input word embedding hj
– trainable parameters b, Wa, Ua, va
a(si−1, hj) = vT
a tanh(Wasi−1 + Uahj + b)
• Other ways to compute attention
– Dot product: a(si−1, hj) = sT
i−1hj
– Scaled dot product: a(si−1, hj) = 1√
|hj|
sT
i−1hj
– General: a(si−1, hj) = sT
i−1Wahj
– Local: a(si−1) = Wasi−1
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
26Attention of Luong et al. (2015)
• Luong et al. (2015) demonstrate good results with the dot product
a(si−1, hj) = sT
i−1hj
• No trainable parameters
• Additional changes
• Currently more popular
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
27General View of Dot-Product Attention
• Three element
Query : decoder state
Key : encoder state
Value : encoder state
• Intuition
– given a query (the decoder state)
– we check how well it matches keys in the database (the encoder states)
– and then use the matching score to scale the retrieved value (also the encoder
state)
• Computation
Attention(Q, K, V ) = softmax(
QKT
√
dk
)V
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
28Scaled Dot-Product Attention
• Refinement of query, key, and value
• Scale it down to lower-dimensional vectors (e.g., 512 from 4096)
• Using a weight matrix for each: QWQ
, KWK
, V WV
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
29Multi-Head Attention
• Add redundancy
– say, 16 attention weights
– each based on its own parameters
• Formally:
headi = Attention(QWQ
i , KWK
i , V WV
i )
MultiHead(Q, K, V ) = Concat(head1, ..., headh)WO
• Multi-head attention is a form of ensembling
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
30Self Attention
• Finally, a very different take at attention
• Motivation so far: need for alignment between input words and output words
• Now: refine representation of input words in the encoder
– representation of an input word mostly depends on itself
– but also informed by the surrounding context
– previously: recurrent neural networks (considers left or right context)
– now: attention mechanism
• Self attention:
Which of the surrounding words is most relevant to refine representation?
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
31Self Attention
• Formal definition (based on sequence of vectors hj, packed into matrix H
self-attention(H) = Attention(HWQ
i , HWK
i , HWV
i )
• Association between every word representation hj any other context word hk
• Resulting vector of normalized association values used to weigh context words
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
38
transformer
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
39Self Attention: Transformer
• Self-attention in encoder
– refine word representation based on relevant context words
– relevance determined by self attention
• Self-attention in decoder
– refine output word predictions based on relevant previous output words
– relevance determined by self attention
• Also regular attention to encoder states in decoder
• Currently most successful model
(maybe only with self attention in decoder, but regular recurrent decoder)
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
40Encoder
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Input Context
Attention
Embed Embed Embed Embed Embed Embed Embed
Word and Position
Embedding
Ewxj Epj
the house is big . Input Wordxj
Add Add Add Add Add Add Add
Positional Input
Word Embedding
Ewxj + Epj
Embed
Embed
Embed
Embed
Embed
Embed
Embed
Input Word Positionj
0 1 2 3 4 5 6
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Input Context
with Shortcut
ĥj
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Encoder State
Refinement
hj
Sequence of self-attention layers
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
41Self Attention Layer
• Given: input word representations hj, packed into a matrix H = (h1, ..., hj)
• Self attention
self-attention(H) = MultiHead(H, H, H)
• Shortcut connection
self-attention(hj) + hj
• Layer normalization
ˆhj = layer-normalization(self-attention(hj) + hj)
• Feed-forward step with ReLU activation function
relu(Wˆhj + b)
• Again, shortcut connection and layer normalization
layer-normalization(relu(Wˆhj + b) + ˆhj)
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
42Stacked Self Attention Layers
• Stack several such layers (say, D = 6)
• Start with input word embedding
h0,j = Exj
• Stacked layers
hd,j = self-attention-layer(hd−1,j)
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
43Decoder
Self
Attention
Self
Attention
Self
Attention
Self
Attention
Self
Attention
Self
Attention
Self
Attention Self-Attention
Embed Embed Embed Embed Embed Embed Embed
Word and Position
Embedding
the house is big . Output Wordyi
Add Add Add Add Add Add Add
Positional Output
Word Embedding
si
Embed
Embed
Embed
Embed
Embed
Embed
Embed
Output Word
Positioni
0 1 2 3 4 5 6
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum Output Context
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Normalization
with Shortcut
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Output State
Refinement
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Context
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Normalization
with Shortcut
ŝi
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Decoder State
Refinement
si
Attention Attention Attention Attention Attention Attention Attention
Encoder State
Attention
h
Decoder computes attention-based representations of the output in several layers,
initialized with the embeddings of the previous output words
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
44Self-Attention in the Decoder
• Same idea as in the encoder
• Output words are initially encoded by word embeddings si = Eyi.
• Self attention is computed over previous output words
– association of a word si is limited to words sk (k ≤ i)
– resulting representation ˜si
self-attention( ˜S) = MultiHead( ˜S, ˜S, ˜S)
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
45Attention in the Decoder
• Original intuition of attention mechanism: focus on relevant input words
• Computed with dot product ˜SHT
• Compute attention between the decoder states ˜S and the final encoder states H
attention( ˜S, H) = MultiHead( ˜S, H, H)
• Note: attention mechanism formally mirrors self-attention
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
46Full Decoder
Encoder Layer
Input Word
Encoder Layer
Encoder Layer
Encoder Layer
Output Word
Embedding
Decoder Layer
Decoder Layer
Decoder Layer
Decoder Layer
Softmax Softmax Softmax Softmax Softmax Softmax Softmax
Argmax Argmax Argmax Argmax Argmax Argmax Argmax
Output Word
Prediction
Output Word
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
47Full Decoder
• Self-attention
self-attention( ˜S) = MultiHead( ˜S, ˜S, ˜S)
– shortcut connections
– layer normalization
– feed-forward layer
• Attention
attention( ˜S, H) = softmaxMultiHead( ˜S, H, H)
– shortcut connections
– layer normalization
– feed-forward layer
• Multiple stacked layers
Philipp Koehn Machine Translation: Alternative Architectures 12 October 2023
8Learning Rate
• Gradient computation gives direction of change
• Scaled by learning rate
• Weight updates
• Simplest form: fixed value
• Annealing
– start with larger value (big changes at beginning)
– reduce over time (minor adjustments to refine model)
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
16Ensuring Randomness
• Typical theoretical assumption
independent and identically distributed
training examples
• Approximate this ideal
– avoid undue structure in the training data
– avoid undue structure in initial weight setting
• ML approach: Maximum entropy training
– Fit properties of training data
– Otherwise, model should be as random as possible
(i.e., has maximum entropy)
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
17Shuffling the Training Data
• Typical training data in machine translation
– different types of corpora
∗ European Parliament Proceedings
∗ collection of movie subtitles
– temporal structure in each corpus
– similar sentences next too each other (e.g., same story / debate)
• Online updating: last examples matter more
• Convergence criterion: no improvement recently
→ stretch of hard examples following easy examples: prematurely stopped
⇒ randomly shuffle the training data
(maybe each epoch)
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
18Weight Initialization
• Initialize weights to random values
• Values are chosen from a uniform distribution
• Ideal weights lead to node values in transition area for activation function
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
19For Example: Sigmoid
• Input values in range [−1; 1]
⇒ Output values in range [0.269;0.731]
• Magic formula (n size of the previous layer)
−
1
√
n
,
1
√
n
• Magic formula for hidden layers
−
√
6
√
nj + nj+1
,
√
6
√
nj + nj+1
– nj is the size of the previous layer
– nj+1 size of next layer
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
20Problem: Overconfident Models
• Predictions of the neural machine translation models are surprisingly confident
• Often almost all the probability mass is assigned to a single word
(word prediction probabilities of over 99%)
• Problem for decoding and training
– decoding: sensible alternatives get low scores, bad for beam search
– training: overfitting is more likely
• Solution: label smoothing
• Jargon notice
– in classification tasks, we predict a label
– jargon term for any output
→ here, we smooth the word predictions
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
21Label Smoothing during Decoding
• Common strategy to combat peaked distributions: smooth them
• Recall
– prediction layer produces numbers for each word
– converted into probabilities using the softmax
p(yi) =
exp si
j exp sj
• Softmax calculation can be smoothed with so-called temperature T
p(yi) =
exp si/T
j exp sj/T
• Higher temperature → distribution smoother
(i.e., less probability is given to most likely choice)
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
24Adjusting the Learning Rate
• Gradient descent training: weight update follows the gradient downhill
• Actual gradients have fairly large values, scale with a learning rate
(low number, e.g., µ = 0.001)
• Change the learning rate over time
– starting with larger updates
– refining weights with smaller updates
– adjust for other reasons
• Learning rate schedule
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
25Momentum Term
• Consider case where weight value far from optimum
• Most training examples push the weight value in the same direction
• Small updates take long to accumulate
• Solution: momentum term mt
– accumulate weight updates at each time step t
– some decay rate for sum (e.g., 0.9)
– combine momentum term mt−1 with weight update value ∆wt
mt = 0.9mt−1 + ∆wt
wt = wt−1 − µ mt
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
26Adapting Learning Rate per Parameter
• Common strategy: reduce the learning rate µ over time
• Initially parameters are far away from optimum → change a lot
• Later nuanced refinements needed → change little
• Now: different learning rate for each parameter
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
27Adagrad
• Different parameters at different stages of training
→ different learning rate for each parameter
• Adagrad
– record gradients for each parameter
– accumulate their square values over time
– use this sum to reduce learning rate
• Update formula
– gradient gt = dEt
dw of error E with respect to weight w
– divide the learning rate µ by accumulated sum
∆wt =
µ
t
τ=1 g2
τ
gt
• Big changes in the parameter value (corresponding to big gradients gt)
→ reduction of the learning rate of the weight parameter
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
28Adam: Elements
• Combine idea of momentum term and reduce parameter update by accumulated
change
• Momentum term idea (e.g., β1 = 0.9)
mt = β1mt−1 + (1 − β1)gt
• Accumulated gradients (decay with β2 = 0.999)
vt = β2vt−1 + (1 − β2)g2
t
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
29Adam: Technical Correction
• Initially, values for mt and vt are close to initial value of 0
• Adjustment
ˆmt =
mt
1 − βt
1
, ˆvt =
vt
1 − βt
2
• With t → ∞ this correction goes away
limt→∞
1
1 − βt
→ 1
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
30Adam
• Given
– learning rate µ
– momentum ˆmt
– accumulated change ˆvt
• Weight update per Adam (e.g., = 10−8
)
∆wt =
µ
√
ˆvt +
ˆmt
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
31Batched Gradient Updates
• Accumulate all weight updates for all the training example → update
(converges slowly)
• Process each training example → update (stochastic gradient descent)
(quicker convergence, but last training disproportionately higher impact)
• Process data in batches
– compute all their gradients for individual word predictions errors
– use sum over each batch to update parameters
→ better parallelization on GPUs
• Process data on multiple compute cores
– batch processing may take different amount of time
– asynchronous training: apply updates when they arrive
– mismatch between original weights and updates may not matter much
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
32
avoiding local optima
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
33Avoiding Local Optima
• One of hardest problem for designing neural network architectures and
optimization methods
• Ensure that model converges to at least to a set of parameter values that give
results close to this optimum on unseen test data.
• There is no real solution to this problem.
• It requires experimentation and analysis that is more craft than science.
• Still, this section presents a number of methods that generally help avoiding
getting stuck in local optima.
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
34Overfitting and Underfitting
• Neural machine translation models
– 100s of millions of parameters
– 100s of millions of training examples (individual word predictions)
• No hard rules for relationship between these two numbers
• Too many parameters and too few training examples → overfitting
• Too few parameters and many training examples → underfitting
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
35Regularization
• Motivation: prefer as few parameters as possible
• Strategy: set un-needed paramters a value of 0
• Method
– adjust training objective
– add cost for any non-zero parameter
– typically done with L2 norm
• Practical impact
– derivative of L2 norm is value of parameter
– if not signal from training: reduce value of parameter
– alsp called weight decay
• Not common in deep learning, but other methods understood as regularization
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
36Curriculum Learning
• Human learning
– learn simple concepts first
– learn more complex material later
• Early epochs: only easy training examples
– only short sentences
– create artificial data by extracting smaller segments
(similar to phrase pair extraction in statistical machine translation)
– Later epochs: all training data
• Not easy to callibrate
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
37Dropout
• Training may get stuck in local optima
– some properties of task have been learned
– discovery of other properties would take it too far out of its comfort zone.
• Machine translation example
– model learned the language model aspects
– but cannot figure out role of input sentence
• Drop out: for each batch, eliminate some nodes
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
38Dropout
• Dropout
– For each batch, different random set of nodes is removed
– Their values are set to 0 and their weights are not updated
– 10%, 20% or even 50% of all the nodes
• Why does this work?
– robustness: redundant nodes play similar nodes
– ensemble learning: different subnetworks are different models
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
39Gradient Clipping
• Exploding gradients: gradients become too large during backward pass
⇒ Limit total value of gradients for a layer to threshold (τ)
• Use of L2 norm of gradient values g
L2(g) =
j
g2
j
• Adjust each gradient value gi for each element i in the vector
gi = gi ×
τ
max(τ, L2(g))
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
40Layer Normalization
• During inference, average node values may become too large or too small
• Has also impact on training (gradients are multiplied with node values)
⇒ Normalize node values
• During training, learn bias layer
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
41Layer Normalization: Math
• Feed-forward layer hl
, weights W, computed sum sl
, activation function
sl
= W hl−1
hl
= sigmoid(hl
)
• Compute mean µl
and variance σl
of sum vector sl
µl
=
1
H
H
i−1
sl
i
σl
=
1
H
H
i−1
(sl
i − µl)2
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
42Layer Normalization: Math
• Normalize sl
ˆsl =
1
σl
(sl
− µl
)
• Learnable bias vectors g and b
ˆsl =
g
σl
(sl
− µl
) + b
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
43Shortcuts and Highways
• Deep learning: many layers of processing
⇒ Error propagation has to travel farther
• All parameters in processing change have to be adjusted
• Instead of always passing through all layers, add connections from first to last
• Jargon alert
– shortcuts
– residual connections
– skip connections
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
44Shortcuts
• Feed-forward layer
y = f(x)
• Pass through input x
y = f(x) + x
• Note: gradient is
y = f (x) + 1
• Constant 1 → gradient is passed through unchanged
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
45Highways
• Regulate how much information from f(x) and x should impact the output y
• Gate t(x) (typically computed by a feed-forward layer)
y = t(x) f(x) + (1 − t(x)) x
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
46Shortcuts and Highways
FF
Basic Layer Skip Connection Highway Network
Add
FF
Add
FF
Gate
Philipp Koehn Machine Translation: Machine Learning Tricks 10 October 2023
28Batching
• Already large degree of parallelism
– most computations on vectors, matrices
– efficient implementations for CPU and GPU
• Further parallelism by batching
– processing several sentence pairs at once
– scalar operation → vector operation
– vector operation → matrix operation
– matrix operation → 3d tensor operation
• Typical batch sizes 50–100 sentence pairs
Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023
29Batches
• Sentences have different length
• When batching, fill up unneeded cells in tensors
⇒ A lot of wasted computations
Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023
30Mini-Batches
• Sort sentences by length, break up into mini-batches
• Example: Maxi-batch 1600 sentence pairs, mini-batch 80 sentence pairs
Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023
31Overall Organization of Training
• Shuffle corpus
• Break into maxi-batches
• Break up each maxi-batch into mini-batches
• Process mini-batch, update parameters
• Once done, repeat
• Typically 5-15 epochs needed (passes through entire training corpus)
Philipp Koehn Machine Translation: Neural Machine Translation 3 October 2023