Alternative Architectures
Philipp Koehn
15 October 2020
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
1Alternative Architectures
• We introduced one translation model
– attentional seq2seq model
– core organizing feature: recurrent neural networks
• Other core neural architectures
– convolutional neural networks
– attention
• But ﬁrst: look at various components of neural architectures
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
2
components
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
3Components of Neural Networks
• Neural networks originally inspired by the brain
– a neuron receives signals from other neurons
– if sufﬁciently activated, it sends signals
– feed-forward layers are roughly based on this
• Computation graph
– any function possible, as long as it is partially differentiable
– not limited by appeals to biological validity
• Deep learning maybe a better name
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
4Feed-Forward Layer
• Classic neural network component
• Given an input vector x, matrix multiplication M with adding a bias vector b
Mx + b
• Adding a non-linear activation function
y = activation(Mx + b)
• Notation
y = FFactivation(x) = a(Mx + b)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
5Feed-Forward Layer
• Historic neural network designs: several feed-forward layers
– input layer
– hidden layers
– output layer
• Powerful tools for a wide range of machine learning problems
• Matrix multiplication also called afﬁne transforms
– appeals to its geometrical properties
– straight lines in input still straight lines in output
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
6Factored Decomposition
• One challenge: very large input and output vectors
• Number of parameters in matrix M = |x| × |y|
⇒ Need to reduce size of matrix
• Solution: ﬁrst reduce to smaller representation
x x
y
y
v
M
A
B
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
7Factored Decomposition: Math
x x
y
y
v
M
A
B
• Intuition
– given highly dimension vector x
– ﬁrst map to into lower dimensional vector v (matrix A)
– then map to output vector y (matrix B)
v = Ax
y = Bv = BAx
• Example
– |x| = 20,000, |y| = 50,000 → M = 1,000,000,000
– |v| = 100 → A = 20,000 × 100 = 2,000,000, B = 100 × 50,000 = 5,000,000
– reduction from 1,000,000,000 to 7,000,000
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
8Factored Decomposition: Interpretation
• Vector v is a bottleneck feature
• Forced to captures salient features
• One example: word embeddings
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
9
basic mathematical operations
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
10Concatenation
• Often multiple input vectors to processing step
• For instance recurrent neural network
– input word
– previous state
• Combined in feed-forward layer
y = activation(M1x1 + M2x2 + b)
• Another view
x = concat(x1, x2)
y = activation(Mx + b)
• Splitting hairs here, but concatenation useful generally
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
11Addition
• Adding vectors: very simplistic, but often done
• Example: compute sentence embeddings s from word embeddings w1, ..., wn
s =
n
i
wi
• Reduces varying length sentence representation into ﬁxed sized vector
• Maybe weight the words, e.g., by attention
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
12Multiplication
• Another elementary mathematical operation
• Three ways to multiply vectors
– element-wise multiplication
v u =
v1
v2
u1
u2
=
v1 × u1
v2 × u2
– dot product
v · u = vT
u =
v1
v2
T
u1
u2
= v1 × u1 + v2 × u2
used for simple version of attention mechanism
– third possibility: vuT
, not commonly done
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
13Maximum
• Goal: reduce the dimensionality of representation
• Example: detect if a face is in image
– any region of image may have positive match
– represent different regions with element in a vector
– maximum value: any region has a face
• Max pooling
– given: n dimensional vector
– goal: reduce to n
k dimensional vector
– method: break up vector into blocks of k elements, map each into single value
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
14Max Out
• Max out
– ﬁrst branch out into multiple feed-forward layers
W1x + b1
W2x + b2
– element-wise maximum
maxout(x) = max(W1x + b1, W2x + b2)
• ReLu activation is a maxout layer: maximum of feed-forward layer and 0
ReLu(x) = max(Wx + b, 0)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
15
processing sequences
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
16Recurrent Neural Networks
• Already described recurrent neural networks at length
– propagate state s
– over time steps t
– receiving an input xt at each turn
st = f(st−1, xt)
(state may computed may as a feed-forward layer)
• More successful
– gated recurrent units (GRU)
– long short-term memory cells (LSTM)
• Good ﬁt for sequences, like words in a sentence
– humans also receive word by word
– most recent words most relevant
→ closer to current state
• But computational problematic: very long computation chains
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
17Alternative Sequence Processing
• Convolutional neural networks
• Attention
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
18
convolutional neural networks
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
19Convolutional Neural Networks (CNN)
• Popular in image processing
• Regions of an image are reduced into increasingly smaller representation
– matrix spanning part of image reduced to single value
– overlapping regions
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
20CNNs for Language
Embed Embed Embed
FF FF FF FF
Embed Embed Embed
FF
FF
Embed Embed
FF FF FF
Embed Embed
FF FF FF
Embed Embed Embed
FF
FF FF FF
FF FF FF FF FF
FF FF FF
FF FF
FF
FF
FF
FF FF FF FF
FF
FF
FF
• Map words into ﬁxed-sized sentence representation
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
21Hierarchical Structure and Language
• Syntactic and semantic theories of language
– language is recursive
– central: verb
– dependents: subject, objects, adjuncts
– their dependents: adjectives, determiners
– also nested: relative clauses
• How to compute sentence embeddings active research topic
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
22Convolutional Neural Networks
• Key step
– take a high dimensional input representation
– map to lower dimensional representation
• Several repetitions of this step
• Examples
– map 50×50 pixel area into scalar value
– combine 3 or more neighboring words into a single vector
• Machine translation
– encode input sentence into single vector
– decode this vector into a sentence in the output language
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
23
attention
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
24Attention
• Machine translation is a structured prediction task
– output is not a single label
– output structure needs to be built, word by word
• Relevant information for each word prediction varies
• Human translators pay attention to different parts of the input sentence when
translating
⇒ Attention mechanism
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
25Computing Attention
• Attention mechanism in neural translation model (Bahdanau et al., 2015)
– previous hidden state si−1
– input word embedding hj
– trainable parameters b, Wa, Ua, va
a(si−1, hj) = vT
a tanh(Wasi−1 + Uahj + b)
• Other ways to compute attention
– Dot product: a(si−1, hj) = sT
i−1hj
– Scaled dot product: a(si−1, hj) = 1√
|hj|
sT
i−1hj
– General: a(si−1, hj) = sT
i−1Wahj
– Local: a(si−1) = Wasi−1
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
26Attention of Luong et al. (2015)
• Luong et al. (2015) demonstrate good results with the dot product
a(si−1, hj) = sT
i−1hj
• No trainable parameters
• Additional changes
• Currently more popular
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
27Attention of Luong et al. (2015)
Luong et al. (2015) Bahdanau et al. (2015)
RNN
Weighted
Sum
Attention
RNN
argmax
Output Word
Prediction
Output Word
Output Word
Embedding
Decoder State
Input Context
Attention
Encoder State
ti
yi
E yi-1
si
ci
αij
h…j…
RNN
Attention
RNN
argmax
Softmax
Weighted
Sum
Softmax
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
28Attention of Luong et al. (2015)
Luong et al. (2015)
Attention
αij = softmax FF(si−1, hj)
Input context ci = j αijhj
Output word
p(yt|y<t, x) =
softmax W FFtanh(si−1, ci)
Decoder state
si = FFtanh(si−1, Eyi−1)
Bahdanau et al. (2015)
Attention
αij = softmax FF(si−1, hj)
Input context ci = j αijhj
Output word
p(yt|y<t, x) =
softmax W FFtanh(si−1, Eyi−1, ci)
Decoder state
si = FFtanh(si−1, Eyi−1, ci)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
29Multi-Head Attention
• Add redundancy
– say, 16 attention weights
– each based on its own parameters
• Formally, for each head k compute an associated between
– decoder state si−1 at time step i
– encoder state hj for the jth input word
– using the softmax of some parameterized function ak
αk
ij = softmax ak
(si−1, hj)
• Average the attention weights
αij =
1
k
k
αk
ij
• Multi-head attention is a form of ensembling
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
30Fine-Grained Attention
• Why just use a single scalar value to weight entire vectors?
– learn weights for each element
– computation of attention values returns vector instead of scalar
• Architecturally, still a feed-forward neural network (or any of variants)
a(si−1, hj) = FFk
(si−1, hj)
• Softmax is now applied over each dimension d
αd
ij =
exp ad
(si−1, hj)
k ad(si−1, hk)
• Input context is now computed by a element-wise multiplication
ci =
j
αij × hj
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
31Self Attention
• Finally, a very different take at attention
• Motivation so far: need for alignment between input words and output words
• Now: reﬁne representation of input words in the encoder
– representation of an input word mostly depends on itself
– but also informed by the surrounding context
– previously: recurrent neural networks (considers left or right context)
– now: attention mechanism
• Self attention:
Which of the surrounding words is most relevant to reﬁne representation?
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
32Self Attention
• Formal deﬁnition (based on sequence of vectors hj, packed into matrix H
self-attention(H) = softmax
HHT
|h|
H
• Association between every word representation hj any other context word hk
– computed by dot product
– results in a vector of raw association values
HHT
• Scaled by the size of the word representation vectors |h|, and softmax
softmax
HHT
|h|
• Resulting vector of normalized association values used to weigh context words
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
33Self Attention
• More familiar math, using word representation vectors hj
• Raw association HHT
√
|h|
ajk =
1
|h|
hjhT
k
• Normalized association (softmax)
αjk =
exp(ajk)
κ exp(ajκ)
• Weighted sum
self-attention(hj) =
k
αjκhk
• More on this later (Transformer)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
34
convolutional machine translation
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
35Convolutional Machine Translation
• First end-to-end neural machine translation model of the modern era
[Kalchbrenner and Blunsom, 2013]
• Encoder
Embed Embed Embed
FF FF FF FF
Embed Embed Embed
FF
FF
FF
FF FF
Input Word
Embeddings
K2 Layer
K3 Layer
L3 Layer
Input Words
– always two convolutional layers, with different size
– here: K2 and K3
• Decoder similar
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
36Reﬁnement
Embed Embed Embed
FF FF FF FF
Embed Embed Embed
FF
FF FF FF
Input Word Embedding
K2 Encoder
K3 Encoder
Input Word
FF FF FF Transfer
FF FF FF FFFF
RNN RNN RNN RNNRNN RNN
Softmax Softmax Softmax SoftmaxSoftmax Softmax
Embed Embed Embed Embed Embed Embed
K3 Decoder
K2 Decoder
Output Word Prediction
Output Word
Output Word Embedding
• Convolutions do not result in a single sentence embedding but a sequence
• Decoder is also informed by a recurrent neural network
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
37CNNs With Attention
[Gehring et al. 2017]
• Combination of
– convolutional neural networks
– attention
• Sequence-to-sequence attention, mainly as before
• Recurrent neural networks replaced by convolutional layers
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
38Encoder
FF FF
Embed Embed Embed Embed
FF FF FF
Embed Embed
FF FF
Embed
FF FF
FF FF FF FF
FF FF FF
FF FF
FF
FF
FF
0
0
0
0
0
0
Encoder Convolution 3
Encoder Convolution 2
Encoder Convolution 1
Input Word
Embeddings
Input Words
• Stacked encoder convolutions
• Not shortening representations
• But: faster processing due to more parallelism
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
39Encoder: Math
• Start with input word embeddings Exj
h0,j = E xj
• Progress through
– sequence of layer encodings hd,j
– at different depth d
– until maximum depth D
hd,j = f(hd−1,j−k, ..., hd−1,j+k)
• Details
– function f is feed-forward layer with shortcut connection
– ﬁnal representation hD,j may only be informed by partial sentence context
– all words at one depth can be processed in parallel → fast
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
40Decoder
FF
Softmax
Embed Embed Embed Embed
Decoder Convolution 3
Output Word Prediction
Output Word
Output Word Embedding
FF FF Decoder Convolution 2
FF FFFF Decoder Convolution 1
Embed Embed
FF
FF
FF
Embed
Input Context
• Decoder state computed by convolutional layers over previous output words
• Each convolutional state also informed by the input context (using attention)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
41Decoder: Math
• Recall: decoder recurrent neural network decoder
si = f(si−1, Eyi−1, ci)
– encoder state si
– embedding of previous output word Eyi−1
– input context ci
• Now
– state computation not depending on previous state si−1 (not recurrent)
– conditioned on the sequence of the κ most recent previous words
si = f(Eyi−κ, ..., Eyi−1, ci)
• Stacked convolutions
s1,i = f(Eyi−κ, ..., Eyi−1, ci)
sd,i = f(sd−1,i−κ−1, ..., sd−1,i, ci) for d > 0, d ≤ ˆD
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
42Attention
• Attention mechanism fundamentally unchanged
• Input context ci computed based on association a(si−1, hj) between
– encoder state hj
– decoder state si−1
• Now
– encoder state hD,j
– decoder state s ˆD,i−1
• Reﬁnement when computing the context vector ci:
shortcut connection between encoder state hD,j and input word embedding xj
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
43
transformer
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
44Self Attention: Transformer
• Self-attention in encoder
– reﬁne word representation based on relevant context words
– relevance determined by self attention
• Self-attention in decoder
– reﬁne output word predictions based on relevant previous output words
– relevance determined by self attention
• Also regular attention to encoder states in decoder
• Currently most successful model
(maybe only with self attention in decoder, but regular recurrent decoder)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
45Encoder
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Weighted
Sum
Self
Attention
Input Context
Attention
Embed Embed Embed Embed Embed Embed Embed
Word and Position
Embedding
Ewxj Epj
<s> the house is big . </s> Input Wordxj
Add Add Add Add Add Add Add
Positional Input
Word Embedding
Ewxj + Epj
Embed
Embed
Embed
Embed
Embed
Embed
Embed
Input Word Positionj
0 1 2 3 4 5 6
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Input Context
with Shortcut
ĥj
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Encoder State
Reﬁnement
hj
Sequence of self-attention layers
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
46Self Attention Layer
• Given: input word representations hj, packed into a matrix H = (h1, ..., hj)
• Self attention
self-attention(H) = softmax
HHT
|h|
H
• Shortcut connection
self-attention(hj) + hj
• Layer normalization
ˆhj = layer-normalization(self-attention(hj) + hj)
• Feed-forward step with ReLU activation function
relu(Wˆhj + b)
• Again, shortcut connection and layer normalization
layer-normalization(relu(Wˆhj + b) + ˆhj)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
47Stacked Self Attention Layers
• Stack several such layers (say, D = 6)
• Start with input word embedding
h0,j = Exj
• Stacked layers
hd,j = self-attention-layer(hd−1,j)
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
48Decoder
Self
Attention
Self
Attention
Self
Attention
Self
Attention
Self
Attention
Self
Attention
Self
Attention
Self-Attention
Embed Embed Embed Embed Embed Embed Embed
Word and Position
Embedding
<s> the house is big . </s> Output Wordyi
Add Add Add Add Add Add Add
Positional Output
Word Embedding
si
Embed
Embed
Embed
Embed
Embed
Embed
Embed
Output Word
Positioni
0 1 2 3 4 5 6
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum Output Context
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Normalization
with Shortcut
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Output State
Reﬁnement
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Weighted
Sum
Context
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Add &
Norm
Normalization
with Shortcut
ŝi
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Add &
Norm
FF
Decoder State
Reﬁnement
si
Attention Attention Attention Attention Attention Attention Attention
Encoder State
Attention
h
Decoder computes attention-based representations of the output in several layers,
initialized with the embeddings of the previous output words
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
49Self-Attention in the Decoder
• Same idea as in the encoder
• Output words are initially encoded by word embeddings si = Eyi.
• Self attention is computed over previous output words
– association of a word si is limited to words sk (k ≤ i)
– resulting representation ˜si
self-attention( ˜S) = softmax
SST
|h|
S
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
50Attention in the Decoder
• Original intuition of attention mechanism: focus on relevant input words
• Computed with dot product ˜SHT
• Compute attention between the decoder states ˜S and the ﬁnal encoder states H
attention( ˜S, H) = softmax
˜SHT
|h|
H
• Note: attention mechanism formally mirrors self-attention
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
51Full Decoder
Encoder Layer
Input Word
Encoder Layer
Encoder Layer
Encoder Layer
Output Word
Embedding
Decoder Layer
Decoder Layer
Decoder Layer
Decoder Layer
Softmax Softmax Softmax Softmax Softmax Softmax Softmax
Argmax Argmax Argmax Argmax Argmax Argmax Argmax
Output Word
Prediction
Output Word
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
52Full Decoder
• Self-attention
self-attention( ˜S) = softmax
SST
|h|
S
– shortcut connections
– layer normalization
– feed-forward layer
• Attention
attention( ˜S, H) = softmax
˜SHT
|h|
H
– shortcut connections
– layer normalization
– feed-forward layer
• Multiple stacked layers
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020
53Mix and Match
• Encoder may be multiple layers of either
– recurrent neural networks
– self-attention layers
• Decoder may be multiple layers of either
– recurrent neural networks
– self-attention layers
• Also possible: self-attention encoder, recurrent neural network deocder
• Even better: both self-attention and recurrent neural network, merged at the end
Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020