Alternative Architectures Philipp Koehn 15 October 2020 Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 1Alternative Architectures • We introduced one translation model – attentional seq2seq model – core organizing feature: recurrent neural networks • Other core neural architectures – convolutional neural networks – attention • But first: look at various components of neural architectures Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 2 components Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 3Components of Neural Networks • Neural networks originally inspired by the brain – a neuron receives signals from other neurons – if sufficiently activated, it sends signals – feed-forward layers are roughly based on this • Computation graph – any function possible, as long as it is partially differentiable – not limited by appeals to biological validity • Deep learning maybe a better name Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 4Feed-Forward Layer • Classic neural network component • Given an input vector x, matrix multiplication M with adding a bias vector b Mx + b • Adding a non-linear activation function y = activation(Mx + b) • Notation y = FFactivation(x) = a(Mx + b) Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 5Feed-Forward Layer • Historic neural network designs: several feed-forward layers – input layer – hidden layers – output layer • Powerful tools for a wide range of machine learning problems • Matrix multiplication also called affine transforms – appeals to its geometrical properties – straight lines in input still straight lines in output Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 6Factored Decomposition • One challenge: very large input and output vectors • Number of parameters in matrix M = |x| × |y| ⇒ Need to reduce size of matrix • Solution: first reduce to smaller representation x x y y v M A B Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 7Factored Decomposition: Math x x y y v M A B • Intuition – given highly dimension vector x – first map to into lower dimensional vector v (matrix A) – then map to output vector y (matrix B) v = Ax y = Bv = BAx • Example – |x| = 20,000, |y| = 50,000 → M = 1,000,000,000 – |v| = 100 → A = 20,000 × 100 = 2,000,000, B = 100 × 50,000 = 5,000,000 – reduction from 1,000,000,000 to 7,000,000 Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 8Factored Decomposition: Interpretation • Vector v is a bottleneck feature • Forced to captures salient features • One example: word embeddings Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 9 basic mathematical operations Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 10Concatenation • Often multiple input vectors to processing step • For instance recurrent neural network – input word – previous state • Combined in feed-forward layer y = activation(M1x1 + M2x2 + b) • Another view x = concat(x1, x2) y = activation(Mx + b) • Splitting hairs here, but concatenation useful generally Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 11Addition • Adding vectors: very simplistic, but often done • Example: compute sentence embeddings s from word embeddings w1, ..., wn s = n i wi • Reduces varying length sentence representation into fixed sized vector • Maybe weight the words, e.g., by attention Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 12Multiplication • Another elementary mathematical operation • Three ways to multiply vectors – element-wise multiplication v u = v1 v2 u1 u2 = v1 × u1 v2 × u2 – dot product v · u = vT u = v1 v2 T u1 u2 = v1 × u1 + v2 × u2 used for simple version of attention mechanism – third possibility: vuT , not commonly done Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 13Maximum • Goal: reduce the dimensionality of representation • Example: detect if a face is in image – any region of image may have positive match – represent different regions with element in a vector – maximum value: any region has a face • Max pooling – given: n dimensional vector – goal: reduce to n k dimensional vector – method: break up vector into blocks of k elements, map each into single value Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 14Max Out • Max out – first branch out into multiple feed-forward layers W1x + b1 W2x + b2 – element-wise maximum maxout(x) = max(W1x + b1, W2x + b2) • ReLu activation is a maxout layer: maximum of feed-forward layer and 0 ReLu(x) = max(Wx + b, 0) Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 15 processing sequences Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 16Recurrent Neural Networks • Already described recurrent neural networks at length – propagate state s – over time steps t – receiving an input xt at each turn st = f(st−1, xt) (state may computed may as a feed-forward layer) • More successful – gated recurrent units (GRU) – long short-term memory cells (LSTM) • Good fit for sequences, like words in a sentence – humans also receive word by word – most recent words most relevant → closer to current state • But computational problematic: very long computation chains Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 17Alternative Sequence Processing • Convolutional neural networks • Attention Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 18 convolutional neural networks Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 19Convolutional Neural Networks (CNN) • Popular in image processing • Regions of an image are reduced into increasingly smaller representation – matrix spanning part of image reduced to single value – overlapping regions Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 20CNNs for Language Embed Embed Embed FF FF FF FF Embed Embed Embed FF FF Embed Embed FF FF FF Embed Embed FF FF FF Embed Embed Embed FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF • Map words into fixed-sized sentence representation Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 21Hierarchical Structure and Language • Syntactic and semantic theories of language – language is recursive – central: verb – dependents: subject, objects, adjuncts – their dependents: adjectives, determiners – also nested: relative clauses • How to compute sentence embeddings active research topic Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 22Convolutional Neural Networks • Key step – take a high dimensional input representation – map to lower dimensional representation • Several repetitions of this step • Examples – map 50×50 pixel area into scalar value – combine 3 or more neighboring words into a single vector • Machine translation – encode input sentence into single vector – decode this vector into a sentence in the output language Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 23 attention Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 24Attention • Machine translation is a structured prediction task – output is not a single label – output structure needs to be built, word by word • Relevant information for each word prediction varies • Human translators pay attention to different parts of the input sentence when translating ⇒ Attention mechanism Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 25Computing Attention • Attention mechanism in neural translation model (Bahdanau et al., 2015) – previous hidden state si−1 – input word embedding hj – trainable parameters b, Wa, Ua, va a(si−1, hj) = vT a tanh(Wasi−1 + Uahj + b) • Other ways to compute attention – Dot product: a(si−1, hj) = sT i−1hj – Scaled dot product: a(si−1, hj) = 1√ |hj| sT i−1hj – General: a(si−1, hj) = sT i−1Wahj – Local: a(si−1) = Wasi−1 Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 26Attention of Luong et al. (2015) • Luong et al. (2015) demonstrate good results with the dot product a(si−1, hj) = sT i−1hj • No trainable parameters • Additional changes • Currently more popular Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 27Attention of Luong et al. (2015) Luong et al. (2015) Bahdanau et al. (2015) RNN Weighted Sum Attention RNN argmax Output Word Prediction Output Word Output Word Embedding Decoder State Input Context Attention Encoder State ti yi E yi-1 si ci αij h…j… RNN Attention RNN argmax Softmax Weighted Sum Softmax Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 28Attention of Luong et al. (2015) Luong et al. (2015) Attention αij = softmax FF(si−1, hj) Input context ci = j αijhj Output word p(yt|y 0, d ≤ ˆD Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 42Attention • Attention mechanism fundamentally unchanged • Input context ci computed based on association a(si−1, hj) between – encoder state hj – decoder state si−1 • Now – encoder state hD,j – decoder state s ˆD,i−1 • Refinement when computing the context vector ci: shortcut connection between encoder state hD,j and input word embedding xj Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 43 transformer Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 44Self Attention: Transformer • Self-attention in encoder – refine word representation based on relevant context words – relevance determined by self attention • Self-attention in decoder – refine output word predictions based on relevant previous output words – relevance determined by self attention • Also regular attention to encoder states in decoder • Currently most successful model (maybe only with self attention in decoder, but regular recurrent decoder) Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 45Encoder Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Input Context Attention Embed Embed Embed Embed Embed Embed Embed Word and Position Embedding Ewxj Epj the house is big . Input Wordxj Add Add Add Add Add Add Add Positional Input Word Embedding Ewxj + Epj Embed Embed Embed Embed Embed Embed Embed Input Word Positionj 0 1 2 3 4 5 6 Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Input Context with Shortcut ĥj Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Encoder State Refinement hj Sequence of self-attention layers Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 46Self Attention Layer • Given: input word representations hj, packed into a matrix H = (h1, ..., hj) • Self attention self-attention(H) = softmax HHT |h| H • Shortcut connection self-attention(hj) + hj • Layer normalization ˆhj = layer-normalization(self-attention(hj) + hj) • Feed-forward step with ReLU activation function relu(Wˆhj + b) • Again, shortcut connection and layer normalization layer-normalization(relu(Wˆhj + b) + ˆhj) Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 47Stacked Self Attention Layers • Stack several such layers (say, D = 6) • Start with input word embedding h0,j = Exj • Stacked layers hd,j = self-attention-layer(hd−1,j) Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 48Decoder Self Attention Self Attention Self Attention Self Attention Self Attention Self Attention Self Attention Self-Attention Embed Embed Embed Embed Embed Embed Embed Word and Position Embedding the house is big . Output Wordyi Add Add Add Add Add Add Add Positional Output Word Embedding si Embed Embed Embed Embed Embed Embed Embed Output Word Positioni 0 1 2 3 4 5 6 Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Output Context Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Normalization with Shortcut Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Output State Refinement Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Context Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Normalization with Shortcut ŝi Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Decoder State Refinement si Attention Attention Attention Attention Attention Attention Attention Encoder State Attention h Decoder computes attention-based representations of the output in several layers, initialized with the embeddings of the previous output words Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 49Self-Attention in the Decoder • Same idea as in the encoder • Output words are initially encoded by word embeddings si = Eyi. • Self attention is computed over previous output words – association of a word si is limited to words sk (k ≤ i) – resulting representation ˜si self-attention( ˜S) = softmax SST |h| S Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 50Attention in the Decoder • Original intuition of attention mechanism: focus on relevant input words • Computed with dot product ˜SHT • Compute attention between the decoder states ˜S and the final encoder states H attention( ˜S, H) = softmax ˜SHT |h| H • Note: attention mechanism formally mirrors self-attention Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 51Full Decoder Encoder Layer Input Word Encoder Layer Encoder Layer Encoder Layer Output Word Embedding Decoder Layer Decoder Layer Decoder Layer Decoder Layer Softmax Softmax Softmax Softmax Softmax Softmax Softmax Argmax Argmax Argmax Argmax Argmax Argmax Argmax Output Word Prediction Output Word Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 52Full Decoder • Self-attention self-attention( ˜S) = softmax SST |h| S – shortcut connections – layer normalization – feed-forward layer • Attention attention( ˜S, H) = softmax ˜SHT |h| H – shortcut connections – layer normalization – feed-forward layer • Multiple stacked layers Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 53Mix and Match • Encoder may be multiple layers of either – recurrent neural networks – self-attention layers • Decoder may be multiple layers of either – recurrent neural networks – self-attention layers • Also possible: self-attention encoder, recurrent neural network deocder • Even better: both self-attention and recurrent neural network, merged at the end Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020