Alternative Architectures • We introduced one translation model – attentional seq2seq model – core organizing feature: recurrent neural networks • Other core neural architectures – convolutional neural networks – attention • But first: look at various components of neural architectures components Components of Neural Networks • Neural networks originally inspired by the brain – a neuron receives signals from other neurons – if sufficiently activated, it sends signals – feed-forward layers are roughly based on this • Computation graph – any function possible, as long as it is partially differentiable – not limited by appeals to biological validity • Deep learning maybe a better name Feed-Forward Layer • Classic neural network component • Given an input vector x, matrix multiplication M with adding a bias vector b Mx + b • Adding a non-linear activation function y = activation(Mx + b) • Notation y = FFactivation(x) = a(Mx + b) Feed-Forward Layer • Historic neural network designs: several feed-forward layers – input layer – hidden layers – output layer • Powerful tools for a wide range of machine learning problems • Matrix multiplication also called affine transforms – appeals to its geometrical properties – straight lines in input still straight lines in output Factored Decomposition • One challenge: very large input and output vectors • Number of parameters in matrix M = |x| × |y| ⇒ Need to reduce size of matrix • Solution: first reduce to smaller representation x x y y v M A B Factored Decomposition: Math x x y y v M A B • Intuition – given highly dimension vector x – first map to into lower dimensional vector v (matrix A) – then map to output vector y (matrix B) v = Ax y = Bv = BAx • Example – |x| = 20,000, |y| = 50,000 → M = 1,000,000,000 – |v| = 100 → A = 20,000 × 100 = 2,000,000, B = 100 × 50,000 = 5,000,000 – reduction from 1,000,000,000 to 7,000,000 Factored Decomposition: Interpretation • Vector v is a bottleneck feature • Forced to captures salient features • One example: word embeddings basic mathematical operations Concatenation • Often multiple input vectors to processing step • For instance recurrent neural network – input word – previous state • Combined in feed-forward layer y = activation(M1x1 + M2x2 + b) • Another view x = concat(x1, x2) y = activation(Mx + b) • Splitting hairs here, but concatenation useful generally Addition • Adding vectors: very simplistic, but often done • Example: compute sentence embeddings s from word embeddings w1, ..., wn s = n i wi • Reduces varying length sentence representation into fixed sized vector • Maybe weight the words, e.g., by attention Multiplication • Another elementary mathematical operation • Three ways to multiply vectors – element-wise multiplication v u = v1 v2 u1 u2 = v1 × u1 v2 × u2 – dot product v · u = vT u = v1 v2 T u1 u2 = v1 × u1 + v2 × u2 used for simple version of attention mechanism – third possibility: vuT , not commonly done Maximum • Goal: reduce the dimensionality of representation • Example: detect if a face is in image – any region of image may have positive match – represent different regions with element in a vector – maximum value: any region has a face • Max pooling – given: n dimensional vector – goal: reduce to n k dimensional vector – method: break up vector into blocks of k elements, map each into single value Max Out • Max out – first branch out into multiple feed-forward layers W1x + b1 W2x + b2 – element-wise maximum maxout(x) = max(W1x + b1, W2x + b2) • ReLu activation is a maxout layer: maximum of feed-forward layer and 0 ReLu(x) = max(Wx + b, 0) processing sequences Recurrent Neural Networks • Already described recurrent neural networks at length – propagate state s – over time steps t – receiving an input xt at each turn st = f(st−1, xt) (state may computed may as a feed-forward layer) • More successful – gated recurrent units (GRU) – long short-term memory cells (LSTM) • Good fit for sequences, like words in a sentence – humans also receive word by word – most recent words most relevant → closer to current state • But computational problematic: very long computation chains Alternative Sequence Processing • Convolutional neural networks • Attention convolutional neural networks Convolutional Neural Networks (CNN) • Popular in image processing • Regions of an image are reduced into increasingly smaller representation – matrix spanning part of image reduced to single value – overlapping regions CNNs for Language Embed Embed Embed FF FF FF FF Embed Embed Embed FF FF Embed Embed FF FF FF Embed Embed FF FF FF Embed Embed Embed FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF • Map words into fixed-sized sentence representation Hierarchical Structure and Language • Syntactic and semantic theories of language – language is recursive – central: verb – dependents: subject, objects, adjuncts – their dependents: adjectives, determiners – also nested: relative clauses • How to compute sentence embeddings active research topic Convolutional Neural Networks • Key step – take a high dimensional input representation – map to lower dimensional representation • Several repetitions of this step • Examples – map 50×50 pixel area into scalar value – combine 3 or more neighboring words into a single vector • Machine translation – encode input sentence into single vector – decode this vector into a sentence in the output language attention Attention • Machine translation is a structured prediction task – output is not a single label – output structure needs to be built, word by word • Relevant information for each word prediction varies • Human translators pay attention to different parts of the input sentence when translating ⇒ Attention mechanism Computing Attention • Attention mechanism in neural translation model (Bahdanau et al., 2015) – previous hidden state si−1 – input word embedding hj – trainable parameters b, Wa, Ua, va a(si−1, hj) = vT a tanh(Wasi−1 + Uahj + b) • Other ways to compute attention – Dot product: a(si−1, hj) = sT i−1hj – Scaled dot product: a(si−1, hj) = 1√ |hj| sT i−1hj – General: a(si−1, hj) = sT i−1Wahj – Local: a(si−1) = Wasi−1 Attention of Luong et al. (2015) • Luong et al. (2015) demonstrate good results with the dot product a(si−1, hj) = sT i−1hj • No trainable parameters • Additional changes • Currently more popular Attention of Luong et al. (2015) Luong et al. (2015) Bahdanau et al. (2015) RNN Weighted Sum Attention RNN argmax Output Word Prediction Output Word Output Word Embedding Decoder State Input Context Attention Encoder State ti yi E yi-1 si ci αij h…j… RNN Attention RNN argmax Softmax Weighted Sum Softmax Attention of Luong et al. (2015) Luong et al. (2015) Attention αij = softmax FF(si−1, hj) Input context ci = j αijhj Output word p(yt|y 0, d ≤ ˆD Attention • Attention mechanism fundamentally unchanged • Input context ci computed based on association a(si−1, hj) between – encoder state hj – decoder state si−1 • Now – encoder state hD,j – decoder state s ˆD,i−1 • Refinement when computing the context vector ci: shortcut connection between encoder state hD,j and input word embedding xj transformer Self Attention: Transformer • Self-attention in encoder – refine word representation based on relevant context words – relevance determined by self attention • Self-attention in decoder – refine output word predictions based on relevant previous output words – relevance determined by self attention • Also regular attention to encoder states in decoder • Currently most successful model (maybe only with self attention in decoder, but regular recurrent decoder) Encoder Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Weighted Sum Self Attention Input Context Attention Embed Embed Embed Embed Embed Embed Embed Word and Position Embedding Ewxj Epj the house is big . Input Wordxj Add Add Add Add Add Add Add Positional Input Word Embedding Ewxj + Epj Embed Embed Embed Embed Embed Embed Embed Input Word Positionj 0 1 2 3 4 5 6 Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Input Context with Shortcut ĥj Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Encoder State Refinement hj Sequence of self-attention layers Self Attention Layer • Given: input word representations hj, packed into a matrix H = (h1, ..., hj) • Self attention self-attention(H) = softmax HHT |h| H • Shortcut connection self-attention(hj) + hj • Layer normalization ˆhj = layer-normalization(self-attention(hj) + hj) • Feed-forward step with ReLU activation function relu(Wˆhj + b) • Again, shortcut connection and layer normalization layer-normalization(relu(Wˆhj + b) + ˆhj) Stacked Self Attention Layers • Stack several such layers (say, D = 6) • Start with input word embedding h0,j = Exj • Stacked layers hd,j = self-attention-layer(hd−1,j) Decoder Self Attention Self Attention Self Attention Self Attention Self Attention Self Attention Self Attention Self-Attention Embed Embed Embed Embed Embed Embed Embed Word and Position Embedding the house is big . Output Wordyi Add Add Add Add Add Add Add Positional Output Word Embedding si Embed Embed Embed Embed Embed Embed Embed Output Word Positioni 0 1 2 3 4 5 6 Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Output Context Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Normalization with Shortcut Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Output State Refinement Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Weighted Sum Context Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Add & Norm Normalization with Shortcut ŝi Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Add & Norm FF Decoder State Refinement si Attention Attention Attention Attention Attention Attention Attention Encoder State Attention h Decoder computes attention-based representations of the output in several layers, initialized with the embeddings of the previous output words Self-Attention in the Decoder • Same idea as in the encoder • Output words are initially encoded by word embeddings si = Eyi. • Self attention is computed over previous output words – association of a word si is limited to words sk (k ≤ i) – resulting representation ˜si self-attention( ˜S) = softmax SST |h| S Attention in the Decoder • Original intuition of attention mechanism: focus on relevant input words • Computed with dot product ˜SHT • Compute attention between the decoder states ˜S and the final encoder states H attention( ˜S, H) = softmax ˜SHT |h| H • Note: attention mechanism formally mirrors self-attention Full Decoder Encoder Layer Input Word Encoder Layer Encoder Layer Encoder Layer Output Word Embedding Decoder Layer Decoder Layer Decoder Layer Decoder Layer Softmax Softmax Softmax Softmax Softmax Softmax Softmax Argmax Argmax Argmax Argmax Argmax Argmax Argmax Output Word Prediction Output Word Full Decoder • Self-attention self-attention( ˜S) = softmax SST |h| S – shortcut connections – layer normalization – feed-forward layer • Attention attention( ˜S, H) = softmax ˜SHT |h| H – shortcut connections – layer normalization – feed-forward layer • Multiple stacked layers Mix and Match • Encoder may be multiple layers of either – recurrent neural networks – self-attention layers • Decoder may be multiple layers of either – recurrent neural networks – self-attention layers • Also possible: self-attention encoder, recurrent neural network deocder • Even better: both self-attention and recurrent neural network, merged at the end