Recurrent Networks Contains material from: Andrej Karpathy’s blog: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Christopher Olah’s blog http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Geoffrey Hinton’s lecture: https: //www.cs.toronto.edu/~hinton/csc2535/notes/lec10new.pdf 1 Recurrent Neural Network – vector notation A simple example of a recurrent MLP: Input: x Hidden (state): h Output: y Matrices U, W, V h = σ(Ux + Wh) Here σ is an activation function (applied component-wise), typically sigmoidal or ReLU. y = σ(Vh) Here σ is typically softmax (so that we get probabilities) or sigmoidal. In what follows I will use σ to denote arbitrary activation functions, keep in mind that each neuron may have different activation function. 2 Recurrent Neural Network – sequence modeling Input sequence: x1, . . . , xT of vectors. Output sequence: y1, . . . , yT of vectors obtained by ht = σ(Uxt + Wht−1) yt = σ(Vht ) 3 RNN – Component-wise Denote: x = (x1, . . . , xM) h = (h1, . . . , hH) y = (y1, . . . , yN) For all k = 1, . . . , H hk = σ   M k =1 Ukk xk + H k =1 Wkk hk   For all k = 1, . . . , N yk = σ   H k =1 Vkk hk   4 RNN – Component-wise – Unfolding Input sequence: x = x1, . . . , xT xt = (xt1, . . . , xtM) Hidden sequence: h = h0, h1, . . . , hT ht = (ht1, . . . , htH) We have h0 = (0, . . . , 0) and htk = σ   M k =1 Ukk xtk + H k =1 Wkk h(t−1)k   Output sequence: y = y1, . . . , yT yt = (yt1, . . . , ytN) We have ytk = σ H k =1 Vkk htk 5 RNN – Component-wise – Unfolding 6 RNN – Comments ht is the memory of the network, captures what happend in all previous steps (with decaying quality). RNN shares weights U, V, W across all steps. Note the similarity to convolutional networks where the weights were shared spatially over images, here they are shared temporally over sequences. RNN can deal with sequences of variable length. Compare with MLP which accepts only fixed-dimension vectors on input. 7 RNN – language modelling (toy example) RNN generating text character by character. Models the probability distribution of the next character in a given sequence. Learns the distribution from a huge number of sequences. For simplicity, assume 4 letter aplhabet: h, e, l, o Encode letters using one-hot encoding, e.g. e is (0, 1, 0, 0). Output layer: Softmax Error: Cross-entropy Training: Gradient descent (simply unfold in time, see later) 8 RNN – language modelling (toy example) 9 Deeper RNN Two hidden layers ... may be arbitrary number. 10 ... and deeper 11 Binary addition – another toy MLP can be trained to do binary addition, but there are obvious regularities that it cannot capture efficiently: We must decide in advance the maximum number of digits in each number. The processing applied to the beginning of a long number does not generalize to the end of the long number because it uses different weights. As a result, feedforward nets do not generalize well on the binary addition task. 12 Binary addition – another toy A finite transducer, in every step reads a pair of numbers of {0, 1}2 and prints an output number of {0, 1}. The network should imitate activity of the automaton. Three hidden neurons should be enough. 13 Binary addition – another toy The network has two input neurons and one output neuron. Three hidden neurons are sufficient for binary addition. 14 Binary addition – another toy RNN learns four distinct patterns of activity for the 3 hidden neurons. These patterns correspond to the nodes in the finite state automaton. Do not confuse units in a neural network with nodes in a finite state automaton. Nodes are like activity vectors. The automaton is restricted to be in exactly one state at each time. The hidden units are restricted to have exactly one vector of activity at each time. A recurrent network can emulate a finite state automaton, but it is exponentially more powerful. With N hidden neurons it has 2N possible binary activity vectors (but only N2 weights) This is important when the input stream has two separate things going on at once. A finite state automaton needs to square its number of states. An RNN needs to double its number of units. 15 Machine translation with RNN 16 Variants of RNN one to one: Standard MLP, single vector in, single out. one to many: Single vector in, sequence out. Image captioning: image in, sentence out. mnay to one: Sequence in, single vector out. Sentiment analysis: sentence in, sentiment (positive/negative) out. 17 Variants of RNN many to many: Sequence in, sequence out. Machine translator: English sentence in, Czech out (may have different lengths). many to one: Synced sequences in and out. Video classification, where we wish to label each frame of the video. 18 Image recognition: recurrent attention model The recurrent network tells a "glimpse" network, where to look. The state of the recurrent network changes based on location and actual perception in the location. 19 RNN – Learning We consider a fixed training example (x, d) where x = x1, . . . , xT is a given input sequence, here xt = (xt1, . . . , xtN) d = d1, . . . , dm is a given sequence of desired values. dt = (dt1, . . . , dtM) Unfolding the RNN for x gives a sequence of hidden states: h0, h1, . . . , hT each ht = (ht1, . . . , htH) here h0 = (0, ..., 0) and a sequence of output values: y1, . . . , yt each yt = (yt1, . . . , ytM) Error function (e.g. squared error): E(x,d) = T t=1 M k=1(ytj − dtj)2 20 Learning – backpropagation through time RNN training algorithm is easy to obtain: Unfold the RNN for several time steps. Consider it to be a (deep) MLP. Train with gradient descent. However, one has to keep in mind that U, W, V are shared in all time instants. To simplify training and to connect with MLP and convolutional networks, we abstract all neurons in the unfolding, and denote them by indices i, j, etc. as before for MLP. 21 Learning – backpropagation through time 22 Recurrent networks – unfolding Let us make neurons of the unfolding anonymous: Denote X a set of input neurons of the unfolding Y a set of output neurons of the unfolding Z a set of all neurons of the unfolding (X, Y ⊆ Z) individual neurons of the unfolding denoted by indices i, j ξj is the inner potential of the neuron j after the computation stops σj is the activation function of j yj is the output of the neuron j after the computation stops wji is the weight of the connection from i to j j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) j→ is a set of all i such that j is adjacent to i (i.e. there is an arc from j to i) jshare is a set of neurons sharing weights with j jshare consists of all incarnations of the same neuron of the RNN in different time instants t. 23 Gradient descent (single training example) Consider the single training example (x, d). In the case of SGD, minibatches of such pairs are used, their errors are averaged. The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. Do not forget that these are weights in the RNN, possibly shared by some neurons in the unfolding. weights in w(0) are randomly initialized to values close to 0 in the step + 1 (here = 0, 1, 2 . . .), weights w( +1) are computed as follows: w(t+1) = w(t) + ∆w(t) where ∆w(t) = −ε(t) · E(x,d)(w(t) ) 24 Backprop E(x,d)(w(t)) is a vector of all partial derivatives ∂E(x,d) ∂wji . First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj: ∂E(x,d) ∂wji = r∈jshare ∂E(x,d) ∂yr · σr (ξr ) · yi for every j ∈ Y: ∂E(x,d) ∂yj = yj − dj This holds for the mean-square error, for other error functions the derivative w.r.t. outputs will be different. for every j ∈ Z Y: ∂E(x,d) ∂yj = r∈j→ ∂E(x,d) ∂yr · σr (ξr ) · wrj 25 In the notation of RNN ∂E(x,d) ∂Vkk = T t=1 ∂E(x,d) ∂ytk · σ · htk ∂E(x,d) ∂Wkk = T t=1 ∂E(x,d) ∂htk · σ · h(t−1)k ∂E(x,d) ∂Ukk = T t=1 ∂E(x,d) ∂htk · σ · xtk Backprop: ∂E(x,d) ∂ytk = ytk − dtk ∂E(x,d) ∂htk = N k =1 ∂E(x,d) ∂ytk ·σ ·Vkk + H k =1 ∂E(x,d) ∂h(t+1)k ·σ ·Wkk 26 Long-term dependencies ∂E(x,d) ∂htk = N k =1 ∂E(x,d) ∂ytk · σ · Vkk + H k =1 ∂E(x,d) ∂h(t+1)k · σ · Wkk Unless Wkk · σ ≈ 1, the gradient either vanishes, or explodes. For large T (long-term dependency), the "deeper" gradient is too small (large). A solution: LSTM 27 LSTM (yt =) ht = ot ◦ σh(ct ) output ct = ft ◦ ct−1 + it ◦ ˜Ct memory ˜Ct = σc(WC · ht−1 + UC · xt ) new memory contents ot = σg(Wo · ht−1 + Uo · xt ) output gate ft = σg(Wf · ht−1 + Uf · xt ) forget gate it = σg(Wi · ht−1 + Ui · xt ) input gate ◦ is the component-wise product σh, σc original is hyperbolic tangents (But in my opinion can be whatever you would put into the output and hidden layers, resp.) σg original is logistic sigmoid 28 LSTM 29 RNN vs LSTM 30 LSTM 31 LSTM ft = σg(Wf · ht−1 + Uf · xt ) 32 LSTM ˜Ct = σc(WC ·ht−1+UC ·xt ) it = σg(Wi · ht−1 + Ui · xt ) 33 LSTM ct = ft ◦ ct−1 + it ◦ ˜Ct ft = σg(Wf · ht−1 + Uf · xt ) ˜Ct = σc(WC ·ht−1+UC ·xt ) it = σg(Wi · ht−1 + Ui · xt ) 34 LSTM (yt =) ht = ot ◦ σh(ct ) ot = σg(Wo ·ht−1 +Uo ·xt ) ct = ft ◦ ct−1 + it ◦ ˜Ct ft = σg(Wf · ht−1 + Uf · xt ) ˜Ct = σc(WC ·ht−1+UC ·xt ) it = σg(Wi · ht−1 + Ui · xt ) 35 Fun with LSTM – Shakespeare A LSTM generating new Shakespeare character by character! All works of Shakespeare concatenated into a single (4.4MB) file. 3-layer RNN with 512 hidden neurons in each layer. VIOLA: Why, Salisbury must find his flesh and thought That which I am not aps, not a man and in fire, To show the reining of the raven and the wars To grace my hand reproach within, and not a fair are hand, That Caesar and my goodly father’s world; When I was heaven of presence and our fleets, We spare with hours, but cut thy council I am great, Murdered and by thy master’s ready there My power to give thee but so much as hell: Some service in the noble bondman here, Would show him to her wine. KING LEAR: O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder’d at the deeds, So drop upon your lordship’s head, and your opinion Shall be against your honour. 36 Fun with LSTM – Wikipedia Hutter Prize 100MB dataset of raw Wikipedia data (96MB for training, the rest for validation ) Naturalism and decision for the majority of Arab countries’ capitalide was grounded by the Irish language by [[John Clair]], [[An Imperial Japanese Revolt]], associated with Guangzham’s sovereignty. His generals were the powerful ruler of the Portugal in the [[Protestant Immineners]], which could be said to be directly in Cantonese Communication, which followed a ceremony and set inspired prison, training. The emperor travelled back to [[Antioch, Perth, October 25|21]] to note, the Kingdom of Costa Rica, unsuccessful fashioned the [[Thrales]], [[Cynth’s Dajoard]], known in western [[Scotland]], near Italy to the conquest of India with the conflict. Copyright was the succession of independence in the slop of Syrian influence that was a famous German movement based on a more popular servicious, non-doctrinal and sexual power post. Many governments recognize the military housing of the [[Civil Liberalization and Infantry Resolution 265 National Party in Hungary]], that is sympathetic to be to the [[Punjab Resolution]] (PJS)[http: //www.humah.yahoo.com/guardian.cfm/7754800786d17551963s89.htm 37 Randomly halucinated (correct!!) xml: Antichrist 865 15900676 2002-08-03T18:14:12Z Paris 23 Automated conversion #REDIRECT [[Christianity]] 38 Fun with LSTM – LaTeX Train RNN on an algebraic geometry book http://stacks.math.columbia.edu/ Raw LaTeX source file (a 16MB file) and trained a multilayer LSTM. The resulting sampled Latex almost compiles! The authors had to step in and fix a few issues manually but then you get plausible looking math. 39 40 Linux source code Trained on all the source and header files found in the Linux repo on Github, concatenated all of them in a single giant file (474MB of C code). 3-layer LSTM with approx. 10 million parameters. 41 42 43 Evolution of Shakespeare 100 iter.: 300 iter.: 500 iter.: 700 iter.: 1200 iter.: 2000 iter.: 44