Recurrent Networks
Contains material from:
Andrej Karpathy’s blog:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Christopher Olah’s blog
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Geoffrey Hinton’s lecture:
https:
//www.cs.toronto.edu/~hinton/csc2535/notes/lec10new.pdf
1
Recurrent Neural Network – vector notation
A simple example of a recurrent MLP:
Input: x
Hidden (state): h
Output: y
Matrices U, W, V
h = σ(Ux + Wh)
Here σ is an activation function (applied
component-wise), typically sigmoidal or ReLU.
y = σ(Vh)
Here σ is typically softmax (so that we get
probabilities) or sigmoidal.
In what follows I will use σ to denote arbitrary activation functions, keep in
mind that each neuron may have different activation function.
2
Recurrent Neural Network – sequence modeling
Input sequence: x1, . . . , xT of vectors.
Output sequence: y1, . . . , yT of vectors obtained by
ht = σ(Uxt + Wht−1)
yt = σ(Vht )
3
RNN – Component-wise
Denote:
x = (x1, . . . , xM)
h = (h1, . . . , hH)
y = (y1, . . . , yN)
For all k = 1, . . . , H
hk = σ


M
k =1
Ukk xk +
H
k =1
Wkk hk


For all k = 1, . . . , N
yk = σ


H
k =1
Vkk hk


4
RNN – Component-wise – Unfolding
Input sequence: x = x1, . . . , xT
xt = (xt1, . . . , xtM)
Hidden sequence:
h = h0, h1, . . . , hT
ht = (ht1, . . . , htH)
We have h0 = (0, . . . , 0) and
htk = σ


M
k =1
Ukk xtk +
H
k =1
Wkk h(t−1)k


Output sequence: y = y1, . . . , yT
yt = (yt1, . . . , ytN)
We have ytk = σ H
k =1 Vkk htk
5
RNN – Component-wise – Unfolding
6
RNN – Comments
ht is the memory of the network, captures what
happend in all previous steps (with decaying quality).
RNN shares weights U, V, W across all steps.
Note the similarity to convolutional networks where the
weights were shared spatially over images, here they are
shared temporally over sequences.
RNN can deal with sequences of variable length.
Compare with MLP which accepts only ﬁxed-dimension
vectors on input.
7
RNN – language modelling (toy example)
RNN generating text character by character.
Models the probability distribution of the next character in a
given sequence.
Learns the distribution from a huge number of sequences.
For simplicity, assume 4 letter aplhabet: h, e, l, o
Encode letters using one-hot encoding, e.g. e is (0, 1, 0, 0).
Output layer: Softmax
Error: Cross-entropy
Training: Gradient descent (simply unfold in time, see later)
8
RNN – language modelling (toy example)
9
Deeper RNN
Two hidden layers ... may be arbitrary number.
10
... and deeper
11
Binary addition – another toy
MLP can be trained to do binary addition, but there are obvious
regularities that it cannot capture efﬁciently:
We must decide in advance the maximum number of digits
in each number.
The processing applied to the beginning of a long number
does not generalize to the end of the long number because
it uses different weights.
As a result, feedforward nets do not generalize well on the binary
addition task.
12
Binary addition – another toy
A ﬁnite transducer, in every step reads a pair of numbers of {0, 1}2
and prints an output number of {0, 1}.
The network should imitate activity of the automaton.
Three hidden neurons should be enough.
13
Binary addition – another toy
The network has two input neurons and one output neuron.
Three hidden neurons are sufﬁcient for binary addition.
14
Binary addition – another toy
RNN learns four distinct patterns of activity for the 3 hidden
neurons. These patterns correspond to the nodes in the
ﬁnite state automaton.
Do not confuse units in a neural network with nodes in a
ﬁnite state automaton. Nodes are like activity vectors.
The automaton is restricted to be in exactly one state at
each time. The hidden units are restricted to have exactly
one vector of activity at each time.
A recurrent network can emulate a ﬁnite state automaton,
but it is exponentially more powerful. With N hidden
neurons it has 2N possible binary activity vectors (but only
N2 weights)
This is important when the input stream has two separate
things going on at once.
A ﬁnite state automaton needs to square its number of
states.
An RNN needs to double its number of units.
15
Machine translation with RNN
16
Variants of RNN
one to one: Standard MLP, single vector in, single out.
one to many: Single vector in, sequence out.
Image captioning: image in, sentence out.
mnay to one: Sequence in, single vector out.
Sentiment analysis: sentence in, sentiment (positive/negative) out.
17
Variants of RNN
many to many: Sequence in, sequence out.
Machine translator: English sentence in, Czech out (may have different
lengths).
many to one: Synced sequences in and out.
Video classiﬁcation, where we wish to label each frame of the video.
18
Image recognition: recurrent attention model
The recurrent network tells a "glimpse" network, where to look.
The state of the recurrent network changes based on location
and actual perception in the location.
19
RNN – Learning
We consider a ﬁxed training example (x, d) where
x = x1, . . . , xT is a given input sequence, here
xt = (xt1, . . . , xtN)
d = d1, . . . , dm is a given sequence of desired values.
dt = (dt1, . . . , dtM)
Unfolding the RNN for x gives a sequence of hidden states:
h0, h1, . . . , hT each ht = (ht1, . . . , htH)
here h0 = (0, ..., 0) and a sequence of output values:
y1, . . . , yt each yt = (yt1, . . . , ytM)
Error function (e.g. squared error): E(x,d) = T
t=1
M
k=1(ytj − dtj)2
20
Learning – backpropagation through time
RNN training algorithm is easy to obtain:
Unfold the RNN for several time steps.
Consider it to be a (deep) MLP.
Train with gradient descent.
However, one has to keep in mind that U, W, V
are shared in all time instants.
To simplify training and to connect with MLP and convolutional
networks, we abstract all neurons in the unfolding, and denote
them by indices i, j, etc. as before for MLP.
21
Learning – backpropagation through time
22
Recurrent networks – unfolding
Let us make neurons of the unfolding anonymous:
Denote
X a set of input neurons of the unfolding
Y a set of output neurons of the unfolding
Z a set of all neurons of the unfolding (X, Y ⊆ Z)
individual neurons of the unfolding denoted by indices i, j
ξj is the inner potential of the neuron j after the computation
stops
σj is the activation function of j
yj is the output of the neuron j after the computation stops
wji is the weight of the connection from i to j
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
jshare is a set of neurons sharing weights with j
jshare consists of all incarnations of the same neuron of the RNN in
different time instants t.
23
Gradient descent (single training example)
Consider the single training example (x, d).
In the case of SGD, minibatches of such pairs are used, their
errors are averaged.
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
Do not forget that these are weights in the RNN, possibly shared by some
neurons in the unfolding.
weights in w(0) are randomly initialized to values close to 0
in the step + 1 (here = 0, 1, 2 . . .), weights w( +1) are
computed as follows:
w(t+1)
= w(t)
+ ∆w(t)
where
∆w(t)
= −ε(t) · E(x,d)(w(t)
)
24
Backprop
E(x,d)(w(t)) is a vector of all partial derivatives
∂E(x,d)
∂wji
.
First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj:
∂E(x,d)
∂wji
=
r∈jshare
∂E(x,d)
∂yr
· σr (ξr ) · yi
for every j ∈ Y:
∂E(x,d)
∂yj
= yj − dj
This holds for the mean-square error, for other error functions
the derivative w.r.t. outputs will be different.
for every j ∈ Z Y:
∂E(x,d)
∂yj
=
r∈j→
∂E(x,d)
∂yr
· σr (ξr ) · wrj
25
In the notation of RNN
∂E(x,d)
∂Vkk
=
T
t=1
∂E(x,d)
∂ytk
· σ · htk
∂E(x,d)
∂Wkk
=
T
t=1
∂E(x,d)
∂htk
· σ · h(t−1)k
∂E(x,d)
∂Ukk
=
T
t=1
∂E(x,d)
∂htk
· σ · xtk
Backprop:
∂E(x,d)
∂ytk
= ytk − dtk
∂E(x,d)
∂htk
=
N
k =1
∂E(x,d)
∂ytk
·σ ·Vkk +
H
k =1
∂E(x,d)
∂h(t+1)k
·σ ·Wkk
26
Long-term dependencies
∂E(x,d)
∂htk
=
N
k =1
∂E(x,d)
∂ytk
· σ · Vkk +
H
k =1
∂E(x,d)
∂h(t+1)k
· σ · Wkk
Unless Wkk · σ ≈ 1, the gradient either vanishes, or
explodes.
For large T (long-term dependency), the "deeper" gradient
is too small (large).
A solution: LSTM
27
LSTM
(yt =) ht = ot ◦ σh(ct ) output
ct = ft ◦ ct−1 + it ◦ ˜Ct memory
˜Ct = σc(WC · ht−1 + UC · xt ) new memory contents
ot = σg(Wo · ht−1 + Uo · xt ) output gate
ft = σg(Wf · ht−1 + Uf · xt ) forget gate
it = σg(Wi · ht−1 + Ui · xt ) input gate
◦ is the component-wise product
σh, σc original is hyperbolic tangents
(But in my opinion can be whatever you would put into the output and
hidden layers, resp.)
σg original is logistic sigmoid
28
LSTM
29
RNN vs LSTM
30
LSTM
31
LSTM
ft = σg(Wf · ht−1 + Uf · xt )
32
LSTM
˜Ct = σc(WC ·ht−1+UC ·xt )
it = σg(Wi · ht−1 + Ui · xt )
33
LSTM
ct = ft ◦ ct−1 + it ◦ ˜Ct
ft = σg(Wf · ht−1 + Uf · xt )
˜Ct = σc(WC ·ht−1+UC ·xt )
it = σg(Wi · ht−1 + Ui · xt )
34
LSTM
(yt =) ht = ot ◦ σh(ct )
ot = σg(Wo ·ht−1 +Uo ·xt )
ct = ft ◦ ct−1 + it ◦ ˜Ct
ft = σg(Wf · ht−1 + Uf · xt )
˜Ct = σc(WC ·ht−1+UC ·xt )
it = σg(Wi · ht−1 + Ui · xt )
35
Fun with LSTM – Shakespeare
A LSTM generating new Shakespeare character by character!
All works of Shakespeare concatenated into a single (4.4MB)
ﬁle.
3-layer RNN with 512 hidden neurons in each layer.
VIOLA: Why, Salisbury must ﬁnd his ﬂesh and thought That which I
am not aps, not a man and in ﬁre, To show the reining of the raven
and the wars To grace my hand reproach within, and not a fair are
hand, That Caesar and my goodly father’s world; When I was heaven
of presence and our ﬂeets, We spare with hours, but cut thy council I
am great, Murdered and by thy master’s ready there My power to give
thee but so much as hell: Some service in the noble bondman here,
Would show him to her wine.
KING LEAR: O, if you were a feeble sight, the courtesy of your law,
Your sight and several breath, will wear the gods With his heads, and
my hands are wonder’d at the deeds, So drop upon your lordship’s
head, and your opinion Shall be against your honour.
36
Fun with LSTM – Wikipedia
Hutter Prize 100MB dataset of raw Wikipedia data (96MB
for training, the rest for validation )
Naturalism and decision for the majority of Arab countries’ capitalide was
grounded by the Irish language by [[John Clair]], [[An Imperial Japanese
Revolt]], associated with Guangzham’s sovereignty. His generals were the
powerful ruler of the Portugal in the [[Protestant Immineners]], which could be
said to be directly in Cantonese Communication, which followed a ceremony
and set inspired prison, training. The emperor travelled back to [[Antioch,
Perth, October 25|21]] to note, the Kingdom of Costa Rica, unsuccessful
fashioned the [[Thrales]], [[Cynth’s Dajoard]], known in western [[Scotland]],
near Italy to the conquest of India with the conﬂict. Copyright was the
succession of independence in the slop of Syrian inﬂuence that was a famous
German movement based on a more popular servicious, non-doctrinal and
sexual power post. Many governments recognize the military housing of the
[[Civil Liberalization and Infantry Resolution 265 National Party in Hungary]],
that is sympathetic to be to the [[Punjab Resolution]] (PJS)[http:
//www.humah.yahoo.com/guardian.cfm/7754800786d17551963s89.htm
37
Randomly halucinated (correct!!) xml:
<page>
<title>Antichrist</title>
<id>865</id>
<revision>
<id>15900676</id>
<timestamp>2002-08-03T18:14:12Z</timestamp>
<contributor>
<username>Paris</username>
<id>23</id>
</contributor>
<minor />
<comment>Automated conversion</comment>
<text xml:space="preserve">
#REDIRECT [[Christianity]]</text>
</revision>
</page>
38
Fun with LSTM – LaTeX
Train RNN on an algebraic geometry book
http://stacks.math.columbia.edu/
Raw LaTeX source ﬁle (a 16MB ﬁle) and trained a
multilayer LSTM.
The resulting sampled Latex almost compiles!
The authors had to step in and ﬁx a few issues manually
but then you get plausible looking math.
39
40
Linux source code
Trained on all the source and header ﬁles found in the
Linux repo on Github, concatenated all of them in a single
giant ﬁle (474MB of C code).
3-layer LSTM with approx. 10 million parameters.
41
42
43
Evolution of Shakespeare
100 iter.:
300 iter.:
500 iter.:
700 iter.:
1200 iter.:
2000 iter.:
44