Recurrent Neural Networks - LSTM
216
RNN
Input:
x = (x1, . . . , xM)
Hidden:
h = (h1, . . . , hH)
Output:
y = (y1, . . . , yN)
217
RNN example
Activation function:
σ(ξ) =



1 ξ ≥ 0
0 ξ < 0
y 1 0 1
h (0, 0) (1, 1) (1, 0) (0, 1) · · ·
x (0, 0) (1, 0) (1, 1)
218
RNN example
Activation function:
σ(ξ) =



1 ξ ≥ 0
0 ξ < 0
y y1 = 1 y2 = 0 y3 = 1
h h0 = (0, 0) h1 = (1, 1) h2 = (1, 0) h3 = (0, 1) · · ·
x x1 = (0, 0) x2 = (1, 0) x3 = (1, 1)
218
RNN example
y y1 = 1 y2 = 0 y3 = 1
h h0 = (0, 0) h1 = (1, 1) h2 = (1, 0) h3 = (0, 1) · · ·
x x1 = (0, 0) x2 = (1, 0) x3 = (1, 1)
218
RNN – formally
M inputs: x = (x1, . . . , xM)
H hidden neurons: h = (h1, . . . , hH)
N output neurons: y = (y1, . . . , yN)
Weights:
Ukk from input xk to hidden hk
Wkk from hidden hk to hidden hk
Vkk from hidden hk to output yk
219
RNN – formally
Input sequence: x = x1, . . . , xT
xt = (xt1, . . . , xtM)
220
RNN – formally
Input sequence: x = x1, . . . , xT
xt = (xt1, . . . , xtM)
Hidden sequence: h = h0, h1, . . . , hT
ht = (ht1, . . . , htH)
We have h0 = (0, . . . , 0) and
htk = σ


M
k =1
Ukk xtk +
H
k =1
Wkk h(t−1)k


220
RNN – formally
Input sequence: x = x1, . . . , xT
xt = (xt1, . . . , xtM)
Hidden sequence: h = h0, h1, . . . , hT
ht = (ht1, . . . , htH)
We have h0 = (0, . . . , 0) and
htk = σ


M
k =1
Ukk xtk +
H
k =1
Wkk h(t−1)k


Output sequence: y = y1, . . . , yT
yt = (yt1, . . . , ytN)
where ytk = σ H
k =1 Vkk htk .
220
RNN – in matrix form
Input sequence: x = x1, . . . , xT
221
RNN – in matrix form
Input sequence: x = x1, . . . , xT
Hidden sequence: h = h0, h1, . . . , hT where
h0 = (0, . . . , 0)
and
ht = σ(Uxt + Wht−1)
221
RNN – in matrix form
Input sequence: x = x1, . . . , xT
Hidden sequence: h = h0, h1, . . . , hT where
h0 = (0, . . . , 0)
and
ht = σ(Uxt + Wht−1)
Output sequence: y = y1, . . . , yT where
yt = σ(Vht )
221
RNN – Comments
ht is the memory of the network, captures what happened
in all previous steps (with decaying quality).
RNN shares weights U, V, W along the sequence.
Note the similarity to convolutional networks where the weights were
shared spatially over images, here they are shared temporally over
sequences.
RNN can deal with sequences of variable length.
Compare with MLP which accepts only ﬁxed-dimension vectors on
input.
222
RNN – training
Training set
T = (x1, d1), . . . , (xp, yp)
here
each x = x 1, . . . , x T is an input sequence,
each d = d 1, . . . , d T is an expected output sequence.
Here each x t = (x t1, . . . , x tM) is an input vector and each
d t = (d t1, . . . , d tN) is an expected output vector.
223
Error function
In what follows I will consider a training set with a single
element (x, d). I.e. drop the index and have
x = x1, . . . , xT where xt = (xt1, . . . , xtM)
d = d1, . . . , dT where dt = (dt1, . . . , dtN)
The squared error of (x, d) is deﬁned by
E(x,d) =
T
t=1
N
k=1
1
2
(ytk − dtk )2
Recall that we have a sequence of network outputs
y = y1, . . . , yT and thus ytk is the k-th component of yt
224
Gradient descent (single training example)
Consider a single training example (x, d).
The algorithm computes a sequence of weight matrices as
follows:
225
Gradient descent (single training example)
Consider a single training example (x, d).
The algorithm computes a sequence of weight matrices as
follows:
Initialize all weights randomly close to 0.
225
Gradient descent (single training example)
Consider a single training example (x, d).
The algorithm computes a sequence of weight matrices as
follows:
Initialize all weights randomly close to 0.
In the step + 1 (here = 0, 1, 2, . . .) compute "new"
weights U( +1), V( +1), W( +1) from the "old" weights
U( ), V( ), W( ) as follows:
U
( +1)
kk
= U
( )
kk
− ε( ) ·
δE(x,d)
δUkk
V
( +1)
kk
= V
( )
kk
− ε( ) ·
δE(x,d)
δVkk
W
( +1)
kk
= W
( )
kk
− ε( ) ·
δE(x,d)
δWkk
225
Gradient descent (single training example)
Consider a single training example (x, d).
The algorithm computes a sequence of weight matrices as
follows:
Initialize all weights randomly close to 0.
In the step + 1 (here = 0, 1, 2, . . .) compute "new"
weights U( +1), V( +1), W( +1) from the "old" weights
U( ), V( ), W( ) as follows:
U
( +1)
kk
= U
( )
kk
− ε( ) ·
δE(x,d)
δUkk
V
( +1)
kk
= V
( )
kk
− ε( ) ·
δE(x,d)
δVkk
W
( +1)
kk
= W
( )
kk
− ε( ) ·
δE(x,d)
δWkk
The above is THE learning algorithm that modiﬁes weights!
225
Backpropagation
Computes the derivatives of E, no weights are modiﬁed!
226
Backpropagation
Computes the derivatives of E, no weights are modiﬁed!
δE(x,d)
δUkk
=
T
t=1
δE(x,d)
δhtk
· σ · xtk k = 1, . . . , M
δE(x,d)
δVkk
=
T
t=1
δE(x,d)
δytk
· σ · htk k = 1, . . . , H
δE(x,d)
δWkk
=
T
t=1
δE(x,d)
δhtk
· σ · h(t−1)k k = 1, . . . , H
226
Backpropagation
Computes the derivatives of E, no weights are modiﬁed!
δE(x,d)
δUkk
=
T
t=1
δE(x,d)
δhtk
· σ · xtk k = 1, . . . , M
δE(x,d)
δVkk
=
T
t=1
δE(x,d)
δytk
· σ · htk k = 1, . . . , H
δE(x,d)
δWkk
=
T
t=1
δE(x,d)
δhtk
· σ · h(t−1)k k = 1, . . . , H
Backpropagation:
δE(x,d)
δytk
= ytk − dtk (assuming squared error)
δE(x,d)
δhtk
=
N
k =1
δE(x,d)
δytk
· σ · Vk k +
H
k =1
δE(x,d)
δh(t+1)k
· σ · Wk k
226
Long-term dependencies
δE(x,d)
δhtk
=
N
k =1
δE(x,d)
δytk
· σ · Vk k +
H
k =1
δE(x,d)
δh(t+1)k
· σ · Wk k
Unless H
k =1 σ · Wk k ≈ 1, the gradient either vanishes, or
explodes.
For a large T (long-term dependency), the gradient
"deeper" in the past tends to be too small (large).
A solution: LSTM
227
LSTM
ht = ot ◦ σh(Ct ) output
Ct = ft ◦ Ct−1 + it ◦ ˜Ct memory
˜Ct = σh(WC · ht−1 + UC · xt ) new memory contents
ot = σg(Wo · ht−1 + Uo · xt ) output gate
ft = σg(Wf · ht−1 + Uf · xt ) forget gate
it = σg(Wi · ht−1 + Ui · xt ) input gate
◦ is the component-wise product of vectors
· is the matrix-vector product
σh hyperbolic tangents (applied component-wise)
σg logistic sigmoid (aplied component-wise)
228
RNN vs LSTM
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
229
LSTM
⇒ ht = ot ◦ σh(Ct )
⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ht−1 + UC · xt )
⇒ ot = σg(Wo · ht−1 + Uo · xt )
⇒ ft = σg(Wf · ht−1 + Uf · xt )
⇒ it = σg(Wi · ht−1 + Ui · xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
230
LSTM
⇒ ht = ot ◦ σh(Ct )
⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ht−1 + UC · xt )
⇒ ot = σg(Wo · ht−1 + Uo · xt )
⇒ ft = σg(Wf · ht−1 + Uf · xt )
⇒ it = σg(Wi · ht−1 + Ui · xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
230
LSTM
⇒ ht = ot ◦ σh(Ct )
⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ht−1 + UC · xt )
⇒ ot = σg(Wo · ht−1 + Uo · xt )
⇒ ft = σg(Wf · ht−1 + Uf · xt )
⇒ it = σg(Wi · ht−1 + Ui · xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
230
LSTM
⇒ ht = ot ◦ σh(Ct )
⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ht−1 + UC · xt )
⇒ ot = σg(Wo · ht−1 + Uo · xt )
⇒ ft = σg(Wf · ht−1 + Uf · xt )
⇒ it = σg(Wi · ht−1 + Ui · xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
230
LSTM
⇒ ht = ot ◦ σh(Ct )
⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ht−1 + UC · xt )
⇒ ot = σg(Wo · ht−1 + Uo · xt )
⇒ ft = σg(Wf · ht−1 + Uf · xt )
⇒ it = σg(Wi · ht−1 + Ui · xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
230
LSTM – summary
LSTM (almost) solves the vanishing gradient problem w.r.t.
the "internal" state of the network.
Learns to control its own memory (via forget gate).
Revolution in machine translation and text processing.
231
Convolutions & LSTM in action – cancer research
232
Colorectal cancer outcome prediction
The problem: Predict 5-year survival probability from an image
of a small region of tumour tissue (1 mm diameter).
233
Colorectal cancer outcome prediction
The problem: Predict 5-year survival probability from an image
of a small region of tumour tissue (1 mm diameter).
Input: Digitized haematoxylin-eosin-stained
tumour tissue microarray samples.
Output: Estimated survival probability.
233
Colorectal cancer outcome prediction
The problem: Predict 5-year survival probability from an image
of a small region of tumour tissue (1 mm diameter).
Input: Digitized haematoxylin-eosin-stained
tumour tissue microarray samples.
Output: Estimated survival probability.
Data:
Training set: 420 patients of Helsinki University Centre
Hospital, diagnosed with colorectal cancer, underwent
primary surgery.
Test set: 182 patients
Follow-up time and outcome known for each patient.
233
Colorectal cancer outcome prediction
The problem: Predict 5-year survival probability from an image
of a small region of tumour tissue (1 mm diameter).
Input: Digitized haematoxylin-eosin-stained
tumour tissue microarray samples.
Output: Estimated survival probability.
Data:
Training set: 420 patients of Helsinki University Centre
Hospital, diagnosed with colorectal cancer, underwent
primary surgery.
Test set: 182 patients
Follow-up time and outcome known for each patient.
Human expert comparison:
Histological grade assessed at the time of diagnosis.
Visual Risk Score: Three pathologists classiﬁed to
high/low-risk categories (by majority vote).
Source: D. Bychkov et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Scientiﬁc
Reports, Nature, 2018. 233
Colorectal cancer outcome prediction
234
Colorectal cancer outcome prediction
234
Data & workﬂow
Input images: 3500 px × 3500 px
Cut into tiles: 224 px × 224 px ⇒ 256 tiles
Each tile pased to a convolutional network (CNN)
Ouptut of CNN: 4096 dimensional vector.
A "string" of 256 vectors (each of the dimension 4096)
pased into a LSTM.
LSTM outputs the probability of 5-year survival.
235
Data & workﬂow
Input images: 3500 px × 3500 px
Cut into tiles: 224 px × 224 px ⇒ 256 tiles
Each tile pased to a convolutional network (CNN)
Ouptut of CNN: 4096 dimensional vector.
A "string" of 256 vectors (each of the dimension 4096)
pased into a LSTM.
LSTM outputs the probability of 5-year survival.
The authors also tried to substitute the LSTM on top of CNN
with
logistic regression
naive Bayes
support vector machines
235
CNN architecture – VGG-16
(Pre)trained on ImageNet (cats, dogs, chairs, etc.)
236
LSTM architecture
LSTM has three layers (264, 128, 64 cells)
237
LSTM – training
L1 regularization (0.005) at each hidden layer of LSTM
i.e. 0.005 times the sum of absolute values of weights added to the error
L2 regularization (0.005) at each hidden layer of LSTM
i.e. 0.005 times the sum of squared values of weights added to the error
Dropout 5% at the input and the last hidden layers of LSTM
Datasets:
Training: 220 samples,
Validation 60 samples,
Test 140 samples.
238
Colorectal cancer outcome prediction
Source: D. Bychkov et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Scientiﬁc
Reports, Nature, 2018.
239