Recurrent Neural Networks - LSTM 216 RNN Input: x = (x1, . . . , xM) Hidden: h = (h1, . . . , hH) Output: y = (y1, . . . , yN) 217 RNN example Activation function: σ(ξ) =    1 ξ ≥ 0 0 ξ < 0 y 1 0 1 h (0, 0) (1, 1) (1, 0) (0, 1) · · · x (0, 0) (1, 0) (1, 1) 218 RNN example Activation function: σ(ξ) =    1 ξ ≥ 0 0 ξ < 0 y y1 = 1 y2 = 0 y3 = 1 h h0 = (0, 0) h1 = (1, 1) h2 = (1, 0) h3 = (0, 1) · · · x x1 = (0, 0) x2 = (1, 0) x3 = (1, 1) 218 RNN example y y1 = 1 y2 = 0 y3 = 1 h h0 = (0, 0) h1 = (1, 1) h2 = (1, 0) h3 = (0, 1) · · · x x1 = (0, 0) x2 = (1, 0) x3 = (1, 1) 218 RNN – formally M inputs: x = (x1, . . . , xM) H hidden neurons: h = (h1, . . . , hH) N output neurons: y = (y1, . . . , yN) Weights: Ukk from input xk to hidden hk Wkk from hidden hk to hidden hk Vkk from hidden hk to output yk 219 RNN – formally Input sequence: x = x1, . . . , xT xt = (xt1, . . . , xtM) 220 RNN – formally Input sequence: x = x1, . . . , xT xt = (xt1, . . . , xtM) Hidden sequence: h = h0, h1, . . . , hT ht = (ht1, . . . , htH) We have h0 = (0, . . . , 0) and htk = σ   M k =1 Ukk xtk + H k =1 Wkk h(t−1)k   220 RNN – formally Input sequence: x = x1, . . . , xT xt = (xt1, . . . , xtM) Hidden sequence: h = h0, h1, . . . , hT ht = (ht1, . . . , htH) We have h0 = (0, . . . , 0) and htk = σ   M k =1 Ukk xtk + H k =1 Wkk h(t−1)k   Output sequence: y = y1, . . . , yT yt = (yt1, . . . , ytN) where ytk = σ H k =1 Vkk htk . 220 RNN – in matrix form Input sequence: x = x1, . . . , xT 221 RNN – in matrix form Input sequence: x = x1, . . . , xT Hidden sequence: h = h0, h1, . . . , hT where h0 = (0, . . . , 0) and ht = σ(Uxt + Wht−1) 221 RNN – in matrix form Input sequence: x = x1, . . . , xT Hidden sequence: h = h0, h1, . . . , hT where h0 = (0, . . . , 0) and ht = σ(Uxt + Wht−1) Output sequence: y = y1, . . . , yT where yt = σ(Vht ) 221 RNN – Comments ht is the memory of the network, captures what happened in all previous steps (with decaying quality). RNN shares weights U, V, W along the sequence. Note the similarity to convolutional networks where the weights were shared spatially over images, here they are shared temporally over sequences. RNN can deal with sequences of variable length. Compare with MLP which accepts only fixed-dimension vectors on input. 222 RNN – training Training set T = (x1, d1), . . . , (xp, yp) here each x = x 1, . . . , x T is an input sequence, each d = d 1, . . . , d T is an expected output sequence. Here each x t = (x t1, . . . , x tM) is an input vector and each d t = (d t1, . . . , d tN) is an expected output vector. 223 Error function In what follows I will consider a training set with a single element (x, d). I.e. drop the index and have x = x1, . . . , xT where xt = (xt1, . . . , xtM) d = d1, . . . , dT where dt = (dt1, . . . , dtN) The squared error of (x, d) is defined by E(x,d) = T t=1 N k=1 1 2 (ytk − dtk )2 Recall that we have a sequence of network outputs y = y1, . . . , yT and thus ytk is the k-th component of yt 224 Gradient descent (single training example) Consider a single training example (x, d). The algorithm computes a sequence of weight matrices as follows: 225 Gradient descent (single training example) Consider a single training example (x, d). The algorithm computes a sequence of weight matrices as follows: Initialize all weights randomly close to 0. 225 Gradient descent (single training example) Consider a single training example (x, d). The algorithm computes a sequence of weight matrices as follows: Initialize all weights randomly close to 0. In the step + 1 (here = 0, 1, 2, . . .) compute "new" weights U( +1), V( +1), W( +1) from the "old" weights U( ), V( ), W( ) as follows: U ( +1) kk = U ( ) kk − ε( ) · δE(x,d) δUkk V ( +1) kk = V ( ) kk − ε( ) · δE(x,d) δVkk W ( +1) kk = W ( ) kk − ε( ) · δE(x,d) δWkk 225 Gradient descent (single training example) Consider a single training example (x, d). The algorithm computes a sequence of weight matrices as follows: Initialize all weights randomly close to 0. In the step + 1 (here = 0, 1, 2, . . .) compute "new" weights U( +1), V( +1), W( +1) from the "old" weights U( ), V( ), W( ) as follows: U ( +1) kk = U ( ) kk − ε( ) · δE(x,d) δUkk V ( +1) kk = V ( ) kk − ε( ) · δE(x,d) δVkk W ( +1) kk = W ( ) kk − ε( ) · δE(x,d) δWkk The above is THE learning algorithm that modifies weights! 225 Backpropagation Computes the derivatives of E, no weights are modified! 226 Backpropagation Computes the derivatives of E, no weights are modified! δE(x,d) δUkk = T t=1 δE(x,d) δhtk · σ · xtk k = 1, . . . , M δE(x,d) δVkk = T t=1 δE(x,d) δytk · σ · htk k = 1, . . . , H δE(x,d) δWkk = T t=1 δE(x,d) δhtk · σ · h(t−1)k k = 1, . . . , H 226 Backpropagation Computes the derivatives of E, no weights are modified! δE(x,d) δUkk = T t=1 δE(x,d) δhtk · σ · xtk k = 1, . . . , M δE(x,d) δVkk = T t=1 δE(x,d) δytk · σ · htk k = 1, . . . , H δE(x,d) δWkk = T t=1 δE(x,d) δhtk · σ · h(t−1)k k = 1, . . . , H Backpropagation: δE(x,d) δytk = ytk − dtk (assuming squared error) δE(x,d) δhtk = N k =1 δE(x,d) δytk · σ · Vk k + H k =1 δE(x,d) δh(t+1)k · σ · Wk k 226 Long-term dependencies δE(x,d) δhtk = N k =1 δE(x,d) δytk · σ · Vk k + H k =1 δE(x,d) δh(t+1)k · σ · Wk k Unless H k =1 σ · Wk k ≈ 1, the gradient either vanishes, or explodes. For a large T (long-term dependency), the gradient "deeper" in the past tends to be too small (large). A solution: LSTM 227 LSTM ht = ot ◦ σh(Ct ) output Ct = ft ◦ Ct−1 + it ◦ ˜Ct memory ˜Ct = σh(WC · ht−1 + UC · xt ) new memory contents ot = σg(Wo · ht−1 + Uo · xt ) output gate ft = σg(Wf · ht−1 + Uf · xt ) forget gate it = σg(Wi · ht−1 + Ui · xt ) input gate ◦ is the component-wise product of vectors · is the matrix-vector product σh hyperbolic tangents (applied component-wise) σg logistic sigmoid (aplied component-wise) 228 RNN vs LSTM Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 229 LSTM ⇒ ht = ot ◦ σh(Ct ) ⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ht−1 + UC · xt ) ⇒ ot = σg(Wo · ht−1 + Uo · xt ) ⇒ ft = σg(Wf · ht−1 + Uf · xt ) ⇒ it = σg(Wi · ht−1 + Ui · xt ) Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 230 LSTM ⇒ ht = ot ◦ σh(Ct ) ⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ht−1 + UC · xt ) ⇒ ot = σg(Wo · ht−1 + Uo · xt ) ⇒ ft = σg(Wf · ht−1 + Uf · xt ) ⇒ it = σg(Wi · ht−1 + Ui · xt ) Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 230 LSTM ⇒ ht = ot ◦ σh(Ct ) ⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ht−1 + UC · xt ) ⇒ ot = σg(Wo · ht−1 + Uo · xt ) ⇒ ft = σg(Wf · ht−1 + Uf · xt ) ⇒ it = σg(Wi · ht−1 + Ui · xt ) Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 230 LSTM ⇒ ht = ot ◦ σh(Ct ) ⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ht−1 + UC · xt ) ⇒ ot = σg(Wo · ht−1 + Uo · xt ) ⇒ ft = σg(Wf · ht−1 + Uf · xt ) ⇒ it = σg(Wi · ht−1 + Ui · xt ) Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 230 LSTM ⇒ ht = ot ◦ σh(Ct ) ⇒ Ct = ft ◦ Ct−1 + it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ht−1 + UC · xt ) ⇒ ot = σg(Wo · ht−1 + Uo · xt ) ⇒ ft = σg(Wf · ht−1 + Uf · xt ) ⇒ it = σg(Wi · ht−1 + Ui · xt ) Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 230 LSTM – summary LSTM (almost) solves the vanishing gradient problem w.r.t. the "internal" state of the network. Learns to control its own memory (via forget gate). Revolution in machine translation and text processing. 231 Convolutions & LSTM in action – cancer research 232 Colorectal cancer outcome prediction The problem: Predict 5-year survival probability from an image of a small region of tumour tissue (1 mm diameter). 233 Colorectal cancer outcome prediction The problem: Predict 5-year survival probability from an image of a small region of tumour tissue (1 mm diameter). Input: Digitized haematoxylin-eosin-stained tumour tissue microarray samples. Output: Estimated survival probability. 233 Colorectal cancer outcome prediction The problem: Predict 5-year survival probability from an image of a small region of tumour tissue (1 mm diameter). Input: Digitized haematoxylin-eosin-stained tumour tissue microarray samples. Output: Estimated survival probability. Data: Training set: 420 patients of Helsinki University Centre Hospital, diagnosed with colorectal cancer, underwent primary surgery. Test set: 182 patients Follow-up time and outcome known for each patient. 233 Colorectal cancer outcome prediction The problem: Predict 5-year survival probability from an image of a small region of tumour tissue (1 mm diameter). Input: Digitized haematoxylin-eosin-stained tumour tissue microarray samples. Output: Estimated survival probability. Data: Training set: 420 patients of Helsinki University Centre Hospital, diagnosed with colorectal cancer, underwent primary surgery. Test set: 182 patients Follow-up time and outcome known for each patient. Human expert comparison: Histological grade assessed at the time of diagnosis. Visual Risk Score: Three pathologists classified to high/low-risk categories (by majority vote). Source: D. Bychkov et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Scientific Reports, Nature, 2018. 233 Colorectal cancer outcome prediction 234 Colorectal cancer outcome prediction 234 Data & workflow Input images: 3500 px × 3500 px Cut into tiles: 224 px × 224 px ⇒ 256 tiles Each tile pased to a convolutional network (CNN) Ouptut of CNN: 4096 dimensional vector. A "string" of 256 vectors (each of the dimension 4096) pased into a LSTM. LSTM outputs the probability of 5-year survival. 235 Data & workflow Input images: 3500 px × 3500 px Cut into tiles: 224 px × 224 px ⇒ 256 tiles Each tile pased to a convolutional network (CNN) Ouptut of CNN: 4096 dimensional vector. A "string" of 256 vectors (each of the dimension 4096) pased into a LSTM. LSTM outputs the probability of 5-year survival. The authors also tried to substitute the LSTM on top of CNN with logistic regression naive Bayes support vector machines 235 CNN architecture – VGG-16 (Pre)trained on ImageNet (cats, dogs, chairs, etc.) 236 LSTM architecture LSTM has three layers (264, 128, 64 cells) 237 LSTM – training L1 regularization (0.005) at each hidden layer of LSTM i.e. 0.005 times the sum of absolute values of weights added to the error L2 regularization (0.005) at each hidden layer of LSTM i.e. 0.005 times the sum of squared values of weights added to the error Dropout 5% at the input and the last hidden layers of LSTM Datasets: Training: 220 samples, Validation 60 samples, Test 140 samples. 238 Colorectal cancer outcome prediction Source: D. Bychkov et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Scientific Reports, Nature, 2018. 239