MLP training – theory 26 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 27 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) 28 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops 28 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) 28 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) 28 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) 28 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) j→ is a set of all i such that j is adjacent to i (i.e. there is an arc from j to i) 28 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi 29 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable) [ e.g. logistic sigmoid σj(ξ) = 1 1+e −λjξ ] 29 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable) [ e.g. logistic sigmoid σj(ξ) = 1 1+e −λjξ ] State of non-input neuron j ∈ Z \ X after the computation stops: yj = σj(ξj) (yj depends on the configuration w and the input x, so we sometimes write yj(w, x) ) 29 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable) [ e.g. logistic sigmoid σj(ξ) = 1 1+e −λjξ ] State of non-input neuron j ∈ Z \ X after the computation stops: yj = σj(ξj) (yj depends on the configuration w and the input x, so we sometimes write yj(w, x) ) The network computes a function R|X| do R|Y| . Layer-wise computation: First, all input neurons are assigned values of the input. In the -th step, all neurons of the -th layer are evaluated. 29 MLP – learning Learning: Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). 30 MLP – learning Learning: Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). Error function: E(w) = p k=1 Ek (w) where Ek (w) = 1 2 j∈Y yj(w, xk ) − dkj 2 30 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji 31 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂E ∂wji (w(t) ) is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is a learning rate in step t + 1. 31 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂E ∂wji (w(t) ) is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is a learning rate in step t + 1. Note that ∂E ∂wji (w(t) ) is a component of the gradient E, i.e. the weight update can be written as w(t+1) = w(t) − ε(t) · E(w(t) ). 31 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji 32 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi 32 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi and for every j ∈ Z X we get ∂Ek ∂yj = yj − dkj for j ∈ Y 32 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi and for every j ∈ Z X we get ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj for j ∈ Z (Y ∪ X) (Here all yj are in fact yj(w, xk )). 32 MLP – error function gradient If σj(ξ) = 1 1+e −λjξ for all j ∈ Z, then σj (ξj) = λjyj(1 − yj) 33 MLP – error function gradient If σj(ξ) = 1 1+e −λjξ for all j ∈ Z, then σj (ξj) = λjyj(1 − yj) and thus for all j ∈ Z X: ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X) 33 MLP – error function gradient If σj(ξ) = 1 1+e −λjξ for all j ∈ Z, then σj (ξj) = λjyj(1 − yj) and thus for all j ∈ Z X: ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X) If σj(ξ) = a · tanh(b · ξj) for all j ∈ Z, then σj (ξj) = b a (a − yj)(a + yj) 33 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: 34 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) 34 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 34 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 34 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 34 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 3. compute ∂Ek ∂wji for all wji using ∂Ek ∂wji := ∂Ek ∂yj · σj (ξj) · yi 34 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 3. compute ∂Ek ∂wji for all wji using ∂Ek ∂wji := ∂Ek ∂yj · σj (ξj) · yi 4. Eji := Eji + ∂Ek ∂wji The resulting Eji equals ∂E ∂wji . 34 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: 35 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: if j ∈ Y, then ∂Ek ∂yj = yj − dkj 35 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: if j ∈ Y, then ∂Ek ∂yj = yj − dkj if j ∈ Z Y ∪ X, then assuming that j is in the -th layer and assuming that ∂Ek ∂yr has already been computed for all neurons in the + 1-st layer, compute ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj (This works because all neurons of r ∈ j→ belong to the + 1-st layer.) 35 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) 36 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 36 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 36 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 36 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) 36 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) The steps 1. - 3. take linear time. 36 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) The steps 1. - 3. take linear time. Note that the speed of convergence of the gradient descent cannot be estimated ... 36 Illustration of the gradient descent – XOR Source: Pattern Classification (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork 37 MLP – learning algorithm Online algorithm: The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂Ek ∂wji (w (t) ji ) is the weight update of wji in the step t + 1 and 0 < ε(t) ≤ 1 is the learning rate in the step t + 1. There are other variants determined by selection of the training examples used for the error computation (more on this later). 38 SGD weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: Choose (randomly) a set of training examples T ⊆ {1, . . . , p} Compute w(t+1) = w(t) + ∆w(t) where ∆w(t) = −ε(t) · k∈T Ek (w(t) ) 0 < ε(t) ≤ 1 is a learning rate in step t + 1 Ek (w(t) ) is the gradient of the error of the example k Note that the random choice of the minibatch is typically implemented by randomly shuffling all data and then choosing minibatches sequentially. 39