ADALINE Architecture: x1 x2 xn · · · y x0 = 1 w0 w1 w2 wn w = (w0, w1, . . . , wn) and x = (x0, x1, . . . , xn) where x0 = 1. Activity: inner potential: ξ = w0 + n i=1 wixi = n i=0 wixi = w · x activation function: σ(ξ) = ξ network function: y[w](x) = σ(ξ) = w · x 1 ADALINE Learning: Given a training set T = x1, d1 , x2, d2 , . . . , xp, dp Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ R is the expected output. Intuition: The network is supposed to compute an affine approximation of the function (some of) whose values are given in the training set. 2 Oaks in Wisconsin 3 ADALINE Error function: E(w) = 1 2 p k=1 w · xk − dk 2 = 1 2 p k=1   n i=0 wixki − dk   2 The goal is to find w which minimizes E(w). 4 Error function 5 Gradient of the error function Consider gradient of the error function: E(w) = ∂E ∂w0 (w), . . . , ∂E ∂wn (w) Intuition: E(w) is a vector in the weight space which points in the direction of the steepest ascent of the error function. Note that the vectors xk are just parameters of the function E, and are thus fixed! Fact If E(w) = 0 = (0, . . . , 0), then w is a global minimum of E. For ADALINE, the error function E(w) is a convex paraboloid and thus has the unique global minimum. 6 Gradient - illustration Caution! This picture just illustrates the notion of gradient ... it is not the convex paraboloid E(w) ! 7 Gradient of the error function (ADALINE) ∂E ∂w (w) = 1 2 p k=1 δE δw   n i=0 wixki − dk   2 = 1 2 p k=1 2   n i=0 wixki − dk   δE δw   n i=0 wixki − dk   = 1 2 p k=1 2   n i=0 wixki − dk     n i=0 δE δw wixki − δE δw dk   = p k=1 w · xk − dk xk Thus E(w) = ∂E ∂w0 (w), . . . , ∂E ∂wn (w) = p k=1 w · xk − dk xk 8 ADALINE - learning Batch algorithm (gradient descent): Idea: In every step "move" the weights in the direction opposite to the gradient. The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε · E(w(t) ) = w(t) − ε · p k=1 w(t) · xk − dk · xk Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate. Proposition For sufficiently small ε > 0 the sequence w(0), w(1), w(2), . . . converges (componentwise) to the global minimum of E (i.e. to the vector w satisfying E(w) = 0). 9 ADALINE – Animation 10 ADALINE - learning Online algorithm (Delta-rule, Widrow-Hoff rule): weights in w(0) initialized randomly close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε(t) · w(t) · xk − dk · xk Here k = t mod p + 1 and 0 < ε(t) ≤ 1 is a learning rate in the step t + 1. Note that the algorithm does not work with the complete gradient but only with its part determined by the currently considered training example. Theorem (Widrow & Hoff) If ε(t) = 1 t , then w(0), w(1), w(2), . . . converges to the global minimum of E. 11 ADALINE - classification How to use the ADALINE for classification? The training set is T = x1, d1 , x2, d2 , . . . , xp, dp kde xk = (xk0, xk1, . . . , xkn) ∈ Rn+1 a dk ∈ {1, −1}. Here dk determines a class. Train the network using the ADALINE algorithm. We may expect the following: if dk = 1, then w · xk ≥ 0 if dk = −1, then w · xk < 0 This does not have to be always true but if the training set is reasonably linearly separable, then the algorithm typically gives satisfactory results. 12 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 13 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) j→ is a set of all i such that j is adjacent to i (i.e. there is an arc from j to i) 14 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable) [ e.g. logistic sigmoid σj(ξ) = 1 1+e −λjξ ] State of non-input neuron j ∈ Z \ X after the computation stops: yj = σj(ξj) (yj depends on the configuration w and the input x, so we sometimes write yj(w, x) ) The network computes a function R|X| do R|Y| . Layer-wise computation: First, all input neurons are assigned values of the input. In the -th step, all neurons of the -th layer are evaluated. 15 MLP – learning Learning: Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). Error function: E(w) = p k=1 Ek (w) where Ek (w) = 1 2 j∈Y yj(w, xk ) − dkj 2 16 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂E ∂wji (w(t) ) is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is a learning rate in step t + 1. Note that ∂E ∂wji (w(t) ) is a component of the gradient E, i.e. the weight update can be written as w(t+1) = w(t) − ε(t) · E(w(t) ). 17 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi and for every j ∈ Z X we get ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj for j ∈ Z (Y ∪ X) (Here all yj are in fact yj(w, xk )). 18 MLP – error function gradient If σj(ξ) = 1 1+e −λjξ for all j ∈ Z, then σj (ξj) = λjyj(1 − yj) and thus for all j ∈ Z X: ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X) If σj(ξ) = a · tanh(b · ξj) for all j ∈ Z, then σj (ξj) = b a (a − yj)(a + yj) 19 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 3. compute ∂Ek ∂wji for all wji using ∂Ek ∂wji := ∂Ek ∂yj · σj (ξj) · yi 4. Eji := Eji + ∂Ek ∂wji The resulting Eji equals ∂E ∂wji . 20 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: if j ∈ Y, then ∂Ek ∂yj = yj − dkj if j ∈ Z Y ∪ X, then assuming that j is in the -th layer and assuming that ∂Ek ∂yr has already been computed for all neurons in the + 1-st layer, compute ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj (This works because all neurons of r ∈ j→ belong to the + 1-st layer.) 21 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) The steps 1. - 3. take linear time. Note that the speed of convergence of the gradient descent cannot be estimated ... 22 MLP – learning algorithm Online algorithm: The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂Ek ∂wji (w (t) ji ) is the weight update of wji in the step t + 1 and 0 < ε(t) ≤ 1 is the learning rate in the step t + 1. There are other variants determined by selection of the training examples used for the error computation (more on this later). 23 Illustration of the gradient descent – XOR Source: Pattern Classification (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork 24 Animation (sin(x)), network 1-5-1) One iteration: 10 iterations: 25