Convolutional network 1 Convolutional layers Every neuron is connected with a (typically small) receptive field of neurons in the lower layer. Neuron is "standard": Computes a weighted sum of its inputs, applies an activation function. 2 Convolutional layers Neurons grouped into feature maps sharing weights. 3 Convolutional layers Each feature map represents a property of the input that is supposed to be spatially invariant. Typically, we consider several feature maps in a single layer. 4 Pooling layers Neurons in the pooling layer compute simple functions of their receptive fields (the fields are typically disjoint): Max-pooling : maximum of inputs L2-pooling : square root of the sum of squres Average-pooling : mean · · · 5 Convolutional networks – architecture Neurons organized in layers, L0, L1, . . . , Ln, connections (typically) only from Lm to Lm+1. Several types of layers: input layer L0 dense layer Lm: Each neuron of Lm connected with each neuron of Lm−1. convolutional & pooling layer Lm: Contains two sub-layers: convolutional layer: Neurons organized into disjoint feature maps, all neurons of a given feature map share weights (but have different inputs) pooling layer: Each (convolutional) feature map F has a corresponding pooling map P. Neurons of P have inputs only from F (typically few of them), compute a simple aggregate function (such as max), have disjoint inputs. 6 Convolutional networks – architecture Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) j→ is a set of all i such that j is adjacent to i (i.e. there is an arc from j to i) jshare is a set of neurons sharing weights with j i.e. neurons that belong to the same feature map as j 7 Convolutional networks – activity neurons of dense and convolutional layers: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable): yj = σj(ξj) Neurons of pooling layers: Apply the "pooling" function: max-pooling: yj = max i∈j← yi avg-pooling: yj = i∈j← yi |j←| A convolutional network is evaluated layer-wise (as MLP), for each j ∈ Y we have that yj(w, x) is the value of the output neuron j after evaluating the network with weights w and input x. 8 Convolutional networks – learning Learning: Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). Error function – mean square error (for example): E(w) = 1 p p k=1 Ek (w) where Ek (w) = 1 2 j∈Y yj(w, xk ) − dkj 2 9 Convolutional networks – SGD The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: Choose (randomly) a set of training examples T ⊆ {1, . . . , p} Compute w(t+1) = w(t) + ∆w(t) where ∆w(t) = −ε(t) · 1 |T| k∈T Ek (w(t) ) Here T is a minibatch (of a fixed size), 0 < ε(t) ≤ 1 is a learning rate in step t + 1 Ek (w(t)) is the gradient of the error of the example k Note that the random choice of the minibatch is typically implemented by randomly shuffling all data and then choosing minibatches sequentially. Epoch consists of one round through all data. 10 Backprop Recall that Ek (w(t)) is a vector of all partial derivatives of the form ∂Ek ∂wji . How to compute ∂Ek ∂wji ? First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj: Recall that for every wji where j is in a dense layer, i.e. does not share weights: ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi Now for every wji where j is in a convolutional layer, that is shares wji with neurons of jshare: ∂Ek ∂wji = r∈jshare ∂Ek ∂yr · σr (ξr ) · yr Neurons of pooling layers do not have weights. 11 Backprop Now compute derivatives w.r.t. yj: for every j ∈ Y: ∂Ek ∂yj = yj − dkj This holds for the mean-square error, for other error functions the derivative w.r.t. outputs will be different. for every j ∈ Z Y such that j→ is either a dense layer, or a convolutional layer: ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj for every j ∈ Z Y such that j→ is max-pooling: Then j→ = {i} for a single "max" neuron and we have ∂Ek ∂yj =    ∂Ek ∂yi if j = arg maxr∈i← yr 0 otherwise I.e. gradient can be propagated from the output layer downwards as in MLP. 12 Convolutional networks – conclusions Conv. nets. are nowadays the most used networks in image processing (and also in other areas where input has some local, "spatially" invariant properties) Typically trained using backpropagation. Due to the weight sharing allow (very) deep architectures. Typically extended with more adjustments and tricks in their topologies. 13 Recurrent networks – Hopfield network Auto-associative network: Given an input, the network outputs a training example (encoded in its weights) "similar" to the given input. 14 Hopfield network Architecture: complete topology, i.e. output of each neuron is input to all neurons all neurons are both input and output denote by ξ1, . . . , ξn inner potentials and by y1, . . . , yn outputs (states) of individual neurons denote by wji the weight of connection from a neuron i ∈ {1, . . . , n} to a neuron j ∈ {1, . . . , n} assume wjj = 0 for every j = 1, . . . , n For now: no neuron has a bias 15 Hopfield network Learning: Training set T = {xk | xk = (xk1, . . . , xkn) ∈ {−1, 1}n , k = 1, . . . , p} The goal is to "store" the training examples of T so that the network is able to associate similar examples. Hebb’s learning rule: If the inputs to a system cause the same pattern of activity to occur repeatedly, the set of active elements constituting that pattern will become increasingly strongly interassociated. That is, each element will tend to turn on every other element and (with negative weights) to turn off the elements that do not form part of the pattern. To put it another way, the pattern as a whole will become "auto-associated". Mathematically speaking: wji = p k=1 xkjxki 1 ≤ j i ≤ n Intuition: "Neurons that fire together, wire together". 16 Hopfield network Learning: Training set T = {xk | xk = (xk1, . . . , xkn) ∈ {−1, 1}n , k = 1, . . . , p} Hebb’s rule: wji = p k=1 xkjxki 1 ≤ j i ≤ n Note that wji = wij, i.e. the weight matrix is symmetric. Learning can be seen as poll about equality of inputs: If xkj = xki, then the training example votes for "i equals j" by adding one to wji. If xkj xki, then the training example votes for "i does not equal j" by subtracting one from wji. 17 Hopfield network Activity: Initially, neurons set to the network input x = (x1, . . . , xn), thus y (0) j = xj for every j = 1, . . . , n. Cyclically update states of neurons, i.e. in step t + 1 compute the value of a neuron j such that j = (t mod p) + 1, as follows: Compute the inner potential: ξ (t) j = n i=1 wjiy (t) i then y (t+1) j =    1 ξ (t) j > 0 y (t) j ξ (t) j = 0 −1 ξ (t) j < 0 18 Hopfield network – activity The computation stops in a step t∗ if the network is for the first time in a stable state, i.e. y (t∗+n) j = y (t∗) j (j = 1, . . . , n) Theorem Assuming symmetric weights, computation of a Hopfiled network always stops for every input. This implies that a given Hopfiled network computes a function from {−1, 1}n to {−1, 1}n (determined by its weights). Denote by y(W, x) = y (t∗) 1 , . . . , y (t∗) n the value of the network for a given input x and a weight matrix W. Denote by yj(W, x) = y (t∗) j the component of the value of the network corresponding to the neuron j. If W is clear from the context, we write only y(x) a yj(x). 19 Ising model – an analogy Simple models of magnetic materials resemble Hopfield network. atomic magnets organized into square-lattice each magnet may have only one of two possible orientations (in the Hopfield network +1 a −1) orientation of each magnet is influenced by an external magnetic field (input of the network) as well as orientation of the other magnets weights in the Hopfiled net model determine interaction among magnets 20 Energy function Energy function E assigns to every state y ∈ {−1, 1}n a (potential) energy: E(y) = − 1 2 n j=1 n i=1 wjiyjyi states with low energy are stable (few neurons "want to" change their states), states with high energy are not stable i.e. large (positive) wjiyjyi is stable and small (negative) wjiyjyi is not stable The energy does not increase during computation: E(y(t)) ≥ E(y(t+1)), stable states y(t∗) correspond to local minima of E. 21 Energy landscape 22 Hopfield – example 1 2 3 −1 −11 y1 y2 y3 E 1 1 1 1 1 1 −1 1 1 −1 1 −3 1 −1 −1 1 −1 1 1 1 −1 1 −1 −3 −1 −1 1 1 −1 −1 −1 1 Hopfield network with three neurons trained on a single training example (1, −1, 1) using Hebb’s rule (note that (−1, 1, −1) has also been "stored" into the network) 23 Hopfield network – convergence Observe that the energy does not increase during computation: E(y(t)) ≥ E(y(t+1)) if the state is updated in a step t + 1, then E(y(t)) > E(y(t+1)) there are only finitely many states, and thus, eventually, a local minimum of E is reached. This proves that computation of a Hopfield network always stops. 24 Hopfield network – phantoms The energy function E may have local minima that do not correspond to training examples (so called phantoms). Phantoms can be "unlearned" e.g. using the following rule: Given a phantom (x1, . . . , xn) ∈ {−1, 1}n and weights wji, then new weights wji are computed by wji = wji − xixj (i.e. similar to Hebb’s rule but with the opposite sign) 25 Reproduction – statistical analysis Capacity of Hopfield network is defined as the ratio p/n of number of training examples the net is able to learn over the number of neurons. Assume that training examples are chosen randomly: each component of xk is set to 1 with probability 1/2 and to −1 with probability 1/2. Consider a configuration W obtained by learning using the Hebb’s rule. Denote β = P xk = y(W, xk ) pro k = 1, . . . , p Then for n → ∞ and p ≤ n/(4 log n) we have β → 1. I.e. the maximum number of examples that can be effectively stored in Hopfield net is proportional to n/(4 log n). 26 Hopfield network – example figures 12 × 10 (120 neurons, −1 is white and 1 is black) learned 8 figures input generated with 25% noise image shows the activity of the Hopfield network 27 Hopfield network – example 28 Hopfield network – example 29