Vector quantization Assume we are given a probability density function p(x) on input vectors x ∈ Rn. I.e. assume that the inputs are randomly generated according to p(x). Our goal is to approximate p(x) using finitely many centres wi ∈ Rn where i = 1, . . . , h. Roughly speaking: We want more centres in areas of higher density and less in areas of low density. 1 Vector quantization Assume we are given a probability density function p(x) on input vectors x ∈ Rn. I.e. assume that the inputs are randomly generated according to p(x). Our goal is to approximate p(x) using finitely many centres wi ∈ Rn where i = 1, . . . , h. Roughly speaking: We want more centres in areas of higher density and less in areas of low density. Formally: To every input x we assign its closest centre wc(x) : c(x) = arg min i=1,...,h x − wi and then minimize the error E = x − wc(x) 2 p(x)dx Caution! c(x) depends on x. 1 Vector quantization In practice, p(x) is obtained by sampling uniformly from a given training (multi)set: T = {xj ∈ Rn | j = 1, . . . , } 2 Vector quantization In practice, p(x) is obtained by sampling uniformly from a given training (multi)set: T = {xj ∈ Rn | j = 1, . . . , } The error then corresponds to E = 1 j=1 xj − wc(xj) 2 (keep in mind that c(xj) = arg mini=1,...,h xj − wi .) 2 Vector quantization In practice, p(x) is obtained by sampling uniformly from a given training (multi)set: T = {xj ∈ Rn | j = 1, . . . , } The error then corresponds to E = 1 j=1 xj − wc(xj) 2 (keep in mind that c(xj) = arg mini=1,...,h xj − wi .) If T has been randomly selected according to p(x) and is large eough, then 1 j=1 xj − wc(xj ) 2 ≈ x − wc(x) 2 p(x)dx 2 Example – image compression Every pixel has 256 shades of grey, each pair of neighbouring pixels is a two-dimensional vector from {0, . . . , 255} × {0, . . . , 255}, our compression finds a small set of centres that will encode shades of grey of pairs of pixels, image is then encoded by simple substitution of pairs of pixels with their centres. 3 Example – image compression pair distribution naive quantization smart quantization 4 Lloyd’s algorithm Assume a finite training set: T = {xj ∈ Rn | j = 1, . . . , } The algorithm moves centres closer to the centres of mass of closest points. 5 Lloyd’s algorithm Assume a finite training set: T = {xj ∈ Rn | j = 1, . . . , } The algorithm moves centres closer to the centres of mass of closest points. In the step t computes w (t) 1 , . . . , w (t) h as follows: 5 Lloyd’s algorithm Assume a finite training set: T = {xj ∈ Rn | j = 1, . . . , } The algorithm moves centres closer to the centres of mass of closest points. In the step t computes w (t) 1 , . . . , w (t) h as follows: for every k = 1, . . . , h compute a set Tk of all vectors of T to which w (t−1) k is the closest centre: Tk = xj ∈ T | k = arg min i=1,...,h xj − w (t−1) i 5 Lloyd’s algorithm Assume a finite training set: T = {xj ∈ Rn | j = 1, . . . , } The algorithm moves centres closer to the centres of mass of closest points. In the step t computes w (t) 1 , . . . , w (t) h as follows: for every k = 1, . . . , h compute a set Tk of all vectors of T to which w (t−1) k is the closest centre: Tk = xj ∈ T | k = arg min i=1,...,h xj − w (t−1) i compute w (t) k as the centre of mass of Tk : w (t) k = 1 |Tk | x∈Tk x We may stop the computation when, e.g. the error E is sufficiently small. 5 Kohonen’s learning Disadvantage of Lloyd’s algorithm: It is not online! 6 Kohonen’s learning Disadvantage of Lloyd’s algorithm: It is not online! The following Kohonen’s algorithm is online (i.e. the inputs may be generated one by one and the centres are adapted online): 6 Kohonen’s learning Disadvantage of Lloyd’s algorithm: It is not online! The following Kohonen’s algorithm is online (i.e. the inputs may be generated one by one and the centres are adapted online): In step t, consider the input xt and compute w (t) k as follows: 6 Kohonen’s learning Disadvantage of Lloyd’s algorithm: It is not online! The following Kohonen’s algorithm is online (i.e. the inputs may be generated one by one and the centres are adapted online): In step t, consider the input xt and compute w (t) k as follows: If w (t−1) k is the closest centre to xt , i.e. k = arg mini xt − w (t−1) i then w (t) k = w (t−1) k + θ · (xt − w (t−1) k ) else w (t) k = w (t−1) k 6 Kohonen’s learning Disadvantage of Lloyd’s algorithm: It is not online! The following Kohonen’s algorithm is online (i.e. the inputs may be generated one by one and the centres are adapted online): In step t, consider the input xt and compute w (t) k as follows: If w (t−1) k is the closest centre to xt , i.e. k = arg mini xt − w (t−1) i then w (t) k = w (t−1) k + θ · (xt − w (t−1) k ) else w (t) k = w (t−1) k 0 < θ ≤ 1 determines how much to move the centre towards the input. Let us formulate this algorithm in the language of neural networks. 6 Kohonen’s learning – neural network Architecture: Single layer x1 xi xn · · · · · · y1 yk yh · · · · · · wk1 wki wkn 7 Kohonen’s learning – neural network Architecture: Single layer x1 xi xn · · · · · · y1 yk yh · · · · · · wk1 wki wkn Activity: For an input x ∈ Rn and k = 1, . . . , h: yk =    1 k = arg mini=1,...,h x − wi 0 otherwise 7 Kohonen’s learning In step t, consider the input xt and compute w (t) k as follows: 8 Kohonen’s learning In step t, consider the input xt and compute w (t) k as follows: If w (t−1) k is the closest center to xt , i.e. k = arg mini xt − w (t−1) i then w (t) k = w (t−1) k + θ · (xt − w (t−1) k ) else w (t) k = w (t−1) k 0 < θ ≤ 1 determines how much to move the center towards the input. 8 Kohonen’s learning – efficiency Works well if most input vectors evenly distributed in a convex area. 9 Kohonen’s learning – efficiency Works well if most input vectors evenly distributed in a convex area. In case of two (or more) separated clusters, the density may not correspond to p(x) at all: Ex. Two separated areas with the same density. Assume that the centres are initially in one of the areas. 9 Kohonen’s learning – efficiency Works well if most input vectors evenly distributed in a convex area. In case of two (or more) separated clusters, the density may not correspond to p(x) at all: Ex. Two separated areas with the same density. Assume that the centres are initially in one of the areas. The second then "drags" only one of the centres (which always wins the competition). 9 Kohonen’s learning – efficiency Works well if most input vectors evenly distributed in a convex area. In case of two (or more) separated clusters, the density may not correspond to p(x) at all: Ex. Two separated areas with the same density. Assume that the centres are initially in one of the areas. The second then "drags" only one of the centres (which always wins the competition). Result: One of the areas will be covered by a single centre even though it contains half of the mass of the input examples. Solution: We tie centres together so that they have to move together. 9 Kohonen’s map Architecture: Single layer x1 xi xn · · · · · · y1 yk yh · · · · · · wk1 wki wkn Topological structure: neurons connected by edges so that they are nodes in an undirected graph. In most cases, this structure is either a one dimensional sequence or a two dimensional grid. 10 Kohonen’s map – illustration 11 Kohonen’s map – bio motivation Source: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 12 Kohonen’s map Activity: Given an input vector x ∈ Rn and k = 1, . . . , h: yk =    1 k = arg mini=1,...,h x − wi 0 jinak 13 Kohonen’s map Activity: Given an input vector x ∈ Rn and k = 1, . . . , h: yk =    1 k = arg mini=1,...,h x − wi 0 jinak Learning: We use the topological structure. Denote by d(c, k) the length of the shortest path from neuron c to neuron k in the topological structure. For every neuron c and a given s ∈ N0 define topological neighbourhood of the neuron c of size s : Ns(c) = {k | d(c, k) ≤ s} 13 Kohonen’s map Activity: Given an input vector x ∈ Rn and k = 1, . . . , h: yk =    1 k = arg mini=1,...,h x − wi 0 jinak Learning: We use the topological structure. Denote by d(c, k) the length of the shortest path from neuron c to neuron k in the topological structure. For every neuron c and a given s ∈ N0 define topological neighbourhood of the neuron c of size s : Ns(c) = {k | d(c, k) ≤ s} In step t, given training example xt adapt wk as follows: w (t) k =    w (t−1) k + θ · xt − w (t−1) k k ∈ Ns(c(xt )) w (t−1) k otherwise where c(xt ) = arg mini=1,...,h xt − w (t−1) i and θ ∈ R and s ∈ N0 are parameters that may change during training. 13 Kohonen’s map – learning More general version: w (t) k = w (t−1) k + Θ(c(xt ), k) · xt − w (t−1) k where c(xt ) = arg mini=1,...,h xt − w (t−1) i . The previous case then corresponds to Θ(c(xt ), k) =    θ k ∈ Ns(c(xt )) 0 jinak A smoother version: Θ(c(xt ), k) = θ0 · exp −d(c(xt ), k)2 σ2 where θ0 ∈ R is a learning rate and σ ∈ R is the width (both parameters may change during training). 14 Example 1 Inputs uniformly distributed in a rectangle. Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 15 Example 2 Inputs uniformly distributed in a triangle. Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 16 Example 3 Inputs uniformly distributed in a cuboid. Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 17 Example 4 Inputs uniformly distributed in a cactus. Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 18 Example – defect Topological defect – twisted network. Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 19 Kohonen’s map – practical approach By Kohonen’s paper: Inital weights are not so important, should be different from each other. 20 Kohonen’s map – practical approach By Kohonen’s paper: Inital weights are not so important, should be different from each other. Two phase learning: coarse phase: Approx. 1000 steps learning rate θ: start with 0.1 and steadily decrement to 0.01 topological neighbourhood of every neuron (determined by s or by the width σ) should be large at the beginning (i.e. contain most neurons) and should shrink to few neurons at the end 20 Kohonen’s map – practical approach By Kohonen’s paper: Inital weights are not so important, should be different from each other. Two phase learning: coarse phase: Approx. 1000 steps learning rate θ: start with 0.1 and steadily decrement to 0.01 topological neighbourhood of every neuron (determined by s or by the width σ) should be large at the beginning (i.e. contain most neurons) and should shrink to few neurons at the end fine tuning: number of steps: approx. 500 times the number of neurons θ close to 0.01 (otherwise topological defects are likely to occur) neighbourhood of each neuron should contain just few other neurons 20 Kohonen’s map – theory Convergence to "ordered" state has been proved only for one dimensional maps and special cases of the distribution p(x) (uniform), fixed neighbourhoods of size 1, and a fixed learning rate. There are simple counterexamples disproving convergence in case these assumptions are not satisfied. 21 Kohonen’s map – theory Convergence to "ordered" state has been proved only for one dimensional maps and special cases of the distribution p(x) (uniform), fixed neighbourhoods of size 1, and a fixed learning rate. There are simple counterexamples disproving convergence in case these assumptions are not satisfied. In more than one dimension there are no guarantees at all, convergence depends on several factors: initial distribution of neurons (centres) size of the neighbourhood learning rate What dimension to choose? Typically one or two dimensional map is used (as a coarse version of dimensionality reduction). 21 LVQ – classification using Kohonen’s map Assume randomly generated training examples of the form (xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq} corresponds to one of the q classes. 22 LVQ – classification using Kohonen’s map Assume randomly generated training examples of the form (xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq} corresponds to one of the q classes. Our goal is to classify objects based on our knowledge of their features, i.e. to every xt assign a class so that the probability of error is minimized. 22 LVQ – classification using Kohonen’s map Assume randomly generated training examples of the form (xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq} corresponds to one of the q classes. Our goal is to classify objects based on our knowledge of their features, i.e. to every xt assign a class so that the probability of error is minimized. Ex.: Conveyor belt with fruits, apples and oranges: Formally, (xt , dt ) where 22 LVQ – classification using Kohonen’s map Assume randomly generated training examples of the form (xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq} corresponds to one of the q classes. Our goal is to classify objects based on our knowledge of their features, i.e. to every xt assign a class so that the probability of error is minimized. Ex.: Conveyor belt with fruits, apples and oranges: Formally, (xt , dt ) where xt ∈ R2, here the first component is the weight and the second the diameter. 22 LVQ – classification using Kohonen’s map Assume randomly generated training examples of the form (xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq} corresponds to one of the q classes. Our goal is to classify objects based on our knowledge of their features, i.e. to every xt assign a class so that the probability of error is minimized. Ex.: Conveyor belt with fruits, apples and oranges: Formally, (xt , dt ) where xt ∈ R2, here the first component is the weight and the second the diameter. dt is either A or O depending on whether the given object is an apple or an orange. 22 LVQ – classification using Kohonen’s map Assume randomly generated training examples of the form (xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq} corresponds to one of the q classes. Our goal is to classify objects based on our knowledge of their features, i.e. to every xt assign a class so that the probability of error is minimized. Ex.: Conveyor belt with fruits, apples and oranges: Formally, (xt , dt ) where xt ∈ R2, here the first component is the weight and the second the diameter. dt is either A or O depending on whether the given object is an apple or an orange. We allow apples and oranges with the same features. The goal is to sort out the fruits based on their weight and diameter. 22 Classification using Kohonen’s map We use Kohonen’s map as follows: 1. Train the map on feature vectors xt where t = 1, . . . , (ignore the classes for now). 23 Classification using Kohonen’s map We use Kohonen’s map as follows: 1. Train the map on feature vectors xt where t = 1, . . . , (ignore the classes for now). 2. Label neurons with classes. The class vc of a given neuron c is determined as follows: For every neuron c and every class Ci count the number #(c, Ci) of training examples xt with class Ci for which the neuron c returns 1 (i.e. is the closest to them). To c, assign the class vc satisfying vc = argmaxCi #(c, Ci) 23 Classification using Kohonen’s map We use Kohonen’s map as follows: 1. Train the map on feature vectors xt where t = 1, . . . , (ignore the classes for now). 2. Label neurons with classes. The class vc of a given neuron c is determined as follows: For every neuron c and every class Ci count the number #(c, Ci) of training examples xt with class Ci for which the neuron c returns 1 (i.e. is the closest to them). To c, assign the class vc satisfying vc = argmaxCi #(c, Ci) 3. Fine tune the network using LVQ (see later) The trained network is used as follows: Given a feature vector x, evaluate the network with x as the input. A single neuron c has the value 1, return vc as the class of x. 23 LVQ Iterate over training examples. For (xt , dt ) find the closes neuron c c = arg min i=1,...,h xt − wi Adjust weights of c as follows: w (t) c =    w (t−1) c + α(xt − w (t−1) c ) dt = vc w (t−1) c − α(xt − w (t−1) c ) dt vc The parameter α should be small right from the beginning (approx. 0.01 − 0.02) and go to 0 steadily. 24 LVQ Iterate over training examples. For (xt , dt ) find the closes neuron c c = arg min i=1,...,h xt − wi Adjust weights of c as follows: w (t) c =    w (t−1) c + α(xt − w (t−1) c ) dt = vc w (t−1) c − α(xt − w (t−1) c ) dt vc The parameter α should be small right from the beginning (approx. 0.01 − 0.02) and go to 0 steadily. By Kohonen: The border between classes should be a good approximation of the Bayes decision boundary. What is it?? 24 Bayes classifier For simplicity, consider two classes C0 and C1 (e.g. A and O). Let P(Ci | x) be the probability that the object belongs to Ci assuming that it has features x. (e.g. P(A | (a, b)) is the probability that a fruit with weight a and diameter b is an apple.) Bayes classifier assigns to x the class Ci which satisfies P(Ci | x) ≥ P(C1−i | x). Denote by R0 the set of all x satisfying P(C0 | x) ≥ P(C1 | x) and R1 = Rn R0. 25 Bayes classifier For simplicity, consider two classes C0 and C1 (e.g. A and O). Let P(Ci | x) be the probability that the object belongs to Ci assuming that it has features x. (e.g. P(A | (a, b)) is the probability that a fruit with weight a and diameter b is an apple.) Bayes classifier assigns to x the class Ci which satisfies P(Ci | x) ≥ P(C1−i | x). Denote by R0 the set of all x satisfying P(C0 | x) ≥ P(C1 | x) and R1 = Rn R0. Bayes classifier minimizes the error probability: P(x ∈ R0 ∧ C1) + P(x ∈ R1 ∧ C0) Bayes decision boundary is the boundary between the sets R0 and R1. 25 Bayes decision boundary vs LVQ Zdroj obrázku: The Self-Organizing Map, Teuvo Kohonen, IEEE, 1990 26 Oceanographic data Source: Patterns of ocean current variability on the West Florida Shelf using the self-organizing map. Y. Liu a R. H. Weisberg, JOURNAL OF GEOPHYSICAL RESEARCH, 2005 Investigates currents in the ocean around Florida. 27 Oceanographic data 11 measuring stations, 3 depths (surface, bottom, in between). data: 2D velocity vectors of the current measured by every hour, for 25585 hours 28 Oceanographic data 11 measuring stations, 3 depths (surface, bottom, in between). data: 2D velocity vectors of the current measured by every hour, for 25585 hours Thus we have 25585 data samples, 66 dimensions. Kohonen’s map: grid 3 × 4 neighbourhoods given by Gaussian functions Θ(c, k) = θ0 · exp −d(c, k)2 σ2 shrinking width (linearly decreasing learning rate) 28 Oceanographic data 29 Oceanographic data crosses are winning neurons) influenced by local fluctuations observable trend: winter: neurons 1-6 (south-east) summer: neurons 10-12 (north-west) 30 Grimm’s fairy tales Zdroj: Contextual Relations of Words in Grimm Tales, Analyzed by Self-Organizing Map. T. Kohonen, T. Honkela a V. Pulkki, ICANN, 1995 Our goal is to visualize syntactic and semantic categories of words in fairy tales (depending on context). 31 Grimm’s fairy tales Zdroj: Contextual Relations of Words in Grimm Tales, Analyzed by Self-Organizing Map. T. Kohonen, T. Honkela a V. Pulkki, ICANN, 1995 Our goal is to visualize syntactic and semantic categories of words in fairy tales (depending on context). Input: Grimm’s fairy tales (understandably encoded using a stream of 270-dimensional vectors) triples of words (predecessor, key, successor) every component in the triple encoded using a randomly generated 90 dimensional real vector Network: Kohonen’s map, 42 × 36 neurons, weights of the form w = (wp, wk , wn) where wp, wk , wn ∈ R90. 31 Grimm’s fairy tales Learning: Trained on triples of successive words in fairy tales The training set consisted of 150 most common words, with "average" context. Coarse training: 600 000 iterations; Fine tuning: 400 000 In the end, 150 most common words labelled neurons: A word u labels a neuron with weights w = (wp, wk , wn) when wk is closest to the code of u. 32 Grimm’s fairy tales 33 Great summary – models We have considered several models of neural networks: ADALINE (aka linear regression) Multilayer Perceptron Hopfield Networks Restricted Boltzmann Machines and Deep Belief Networks Convolutional Networks Recurrent Networks (LSTM) Kohonen’s Maps 34 Great summary – algorithms Gradient descent! The only exception were Kohonen’s maps (Kohonen learning) and Hopfield (Hebb’s learning). The gradient computed using Backpropagation: 35 Great summary – algorithms Gradient descent! The only exception were Kohonen’s maps (Kohonen learning) and Hopfield (Hebb’s learning). The gradient computed using Backpropagation: MLP, Convolutional, Recurrent (LSTM) Simulations: RBM 35 Deeper thoughts Most neural network models are universal approximators (i.e. capable of approximating any reasonable function), but it is difficult to find the appropriate configuration → such configuration can be learned efficiently (without guarantees of course) Depth is stronger than size: deep networks are more succinct in their representation but are harder to train: Do not forget the vanishin/exploding gradient problem! The way how backprop is derived: Unification of all neurons using indices, backprop for models then differs very little, only in specification of neurons with tied weights! Weight tying = single most effective trick in the history of neural networks! 36