PV021: Neural networks Tomáš Brázdil 1 Course organization Course materials: Main: The lecture Neural Networks and Deep Learning by Michael Nielsen http://neuralnetworksanddeeplearning.com/ (Extremely well written modern online textbook.) Deep learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville http://www.deeplearningbook.org/ (A very good overview of the state-of-the-art in neural networks.) 2 Course organization Evaluation: Project teams of two students implementation of a selected model + analysis of given data implementation either in C, C++, or in Java without use of any specialized libraries for data analysis and machine learning need to get over a given accuracy threshold (a gentle one, just to eliminate non-functional implementations) 3 Course organization Evaluation: Project teams of two students implementation of a selected model + analysis of given data implementation either in C, C++, or in Java without use of any specialized libraries for data analysis and machine learning need to get over a given accuracy threshold (a gentle one, just to eliminate non-functional implementations) Oral exam I may ask about anything from the lecture including some proofs that occur only on the whiteboard! 3 Course organization Evaluation: Project teams of two students implementation of a selected model + analysis of given data implementation either in C, C++, or in Java without use of any specialized libraries for data analysis and machine learning need to get over a given accuracy threshold (a gentle one, just to eliminate non-functional implementations) Oral exam I may ask about anything from the lecture including some proofs that occur only on the whiteboard! Application of any deep learning toolset on given (difficult) data. We prefer TensorFlow but you may use another library (CNTK, Caffe, DeepLearning4j, ...) The goal is to get the best results on increasingly more difficult datasets. The team with the best result on the hardest dataset will automatically get > F at the exam. 3 FAQ Q: Why English? 4 FAQ Q: Why English? A: Couple of reasons. First, all resources about modern neural nets are in English, it is rather cumbersome to translate everything to Czech (combination of Czech and English is ugly). Second, to attract non-Czech speaking students to the course. 4 FAQ Q: Why English? A: Couple of reasons. First, all resources about modern neural nets are in English, it is rather cumbersome to translate everything to Czech (combination of Czech and English is ugly). Second, to attract non-Czech speaking students to the course. Q: Why we cannot use specialized libraries in projects? 4 FAQ Q: Why English? A: Couple of reasons. First, all resources about modern neural nets are in English, it is rather cumbersome to translate everything to Czech (combination of Czech and English is ugly). Second, to attract non-Czech speaking students to the course. Q: Why we cannot use specialized libraries in projects? A: In order to "touch" the low level implementation details of the algorithms. You should not even use libraries for linear algebra and numerical methods, so that you will be confronted with rounding errors and numerical instabilities. 4 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) 5 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) spam filter learns to recognize spam from a database of "labelled" emails consequently is able to distinguish spam from ham 5 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) spam filter learns to recognize spam from a database of "labelled" emails consequently is able to distinguish spam from ham handwritten text reader learns from a database of handwritten letters (or text) labelled by their correct meaning consequently is able to recognize text 5 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) spam filter learns to recognize spam from a database of "labelled" emails consequently is able to distinguish spam from ham handwritten text reader learns from a database of handwritten letters (or text) labelled by their correct meaning consequently is able to recognize text · · · and lots of much much more sophisticated applications ... 5 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) spam filter learns to recognize spam from a database of "labelled" emails consequently is able to distinguish spam from ham handwritten text reader learns from a database of handwritten letters (or text) labelled by their correct meaning consequently is able to recognize text · · · and lots of much much more sophisticated applications ... Basic attributes of learning algorithms: representation: ability to capture the inner structure of training data generalization: ability to work properly on new data 5 Machine learning in general Machine learning algorithms typically construct mathematical models of given data. The models may be subsequently applied to fresh data. 6 Machine learning in general Machine learning algorithms typically construct mathematical models of given data. The models may be subsequently applied to fresh data. There are many types of models: decision trees support vector machines hidden Markov models Bayes networks and other graphical models neural networks · · · Neural networks, based on models of a (human) brain, form a natural basis for learning algorithms! 6 Artificial neural networks Artificial neuron is a rough mathematical approximation of a biological neuron. (Aritificial) neural network (NN) consists of a number of interconnected artificial neurons. "Behavior" of the network is encoded in connections between neurons. σ ξ x1 x2 xn y Zdroj obrázku: http://tulane.edu/sse/cmb/people/schrader/ 7 Why artificial neural networks? Modelling of biological neural networks (computational neuroscience). simplified mathematical models help to identify important mechanisms How a brain receives information? How the information is stored? How a brain develops? · · · 8 Why artificial neural networks? Modelling of biological neural networks (computational neuroscience). simplified mathematical models help to identify important mechanisms How a brain receives information? How the information is stored? How a brain develops? · · · neuroscience is strongly multidisciplinary; precise mathematical descriptions help in communication among experts and in design of new experiments. I will not spend much time on this area! 8 Why artificial neural networks? Neural networks in machine learning. Typically primitive models, far from their biological counterparts (but often inspired by biology). 9 Why artificial neural networks? Neural networks in machine learning. Typically primitive models, far from their biological counterparts (but often inspired by biology). Strongly oriented towards concrete application domains: decision making and control - autonomous vehicles, manufacturing processes, control of natural resources games - backgammon, poker, GO finance - stock prices, risk analysis medicine - diagnosis, signal processing (EKG, EEG, ...), image processing (MRI, roentgen, ...) text and speech processing - automatic translation, text generation, speech recognition other signal processing - filtering, radar tracking, noise reduction · · · I will concentrate on this area! 9 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction 10 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction Learning a kid learns to recognize a rabbit after seeing several rabbits 10 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction Learning a kid learns to recognize a rabbit after seeing several rabbits Generalization a kid is able to recognize a new rabbit after seeing several (old) rabbits 10 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction Learning a kid learns to recognize a rabbit after seeing several rabbits Generalization a kid is able to recognize a new rabbit after seeing several (old) rabbits Robustness a blurred photo of a rabbit may still be classified as a picture of a rabbit 10 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction Learning a kid learns to recognize a rabbit after seeing several rabbits Generalization a kid is able to recognize a new rabbit after seeing several (old) rabbits Robustness a blurred photo of a rabbit may still be classified as a picture of a rabbit Graceful degradation Experiments have shown that damaged neural network is still able to work quite well Damaged network may re-adapt, remaining neurons may take on functionality of the damaged ones 10 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets) 11 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets) Standard applications of these models (image processing, speech and text processing) 11 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets) Standard applications of these models (image processing, speech and text processing) Basic learning algorithms (gradient descent & backpropagation, Hebb’s rule) 11 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets) Standard applications of these models (image processing, speech and text processing) Basic learning algorithms (gradient descent & backpropagation, Hebb’s rule) Basic practical training techniques (data preparation, setting various parameters, control of learning) 11 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets) Standard applications of these models (image processing, speech and text processing) Basic learning algorithms (gradient descent & backpropagation, Hebb’s rule) Basic practical training techniques (data preparation, setting various parameters, control of learning) Basic information about current implementations (TensorFlow, CNTK) 11 Biological neural network Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. Each neuron is connected with approx. 104 neurons. Neurons themselves are very complex systems. 12 Biological neural network Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. Each neuron is connected with approx. 104 neurons. Neurons themselves are very complex systems. Rough description of nervous system: External stimulus is received by sensory receptors (e.g. eye cells). 12 Biological neural network Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. Each neuron is connected with approx. 104 neurons. Neurons themselves are very complex systems. Rough description of nervous system: External stimulus is received by sensory receptors (e.g. eye cells). Information is futher transfered via peripheral nervous system (PNS) to the central nervous systems (CNS) where it is processed (integrated), and subseqently, an output signal is produced. 12 Biological neural network Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. Each neuron is connected with approx. 104 neurons. Neurons themselves are very complex systems. Rough description of nervous system: External stimulus is received by sensory receptors (e.g. eye cells). Information is futher transfered via peripheral nervous system (PNS) to the central nervous systems (CNS) where it is processed (integrated), and subseqently, an output signal is produced. Afterwards, the output signal is transfered via PNS to effectors (e.g. muscle cells). 12 Biological neural network Zdroj: N. Campbell and J. Reece; Biology, 7th Edition; ISBN: 080537146X 13 Biological neuron Zdroj: http://www.web-books.com/eLibrary/Medicine/Physiology/Nervous/Nervous.htm 14 Synaptic connections 15 Action potential 16 Summation 17 Biological and Mathematical neurons 18 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn x1, . . . , xn ∈ R are inputs 19 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn x1, . . . , xn ∈ R are inputs w1, . . . , wn ∈ R are weights 19 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn x1, . . . , xn ∈ R are inputs w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = n i=1 wixi 19 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn x1, . . . , xn ∈ R are inputs w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = n i=1 wixi y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ h ; 0 ξ < h. where h ∈ R is a threshold. 19 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs 20 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights 20 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = w0 + n i=1 wixi 20 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = w0 + n i=1 wixi y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. (The threshold h has been substituted with the new input x0 = 1 and the weight w0 = −h.) 20 Neuron and linear separation ξ = 0 ξ > 0 ξ > 0 ξ < 0 ξ < 0 inner potential ξ = w0 + n i=1 wixi determines a separation hyperplane in the n-dimensional input space in 2d line in 3d plane · · · 21 Neuron and linear separation σ x1 xn · · · x0 = 1 1/0 by A/B w1 wn w0 n = 8 · 8, i.e. the number of pixels in the images. Inputs are binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0). 22 Neuron and linear separation ¯w0 + n i=1 ¯wixi = 0 w0 + n i=1 wixi = 0 A A A A B B B Red line classifies incorrectly Green line classifies correctly (may be a result of a correction by a learning algorithm) 23 Neuron and linear separation (XOR) 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) x1 x2 No line separates ones from zeros. 24 Neural networks Neural network consists of formal neurons interconnected in such a way that the output of one neuron is an input of several other neurons. In order to describe a particular type of neural networks we need to specify: Architecture How the neurons are connected. Activity How the network transforms inputs to outputs. Learning How the weights are changed during training. 25 Architecture Network architecture is given as a digraph whose nodes are neurons and edges are connections. We distinguish several categories of neurons: Output neurons Hidden neurons Input neurons (In general, a neuron may be both input and output; a neuron is hidden if it is neither input, nor output.) 26 Architecture – Cycles A network is cyclic (recurrent) if its architecture contains a directed cycle. 27 Architecture – Cycles A network is cyclic (recurrent) if its architecture contains a directed cycle. Otherwise it is acyclic (feed-forward) 27 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers 28 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer 28 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer 28 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 28 Activity Consider a network with n neurons, k input and output. 29 Activity Consider a network with n neurons, k input and output. State of a network is a vector of output values of all neurons. (States of a network with n neurons are vectors of Rn ) State-space of a network is a set of all states. 29 Activity Consider a network with n neurons, k input and output. State of a network is a vector of output values of all neurons. (States of a network with n neurons are vectors of Rn ) State-space of a network is a set of all states. Network input is a vector of k real numbers, i.e. an element of Rk . Network input space is a set of all network inputs. (sometimes we restrict ourselves to a proper subset of Rk ) 29 Activity Consider a network with n neurons, k input and output. State of a network is a vector of output values of all neurons. (States of a network with n neurons are vectors of Rn ) State-space of a network is a set of all states. Network input is a vector of k real numbers, i.e. an element of Rk . Network input space is a set of all network inputs. (sometimes we restrict ourselves to a proper subset of Rk ) Initial state Input neurons set to values from the network input (each component of the network input corresponds to an input neuron) Values of the remaining neurons set to 0. 29 Activity – computation of a network Computation (typically) proceeds in discrete steps. 30 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 30 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) 30 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) A computation is finite on a network input x if the state changes only finitely many times (i.e. there is a moment in time after which the state of the network never changes). We also say that the network stops on x. 30 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) A computation is finite on a network input x if the state changes only finitely many times (i.e. there is a moment in time after which the state of the network never changes). We also say that the network stops on x. Network output is a vector of values of all output neurons in the network (i.e. an element of R ). Note that the network output keeps changing throughout the computation! 30 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) A computation is finite on a network input x if the state changes only finitely many times (i.e. there is a moment in time after which the state of the network never changes). We also say that the network stops on x. Network output is a vector of values of all output neurons in the network (i.e. an element of R ). Note that the network output keeps changing throughout the computation! MLP uses the following selection rule: In the i-th step evaluate all neurons in the i-th layer. 30 Activity – semantics of a network Definition Consider a network with n neurons, k input, output. Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on every input of A. Then we say that the network computes a function F : A → B if for every network input x the vector F(x) ∈ B is the output of the network after the computation on x stops. 31 Activity – semantics of a network Definition Consider a network with n neurons, k input, output. Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on every input of A. Then we say that the network computes a function F : A → B if for every network input x the vector F(x) ∈ B is the output of the network after the computation on x stops. 31 Activity – semantics of a network Definition Consider a network with n neurons, k input, output. Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on every input of A. Then we say that the network computes a function F : A → B if for every network input x the vector F(x) ∈ B is the output of the network after the computation on x stops. Example 1 This network computes a function from R2 to R. 31 Activity – inner potential and activation functions In order to specify activity of the network, we need to specify how the inner potentials ξ are computed and what are the activation functions σ. 32 Activity – inner potential and activation functions In order to specify activity of the network, we need to specify how the inner potentials ξ are computed and what are the activation functions σ. We assume (unless otherwise specified) that ξ = w0 + n i=1 wi · xi here x = (x1, . . . , xn) are inputs of the neuron and w = (w1, . . . , wn) are weights. 32 Activity – inner potential and activation functions In order to specify activity of the network, we need to specify how the inner potentials ξ are computed and what are the activation functions σ. We assume (unless otherwise specified) that ξ = w0 + n i=1 wi · xi here x = (x1, . . . , xn) are inputs of the neuron and w = (w1, . . . , wn) are weights. There are special types of neural network where the inner potential is computed differently, e.g. as a "distance" of an input from the weight vector: ξ = x − w here ||·|| is a vector norm, typically Euclidean. 32 Activity – inner potential and activation functions There are many activation functions, typical examples: Unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. 33 Activity – inner potential and activation functions There are many activation functions, typical examples: Unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. (Logistic) sigmoid σ(ξ) = 1 1 + e−λ·ξ here λ ∈ R is a steepness parameter. Hyperbolic tangens σ(ξ) = 1 − e−ξ 1 + e−ξ 33 Activity – XOR 1 1 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 1 1 σ 11 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 0 0 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 0 0 σ 01 σ 1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 1 0 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 1 0 σ 11 σ 1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 1 0 σ 11 σ 1 1 σ 1 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 0 1 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 0 1 σ 11 σ 1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 0 1 σ 11 σ 1 1 σ 1 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – MLP and linear separation 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) P1 P2 x1 x2 σ1 σ 1 σ1 −22 2 −2 1 −1 1 3 −2 The line P1 is given by −1 + 2x1 + 2x2 = 0 The line P2 is given by 3 − 2x1 − 2x2 = 0 35 Activity – example x1 1 σ 0 1 σ0 1 σ 0 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 36 Activity – example x1 1 σ 1 1 σ0 1 σ 0 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 36 Activity – example x1 1 σ 1 1 σ 1 1 σ 0 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 36 Activity – example x1 1 σ 1 1 σ 1 1 σ 1 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 36 Activity – example x1 1 σ 0 1 σ 1 1 σ 1 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 36 Learning Consider a network with n neurons, k input and output. 37 Learning Consider a network with n neurons, k input and output. Configuration of a network is a vector of all values of weights. (Configurations of a network with m connections are elements of Rm ) Weight-space of a network is a set of all configurations. 37 Learning Consider a network with n neurons, k input and output. Configuration of a network is a vector of all values of weights. (Configurations of a network with m connections are elements of Rm ) Weight-space of a network is a set of all configurations. initial configuration weights can be initialized randomly or using some sophisticated algorithm 37 Learning algorithms Learning rule for weight adaptation. (the goal is to find a configuration in which the network computes a desired function) 38 Learning algorithms Learning rule for weight adaptation. (the goal is to find a configuration in which the network computes a desired function) Supervised learning The desired function is described using training examples that are pairs of the form (input, output). Learning algorithm searches for a configuration which "corresponds" to the training examples, typically by minimizing an error function. 38 Learning algorithms Learning rule for weight adaptation. (the goal is to find a configuration in which the network computes a desired function) Supervised learning The desired function is described using training examples that are pairs of the form (input, output). Learning algorithm searches for a configuration which "corresponds" to the training examples, typically by minimizing an error function. Unsupervised learning The training set contains only inputs. The goal is to determine distribution of the inputs (clustering, deep belief networks, etc.) 38 Supervised learning – illustration A A A A B B B classification in the plane using a single neuron 39 Supervised learning – illustration A A A A B B B classification in the plane using a single neuron training examples are of the form (point, value) where the value is either 1, or 0 depending on whether the point is either A, or B 39 Supervised learning – illustration A A A A B B B classification in the plane using a single neuron training examples are of the form (point, value) where the value is either 1, or 0 depending on whether the point is either A, or B the algorithm considers examples one after another whenever an incorrectly classified point is considered, the learning algorithm turns the line in the direction of the point 39 Unsupervised learning – illustration X X X X A A A A B B B we search for two centres of clusters 40 Unsupervised learning – illustration X X X X A A A A B B B we search for two centres of clusters red crosses correspond to potential centres before application of the learning algorithm, green ones after the application 40 Summary – Advantages of neural networks Massive parallelism neurons can be evaluated in parallel 41 Summary – Advantages of neural networks Massive parallelism neurons can be evaluated in parallel Learning many sophisticated learning algorithms used to "program" neural networks 41 Summary – Advantages of neural networks Massive parallelism neurons can be evaluated in parallel Learning many sophisticated learning algorithms used to "program" neural networks generalization and robustness information is encoded in a distributed manned in weights "close" inputs typicaly get similar values 41 Summary – Advantages of neural networks Massive parallelism neurons can be evaluated in parallel Learning many sophisticated learning algorithms used to "program" neural networks generalization and robustness information is encoded in a distributed manned in weights "close" inputs typicaly get similar values Graceful degradation damage typically causes only a decrease in precision of results 41 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = w0 + n i=1 wixi y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. 42 Boolean functions Activation function: unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. 43 Boolean functions Activation function: unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. σ x1 x2 xn x0 = 1 y = AND(x1, . . . , xn) 1 1 · · · 1 −n σ x1 x2 xn x0 = 1 y = OR(x1, . . . , xn) 1 1 · · · 1 −1 σ x1 x0 = 1 y = NOT(x1) −1 0 43 Boolean functions Theorem Let σ be the unit step function. Two layer MLPs, where each neuron has σ as the activation function, are able to compute all functions of the form F : {0, 1}n → {0, 1}. 44 Boolean functions Theorem Let σ be the unit step function. Two layer MLPs, where each neuron has σ as the activation function, are able to compute all functions of the form F : {0, 1}n → {0, 1}. Proof. Given a vector v = (v1, . . . , vn) ∈ {0, 1}n, consider a neuron Nv whose output is 1 iff the input is v: σ y x1 xi xn x0 = 1 w1 wi · · ·· · · wn w0 w0 = − n i=1 vi wi =    1 vi = 1 −1 vi = 0 Now let us connect all outputs of all neurons Nv satisfying F(v) = 1 using a neuron implementing OR. 44 Non-linear separation x1 x2 y Consider a three layer network; each neuron has the unit step activation function. The network divides the input space in two subspaces according to the output (0 or 1). 45 Non-linear separation x1 x2 y Consider a three layer network; each neuron has the unit step activation function. The network divides the input space in two subspaces according to the output (0 or 1). The first (hidden) layer divides the input space into half-spaces. 45 Non-linear separation x1 x2 y Consider a three layer network; each neuron has the unit step activation function. The network divides the input space in two subspaces according to the output (0 or 1). The first (hidden) layer divides the input space into half-spaces. The second layer may e.g. make intersections of the half-spaces ⇒ convex sets. 45 Non-linear separation x1 x2 y Consider a three layer network; each neuron has the unit step activation function. The network divides the input space in two subspaces according to the output (0 or 1). The first (hidden) layer divides the input space into half-spaces. The second layer may e.g. make intersections of the half-spaces ⇒ convex sets. The third layer may e.g. make unions of some convex sets. 45 Non-linear separation – illustration x1 xk · · · · · · · · · y Consider three layer networks; each neuron has the unit step activation function. Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . 46 Non-linear separation – illustration x1 xk · · · · · · · · · y Consider three layer networks; each neuron has the unit step activation function. Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . Cover A with hypercubes (in 2D squares, in 3D cubes, ...) 46 Non-linear separation – illustration x1 xk · · · · · · · · · y Consider three layer networks; each neuron has the unit step activation function. Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . Cover A with hypercubes (in 2D squares, in 3D cubes, ...) Each hypercube K can be separated using a two layer network NK (i.e. a function computed by NK gives 1 for points in K and 0 for the rest). 46 Non-linear separation – illustration x1 xk · · · · · · · · · y Consider three layer networks; each neuron has the unit step activation function. Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . Cover A with hypercubes (in 2D squares, in 3D cubes, ...) Each hypercube K can be separated using a two layer network NK (i.e. a function computed by NK gives 1 for points in K and 0 for the rest). Finally, connect outputs of the nets NK satisfying K ∩ A ∅ using a neuron implementing OR. 46 Non-linear separation - sigmoid Theorem (Cybenko 1989 - informal version) Let σ be a continuous function which is sigmoidal, i.e. satisfies σ(x) =    1 pro x → +∞ 0 pro x → −∞ For every reasonable set A ⊆ [0, 1]n, there is a two layer network where each hidden neuron has the activation function σ (output neurons are linear), that satisfies the following: For most vectors v ∈ [0, 1]n we have that v ∈ A iff the network output is > 0 for the input v. For mathematically oriented: "reasonable" means Lebesgue measurable "most" means that the set of incorrectly classified vectors has the Lebesgue measure smaller than a given ε > 0 47 Non-linear separation - practical illustration ALVINN drives a car 48 Non-linear separation - practical illustration ALVINN drives a car The net has 30 × 32 = 960 inputs (the input space is thus R960 ) 48 Non-linear separation - practical illustration ALVINN drives a car The net has 30 × 32 = 960 inputs (the input space is thus R960 ) Input values correspond to shades of gray of pixels. 48 Non-linear separation - practical illustration ALVINN drives a car The net has 30 × 32 = 960 inputs (the input space is thus R960 ) Input values correspond to shades of gray of pixels. Output neurons "classify" images of the road based on their "curvature". Zdroj obrázku: http://jmvidal.cse.sc.edu/talks/ann/alvin.html 48 Function approximation - three layers Let σ be a logistic sigmoid, i.e. σ(ξ) = 1 1 + e−ξ For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there is a three-layer network computing a function F : [0, 1]n → [0, 1] such that there is a linear activation in the output layer, i.e. the value of the output neuron is its inner potential ξ, 49 Function approximation - three layers Let σ be a logistic sigmoid, i.e. σ(ξ) = 1 1 + e−ξ For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there is a three-layer network computing a function F : [0, 1]n → [0, 1] such that there is a linear activation in the output layer, i.e. the value of the output neuron is its inner potential ξ, the remaining neurons have the logistic sigmoid σ as their activation, 49 Function approximation - three layers Let σ be a logistic sigmoid, i.e. σ(ξ) = 1 1 + e−ξ For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there is a three-layer network computing a function F : [0, 1]n → [0, 1] such that there is a linear activation in the output layer, i.e. the value of the output neuron is its inner potential ξ, the remaining neurons have the logistic sigmoid σ as their activation, for every v ∈ [0, 1]n we have that |F(v) − f(v)| < ε. 49 Function approximation – three layer networks x1 x2 σ σ σ σ σ· · · · · · · · · ζ y weighted sum of "spikes" ... + the other two 90 degree rotations a "spike" inner potential the value of the neuron σ(ξ) = 1 1+e−ξ ζ(ξ) = ξ 50 Function approximation - two-layer networks Theorem (Cybenko 1989) Let σ be a continuous function which is sigmoidal, i.e. is increasing and satisfies σ(x) =    1 pro x → +∞ 0 pro x → −∞ For every continuous function f : [0, 1]n → [0, 1] and every ε > 0 there is a function F : [0, 1]n → [0, 1] computed by a two layer network where each hidden neuron has the activation function σ (output neurons are linear), that satisfies the following |f(v) − F(v)| < ε pro každé v ∈ [0, 1]n . 51 Neural networks and computability Consider recurrent networks (i.e. containing cycles) 52 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); 52 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); 52 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); parallel activity rule (output values of all neurons are recomputed in every step); 52 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); parallel activity rule (output values of all neurons are recomputed in every step); activation function σ(ξ) =    1 ξ ≥ 0 ; ξ 0 ≤ ξ ≤ 1 ; 0 ξ < 0. 52 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); parallel activity rule (output values of all neurons are recomputed in every step); activation function σ(ξ) =    1 ξ ≥ 0 ; ξ 0 ≤ ξ ≤ 1 ; 0 ξ < 0. We encode words ω ∈ {0, 1}+ into numbers as follows: δ(ω) = |ω| i=1 ω(i) 2i + 1 2|ω|+1 E.g. ω = 11001 gives δ(ω) = 1 2 + 1 22 + 1 25 + 1 26 (= 0.110011 in binary form). 52 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. 53 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. Recurrent networks with rational weights are equivalent to Turing machines For every recursively enumerable language L ⊆ {0, 1}+ there is a recurrent network with rational weights and less than 1000 neurons, which recognizes L. The halting problem is undecidable for networks with at least 25 neurons and rational weights. There is "universal" network (equivalent of the universal Turing machine) 53 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. Recurrent networks with rational weights are equivalent to Turing machines For every recursively enumerable language L ⊆ {0, 1}+ there is a recurrent network with rational weights and less than 1000 neurons, which recognizes L. The halting problem is undecidable for networks with at least 25 neurons and rational weights. There is "universal" network (equivalent of the universal Turing machine) Recurrent networks are super-Turing powerful 53 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. Recurrent networks with rational weights are equivalent to Turing machines For every recursively enumerable language L ⊆ {0, 1}+ there is a recurrent network with rational weights and less than 1000 neurons, which recognizes L. The halting problem is undecidable for networks with at least 25 neurons and rational weights. There is "universal" network (equivalent of the universal Turing machine) Recurrent networks are super-Turing powerful For every language L ⊆ {0, 1}+ there is a recurrent network with less than 1000 nerons which recognizes L. 53 Summary of theoretical results Neural networks are very strong from the point of view of theory: All Boolean functions can be expressed using two-layer networks. Two-layer networks may approximate any continuous function. Recurrent networks are at least as strong as Turing machines. 54 Summary of theoretical results Neural networks are very strong from the point of view of theory: All Boolean functions can be expressed using two-layer networks. Two-layer networks may approximate any continuous function. Recurrent networks are at least as strong as Turing machines. These results are purely theoretical! "Theoretical" networks are extremely huge. It is very difficult to handcraft them even for simplest problems. From practical point of view, the most important advantage of neural networks are: learning, generalization, robustness. 54 Neural networks vs classical computers Neural networks Classical computers Data implicitly in weights explicitly Computation naturally parallel sequential, localized Robustness robust w.r.t. input corruption & damage changing one bit may completely crash the computation Precision imprecise, network recalls a training example "similar" to the input (typically) precise Programming learning manual 55 History of neurocomputers 1951: SNARC (Minski et al) the first implementation of neural network a rat strives to exit a maze 40 artificial neurons (300 vacuum tubes, engines, etc.) 56 History of neurocomputers 1957: Mark I Perceptron (Rosenblatt et al) - the first successful network for image recognition single layer network image represented by 20 × 20 photocells intensity of pixels was treated as the input to a perceptron (basically the formal neuron), which recognized figures weights were implemented using potentiometers, each set by its own engine it was possible to arbitrarily reconnect inputs to neurons to demonstrate adaptability 57 History of neurocomputers 1960: ADALINE (Widrow & Hof) single layer neural network weights stored in a newly invented electronic component memistor, which remembers history of electric current in the form of resistance. Widrow founded a company Memistor Corporation, which sold implementations of neural networks. 1960-66: several companies concerned with neural networks were founded. 58 History of neurocomputers 1967-82: dead still after publication of a book by Minski & Papert (published 1969, title Perceptrons) 1983-end of 90s: revival of neural networks many attempts at hardware implementations application specific chips (ASIC) programmable hardware (FPGA) hw implementations typically not better than "software" implementations on universal computers (problems with weight storage, size, speed, cost of production etc.) 59 History of neurocomputers 1967-82: dead still after publication of a book by Minski & Papert (published 1969, title Perceptrons) 1983-end of 90s: revival of neural networks many attempts at hardware implementations application specific chips (ASIC) programmable hardware (FPGA) hw implementations typically not better than "software" implementations on universal computers (problems with weight storage, size, speed, cost of production etc.) end of 90s-cca 2005: NN suppressed by other machine learning methods (support vector machines (SVM)) 2006-now: The boom of neural networks! deep networks – often better than any other method GPU implementations ... some specialized hw implementations (Google’s TPU) 59 History in waves ... Figure: The figure shows two of the three historical waves of articial neural nets research, as measured by the frequency of the phrases “cybernetics” and “connectionism” or “neural networks” according to Google Books (the third wave is too recent to appear). 60 Current hardware – What do we face? Increasing dataset size ... 61 Current hardware – What do we face? ... and thus increasing size of neural networks ... 2. ADALINE 4. Early back-propagation network (Rumelhart et al., 1986b) 8. Image recognition: LeNet-5 (LeCun et al., 1998b) 10. Dimensionality reduction: Deep belief network (Hinton et al., 2006) ... here the third "wave" of neural networks started 15. Digit recognition: GPU-accelerated multilayer perceptron (Ciresan et al., 2010) 18. Image recognition (AlexNet): Multi-GPU convolutional network (Krizhevsky et al., 2012) 20. Image recognition: GoogLeNet (Szegedy et al., 2014a) 62 Current hardware – What do we face? ... as a reward we get this ... Figure: Since deep networks reached the scale necessary to compete in the ImageNetLarge Scale Visual Recognition Challenge, they have consistently won the competition every year, and yielded lower and lower error rates each time. Data from Russakovsky et al. (2014b) and He et al. (2015). 63 Current hardware In 2012, Google trained a large network of 1.7 billion weights and 9 layers The task was image recognition (10 million youtube video frames) The hw comprised a 1000 computer network (16 000 cores), computation took three days. 64 Current hardware In 2012, Google trained a large network of 1.7 billion weights and 9 layers The task was image recognition (10 million youtube video frames) The hw comprised a 1000 computer network (16 000 cores), computation took three days. In 2014, similar task performed on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI. Able to train 1 billion parameter networks on just 3 machines in a couple of days. Able to scale to 11 billion weights (approx. 6.5 times larger than the Google model) on 16 GPUs. 64 Current hardware – NVIDIA DGX Station 4x GPU (Tesla V100) TFLOPS = 480 GPU memory 64GB total NVIDIA Tensor Cores: 2,560 NVIDIA CUDA Cores: 20,480 System memory: 256 GB Network: Dual 10 Gb LAN NVIDIA Deep Learning SDK 65 Current software TensorFlow (Google) open source software library for numerical computation using data flow graphs allows implementation of most current neural networks allows computation on multiple devices (CPUs, GPUs, ...) Python API Keras: a library on top of TensorFlow that allows easy description of most modern neural networks CNTK (Microsoft) functionality similar to TensorFlow special input language called BrainScript Theano: The "academic" grand-daddy of deep-learning frameworks, written in Python. Strongly inspired TensorFlow (some people developing Theano moved on to develop TensorFlow). There are others: Caffe, Torch (Facebook), Deeplearning4j, ... 66 Current software – Keras 67 Other software implementations Most "mathematical" software packages contain some support of neural networks: MATLAB R STATISTICA Weka ... The implementations are typically not on par with the previously mentioned dedicated deep-learning libraries. 68 SyNAPSE (USA) Big research project, partially funded by DARPA Among the main subjects IBM a HRL, collaboraton with top US universities, e.g. Boston, Stanford The project started in 2008 Invested tens of millions USD. The goal Develop a neural network comparable with a real brain of a mammal The resulting hw chip should simulate 10 billion neurons, 100 trillion synaptic connections, consume 1 kilowatt (∼ a small heater), size 2 dm3 Oriented towards development of a new parallel computer architecture rather than neuroscience. 69 SyNAPSE (USA) – some results A cat brain simulation (2009) A simulation of a network with 109 neurons, 1013 synapses Simulated on a supercomputer Dawn (Blue Gene/P), 147,450 CPU, 144 tB of memory 643 times slower than the real brain The network modelled according to the real-brain structure (hierarchical model of a visual cortex, 4 layers) The authors claim that they observed some behaviour similar to the behaviour of the real brain (signal propagation, α, γ waves) ... simulation was heavily criticised (see latter) ... in 2012 the number of neurons increased to 530 bn neurons a 100 tn synapses 70 SyNAPSE (USA) – TrueNorth A chip with 5.4 billion elements 4096 neurosynaptic cores connected by a network, implementing 1 million programmable "spiking" neurons, 256 million programmable synaptic connections global frequency 1-kHz low energy consumption, approx. 63mW Offline learning, implemented some known algorithms (convolutional networks, RBM etc.) Applied to simple image recognition tasks. 71 Human Brain Project, HBP (Europe) Funded by EU, budget 109 EUR for 10 years Successor of Blue Brain Project at EPFL Lausanne Blue Brain started in 2005, ended in 2012, Human Brain Project started in 2013 The original goal: Deeper understanding of human brain networking in neuroscience diagnosis of brain diseases thinking machine The approach: study of brain tissue using current technology modelling of biological neurons simulation of the models (program NEURON) 72 HBP (Europe Blue brain project (2008) Model of a part of the brain cortex of a rat (approx. 10,000 neurons), much more complex model of neurons than in SyNAPSE Simulated on a supercomputer of the type Blue Gene/P (provided by IBM on discount), 16,384 CPU, 56 teraflops, 16 terabyt˚u pamˇeti, 1 PB disk space Simulation 300x slower than the real brain Human brain project (2015): Simplified model of the nervous system of a rat (approx. 200 000 neurons) 73 SyNAPSE vs HBP 2011: IBM Simulates 4.5 percent of the Human Brain, and All of the Cat Brain (Scientific American) “... performed the first near real-time cortical simulation of the brain that exceeds the scale of a cat cortex” (IBM) 74 SyNAPSE vs HBP 2011: IBM Simulates 4.5 percent of the Human Brain, and All of the Cat Brain (Scientific American) “... performed the first near real-time cortical simulation of the brain that exceeds the scale of a cat cortex” (IBM) This announcement has been heavily criticised by Dr. Markram (head of HBP) 74 SyNAPSE vs HBP 2011: IBM Simulates 4.5 percent of the Human Brain, and All of the Cat Brain (Scientific American) “... performed the first near real-time cortical simulation of the brain that exceeds the scale of a cat cortex” (IBM) This announcement has been heavily criticised by Dr. Markram (head of HBP) “This is a mega public relations stunt – a clear case of scientific deception of the public” 74 SyNAPSE vs HBP 2011: IBM Simulates 4.5 percent of the Human Brain, and All of the Cat Brain (Scientific American) “... performed the first near real-time cortical simulation of the brain that exceeds the scale of a cat cortex” (IBM) This announcement has been heavily criticised by Dr. Markram (head of HBP) “This is a mega public relations stunt – a clear case of scientific deception of the public” “Their so called “neurons” are the tiniest of points you can imagine, a microscopic dot” 74 SyNAPSE vs HBP 2011: IBM Simulates 4.5 percent of the Human Brain, and All of the Cat Brain (Scientific American) “... performed the first near real-time cortical simulation of the brain that exceeds the scale of a cat cortex” (IBM) This announcement has been heavily criticised by Dr. Markram (head of HBP) “This is a mega public relations stunt – a clear case of scientific deception of the public” “Their so called “neurons” are the tiniest of points you can imagine, a microscopic dot” “Neurons contain 10’s of thousands of proteins that form a network with 10’s of millions of interactions. These interactions are incredibly complex and will require solving millions of differential equations. They have none of that.” 74 SyNAPSE vs HBP “Eugene Izhikevik himself already in 2005 ran a simulation with 100 billion such points interacting just for the fun of it: (over 60 times larger than Modha’s simulation)” 75 SyNAPSE vs HBP “Eugene Izhikevik himself already in 2005 ran a simulation with 100 billion such points interacting just for the fun of it: (over 60 times larger than Modha’s simulation)” Why did they get the Gordon Bell Prize? “They seem to have been very successful in influencing the committee with their claim, which technically is not peer-reviewed by the respective community and is neuroscientifically outrageous.” 75 SyNAPSE vs HBP “Eugene Izhikevik himself already in 2005 ran a simulation with 100 billion such points interacting just for the fun of it: (over 60 times larger than Modha’s simulation)” Why did they get the Gordon Bell Prize? “They seem to have been very successful in influencing the committee with their claim, which technically is not peer-reviewed by the respective community and is neuroscientifically outrageous.” But is there any innovation here? “The only innovation here is that IBM has built a large supercomputer.” 75 SyNAPSE vs HBP “Eugene Izhikevik himself already in 2005 ran a simulation with 100 billion such points interacting just for the fun of it: (over 60 times larger than Modha’s simulation)” Why did they get the Gordon Bell Prize? “They seem to have been very successful in influencing the committee with their claim, which technically is not peer-reviewed by the respective community and is neuroscientifically outrageous.” But is there any innovation here? “The only innovation here is that IBM has built a large supercomputer.” But did Mohda not collaborate with neuroscientists? “I would be very surprised if any neuroscientists that he may have had in his DARPA consortium realized he was going to make such an outrageous claim. I can’t imagine that the San Fransisco neuroscientists knew he was going to make such a stupid claim. Modha himself is a software engineer with no knowledge of the brain.” 75 ... and in the meantime in Europe In 2014, the European Commission received an open letter signed by more than 130 heads of laboratories demanding a substantial change in the management of the whole project. 76 ... and in the meantime in Europe In 2014, the European Commission received an open letter signed by more than 130 heads of laboratories demanding a substantial change in the management of the whole project. Peter Dayan, director of the computational neuroscience unit at UCL: “The main apparent goal of building the capacity to construct a larger-scale simulation of the human brain is radically premature.” “We are left with a project that can’t but fail from a scientific perspective. It is a waste of money, it will suck out funds from valuable neuroscience research, and would leave the public, who fund this work, justifiably upset.” 76 ... and in 2016 The European Commission and the Human Brain Project Coordinator, the École Polytechnique Fédérale de Lausanne (EPFL), have signed the first Specific Grant Agreement (SGA1), releasing EUR 89 million in funding retroactively from 1st April 2016 until the end of March 2018. The signature of SGA1 means that the HBP and the European Commision have agreed on the HBP Work Plan for this two year period. The SGA1 work plan will move the Project closer to achieving its aim of establishing a cutting-edge, ICT-based scientific Research Infrastructure for brain research, cognitive neuroscience and brain-inspired computing. 77 ADALINE Architecture: x1 x2 xn · · · y x0 = 1 w0 w1 w2 wn w = (w0, w1, . . . , wn) and x = (x0, x1, . . . , xn) where x0 = 1. Activity: inner potential: ξ = w0 + n i=1 wixi = n i=0 wixi = w · x activation function: σ(ξ) = ξ network function: y[w](x) = σ(ξ) = w · x 78 ADALINE Learning: Given a training set T = x1, d1 , x2, d2 , . . . , xp, dp Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ R is the expected output. Intuition: The network is supposed to compute an affine approximation of the function (some of) whose values are given in the training set. 79 Oaks in Wisconsin 80 ADALINE Error function: E(w) = 1 2 p k=1 w · xk − dk 2 = 1 2 p k=1   n i=0 wixki − dk   2 The goal is to find w which minimizes E(w). 81 Error function 82 Gradient of the error function Consider gradient of the error function: E(w) = ∂E ∂w0 (w), . . . , ∂E ∂wn (w) Intuition: E(w) is a vector in the weight space which points in the direction of the steepest ascent of the error function. Note that the vectors xk are just parameters of the function E, and are thus fixed! 83 Gradient of the error function Consider gradient of the error function: E(w) = ∂E ∂w0 (w), . . . , ∂E ∂wn (w) Intuition: E(w) is a vector in the weight space which points in the direction of the steepest ascent of the error function. Note that the vectors xk are just parameters of the function E, and are thus fixed! Fact If E(w) = 0 = (0, . . . , 0), then w is a global minimum of E. For ADALINE, the error function E(w) is a convex paraboloid and thus has the unique global minimum. 83 Gradient - illustration Caution! This picture just illustrates the notion of gradient ... it is not the convex paraboloid E(w) ! 84 Gradient of the error function (ADALINE) ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 85 Gradient of the error function (ADALINE) ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 = 1 2 p k=1 2   n i=0 wixki − dk   δ δw   n i=0 wixki − dk   85 Gradient of the error function (ADALINE) ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 = 1 2 p k=1 2   n i=0 wixki − dk   δ δw   n i=0 wixki − dk   = 1 2 p k=1 2   n i=0 wixki − dk     n i=0 δ δw wixki − δE δw dk   85 Gradient of the error function (ADALINE) ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 = 1 2 p k=1 2   n i=0 wixki − dk   δ δw   n i=0 wixki − dk   = 1 2 p k=1 2   n i=0 wixki − dk     n i=0 δ δw wixki − δE δw dk   = p k=1 w · xk − dk xk 85 Gradient of the error function (ADALINE) ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 = 1 2 p k=1 2   n i=0 wixki − dk   δ δw   n i=0 wixki − dk   = 1 2 p k=1 2   n i=0 wixki − dk     n i=0 δ δw wixki − δE δw dk   = p k=1 w · xk − dk xk Thus E(w) = ∂E ∂w0 (w), . . . , ∂E ∂wn (w) = p k=1 w · xk − dk xk 85 ADALINE - learning Batch algorithm (gradient descent): Idea: In every step "move" the weights in the direction opposite to the gradient. 86 ADALINE - learning Batch algorithm (gradient descent): Idea: In every step "move" the weights in the direction opposite to the gradient. The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 86 ADALINE - learning Batch algorithm (gradient descent): Idea: In every step "move" the weights in the direction opposite to the gradient. The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε · E(w(t) ) = w(t) − ε · p k=1 w(t) · xk − dk · xk Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate. 86 ADALINE - learning Batch algorithm (gradient descent): Idea: In every step "move" the weights in the direction opposite to the gradient. The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε · E(w(t) ) = w(t) − ε · p k=1 w(t) · xk − dk · xk Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate. Proposition For sufficiently small ε > 0 the sequence w(0), w(1), w(2), . . . converges (componentwise) to the global minimum of E (i.e. to the vector w satisfying E(w) = 0). 86 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE – Animation 87 ADALINE - learning Online algorithm (Delta-rule, Widrow-Hoff rule): weights in w(0) initialized randomly close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε(t) · w(t) · xk − dk · xk Here k = t mod p + 1 and 0 < ε(t) ≤ 1 is a learning rate in the step t + 1. Note that the algorithm does not work with the complete gradient but only with its part determined by the currently considered training example. 88 ADALINE - learning Online algorithm (Delta-rule, Widrow-Hoff rule): weights in w(0) initialized randomly close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε(t) · w(t) · xk − dk · xk Here k = t mod p + 1 and 0 < ε(t) ≤ 1 is a learning rate in the step t + 1. Note that the algorithm does not work with the complete gradient but only with its part determined by the currently considered training example. Theorem (Widrow & Hoff) If ε(t) = 1 t , then w(0), w(1), w(2), . . . converges to the global minimum of E. 88 ADALINE - classification How to use the ADALINE for classification? The training set is T = x1, d1 , x2, d2 , . . . , xp, dp kde xk = (xk0, xk1, . . . , xkn) ∈ Rn+1 a dk ∈ {1, −1}. Here dk determines a class. 89 ADALINE - classification How to use the ADALINE for classification? The training set is T = x1, d1 , x2, d2 , . . . , xp, dp kde xk = (xk0, xk1, . . . , xkn) ∈ Rn+1 a dk ∈ {1, −1}. Here dk determines a class. Train the network using the ADALINE algorithm. 89 ADALINE - classification How to use the ADALINE for classification? The training set is T = x1, d1 , x2, d2 , . . . , xp, dp kde xk = (xk0, xk1, . . . , xkn) ∈ Rn+1 a dk ∈ {1, −1}. Here dk determines a class. Train the network using the ADALINE algorithm. We may expect the following: if dk = 1, then w · xk ≥ 0 if dk = −1, then w · xk < 0 89 ADALINE - classification How to use the ADALINE for classification? The training set is T = x1, d1 , x2, d2 , . . . , xp, dp kde xk = (xk0, xk1, . . . , xkn) ∈ Rn+1 a dk ∈ {1, −1}. Here dk determines a class. Train the network using the ADALINE algorithm. We may expect the following: if dk = 1, then w · xk ≥ 0 if dk = −1, then w · xk < 0 This does not have to be always true but if the training set is reasonably linearly separable, then the algorithm typically gives satisfactory results. 89 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 90 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) 91 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops 91 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) 91 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) 91 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) 91 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) j→ is a set of all i such that j is adjacent to i (i.e. there is an arc from j to i) 91 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi 92 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable) [ e.g. logistic sigmoid σj(ξ) = 1 1+e −λjξ ] 92 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable) [ e.g. logistic sigmoid σj(ξ) = 1 1+e −λjξ ] State of non-input neuron j ∈ Z \ X after the computation stops: yj = σj(ξj) (yj depends on the configuration w and the input x, so we sometimes write yj(w, x) ) 92 MLP – activity Activity: inner potential of neuron j: ξj = i∈j← wjiyi activation function σj for neuron j (arbitrary differentiable) [ e.g. logistic sigmoid σj(ξ) = 1 1+e −λjξ ] State of non-input neuron j ∈ Z \ X after the computation stops: yj = σj(ξj) (yj depends on the configuration w and the input x, so we sometimes write yj(w, x) ) The network computes a function R|X| do R|Y| . Layer-wise computation: First, all input neurons are assigned values of the input. In the -th step, all neurons of the -th layer are evaluated. 92 MLP – learning Learning: Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). 93 MLP – learning Learning: Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). Error function: E(w) = p k=1 Ek (w) where Ek (w) = 1 2 j∈Y yj(w, xk ) − dkj 2 93 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji 94 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂E ∂wji (w(t) ) is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is a learning rate in step t + 1. 94 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂E ∂wji (w(t) ) is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is a learning rate in step t + 1. Note that ∂E ∂wji (w(t) ) is a component of the gradient E, i.e. the weight update can be written as w(t+1) = w(t) − ε(t) · E(w(t) ). 94 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji 95 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi 95 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi and for every j ∈ Z X we get ∂Ek ∂yj = yj − dkj for j ∈ Y 95 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi and for every j ∈ Z X we get ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj for j ∈ Z (Y ∪ X) (Here all yj are in fact yj(w, xk )). 95 MLP – error function gradient If σj(ξ) = 1 1+e −λjξ for all j ∈ Z, then σj (ξj) = λjyj(1 − yj) 96 MLP – error function gradient If σj(ξ) = 1 1+e −λjξ for all j ∈ Z, then σj (ξj) = λjyj(1 − yj) and thus for all j ∈ Z X: ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X) 96 MLP – error function gradient If σj(ξ) = 1 1+e −λjξ for all j ∈ Z, then σj (ξj) = λjyj(1 − yj) and thus for all j ∈ Z X: ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X) If σj(ξ) = a · tanh(b · ξj) for all j ∈ Z, then σj (ξj) = b a (a − yj)(a + yj) 96 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: 97 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) 97 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 97 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 97 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 97 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 3. compute ∂Ek ∂wji for all wji using ∂Ek ∂wji := ∂Ek ∂yj · σj (ξj) · yi 97 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 3. compute ∂Ek ∂wji for all wji using ∂Ek ∂wji := ∂Ek ∂yj · σj (ξj) · yi 4. Eji := Eji + ∂Ek ∂wji The resulting Eji equals ∂E ∂wji . 97 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: 98 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: if j ∈ Y, then ∂Ek ∂yj = yj − dkj 98 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: if j ∈ Y, then ∂Ek ∂yj = yj − dkj if j ∈ Z Y ∪ X, then assuming that j is in the -th layer and assuming that ∂Ek ∂yr has already been computed for all neurons in the + 1-st layer, compute ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj (This works because all neurons of r ∈ j→ belong to the + 1-st layer.) 98 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) 99 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 99 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 99 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 99 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) 99 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) The steps 1. - 3. take linear time. 99 Complexity of the batch algorithm Computation of ∂E ∂wji (w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σr (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(w, xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) The steps 1. - 3. take linear time. Note that the speed of convergence of the gradient descent cannot be estimated ... 99 MLP – learning algorithm Online algorithm: The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂Ek ∂wji (w (t) ji ) is the weight update of wji in the step t + 1 and 0 < ε(t) ≤ 1 is the learning rate in the step t + 1. There are other variants determined by selection of the training examples used for the error computation (more on this later). 100 Illustration of the gradient descent – XOR Source: Pattern Classification (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork 101 Animation (sin(x)), network 1-5-1) One iteration: 102 Animation (sin(x)), network 1-5-1) 10 iterations: 102 Animation (sin(x)), network 1-5-1) 20 iterations: 102 Animation (sin(x)), network 1-5-1) 40 iterations: 102 Animation (sin(x)), network 1-5-1) 100 iterations: 102 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 103 MLP – architecture Notation: Denote X a set of input neurons Y a set of output neurons Z a set of all neurons (X, Y ⊆ Z) individual neurons denoted by indices i, j etc. ξj is the inner potential of the neuron j after the computation stops yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) j→ is a set of all i such that j is adjacent to i (i.e. there is an arc from j to i) 104 MLP – learning Learning: Given a training set T of the form xk , dk k = 1, . . . , p Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input xk (the vector dk can be written as dkj j∈Y ). Error function: E(w) = p k=1 Ek (w) where Ek (w) = 1 2 j∈Y yj(w, xk ) − dkj 2 105 MLP – batch learning The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: w(t+1) = w(t) + ∆w(t) Here ∆w(t) = −ε(t) · E(w(t) ) = −ε(t) · p k=1 Ek (w(t) ) 0 < ε(t) ≤ 1 is a learning rate in step t + 1 E(w(t)) is the gradient of the error function Ek (w(t)) is the gradient of the error function for the training example k 106 MLP – error functions square error: E(w) = p k=1 Ek (w) where Ek (w) = 1 2 j∈Y yj(w, xk ) − dkj 2 mean square error (mse): E(w) = 1 p p k=1 Ek (w) I will use mse throughout the rest of this lecture. 107 MLP – mse gradient For every wji we have ∂E ∂wji = 1 p p k=1 ∂Ek ∂wji 108 MLP – mse gradient For every wji we have ∂E ∂wji = 1 p p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σj (ξj) · yi and for every j ∈ Z X we get ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj for j ∈ Z (Y ∪ X) (Here all yj are in fact yj(w, xk )). 108 Practical issues of gradient descent Training efficiency: What size of a minibatch? How to choose the learning rate ε(t) and control SGD ? How to pre-process the inputs? How to initialize weights? How to choose desired output values of the network? 109 Practical issues of gradient descent Training efficiency: What size of a minibatch? How to choose the learning rate ε(t) and control SGD ? How to pre-process the inputs? How to initialize weights? How to choose desired output values of the network? Quality of the resulting model: When to stop training? Regularization techniques. How large network? For simplicity, I will illustrate the reasoning on MLP + mse. Later we will see other topologies and error functions with different but always somewhat related issues. 109 Issues in gradient descent Lots of local minima where the descent gets stuck: The model identifiability problem: Swapping incoming weights of neurons i and j leaves the same network topology – weight space symmetry Recent studies show that for sufficiently large networks all local minima have low values of the error function. 110 Issues in gradient descent Lots of local minima where the descent gets stuck: The model identifiability problem: Swapping incoming weights of neurons i and j leaves the same network topology – weight space symmetry Recent studies show that for sufficiently large networks all local minima have low values of the error function. Saddle points One can show (by a combinatorial argument) that larger networks have exponentially more saddle points than local minima. 110 Issues in gradient descent – too slow descent flat regions E.g. if the inner potentials are too large (in abs. value), then their derivative is extremely small. 111 Issues in gradient descent – too fast descent steep cliffs: the gradient is extremely large, descent skips important weight vectors 112 Issues in gradient descent – local vs global structure What if we initialize on the left? 113 Issues in computing the gradient vanishing and exploding gradients ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj for j ∈ Z (Y ∪ X) 114 Issues in computing the gradient vanishing and exploding gradients ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σr (ξr ) · wrj for j ∈ Z (Y ∪ X) inexact gradient computation: Minibatch gradient is only an estimate of the true gradient. Note that the variance of the estimate is (roughly) σ/ √ m where m is the size of the minibatch and σ is the variance of the gradient estimate for a single training example. (E.g. minibatch size 10 000 means 100 times more computation than the size 100 but gives only 10 times less variance.) 114 Minibatch size Larger batches provide a more accurate estimate of the gradient, but with less than linear returns. 115 Minibatch size Larger batches provide a more accurate estimate of the gradient, but with less than linear returns. Multicore architectures are usually underutilized by extremely small batches. 115 Minibatch size Larger batches provide a more accurate estimate of the gradient, but with less than linear returns. Multicore architectures are usually underutilized by extremely small batches. If all examples in the batch are to be processed in parallel (as is the typical case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size. 115 Minibatch size Larger batches provide a more accurate estimate of the gradient, but with less than linear returns. Multicore architectures are usually underutilized by extremely small batches. If all examples in the batch are to be processed in parallel (as is the typical case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size. Some kinds of hardware achieve better runtime with specific sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models. 115 Minibatch size Larger batches provide a more accurate estimate of the gradient, but with less than linear returns. Multicore architectures are usually underutilized by extremely small batches. If all examples in the batch are to be processed in parallel (as is the typical case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size. Some kinds of hardware achieve better runtime with specific sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models. Small batches can offer a regularizing effect, perhaps due to the noise they add to the learning process. 115 Moment Issue in the gradient descent: E(w(t)) constantly changes direction (but the error steadily decreases). 116 Moment Issue in the gradient descent: E(w(t)) constantly changes direction (but the error steadily decreases). Solution: In every step add the change made in the previous step (weighted by a factor α): ∆w(t) = −ε(t) · k∈T Ek (w(t) ) + α · ∆w (t−1) ji where 0 < α < 1. 116 Momentum – illustration 117 SGD with momentum weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are computed as follows: Choose (randomly) a set of training examples T ⊆ {1, . . . , p} Compute w(t+1) = w(t) + ∆w(t) where ∆w(t) = −ε(t) · k∈T Ek (w(t) ) + α∆w(t−1) 0 < ε(t) ≤ 1 is a learning rate in step t + 1 0 < α < 1 measures the "influence" of the moment Ek (w(t) ) is the gradient of the error of the example k Note that the random choice of the minibatch is typically implemented by randomly shuffling all data and then choosing minibatches sequentially. 118 Learning rate Generic rules for adaptation of ε(t) 119 Learning rate Generic rules for adaptation of ε(t) Start with a larger learning rate (e.g. ε = 0.1). Later decrease as the descent is supposed to settle in a minimum of E. Some tools allow to set a list of learning rates, each rate for one epoch of the descent. 119 Learning rate Generic rules for adaptation of ε(t) Start with a larger learning rate (e.g. ε = 0.1). Later decrease as the descent is supposed to settle in a minimum of E. Some tools allow to set a list of learning rates, each rate for one epoch of the descent. In case you may observe the error evolving: If the error decreases, increase slightly the rate. If the error increases, decrease the rate. Note that the error may increase for the short period without any harm to convergence of the learning process. 119 AdaGrad So far we have considered a uniform learning rate. It is better to have larger rates for weights with smaller updates, smaller rates for weights with larger updates. AdaGrad uses individually adapting learning rate for each weight. 120 SGD with AdaGrad weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1) : Choose (randomly) a minibatch T ⊆ {1, . . . , p} Compute w (t+1) ji = w (t) ji + ∆w (t) ji 121 SGD with AdaGrad weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1) : Choose (randomly) a minibatch T ⊆ {1, . . . , p} Compute w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = − η r (t) j + δ · k∈T ∂Ek ∂wji (w(t) ) and r (t) j = r (t−1) j +   k∈T ∂Ek ∂wji (w(t) )   2 η is a constant expressing the influence of the learning rate, typically 0.01. δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0. 121 RMSProp The main disadvantage of AdaGrad is the accumulation of the gradient throughout the whole learning process. In case the learning needs to get over several "hills" before settling in a deep "valley", the weight updates get far too small before getting to it. RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after finding a convex bowl, as if it were an instance of the AdaGrad algorithm initialized within that bowl. 122 SGD with RMSProp weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1) : Choose (randomly) a minibatch T ⊆ {1, . . . , p} Compute w (t+1) ji = w (t) ji + ∆w (t) ji 123 SGD with RMSProp weights in w(0) are randomly initialized to values close to 0 in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1) : Choose (randomly) a minibatch T ⊆ {1, . . . , p} Compute w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = − η r (t) j + δ · k∈T ∂Ek ∂wji (w(t) ) and r (t) j = ρr (t−1) j + (1 − ρ)   k∈T ∂Ek ∂wji (w(t) )   2 η is a constant expressing the influence of the learning rate (Hinton suggests ρ = 0.9 and η = 0.001). δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0. 123 Other optimization methods There are more methods such as AdaDelta, Adam (roughly RMSProp combined with momentum), etc. A natural question: Which algorithm should one choose? 124 Other optimization methods There are more methods such as AdaDelta, Adam (roughly RMSProp combined with momentum), etc. A natural question: Which algorithm should one choose? Unfortunately, there is currently no consensus on this point. According to a recent study, the family of algorithms with adaptive learning rates (represented by RMSProp and AdaDelta) performed fairly robustly, no single best algorithm has emerged. 124 Other optimization methods There are more methods such as AdaDelta, Adam (roughly RMSProp combined with momentum), etc. A natural question: Which algorithm should one choose? Unfortunately, there is currently no consensus on this point. According to a recent study, the family of algorithms with adaptive learning rates (represented by RMSProp and AdaDelta) performed fairly robustly, no single best algorithm has emerged. Currently, the most popular optimization algorithms actively in use include SGD, SGD with momentum, RMSProp, RMSProp with momentum, AdaDelta and Adam. The choice of which algorithm to use, at this point, seems to depend largely on the user’s familiarity with the algorithm. 124 Choice of (hidden) activations Generic requirements imposed on activation functions: 1. differentiability (to do gradient descent) 2. non-linearity (linear multi-layer networks are equivalent to single-layer) 3. monotonicity (local extrema of activation functions induce local extrema of the error function) 4. "linearity" (i.e. preserve as much linearity as possible; linear models are easiest to fit; find the "minimum" non-linearity needed to solve a given task) The choice of activation functions is closely related to input preprocessing and the initial choice of weights. I will illustrate the reasoning on sigmoidal functions; say few words about other activation functions later. 125 Activation functions – tanh σ(ξ) = 1.7159 · tanh(2 3 · ξ), we have limξ→∞ σ(ξ) = 1.7159 and limξ→−∞ σ(ξ) = −1.7159 126 Activation functions – tanh σ(ξ) = 1.7159 · tanh(2 3 · ξ) is almost linear on [−1, 1] 127 Activation functions – tanh first derivative: σ(ξ) = 1.7159 · tanh(2 3 · ξ) 128 Input preprocessing Some inputs may be much larger than others. E.g..: Height vs weight of a person, maximum speed of a car (in km/h) vs its price (in CZK), etc. 129 Input preprocessing Some inputs may be much larger than others. E.g..: Height vs weight of a person, maximum speed of a car (in km/h) vs its price (in CZK), etc. Large inputs have greater influence on the training than the small ones. In addition, too large inputs may slow down learning (saturation of activation functions). 129 Input preprocessing Some inputs may be much larger than others. E.g..: Height vs weight of a person, maximum speed of a car (in km/h) vs its price (in CZK), etc. Large inputs have greater influence on the training than the small ones. In addition, too large inputs may slow down learning (saturation of activation functions). Typical standardization: average = 0 (subtract the mean) variance = 1 (divide by the standard deviation) Here the mean and standard deviation may be estimated from data (the training set). (illustration of standard deviation) 129 Input preprocessing Individual inputs should not be correlated. Correlated inputs can be removed as a part of dimensionality reduction. (Dimensionality reduction and decorrelation can be implemented using neural networks. There are also standard methods such as PCA.) 130 Initial weights (for tanh) Typically, the weights are chosen randomly from an interval [−w, w] where w depends on the number of inputs of a given neuron. 131 Initial weights (for tanh) Typically, the weights are chosen randomly from an interval [−w, w] where w depends on the number of inputs of a given neuron. Consider the activation function σ(ξ) = 1.7159 · tanh(2 3 · ξ) for all neurons. σ is almost linear on [−1, 1], extreme values of σ are close to −1 and 1, σ saturates out of the interval [−4, 4] (i.e. it is close to its limit values and its derivative is close to 0. 131 Initial weights (for tanh) Typically, the weights are chosen randomly from an interval [−w, w] where w depends on the number of inputs of a given neuron. Consider the activation function σ(ξ) = 1.7159 · tanh(2 3 · ξ) for all neurons. σ is almost linear on [−1, 1], extreme values of σ are close to −1 and 1, σ saturates out of the interval [−4, 4] (i.e. it is close to its limit values and its derivative is close to 0. Thus for too small w we may get (almost) linear model. for too large w (i.e. much larger than 1) the activations may get saturated and the learning will be very slow. Hence, we want to choose w so that the inner potentials of neurons will be roughly in the interval [−1, 1]. 131 Initial weights (for tanh) Standardization gives mean = 0 and variance = 1 of the input data. Assume that individual inputs are (almost) uncorrelated. 132 Initial weights (for tanh) Standardization gives mean = 0 and variance = 1 of the input data. Assume that individual inputs are (almost) uncorrelated. Consider a neuron j from the first layer with d inputs. Assume that its weights are chosen uniformly from [−w, w]. 132 Initial weights (for tanh) Standardization gives mean = 0 and variance = 1 of the input data. Assume that individual inputs are (almost) uncorrelated. Consider a neuron j from the first layer with d inputs. Assume that its weights are chosen uniformly from [−w, w]. The rule: choose w so that the standard deviation of ξj (denote by oj) is close to the border of the interval on which σj is linear. In our case: oj ≈ 1. 132 Initial weights (for tanh) Standardization gives mean = 0 and variance = 1 of the input data. Assume that individual inputs are (almost) uncorrelated. Consider a neuron j from the first layer with d inputs. Assume that its weights are chosen uniformly from [−w, w]. The rule: choose w so that the standard deviation of ξj (denote by oj) is close to the border of the interval on which σj is linear. In our case: oj ≈ 1. Our assumptions imply: oj = d 3 · w. Thus we put w = √ 3√ d . 132 Initial weights (for tanh) Standardization gives mean = 0 and variance = 1 of the input data. Assume that individual inputs are (almost) uncorrelated. Consider a neuron j from the first layer with d inputs. Assume that its weights are chosen uniformly from [−w, w]. The rule: choose w so that the standard deviation of ξj (denote by oj) is close to the border of the interval on which σj is linear. In our case: oj ≈ 1. Our assumptions imply: oj = d 3 · w. Thus we put w = √ 3√ d . The same works for higher layers, d corresponds to the number of neurons in the layer one level lower. 132 Glorot & Bengio initialization The previous heuristics for weight initialization ignores variance of the gradient (i.e. it is concerned only with the "size" of activations in the forward pass. 133 Glorot & Bengio initialization The previous heuristics for weight initialization ignores variance of the gradient (i.e. it is concerned only with the "size" of activations in the forward pass. Glorot & Bengio (2010) presented a normalized initialization by choosing w uniformly from the interval:  − 6 m + n , 6 m + n   Here m is the number of inputs to the neuron, m is the number of outputs of the neuron. 133 Glorot & Bengio initialization The previous heuristics for weight initialization ignores variance of the gradient (i.e. it is concerned only with the "size" of activations in the forward pass. Glorot & Bengio (2010) presented a normalized initialization by choosing w uniformly from the interval:  − 6 m + n , 6 m + n   Here m is the number of inputs to the neuron, m is the number of outputs of the neuron. This is designed to compromise between the goal of initializing all layers to have the same activation variance and the goal of initializing all layers to have the same gradient variance. The formula is derived using the assumption that the network consists only of a chain of matrix multiplications, with no non-linearities. Real neural networks obviously violate this assumption, but many strategies designed for the linear model perform reasonably well on its non-linear counterparts. 133 Target values (tanh) Target values dkj should be chosen in the range of the output activation functions, in our case [−1.716, 1.716]. Target values too close to extrema of the output activations, in our case ±1.716, may cause that the weights will grow indefinitely (slows down learning). Thus it is good to choose target values from the interval [−1.716 + δ, 1.716 − δ]. As before, ideally [−1.716 + δ, 1.716 − δ] should span the interval on which the activation function is linear, i.e. dkj should be taken from [−1, 1]. 134 Modern activation functions For hidden neurons sigmoidal functions are often substituted with piece-wise linear activations functions. Most prominent is ReLU: σ(ξ) = max{0, ξ} THE default activation function recommended for use with most feedforward neural networks. As close to linear function as possible; very simple; does not saturate for large potentials. 135 Output neurons The choice of activation functions for output units depends on the concrete applications. For regression (function approximation) the output is typically linear (or sigmoidal). 136 Output neurons The choice of activation functions for output units depends on the concrete applications. For regression (function approximation) the output is typically linear (or sigmoidal). For classification, the current activation functions of choice are logistic sigmoid or tanh – binary classification softmax: σj(ξj) = eξj i∈Y eξi for multi-class classification. 136 Output neurons The choice of activation functions for output units depends on the concrete applications. For regression (function approximation) the output is typically linear (or sigmoidal). For classification, the current activation functions of choice are logistic sigmoid or tanh – binary classification softmax: σj(ξj) = eξj i∈Y eξi for multi-class classification. For some reasons the error function used with softmax (assuming that the target values dkj are from {0, 1}) is typically cross-entropy: − 1 p p k=1 j∈Y dkj ln(yj) + (1 − dkj) ln(1 − yj) ... which somewhat corresponds to the maximum likelihood principle. 136 Sigmoidal outputs with cross-entropy – in detail Consider Binary classification, two classes {0, 1} One output neuron j, its activation logistic sigmoid σj(ξj) = 1 1 + e−ξj The output of the network is y = σj(ξj). 137 Sigmoidal outputs with cross-entropy – in detail Consider Binary classification, two classes {0, 1} One output neuron j, its activation logistic sigmoid σj(ξj) = 1 1 + e−ξj The output of the network is y = σj(ξj). For a training set T = xk , dk k = 1, . . . , p (here xk ∈ R|X| and dk ∈ R), the cross-entropy looks like this: Ecross = − 1 p p k=1 [dk ln(yk ) + (1 − dk ) ln(1 − yk )] where yk is the output of the network for the k-th training input xk , and dk is the k-th desired output. 137 Generalization Intuition: Generalization = ability to cope with new unseen instances. Data are mostly noisy, so it is not good idea to fit exactly. In case of function approximation, the network should not return exact results as in the training set. 138 Generalization Intuition: Generalization = ability to cope with new unseen instances. Data are mostly noisy, so it is not good idea to fit exactly. In case of function approximation, the network should not return exact results as in the training set. More formally: It is typically assumed that the training set has been generated as follows: dkj = gj(xk ) + Θkj where gj is the "underlying" function corresponding to the output neuron j ∈ Y and Θkj is random noise. The network should fit gj not the noise. Methods improving generalization are called regularization methods. 138 Regularization Regularization is a big issue in neural networks, as they typically use a huge amount of parameters and thus are very susceptible to overfitting. 139 Regularization Regularization is a big issue in neural networks, as they typically use a huge amount of parameters and thus are very susceptible to overfitting. von Neumann: "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk." ... and I ask you prof. Neumann: What can you fit with 40GB of parameters?? 139 Early stopping Early stopping means that we stop learning before it reaches a minimum of the error E. When to stop? 140 Early stopping Early stopping means that we stop learning before it reaches a minimum of the error E. When to stop? In many applications the error function is not the main thing we want to optimize. E.g. in the case of a trading system, we typically want to maximize our profit not to minimize (strange) error functions designed to be easily differentiable. Also, as noted before, minimizing E completely is not good for generalization. For start: We may employ standard approach of training on one set and stopping on another one. 140 Early stopping Divide your dataset into several subsets: training set (e.g. 60%) – train the network here validation set (e.g. 20%) – use to stop the training (possibly) test set (e.g. 20%) – use to compare trained models What to use as a stopping rule? 141 Early stopping Divide your dataset into several subsets: training set (e.g. 60%) – train the network here validation set (e.g. 20%) – use to stop the training (possibly) test set (e.g. 20%) – use to compare trained models What to use as a stopping rule? You may observe E (or any other function of interest) on the validation set, if it does not improve for last k steps, stop. Alternatively, you may observe the gradient, if it is small for some time, stop. (recent studies shown that this traditional rule is not too good: it may happen that the gradient is larger close to minimum values; on the other hand, E does not have to be evaluated which saves time. To compare models you may use ML techniques such as cross-validation etc. 141 Size of the network Similar problem as in the case of the training duration: Too small network is not able to capture intrinsic properties of the training set. Large networks overfit faster – bad generalization. Solution: Optimal number of neurons :-) 142 Size of the network Similar problem as in the case of the training duration: Too small network is not able to capture intrinsic properties of the training set. Large networks overfit faster – bad generalization. Solution: Optimal number of neurons :-) there are some (useless) theoretical bounds there are algorithms dynamically adding/removing neurons (not much use nowadays) In practice: start using a rule of thumb: the number of neurons ≈ ten times less than the number of training instances. experiment, experiment, experiment. 142 Feature extraction Consider a two layer network. Hidden neurons are supposed to represent "patterns" in the inputs. Example: Network 64-2-3 for letter classification: 143 Ensemble methods Techniques for reducing generalization error by combining several models. The reason that ensemble methods work is that different models will usually not make all the same errors on the test set. Idea: Train several different models separately, then have all of the models vote on the output for test examples. 144 Ensemble methods Techniques for reducing generalization error by combining several models. The reason that ensemble methods work is that different models will usually not make all the same errors on the test set. Idea: Train several different models separately, then have all of the models vote on the output for test examples. Bagging: Generate k training sets T1, ..., Tk of the same size by sampling from T uniformly with replacement. If |Ti| = |T |, then on average |Ti| = (1 − 1/e)|T |. For each i, train a model Mi on Ti. Combine outputs of the models: for regression by averaging, for classification by (majority) voting. 144 Dropout The algorithm: In every step of the gradient descent choose randomly a set N of neurons, each neuron is included in N independently with probability 1/2, (in practice, different probabilities are used as well). update weights of neurons in N (in a standard way), leave weights of the other neurons unchanged. 145 Dropout The algorithm: In every step of the gradient descent choose randomly a set N of neurons, each neuron is included in N independently with probability 1/2, (in practice, different probabilities are used as well). update weights of neurons in N (in a standard way), leave weights of the other neurons unchanged. Dropout resembles bagging: Large ensemble of neural networks is trained "at once" on parts of the data. Dropout is not exactly the same as bagging: The models share parameters, with each model inheriting a different subset of parameters from the parent neural network. This parameter sharing makes it possible to represent an exponential number of models with a tractable amount of memory. In the case of bagging, each model is trained to convergence on its respective training set. This would be infeasible for large networks/training sets. 145 Weight decay Generalization can be improved by removing "unimportant" weights. Penalising large weights gives stronger indication about their importance. 146 Weight decay Generalization can be improved by removing "unimportant" weights. Penalising large weights gives stronger indication about their importance. In every step we decrease weights (multiplicatively) as follows: w (t+1) ji = (1 − ζ)(w (t) ji + ∆w (t) ji ) Intuition: Unimportant weights will be pushed to 0, important weights will survive the decay. 146 Weight decay Generalization can be improved by removing "unimportant" weights. Penalising large weights gives stronger indication about their importance. In every step we decrease weights (multiplicatively) as follows: w (t+1) ji = (1 − ζ)(w (t) ji + ∆w (t) ji ) Intuition: Unimportant weights will be pushed to 0, important weights will survive the decay. Weight decay is equivalent to the gradient descent with a constant learning rate ε and the following error function: E (w) = E(w) + 2ζ ε (w · w) Here 2ζ ε (w · w) penalizes large weights. 146 More optimization, regularization ... There are many more practical tips, optimization methods, regularization methods, etc. For a very nice survey see http://www.deeplearningbook.org/ ... and also all other infinitely many urls concerned with deep learning. 147