PV021: Neural networks Tomáš Brázdil 1 Course organization Course materials: Main: The lecture Neural Networks and Deep Learning by Michael Nielsen http://neuralnetworksanddeeplearning.com/ (Extremely well written modern online textbook.) Deep learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville http://www.deeplearningbook.org/ (A very good overview of the state-of-the-art in neural networks.) Suggested: deeplearning.ai courses by Andrew Ng 2 Course organization Evaluation: Project teams of two students implementation of a selected model + analysis of given data implementation either in C, C++ without use of any specialized libraries for data analysis and machine learning need to get over a given accuracy threshold (a gentle one, just to eliminate non-functional implementations) 3 Course organization Evaluation: Project teams of two students implementation of a selected model + analysis of given data implementation either in C, C++ without use of any specialized libraries for data analysis and machine learning need to get over a given accuracy threshold (a gentle one, just to eliminate non-functional implementations) Oral exam I may ask about anything from the lecture! 3 FAQ Q: Why we cannot use specialized libraries in projects? 4 FAQ Q: Why we cannot use specialized libraries in projects? A: In order to "touch" the low level implementation details of the algorithms. You should not even use libraries for linear algebra and numerical methods, so that you will be confronted with rounding errors and numerical instabilities. 4 FAQ Q: Why we cannot use specialized libraries in projects? A: In order to "touch" the low level implementation details of the algorithms. You should not even use libraries for linear algebra and numerical methods, so that you will be confronted with rounding errors and numerical instabilities. Q: Why should you attend this course when there are infinitely many great reasources elsewhere? A: There are at least two reasons: You may discuss issues with me, my colleagues and other students. I will make you truly learn fundamentals by heart. 4 Notable features of the course Use of mathematical notation and reasoning (contains several proofs that are mandatory for the exam) Sometimes goes deeper into statistical underpinnings of neural networks learning The project demands a complete working solution which must satisfy a prescribed performance specification 5 Notable features of the course Use of mathematical notation and reasoning (contains several proofs that are mandatory for the exam) Sometimes goes deeper into statistical underpinnings of neural networks learning The project demands a complete working solution which must satisfy a prescribed performance specification An unusual exam system! You can repeat the oral exam as many times as needed (only the best grade goes into IS). 5 Notable features of the course Use of mathematical notation and reasoning (contains several proofs that are mandatory for the exam) Sometimes goes deeper into statistical underpinnings of neural networks learning The project demands a complete working solution which must satisfy a prescribed performance specification An unusual exam system! You can repeat the oral exam as many times as needed (only the best grade goes into IS). An example of an instruction email (from another course with the same system): It is typically not sufficient to devote a single afternoon to the preparation for the exam. You have to know _everything_ (which means every single thing) starting with the slide 42 and ending with the slide 245 with notable exceptions of slides: 121 - 123, 137 - 140, 165, 167. Proofs presented on the whiteboard are also mandatory. 5 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) 6 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) spam filter learns to recognize spam from a database of "labelled" emails consequently is able to distinguish spam from ham 6 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) spam filter learns to recognize spam from a database of "labelled" emails consequently is able to distinguish spam from ham handwritten text reader learns from a database of handwritten letters (or text) labelled by their correct meaning consequently is able to recognize text 6 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) spam filter learns to recognize spam from a database of "labelled" emails consequently is able to distinguish spam from ham handwritten text reader learns from a database of handwritten letters (or text) labelled by their correct meaning consequently is able to recognize text · · · and lots of much much more sophisticated applications ... 6 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) spam filter learns to recognize spam from a database of "labelled" emails consequently is able to distinguish spam from ham handwritten text reader learns from a database of handwritten letters (or text) labelled by their correct meaning consequently is able to recognize text · · · and lots of much much more sophisticated applications ... Basic attributes of learning algorithms: representation: ability to capture the inner structure of training data generalization: ability to work properly on new data 6 Machine learning in general Machine learning algorithms typically construct mathematical models of given data. The models may be subsequently applied to fresh data. 7 Machine learning in general Machine learning algorithms typically construct mathematical models of given data. The models may be subsequently applied to fresh data. There are many types of models: decision trees support vector machines hidden Markov models Bayes networks and other graphical models neural networks · · · Neural networks, based on models of a (human) brain, form a natural basis for learning algorithms! 7 Artificial neural networks Artificial neuron is a rough mathematical approximation of a biological neuron. (Aritificial) neural network (NN) consists of a number of interconnected artificial neurons. "Behavior" of the network is encoded in connections between neurons. σ ξ x1 x2 xn y Zdroj obrázku: http://tulane.edu/sse/cmb/people/schrader/ 8 Why artificial neural networks? Modelling of biological neural networks (computational neuroscience). simplified mathematical models help to identify important mechanisms How a brain receives information? How the information is stored? How a brain develops? · · · 9 Why artificial neural networks? Modelling of biological neural networks (computational neuroscience). simplified mathematical models help to identify important mechanisms How a brain receives information? How the information is stored? How a brain develops? · · · neuroscience is strongly multidisciplinary; precise mathematical descriptions help in communication among experts and in design of new experiments. I will not spend much time on this area! 9 Why artificial neural networks? Neural networks in machine learning. Typically primitive models, far from their biological counterparts (but often inspired by biology). 10 Why artificial neural networks? Neural networks in machine learning. Typically primitive models, far from their biological counterparts (but often inspired by biology). Strongly oriented towards concrete application domains: decision making and control - autonomous vehicles, manufacturing processes, control of natural resources games - backgammon, poker, GO, Starcraft, ... finance - stock prices, risk analysis medicine - diagnosis, signal processing (EKG, EEG, ...), image processing (MRI, roentgen, WSI ...) text and speech processing - automatic translation, text generation, speech recognition other signal processing - filtering, radar tracking, noise reduction · · · I will concentrate on this area! 10 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction 11 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction Learning a kid learns to recognize a rabbit after seeing several rabbits 11 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction Learning a kid learns to recognize a rabbit after seeing several rabbits Generalization a kid is able to recognize a new rabbit after seeing several (old) rabbits 11 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction Learning a kid learns to recognize a rabbit after seeing several rabbits Generalization a kid is able to recognize a new rabbit after seeing several (old) rabbits Robustness a blurred photo of a rabbit may still be classified as an image of a rabbit 11 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction Learning a kid learns to recognize a rabbit after seeing several rabbits Generalization a kid is able to recognize a new rabbit after seeing several (old) rabbits Robustness a blurred photo of a rabbit may still be classified as an image of a rabbit Graceful degradation Experiments have shown that damaged neural network is still able to work quite well Damaged network may re-adapt, remaining neurons may take on functionality of the damaged ones 11 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurrent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets, autoencoders and generative adversarial networks) 12 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurrent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets, autoencoders and generative adversarial networks) Standard applications of these models (image processing, a little bit of speech and text processing) 12 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurrent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets, autoencoders and generative adversarial networks) Standard applications of these models (image processing, a little bit of speech and text processing) Basic learning algorithms (gradient descent & backpropagation, Hebb’s rule) 12 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurrent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets, autoencoders and generative adversarial networks) Standard applications of these models (image processing, a little bit of speech and text processing) Basic learning algorithms (gradient descent & backpropagation, Hebb’s rule) Basic practical training techniques (data preparation, setting various parameters, control of learning) 12 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurrent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets, autoencoders and generative adversarial networks) Standard applications of these models (image processing, a little bit of speech and text processing) Basic learning algorithms (gradient descent & backpropagation, Hebb’s rule) Basic practical training techniques (data preparation, setting various parameters, control of learning) Basic information about current implementations (TensorFlow, Keras) 12 Biological neural network Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. Each neuron is connected with approx. 104 neurons. Neurons themselves are very complex systems. 13 Biological neural network Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. Each neuron is connected with approx. 104 neurons. Neurons themselves are very complex systems. Rough description of nervous system: External stimulus is received by sensory receptors (e.g. eye cells). 13 Biological neural network Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. Each neuron is connected with approx. 104 neurons. Neurons themselves are very complex systems. Rough description of nervous system: External stimulus is received by sensory receptors (e.g. eye cells). Information is futher transfered via peripheral nervous system (PNS) to the central nervous systems (CNS) where it is processed (integrated), and subseqently, an output signal is produced. 13 Biological neural network Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. Each neuron is connected with approx. 104 neurons. Neurons themselves are very complex systems. Rough description of nervous system: External stimulus is received by sensory receptors (e.g. eye cells). Information is futher transfered via peripheral nervous system (PNS) to the central nervous systems (CNS) where it is processed (integrated), and subseqently, an output signal is produced. Afterwards, the output signal is transfered via PNS to effectors (e.g. muscle cells). 13 Biological neural network Zdroj: N. Campbell and J. Reece; Biology, 7th Edition; ISBN: 080537146X 14 Summation 15 Biological and Mathematical neurons 16 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn x1, . . . , xn ∈ R are inputs 17 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn x1, . . . , xn ∈ R are inputs w1, . . . , wn ∈ R are weights 17 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn x1, . . . , xn ∈ R are inputs w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = n i=1 wixi 17 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn x1, . . . , xn ∈ R are inputs w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = n i=1 wixi y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ h ; 0 ξ < h. where h ∈ R is a threshold. 17 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs 18 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights 18 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = w0 + n i=1 wixi 18 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = w0 + n i=1 wixi y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. (The threshold h has been substituted with the new input x0 = 1 and the weight w0 = −h.) 18 Neuron and linear separation ξ = 0 ξ > 0 ξ > 0 ξ < 0 ξ < 0 inner potential ξ = w0 + n i=1 wixi determines a separation hyperplane in the n-dimensional input space in 2d line in 3d plane · · · 19 Neuron and linear separation σ σ( wixi) x1 xn · · · 1/0 by A/B w1 wn n = 8 · 8, i.e. the number of pixels in the images. Inputs are binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0). 20 Neuron and linear separation σ x1 xn · · · x0 = 1 1/0 pro A/B w1 wn w0 n = 8 · 8, i.e. the number of pixels in the images. Inputs are binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0). 21 Neuron and linear separation ¯w0 + n i=1 ¯wixi = 0 w0 + n i=1 wixi = 0 A A A A B B B Red line classifies incorrectly Green line classifies correctly (may be a result of a correction by a learning algorithm) 22 Neuron and linear separation (XOR) 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) x1 x2 No line separates ones from zeros. 23 Neural networks Neural network consists of formal neurons interconnected in such a way that the output of one neuron is an input of several other neurons. In order to describe a particular type of neural networks we need to specify: Architecture How the neurons are connected. Activity How the network transforms inputs to outputs. Learning How the weights are changed during training. 24 Architecture Network architecture is given as a digraph whose nodes are neurons and edges are connections. We distinguish several categories of neurons: Output neurons Hidden neurons Input neurons (In general, a neuron may be both input and output; a neuron is hidden if it is neither input, nor output.) 25 Architecture – Cycles A network is cyclic (recurrent) if its architecture contains a directed cycle. 26 Architecture – Cycles A network is cyclic (recurrent) if its architecture contains a directed cycle. Otherwise it is acyclic (feed-forward) 26 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers 27 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer 27 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer 27 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 27 Activity Consider a network with n neurons, k input and output. 28 Activity Consider a network with n neurons, k input and output. State of a network is a vector of output values of all neurons. (States of a network with n neurons are vectors of Rn ) State-space of a network is a set of all states. 28 Activity Consider a network with n neurons, k input and output. State of a network is a vector of output values of all neurons. (States of a network with n neurons are vectors of Rn ) State-space of a network is a set of all states. Network input is a vector of k real numbers, i.e. an element of Rk . Network input space is a set of all network inputs. (sometimes we restrict ourselves to a proper subset of Rk ) 28 Activity Consider a network with n neurons, k input and output. State of a network is a vector of output values of all neurons. (States of a network with n neurons are vectors of Rn ) State-space of a network is a set of all states. Network input is a vector of k real numbers, i.e. an element of Rk . Network input space is a set of all network inputs. (sometimes we restrict ourselves to a proper subset of Rk ) Initial state Input neurons set to values from the network input (each component of the network input corresponds to an input neuron) Values of the remaining neurons set to 0. 28 Activity – computation of a network Computation (typically) proceeds in discrete steps. 29 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 29 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) 29 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) A computation is finite on a network input x if the state changes only finitely many times (i.e. there is a moment in time after which the state of the network never changes). We also say that the network stops on x. 29 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) A computation is finite on a network input x if the state changes only finitely many times (i.e. there is a moment in time after which the state of the network never changes). We also say that the network stops on x. Network output is a vector of values of all output neurons in the network (i.e. an element of R ). Note that the network output keeps changing throughout the computation! 29 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) A computation is finite on a network input x if the state changes only finitely many times (i.e. there is a moment in time after which the state of the network never changes). We also say that the network stops on x. Network output is a vector of values of all output neurons in the network (i.e. an element of R ). Note that the network output keeps changing throughout the computation! MLP uses the following selection rule: In the i-th step evaluate all neurons in the i-th layer. 29 Activity – semantics of a network Definition Consider a network with n neurons, k input, output. Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on every input of A. Then we say that the network computes a function F : A → B if for every network input x the vector F(x) ∈ B is the output of the network after the computation on x stops. 30 Activity – semantics of a network Definition Consider a network with n neurons, k input, output. Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on every input of A. Then we say that the network computes a function F : A → B if for every network input x the vector F(x) ∈ B is the output of the network after the computation on x stops. 30 Activity – semantics of a network Definition Consider a network with n neurons, k input, output. Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on every input of A. Then we say that the network computes a function F : A → B if for every network input x the vector F(x) ∈ B is the output of the network after the computation on x stops. Example 1 This network computes a function from R2 to R. 30 Activity – inner potential and activation functions In order to specify activity of the network, we need to specify how the inner potentials ξ are computed and what are the activation functions σ. 31 Activity – inner potential and activation functions In order to specify activity of the network, we need to specify how the inner potentials ξ are computed and what are the activation functions σ. We assume (unless otherwise specified) that ξ = w0 + n i=1 wi · xi here x = (x1, . . . , xn) are inputs of the neuron and w = (w1, . . . , wn) are weights. 31 Activity – inner potential and activation functions In order to specify activity of the network, we need to specify how the inner potentials ξ are computed and what are the activation functions σ. We assume (unless otherwise specified) that ξ = w0 + n i=1 wi · xi here x = (x1, . . . , xn) are inputs of the neuron and w = (w1, . . . , wn) are weights. There are special types of neural network where the inner potential is computed differently, e.g. as a "distance" of an input from the weight vector: ξ = x − w here ||·|| is a vector norm, typically Euclidean. 31 Activity – inner potential and activation functions There are many activation functions, typical examples: Unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. 32 Activity – inner potential and activation functions There are many activation functions, typical examples: Unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. (Logistic) sigmoid σ(ξ) = 1 1 + e−λ·ξ here λ ∈ R is a steepness parameter. Hyperbolic tangens σ(ξ) = 1 − e−ξ 1 + e−ξ ReLU σ(ξ) = max(ξ, 0) 32 Activity – XOR 1 1 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 33 Activity – XOR 1 1 σ 11 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 33 Activity – XOR 0 0 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 33 Activity – XOR 0 0 σ 01 σ 1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 33 Activity – XOR 1 0 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 33 Activity – XOR 1 0 σ 11 σ 1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 33 Activity – XOR 1 0 σ 11 σ 1 1 σ 1 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 33 Activity – XOR 0 1 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 33 Activity – XOR 0 1 σ 11 σ 1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 33 Activity – XOR 0 1 σ 11 σ 1 1 σ 1 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 33 Activity – MLP and linear separation 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) P1 P2 x1 x2 σ1 σ 1 σ1 −22 2 −2 1 −1 1 3 −2 The line P1 is given by −1 + 2x1 + 2x2 = 0 The line P2 is given by 3 − 2x1 − 2x2 = 0 34 Activity – example x1 1 σ 0 1 σ0 1 σ 0 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 35 Activity – example x1 1 σ 1 1 σ0 1 σ 0 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 35 Activity – example x1 1 σ 1 1 σ 1 1 σ 0 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 35 Activity – example x1 1 σ 1 1 σ 1 1 σ 1 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 35 Activity – example x1 1 σ 0 1 σ 1 1 σ 1 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 35 Learning Consider a network with n neurons, k input and output. 36 Learning Consider a network with n neurons, k input and output. Configuration of a network is a vector of all values of weights. (Configurations of a network with m connections are elements of Rm ) Weight-space of a network is a set of all configurations. 36 Learning Consider a network with n neurons, k input and output. Configuration of a network is a vector of all values of weights. (Configurations of a network with m connections are elements of Rm ) Weight-space of a network is a set of all configurations. initial configuration weights can be initialized randomly or using some sophisticated algorithm 36 Learning algorithms Learning rule for weight adaptation. (the goal is to find a configuration in which the network computes a desired function) 37 Learning algorithms Learning rule for weight adaptation. (the goal is to find a configuration in which the network computes a desired function) Supervised learning The desired function is described using training examples that are pairs of the form (input, output). Learning algorithm searches for a configuration which "corresponds" to the training examples, typically by minimizing an error function. 37 Learning algorithms Learning rule for weight adaptation. (the goal is to find a configuration in which the network computes a desired function) Supervised learning The desired function is described using training examples that are pairs of the form (input, output). Learning algorithm searches for a configuration which "corresponds" to the training examples, typically by minimizing an error function. Unsupervised learning The training set contains only inputs. The goal is to determine distribution of the inputs (clustering, deep belief networks, etc.) 37 Supervised learning – illustration A A A A B B B classification in the plane using a single neuron 38 Supervised learning – illustration A A A A B B B classification in the plane using a single neuron training examples are of the form (point, value) where the value is either 1, or 0 depending on whether the point is either A, or B 38 Supervised learning – illustration A A A A B B B classification in the plane using a single neuron training examples are of the form (point, value) where the value is either 1, or 0 depending on whether the point is either A, or B the algorithm considers examples one after another whenever an incorrectly classified point is considered, the learning algorithm turns the line in the direction of the point 38 Summary – Advantages of neural networks Massive parallelism neurons can be evaluated in parallel 39 Summary – Advantages of neural networks Massive parallelism neurons can be evaluated in parallel Learning many sophisticated learning algorithms used to "program" neural networks 39 Summary – Advantages of neural networks Massive parallelism neurons can be evaluated in parallel Learning many sophisticated learning algorithms used to "program" neural networks generalization and robustness information is encoded in a distributed manned in weights "close" inputs typicaly get similar values 39 Summary – Advantages of neural networks Massive parallelism neurons can be evaluated in parallel Learning many sophisticated learning algorithms used to "program" neural networks generalization and robustness information is encoded in a distributed manned in weights "close" inputs typicaly get similar values Graceful degradation damage typically causes only a decrease in precision of results 39 Expressive power of neural networks 40 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = w0 + n i=1 wixi y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. 41 Boolean functions Activation function: unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. 42 Boolean functions Activation function: unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. σ x1 x2 xn x0 = 1 y = AND(x1, . . . , xn) 1 1 · · · 1 −n σ x1 x2 xn x0 = 1 y = OR(x1, . . . , xn) 1 1 · · · 1 −1 σ x1 x0 = 1 y = NOT(x1) −1 0 42 Boolean functions Theorem Let σ be the unit step function. Two layer MLPs, where each neuron has σ as the activation function, are able to compute all functions of the form F : {0, 1}n → {0, 1}. 43 Boolean functions Theorem Let σ be the unit step function. Two layer MLPs, where each neuron has σ as the activation function, are able to compute all functions of the form F : {0, 1}n → {0, 1}. Proof. Given a vector v = (v1, . . . , vn) ∈ {0, 1}n, consider a neuron Nv whose output is 1 iff the input is v: σ y x1 xi xn x0 = 1 w1 wi · · ·· · · wn w0 w0 = − n i=1 vi wi =    1 vi = 1 −1 vi = 0 Now let us connect all outputs of all neurons Nv satisfying F(v) = 1 using a neuron implementing OR. 43 Non-linear separation x1 x2 y Consider a three layer network; each neuron has the unit step activation function. The network divides the input space in two subspaces according to the output (0 or 1). 44 Non-linear separation x1 x2 y Consider a three layer network; each neuron has the unit step activation function. The network divides the input space in two subspaces according to the output (0 or 1). The first (hidden) layer divides the input space into half-spaces. 44 Non-linear separation x1 x2 y Consider a three layer network; each neuron has the unit step activation function. The network divides the input space in two subspaces according to the output (0 or 1). The first (hidden) layer divides the input space into half-spaces. The second layer may e.g. make intersections of the half-spaces ⇒ convex sets. 44 Non-linear separation x1 x2 y Consider a three layer network; each neuron has the unit step activation function. The network divides the input space in two subspaces according to the output (0 or 1). The first (hidden) layer divides the input space into half-spaces. The second layer may e.g. make intersections of the half-spaces ⇒ convex sets. The third layer may e.g. make unions of some convex sets. 44 Non-linear separation – illustration x1 xk · · · · · · · · · y Consider three layer networks; each neuron has the unit step activation function. Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . 45 Non-linear separation – illustration x1 xk · · · · · · · · · y Consider three layer networks; each neuron has the unit step activation function. Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . Cover A with hypercubes (in 2D squares, in 3D cubes, ...) 45 Non-linear separation – illustration x1 xk · · · · · · · · · y Consider three layer networks; each neuron has the unit step activation function. Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . Cover A with hypercubes (in 2D squares, in 3D cubes, ...) Each hypercube K can be separated using a two layer network NK (i.e. a function computed by NK gives 1 for points in K and 0 for the rest). 45 Non-linear separation – illustration x1 xk · · · · · · · · · y Consider three layer networks; each neuron has the unit step activation function. Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . Cover A with hypercubes (in 2D squares, in 3D cubes, ...) Each hypercube K can be separated using a two layer network NK (i.e. a function computed by NK gives 1 for points in K and 0 for the rest). Finally, connect outputs of the nets NK satisfying K ∩ A ∅ using a neuron implementing OR. 45 Non-linear separation - sigmoid Theorem (Cybenko 1989 - informal version) Let σ be a continuous function which is sigmoidal, i.e. satisfies σ(x) =    1 pro x → +∞ 0 pro x → −∞ For every "reasonable" set A ⊆ [0, 1]n, there is a two layer network where each hidden neuron has the activation function σ (output neurons are linear), that satisfies the following: For "most" vectors v ∈ [0, 1]n we have that v ∈ A iff the network output is > 0 for the input v. For mathematically oriented: "reasonable" means Lebesgue measurable "most" means that the set of incorrectly classified vectors has the Lebesgue measure smaller than a given ε > 0 46 Non-linear separation - practical illustration ALVINN drives a car 47 Non-linear separation - practical illustration ALVINN drives a car The net has 30 × 32 = 960 inputs (the input space is thus R960 ) 47 Non-linear separation - practical illustration ALVINN drives a car The net has 30 × 32 = 960 inputs (the input space is thus R960 ) Input values correspond to shades of gray of pixels. 47 Non-linear separation - practical illustration ALVINN drives a car The net has 30 × 32 = 960 inputs (the input space is thus R960 ) Input values correspond to shades of gray of pixels. Output neurons "classify" images of the road based on their "curvature". Zdroj obrázku: http://jmvidal.cse.sc.edu/talks/ann/alvin.html 47 Function approximation - three layers Let σ be a logistic sigmoid, i.e. σ(ξ) = 1 1 + e−ξ For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there is a three-layer network computing a function F : [0, 1]n → [0, 1] such that there is a linear activation in the output layer, i.e. the value of the output neuron is its inner potential ξ, 48 Function approximation - three layers Let σ be a logistic sigmoid, i.e. σ(ξ) = 1 1 + e−ξ For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there is a three-layer network computing a function F : [0, 1]n → [0, 1] such that there is a linear activation in the output layer, i.e. the value of the output neuron is its inner potential ξ, the remaining neurons have the logistic sigmoid σ as their activation, 48 Function approximation - three layers Let σ be a logistic sigmoid, i.e. σ(ξ) = 1 1 + e−ξ For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there is a three-layer network computing a function F : [0, 1]n → [0, 1] such that there is a linear activation in the output layer, i.e. the value of the output neuron is its inner potential ξ, the remaining neurons have the logistic sigmoid σ as their activation, for every v ∈ [0, 1]n we have that |F(v) − f(v)| < ε. 48 Function approximation – three layer networks x1 x2 σ σ σ σ σ· · · · · · · · · ζ y weighted sum of "spikes" ... + the other two 90 degree rotations a "spike" inner potential the value of the neuron 49 Function approximation - two-layer networks Theorem (Cybenko 1989) Let σ be a continuous function which is sigmoidal, i.e. is increasing and satisfies σ(x) =    1 pro x → +∞ 0 pro x → −∞ For every continuous function f : [0, 1]n → [0, 1] and every ε > 0 there is a function F : [0, 1]n → [0, 1] computed by a two layer network where each hidden neuron has the activation function σ (output neurons are linear), that satisfies the following |f(v) − F(v)| < ε pro každé v ∈ [0, 1]n . 50 Neural networks and computability Consider recurrent networks (i.e. containing cycles) 51 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); 51 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); 51 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); parallel activity rule (output values of all neurons are recomputed in every step); 51 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); parallel activity rule (output values of all neurons are recomputed in every step); activation function σ(ξ) =    1 ξ ≥ 1 ; ξ 0 ≤ ξ ≤ 1 ; 0 ξ < 0. 51 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); parallel activity rule (output values of all neurons are recomputed in every step); activation function σ(ξ) =    1 ξ ≥ 1 ; ξ 0 ≤ ξ ≤ 1 ; 0 ξ < 0. We encode words ω ∈ {0, 1}+ into numbers as follows: δ(ω) = |ω| i=1 ω(i) 2i + 1 2|ω|+1 E.g. ω = 11001 gives δ(ω) = 1 2 + 1 22 + 1 25 + 1 26 (= 0.110011 in binary form). 51 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. 52 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. Recurrent networks with rational weights are equivalent to Turing machines For every recursively enumerable language L ⊆ {0, 1}+ there is a recurrent network with rational weights and less than 1000 neurons, which recognizes L. The halting problem is undecidable for networks with at least 25 neurons and rational weights. There is "universal" network (equivalent of the universal Turing machine) 52 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. Recurrent networks with rational weights are equivalent to Turing machines For every recursively enumerable language L ⊆ {0, 1}+ there is a recurrent network with rational weights and less than 1000 neurons, which recognizes L. The halting problem is undecidable for networks with at least 25 neurons and rational weights. There is "universal" network (equivalent of the universal Turing machine) Recurrent networks are super-Turing powerful 52 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. Recurrent networks with rational weights are equivalent to Turing machines For every recursively enumerable language L ⊆ {0, 1}+ there is a recurrent network with rational weights and less than 1000 neurons, which recognizes L. The halting problem is undecidable for networks with at least 25 neurons and rational weights. There is "universal" network (equivalent of the universal Turing machine) Recurrent networks are super-Turing powerful For every language L ⊆ {0, 1}+ there is a recurrent network with less than 1000 nerons which recognizes L. 52 Summary of theoretical results Neural networks are very strong from the point of view of theory: All Boolean functions can be expressed using two-layer networks. Two-layer networks may approximate any continuous function. Recurrent networks are at least as strong as Turing machines. 53 Summary of theoretical results Neural networks are very strong from the point of view of theory: All Boolean functions can be expressed using two-layer networks. Two-layer networks may approximate any continuous function. Recurrent networks are at least as strong as Turing machines. These results are purely theoretical! "Theoretical" networks are extremely huge. It is very difficult to handcraft them even for simplest problems. From practical point of view, the most important advantage of neural networks are: learning, generalization, robustness. 53 Neural networks vs classical computers Neural networks "Classical" computers Data implicitly in weights explicitly Computation naturally parallel sequential, localized Robustness robust w.r.t. input corruption & damage changing one bit may completely crash the computation Precision imprecise, network recalls a training example "similar" to the input (typically) precise Programming learning manual 54 History & implementations 55 History of neurocomputers 1951: SNARC (Minski et al) the first implementation of neural network a rat strives to exit a maze 40 artificial neurons (300 vacuum tubes, engines, etc.) 56 History of neurocomputers 1957: Mark I Perceptron (Rosenblatt et al) - the first successful network for image recognition single layer network image represented by 20 × 20 photocells intensity of pixels was treated as the input to a perceptron (basically the formal neuron), which recognized figures weights were implemented using potentiometers, each set by its own engine it was possible to arbitrarily reconnect inputs to neurons to demonstrate adaptability 57 History of neurocomputers 1960: ADALINE (Widrow & Hof) single layer neural network weights stored in a newly invented electronic component memistor, which remembers history of electric current in the form of resistance. Widrow founded a company Memistor Corporation, which sold implementations of neural networks. 1960-66: several companies concerned with neural networks were founded. 58 History of neurocomputers 1967-82: dead still after publication of a book by Minski & Papert (published 1969, title Perceptrons) 1983-end of 90s: revival of neural networks many attempts at hardware implementations application specific chips (ASIC) programmable hardware (FPGA) hw implementations typically not better than "software" implementations on universal computers (problems with weight storage, size, speed, cost of production etc.) 59 History of neurocomputers 1967-82: dead still after publication of a book by Minski & Papert (published 1969, title Perceptrons) 1983-end of 90s: revival of neural networks many attempts at hardware implementations application specific chips (ASIC) programmable hardware (FPGA) hw implementations typically not better than "software" implementations on universal computers (problems with weight storage, size, speed, cost of production etc.) end of 90s-cca 2005: NN suppressed by other machine learning methods (support vector machines (SVM)) 2006-now: The boom of neural networks! deep networks – often better than any other method GPU implementations ... some specialized hw implementations (Google’s TPU) 59 Some highlights Breakthrough in image recognition. Accuracy of image recognition improved by an order of magnitude in 5 years. Breakthrough in game playing. Superhuman results in Go and Chess almost without any human intervention. Master level in Starcraft, poker, etc. Breakthrough in machine translation. Switching to deep learning produced a 60% increase in translation accuracy compared to the phrase-based approach previously used in Google Translate (in human evaluation) Breakthrough in speech processing. 60 History in waves ... Figure: The figure shows two of the three historical waves of artificial neural nets research, as measured by the frequency of the phrases "cybernetics" and "connectionism" or "neural networks" according to Google Books (the third wave is too recent to appear). 61 Current hardware – What do we face? Increasing dataset size ... 62 Current hardware – What do we face? ... and thus increasing size of neural networks ... 2. ADALINE 4. Early back-propagation network (Rumelhart et al., 1986b) 8. Image recognition: LeNet-5 (LeCun et al., 1998b) 10. Dimensionality reduction: Deep belief network (Hinton et al., 2006) ... here the third "wave" of neural networks started 15. Digit recognition: GPU-accelerated multilayer perceptron (Ciresan et al., 2010) 18. Image recognition (AlexNet): Multi-GPU convolutional network (Krizhevsky et al., 2012) 20. Image recognition: GoogLeNet (Szegedy et al., 2014a) 63 Current hardware – What do we face? ... as a reward we get this ... Figure: Since deep networks reached the scale necessary to compete in the ImageNetLarge Scale Visual Recognition Challenge, they have consistently won the competition every year, and yielded lower and lower error rates each time. Data from Russakovsky et al. (2014b) and He et al. (2015). 64 Current hardware In 2012, Google trained a large network of 1.7 billion weights and 9 layers The task was image recognition (10 million youtube video frames) The hw comprised a 1000 computer network (16 000 cores), computation took three days. 65 Current hardware In 2012, Google trained a large network of 1.7 billion weights and 9 layers The task was image recognition (10 million youtube video frames) The hw comprised a 1000 computer network (16 000 cores), computation took three days. In 2014, similar task performed on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI. Able to train 1 billion parameter networks on just 3 machines in a couple of days. Able to scale to 11 billion weights (approx. 6.5 times larger than the Google model) on 16 GPUs. 65 Current hardware – NVIDIA DGX-1 Station 8x GPU (Tesla V100) TFLOPS = 1000 GPU memory 256GB total NVIDIA Tensor Cores: 5,120 NVIDIA CUDA Cores: 40,960 System memory: 512 GB Network: Dual 10 Gb LAN NVIDIA Deep Learning SDK 66 Deep learning in clouds Several companies offer cloud services for deep learning: Amazon Web Services Google Cloud Deep Cognition ... Advantages: Do not have to care (too much) about technical problems. Do not have to buy and optimize highend hw/sw, networks etc. Scaling & virtually limitless storage. Disadvatages: Do not have full control. Performance can vary, connectivity problems. Have to pay for services. Privacy issues. 67 Current software TensorFlow (Google) open source software library for numerical computation using data flow graphs allows implementation of most current neural networks allows computation on multiple devices (CPUs, GPUs, ...) Python API Keras: a part of TensorFlow that allows easy description of most modern neural networks PyTorch (Facebook) similar to TensorFlow object oriented Theano (dead): The "academic" grand-daddy of deep-learning frameworks, written in Python. Strongly inspired TensorFlow (some people developing Theano moved on to develop TensorFlow). There are others: Caffe, Deeplearning4j, ... 68 Current software – Keras 69 Current software – Keras functional API 70 Current software – TensorFlow 71 Current software – TensorFlow 72 Current software – PyTorch 73 Other software implementations Most "mathematical" software packages contain some support of neural networks: MATLAB R STATISTICA Weka ... The implementations are typically not on par with the previously mentioned dedicated deep-learning libraries. 74 Training linear models 75 Linear regression (ADALINE) Architecture: x1 x2 xn · · · y x0 = 1 w0 w1 w2 wn w = (w0, w1, . . . , wn) and x = (x0, x1, . . . , xn) where x0 = 1. Activity: inner potential: ξ = w0 + n i=1 wixi = n i=0 wixi = w · x activation function: σ(ξ) = ξ network function: y[w](x) = σ(ξ) = w · x 76 Linear regression (ADALINE) Learning: Given a training dataset T = x1, d1 , x2, d2 , . . . , xp, dp Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ R is the expected output. Intuition: The network is supposed to compute an affine approximation of the function (some of) whose values are given in the training set. 77 Oaks in Wisconsin 78 Linear regression (ADALINE) Error function: E(w) = 1 2 p k=1 w · xk − dk 2 = 1 2 p k=1   n i=0 wixki − dk   2 The goal is to find w which minimizes E(w). 79 Error function 80 Gradient of the error function Consider gradient of the error function: E(w) = ∂E ∂w0 (w), . . . , ∂E ∂wn (w) Intuition: E(w) is a vector in the weight space which points in the direction of the steepest ascent of the error function. Note that the vectors xk are just parameters of the function E, and are thus fixed! 81 Gradient of the error function Consider gradient of the error function: E(w) = ∂E ∂w0 (w), . . . , ∂E ∂wn (w) Intuition: E(w) is a vector in the weight space which points in the direction of the steepest ascent of the error function. Note that the vectors xk are just parameters of the function E, and are thus fixed! Fact If E(w) = 0 = (0, . . . , 0), then w is a global minimum of E. For ADALINE, the error function E(w) is a convex paraboloid and thus has the unique global minimum. 81 Gradient - illustration Caution! This picture just illustrates the notion of gradient ... it is not the convex paraboloid E(w) ! 82 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. 83 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. Consider a concrete training set: T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)} = ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3) 83 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. Consider a concrete training set: T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)} = ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3) E(w0, w1) = 1 2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2] 83 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. Consider a concrete training set: T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)} = ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3) E(w0, w1) = 1 2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2] δE δw0 83 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. Consider a concrete training set: T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)} = ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3) E(w0, w1) = 1 2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2] δE δw0 = (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1 83 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. Consider a concrete training set: T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)} = ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3) E(w0, w1) = 1 2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2] δE δw0 = (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1 δE δw1 83 Gradient of the error function First, consider n = 1. Then the model is y = w0 + w1 · x. Consider a concrete training set: T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)} = ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3) E(w0, w1) = 1 2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2] δE δw0 = (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1 δE δw1 = (w0 +w1 ·2−1)·2+(w0 +w1 ·3−2)·3+(w0 +w1 ·4−5)·4 83 Gradient of the error function ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 84 Gradient of the error function ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 = 1 2 p k=1 2   n i=0 wixki − dk   δ δw   n i=0 wixki − dk   84 Gradient of the error function ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 = 1 2 p k=1 2   n i=0 wixki − dk   δ δw   n i=0 wixki − dk   = 1 2 p k=1 2   n i=0 wixki − dk     n i=0 δ δw wixki − δE δw dk   84 Gradient of the error function ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 = 1 2 p k=1 2   n i=0 wixki − dk   δ δw   n i=0 wixki − dk   = 1 2 p k=1 2   n i=0 wixki − dk     n i=0 δ δw wixki − δE δw dk   = p k=1 w · xk − dk xk 84 Gradient of the error function ∂E ∂w (w) = 1 2 p k=1 δ δw   n i=0 wixki − dk   2 = 1 2 p k=1 2   n i=0 wixki − dk   δ δw   n i=0 wixki − dk   = 1 2 p k=1 2   n i=0 wixki − dk     n i=0 δ δw wixki − δE δw dk   = p k=1 w · xk − dk xk Thus E(w) = ∂E ∂w0 (w), . . . , ∂E ∂wn (w) = p k=1 w · xk − dk xk 84 Linear regression - learning Batch algorithm (gradient descent): Idea: In every step "move" the weights in the direction opposite to the gradient. 85 Linear regression - learning Batch algorithm (gradient descent): Idea: In every step "move" the weights in the direction opposite to the gradient. The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 85 Linear regression - learning Batch algorithm (gradient descent): Idea: In every step "move" the weights in the direction opposite to the gradient. The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε · E(w(t) ) = w(t) − ε · p k=1 w(t) · xk − dk · xk Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate. 85 Linear regression - learning Batch algorithm (gradient descent): Idea: In every step "move" the weights in the direction opposite to the gradient. The algorithm computes a sequence of weight vectors w(0), w(1), w(2), . . .. weights in w(0) are randomly initialized to values close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε · E(w(t) ) = w(t) − ε · p k=1 w(t) · xk − dk · xk Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate. Proposition For sufficiently small ε > 0 the sequence w(0), w(1), w(2), . . . converges (componentwise) to the global minimum of E (i.e. to the vector w satisfying E(w) = 0). 85 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 Linear regression - animation 86 ADALINE - learning Online algorithm (Delta-rule, Widrow-Hoff rule): weights in w(0) initialized randomly close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε(t) · w(t) · xk − dk · xk Here k = t mod p + 1 and 0 < ε(t) ≤ 1 is a learning rate in the step t + 1. Note that the algorithm does not work with the complete gradient but only with its part determined by the currently considered training example. 87 ADALINE - learning Online algorithm (Delta-rule, Widrow-Hoff rule): weights in w(0) initialized randomly close to 0 in the step t + 1, weights w(t+1) are computed as follows: w(t+1) = w(t) − ε(t) · w(t) · xk − dk · xk Here k = t mod p + 1 and 0 < ε(t) ≤ 1 is a learning rate in the step t + 1. Note that the algorithm does not work with the complete gradient but only with its part determined by the currently considered training example. Theorem (Widrow & Hoff) If ε(t) = 1 t , then w(0), w(1), w(2), . . . converges to the global minimum of E. 87