PV021: Neural networks Tomáš Brázdil 1 Course organization Course materials: ▶ Main: The lecture ▶ Neural Networks and Deep Learning by Michael Nielsen http://neuralnetworksanddeeplearning.com/ (Extremely well-written online textbook (a little outdated)) ▶ Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville http://www.deeplearningbook.org/ ("Classical" overview of the theory of neural networks (a little outdated)) ▶ Probabilistic Machine Learning: An Introduction by Kevin Murphy https://probml.github.io/pml-book/book1.html (Greatly advanced ML textbook with (almost) up-to-date basic neural networks.) ▶ Infinitely many online tutorials on everything (to build intuition) Suggested: deeplearning.ai courses by Andrew Ng 2 Course organization Evaluation: ▶ Project (Dr. Tomáš Foltýnek) ▶ implementation of a selected model + analysis of given data ▶ implementation C/C++/Java/Rust without the use of any specialized libraries for data analysis and machine learning ▶ need to get over a given accuracy threshold (a gentle one, just to eliminate non-functional implementations) 3 Course organization Evaluation: ▶ Project (Dr. Tomáš Foltýnek) ▶ implementation of a selected model + analysis of given data ▶ implementation C/C++/Java/Rust without the use of any specialized libraries for data analysis and machine learning ▶ need to get over a given accuracy threshold (a gentle one, just to eliminate non-functional implementations) ▶ Oral exam ▶ I may ask about anything from the lecture! You will get a detailed manual specifying the mandatory knowledge. 3 FAQ Q: Why can we not use specialized libraries in projects? 4 FAQ Q: Why can we not use specialized libraries in projects? A: In order to "touch" the low level implementation details of the algorithms. You should not even use libraries for linear algebra and numerical methods so that you will be confronted with rounding errors and numerical instabilities. 4 FAQ Q: Why can we not use specialized libraries in projects? A: In order to "touch" the low level implementation details of the algorithms. You should not even use libraries for linear algebra and numerical methods so that you will be confronted with rounding errors and numerical instabilities. Q: Why should you attend this course when there are infinitely many great reasources elsewhere? A: There are at least two reasons: ▶ You may discuss issues with me, my colleagues and other students. ▶ I will make you truly learn fundamentals by heart. 4 Notable features of the course ▶ Use of mathematical notation and reasoning (mandatory for the exam) ▶ Sometimes goes deeper into statistical underpinnings of neural networks learning ▶ The project demands a complete working solution which must satisfy a prescribed performance specification 5 Notable features of the course ▶ Use of mathematical notation and reasoning (mandatory for the exam) ▶ Sometimes goes deeper into statistical underpinnings of neural networks learning ▶ The project demands a complete working solution which must satisfy a prescribed performance specification An unusual exam system! You can repeat the oral exam as many times as needed (only the best grade goes into IS). 5 Notable features of the course ▶ Use of mathematical notation and reasoning (mandatory for the exam) ▶ Sometimes goes deeper into statistical underpinnings of neural networks learning ▶ The project demands a complete working solution which must satisfy a prescribed performance specification An unusual exam system! You can repeat the oral exam as many times as needed (only the best grade goes into IS). An example of an instruction email (from another course with the same system): It is typically not sufficient to devote a single afternoon to the preparation for the exam. You have to know _everything_ (which means every single thing) starting with the slide 42 and ending with the slide 245 with notable exceptions of slides: 121 - 123, 137 - 140, 165, 167. Proofs presented on the whiteboard are also mandatory. 5 Machine learning in general ▶ Machine learning = construction of systems that learn their functionality from data (... and thus do not need to be programmed.) 6 Machine learning in general ▶ Machine learning = construction of systems that learn their functionality from data (... and thus do not need to be programmed.) ▶ spam filter ▶ learns to recognize spam from a database of "labeled" emails ▶ consequently can distinguish spam from ham 6 Machine learning in general ▶ Machine learning = construction of systems that learn their functionality from data (... and thus do not need to be programmed.) ▶ spam filter ▶ learns to recognize spam from a database of "labeled" emails ▶ consequently can distinguish spam from ham ▶ handwritten text reader ▶ learns from a database of handwritten letters (or text) labeled by their correct meaning ▶ consequently is able to recognize text 6 Machine learning in general ▶ Machine learning = construction of systems that learn their functionality from data (... and thus do not need to be programmed.) ▶ spam filter ▶ learns to recognize spam from a database of "labeled" emails ▶ consequently can distinguish spam from ham ▶ handwritten text reader ▶ learns from a database of handwritten letters (or text) labeled by their correct meaning ▶ consequently is able to recognize text ▶ · · · ▶ and lots of much, much more sophisticated applications ... 6 Machine learning in general ▶ Machine learning = construction of systems that learn their functionality from data (... and thus do not need to be programmed.) ▶ spam filter ▶ learns to recognize spam from a database of "labeled" emails ▶ consequently can distinguish spam from ham ▶ handwritten text reader ▶ learns from a database of handwritten letters (or text) labeled by their correct meaning ▶ consequently is able to recognize text ▶ · · · ▶ and lots of much, much more sophisticated applications ... ▶ Basic attributes of learning algorithms: ▶ representation: ability to capture the inner structure of training data ▶ generalization: ability to work properly on new data 6 Machine learning in general Machine learning algorithms typically construct mathematical models of given data. The models may be subsequently applied to fresh data. 7 Machine learning in general Machine learning algorithms typically construct mathematical models of given data. The models may be subsequently applied to fresh data. There are many types of models: ▶ decision trees ▶ support vector machines ▶ hidden Markov models ▶ Bayes networks and other graphical models ▶ neural networks ▶ · · · Neural networks, based on models of a (human) brain, form a natural basis for learning algorithms! 7 Artificial neural networks ▶ Artificial neuron is a rough mathematical approximation of a biological neuron. ▶ (Aritificial) neural network (NN) consists of a number of interconnected artificial neurons. "Behavior" of the network is encoded in connections between neurons. σ ξ x1 x2 xn y Zdroj obrázku: http://tulane.edu/sse/cmb/people/schrader/ 8 Why artificial neural networks? Modelling of biological neural networks (computational neuroscience). ▶ simplified mathematical models help to identify important mechanisms ▶ How the brain receives information? ▶ How the information is stored? ▶ How the brain develops? ▶ · · · 9 Why artificial neural networks? Modelling of biological neural networks (computational neuroscience). ▶ simplified mathematical models help to identify important mechanisms ▶ How the brain receives information? ▶ How the information is stored? ▶ How the brain develops? ▶ · · · ▶ neuroscience is strongly multidisciplinary; precise mathematical descriptions help in communication among experts and in design of new experiments. I will not spend much time on this area! 9 Why artificial neural networks? Neural networks in machine learning. ▶ Typically primitive models, far from their biological counterparts (but often inspired by biology). 10 Why artificial neural networks? Neural networks in machine learning. ▶ Typically primitive models, far from their biological counterparts (but often inspired by biology). ▶ Strongly oriented towards concrete application domains: ▶ decision making and control - autonomous vehicles, manufacturing processes, control of natural resources ▶ games - backgammon, poker, GO, Starcraft, ... ▶ finance - stock prices, risk analysis ▶ medicine - diagnosis, signal processing (EKG, EEG, ...), image processing (MRI, CT, WSI ...) ▶ text and speech processing - machine translation, text generation, speech recognition ▶ other signal processing - filtering, radar tracking, noise reduction ▶ art - music and painting generation, deepfakes ▶ · · · I will concentrate on this area! 10 Important features of neural networks ▶ Massive parallelism ▶ many slow (and "dumb") computational elements work in parallel on several levels of abstraction 11 Important features of neural networks ▶ Massive parallelism ▶ many slow (and "dumb") computational elements work in parallel on several levels of abstraction ▶ Learning ▶ a kid learns to recognize a rabbit after seeing several rabbits 11 Important features of neural networks ▶ Massive parallelism ▶ many slow (and "dumb") computational elements work in parallel on several levels of abstraction ▶ Learning ▶ a kid learns to recognize a rabbit after seeing several rabbits ▶ Generalization ▶ a kid is able to recognize a new rabbit after seeing several (old) rabbits 11 Important features of neural networks ▶ Massive parallelism ▶ many slow (and "dumb") computational elements work in parallel on several levels of abstraction ▶ Learning ▶ a kid learns to recognize a rabbit after seeing several rabbits ▶ Generalization ▶ a kid is able to recognize a new rabbit after seeing several (old) rabbits ▶ Robustness ▶ a blurred photo of a rabbit may still be classified as an image of a rabbit 11 Important features of neural networks ▶ Massive parallelism ▶ many slow (and "dumb") computational elements work in parallel on several levels of abstraction ▶ Learning ▶ a kid learns to recognize a rabbit after seeing several rabbits ▶ Generalization ▶ a kid is able to recognize a new rabbit after seeing several (old) rabbits ▶ Robustness ▶ a blurred photo of a rabbit may still be classified as an image of a rabbit ▶ Graceful degradation ▶ Experiments have shown that damaged neural network is still able to work quite well ▶ Damaged network may re-adapt, remaining neurons may take on functionality of the damaged ones 11 The aim of the course ▶ We will concentrate on ▶ basic techniques and principles of neural networks, ▶ fundamental models of neural networks and their applications. ▶ You should learn ▶ basic models (multilayer perceptron, convolutional networks, recurrent networks, transformers, autoencoders and generative adversarial networks) 12 The aim of the course ▶ We will concentrate on ▶ basic techniques and principles of neural networks, ▶ fundamental models of neural networks and their applications. ▶ You should learn ▶ basic models (multilayer perceptron, convolutional networks, recurrent networks, transformers, autoencoders and generative adversarial networks) ▶ Simple applications of these models (image processing, a little bit of text processing) 12 The aim of the course ▶ We will concentrate on ▶ basic techniques and principles of neural networks, ▶ fundamental models of neural networks and their applications. ▶ You should learn ▶ basic models (multilayer perceptron, convolutional networks, recurrent networks, transformers, autoencoders and generative adversarial networks) ▶ Simple applications of these models (image processing, a little bit of text processing) ▶ Basic learning algorithms (gradient descent with backpropagation) 12 The aim of the course ▶ We will concentrate on ▶ basic techniques and principles of neural networks, ▶ fundamental models of neural networks and their applications. ▶ You should learn ▶ basic models (multilayer perceptron, convolutional networks, recurrent networks, transformers, autoencoders and generative adversarial networks) ▶ Simple applications of these models (image processing, a little bit of text processing) ▶ Basic learning algorithms (gradient descent with backpropagation) ▶ Basic practical training techniques (data preparation, setting various hyper-parameters, control of learning, improving generalization) 12 The aim of the course ▶ We will concentrate on ▶ basic techniques and principles of neural networks, ▶ fundamental models of neural networks and their applications. ▶ You should learn ▶ basic models (multilayer perceptron, convolutional networks, recurrent networks, transformers, autoencoders and generative adversarial networks) ▶ Simple applications of these models (image processing, a little bit of text processing) ▶ Basic learning algorithms (gradient descent with backpropagation) ▶ Basic practical training techniques (data preparation, setting various hyper-parameters, control of learning, improving generalization) ▶ Basic information about current implementations (TensorFlow-Keras, Pytorch) 12 Biological neural network ▶ Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. ▶ Each neuron is connected with approx. 104 neurons. ▶ Neurons themselves are very complex systems. 13 Biological neural network ▶ Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. ▶ Each neuron is connected with approx. 104 neurons. ▶ Neurons themselves are very complex systems. Rough description of nervous system: ▶ External stimulus is received by sensory receptors (e.g. eye cells). 13 Biological neural network ▶ Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. ▶ Each neuron is connected with approx. 104 neurons. ▶ Neurons themselves are very complex systems. Rough description of nervous system: ▶ External stimulus is received by sensory receptors (e.g. eye cells). ▶ Information is futher transfered via peripheral nervous system (PNS) to the central nervous systems (CNS) where it is processed (integrated), and subseqently, an output signal is produced. 13 Biological neural network ▶ Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. ▶ Each neuron is connected with approx. 104 neurons. ▶ Neurons themselves are very complex systems. Rough description of nervous system: ▶ External stimulus is received by sensory receptors (e.g. eye cells). ▶ Information is futher transfered via peripheral nervous system (PNS) to the central nervous systems (CNS) where it is processed (integrated), and subseqently, an output signal is produced. ▶ Afterwards, the output signal is transfered via PNS to effectors (e.g. muscle cells). 13 Biological neural network Zdroj: N. Campbell and J. Reece; Biology, 7th Edition; ISBN: 080537146X 14 Summation 15 Biological and Mathematical neurons 16 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn ▶ x1, . . . , xn ∈ R are inputs 17 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn ▶ x1, . . . , xn ∈ R are inputs ▶ w1, . . . , wn ∈ R are weights 17 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn ▶ x1, . . . , xn ∈ R are inputs ▶ w1, . . . , wn ∈ R are weights ▶ ξ is an inner potential; almost always ξ = n i=1 wixi 17 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn ▶ x1, . . . , xn ∈ R are inputs ▶ w1, . . . , wn ∈ R are weights ▶ ξ is an inner potential; almost always ξ = n i=1 wixi ▶ y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ h ; 0 ξ < h. where h ∈ R is a threshold. 17 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h ▶ x0 = 1, x1, . . . , xn ∈ R are inputs 18 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h ▶ x0 = 1, x1, . . . , xn ∈ R are inputs ▶ w0, w1, . . . , wn ∈ R are weights 18 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h ▶ x0 = 1, x1, . . . , xn ∈ R are inputs ▶ w0, w1, . . . , wn ∈ R are weights ▶ ξ is an inner potential; almost always ξ = w0 + n i=1 wixi 18 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h ▶ x0 = 1, x1, . . . , xn ∈ R are inputs ▶ w0, w1, . . . , wn ∈ R are weights ▶ ξ is an inner potential; almost always ξ = w0 + n i=1 wixi ▶ y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. (The threshold h has been substituted with the new input x0 = 1 and the weight w0 = −h.) 18 Neuron and linear separation ξ = 0 ξ > 0 ξ > 0 ξ < 0 ξ < 0 ▶ inner potential ξ = w0 + n i=1 wixi determines a separation hyperplane in the n-dimensional input space ▶ in 2d line ▶ in 3d plane ▶ · · · 19 Neuron geometry 20 Neuron and linear separation σ σ( wixi) x1 xn · · · 1/0 by A/B w1 wn n = 8 · 8, i.e. the number of pixels in the images. Inputs are binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0). 21 Neuron and linear separation σ x1 xn · · · x0 = 1 1/0 pro A/B w1 wn w0 n = 8 · 8, i.e. the number of pixels in the images. Inputs are binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0). 22 Neuron and linear separation ¯w0 + n i=1 ¯wixi = 0 w0 + n i=1 wixi = 0 A A A A B B B ▶ Red line classifies incorrectly ▶ Green line classifies correctly (may be a result of a correction by a learning algorithm) 23 Neuron and linear separation (XOR) 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) x1 x2 ▶ No line separates ones from zeros. 24 Neural networks Neural network consists of formal neurons interconnected in such a way that the output of one neuron is an input of several other neurons. In order to describe a particular type of neural networks we need to specify: ▶ Architecture How the neurons are connected. ▶ Activity How the network transforms inputs to outputs. ▶ Learning How the weights are changed during training. 25 Architecture Network architecture is given as a digraph whose nodes are neurons and edges are connections. We distinguish several categories of neurons: ▶ Output neurons ▶ Hidden neurons ▶ Input neurons (In general, a neuron may be both input and output; a neuron is hidden if it is neither input, nor output.) 26 Architecture – Cycles ▶ A network is cyclic (recurrent) if its architecture contains a directed cycle. 27 Architecture – Cycles ▶ A network is cyclic (recurrent) if its architecture contains a directed cycle. ▶ Otherwise it is acyclic (feed-forward) 27 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 ▶ Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers 28 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 ▶ Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers ▶ layers numbered from 0; the input layer has number 0 ▶ E.g. three-layer network has two hidden layers and one output layer 28 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 ▶ Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers ▶ layers numbered from 0; the input layer has number 0 ▶ E.g. three-layer network has two hidden layers and one output layer ▶ Neurons in the i-th layer are connected with all neurons in the i + 1-st layer 28 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 ▶ Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers ▶ layers numbered from 0; the input layer has number 0 ▶ E.g. three-layer network has two hidden layers and one output layer ▶ Neurons in the i-th layer are connected with all neurons in the i + 1-st layer ▶ Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 28 Activity Consider a network with n neurons, k input and ℓ output. 29 Activity Consider a network with n neurons, k input and ℓ output. ▶ State of a network is a vector of output values of all neurons. (States of a network with n neurons are vectors of Rn ) ▶ State-space of a network is a set of all states. 29 Activity Consider a network with n neurons, k input and ℓ output. ▶ State of a network is a vector of output values of all neurons. (States of a network with n neurons are vectors of Rn ) ▶ State-space of a network is a set of all states. ▶ Network input is a vector of k real numbers, i.e. an element of Rk . ▶ Network input space is a set of all network inputs. (sometimes we restrict ourselves to a proper subset of Rk ) 29 Activity Consider a network with n neurons, k input and ℓ output. ▶ State of a network is a vector of output values of all neurons. (States of a network with n neurons are vectors of Rn ) ▶ State-space of a network is a set of all states. ▶ Network input is a vector of k real numbers, i.e. an element of Rk . ▶ Network input space is a set of all network inputs. (sometimes we restrict ourselves to a proper subset of Rk ) ▶ Initial state Input neurons set to values from the network input (each component of the network input corresponds to an input neuron) Values of the remaining neurons set to 0. 29 Activity – computation of a network ▶ Computation (typically) proceeds in discrete steps. 30 Activity – computation of a network ▶ Computation (typically) proceeds in discrete steps. In every step the following happens: 30 Activity – computation of a network ▶ Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) 30 Activity – computation of a network ▶ Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) A computation is finite on a network input ⃗x if the state changes only finitely many times (i.e. there is a moment in time after which the state of the network never changes). We also say that the network stops on ⃗x. 30 Activity – computation of a network ▶ Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) A computation is finite on a network input ⃗x if the state changes only finitely many times (i.e. there is a moment in time after which the state of the network never changes). We also say that the network stops on ⃗x. ▶ Network output is a vector of values of all output neurons in the network (i.e., an element of Rℓ). Note that the network output keeps changing throughout the computation! 30 Activity – computation of a network ▶ Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) A computation is finite on a network input ⃗x if the state changes only finitely many times (i.e. there is a moment in time after which the state of the network never changes). We also say that the network stops on ⃗x. ▶ Network output is a vector of values of all output neurons in the network (i.e., an element of Rℓ). Note that the network output keeps changing throughout the computation! MLP uses the following selection rule: In the i-th step evaluate all neurons in the i-th layer. 30 Activity – semantics of a network Definition Consider a network with n neurons, k input, ℓ output. Let A ⊆ Rk and B ⊆ Rℓ. Suppose that the network stops on every input of A. Then we say that the network computes a function F : A → B if for every network input ⃗x the vector F(⃗x) ∈ B is the output of the network after the computation on ⃗x stops. 31 Activity – semantics of a network Definition Consider a network with n neurons, k input, ℓ output. Let A ⊆ Rk and B ⊆ Rℓ. Suppose that the network stops on every input of A. Then we say that the network computes a function F : A → B if for every network input ⃗x the vector F(⃗x) ∈ B is the output of the network after the computation on ⃗x stops. 31 Activity – semantics of a network Definition Consider a network with n neurons, k input, ℓ output. Let A ⊆ Rk and B ⊆ Rℓ. Suppose that the network stops on every input of A. Then we say that the network computes a function F : A → B if for every network input ⃗x the vector F(⃗x) ∈ B is the output of the network after the computation on ⃗x stops. Example 1 This network computes a function from R2 to R. 31 Activity – inner potential and activation functions In order to specify activity of the network, we need to specify how the inner potentials ξ are computed and what are the activation functions σ. 32 Activity – inner potential and activation functions In order to specify activity of the network, we need to specify how the inner potentials ξ are computed and what are the activation functions σ. We assume (unless otherwise specified) that ξ = w0 + n i=1 wi · xi here ⃗x = (x1, . . . , xn) are inputs of the neuron and ⃗w = (w1, . . . , wn) are weights. 32 Activity – inner potential and activation functions In order to specify activity of the network, we need to specify how the inner potentials ξ are computed and what are the activation functions σ. We assume (unless otherwise specified) that ξ = w0 + n i=1 wi · xi here ⃗x = (x1, . . . , xn) are inputs of the neuron and ⃗w = (w1, . . . , wn) are weights. There are special types of neural networks where the inner potential is computed differently, e.g., as a "distance" of an input from the weight vector: ξ = ⃗x − ⃗w here ||·|| is a vector norm, typically Euclidean. 32 Activity – inner potential and activation functions There are many activation functions, typical examples: ▶ Unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. 33 Activity – inner potential and activation functions There are many activation functions, typical examples: ▶ Unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. ▶ (Logistic) sigmoid σ(ξ) = 1 1 + e−λ·ξ here λ ∈ R is a steepness parameter. ▶ Hyperbolic tangens σ(ξ) = 1 − e−ξ 1 + e−ξ ▶ ReLU σ(ξ) = max(ξ, 0) 33 Activity – XOR 1 1 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. ▶ The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 1 1 σ 11 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. ▶ The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 0 0 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. ▶ The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 0 0 σ 01 σ 1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. ▶ The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 1 0 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. ▶ The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 1 0 σ 11 σ 1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. ▶ The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 1 0 σ 11 σ 1 1 σ 1 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. ▶ The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 0 1 σ 01 σ0 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. ▶ The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 0 1 σ 11 σ 1 1 σ 0 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. ▶ The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – XOR 0 1 σ 11 σ 1 1 σ 1 1 −22 2 −2 1 −1 1 3 −2 ▶ Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. ▶ The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 34 Activity – MLP and linear separation 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) P1 P2 x1 x2 σ1 σ 1 σ1 −22 2 −2 1 −1 1 3 −2 ▶ The line P1 is given by −1 + 2x1 + 2x2 = 0 ▶ The line P2 is given by 3 − 2x1 − 2x2 = 0 35 Activity – example x1 1 σ 0 1 σ0 1 σ 0 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 36 Activity – example x1 1 σ 1 1 σ0 1 σ 0 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 36 Activity – example x1 1 σ 1 1 σ 1 1 σ 0 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 36 Activity – example x1 1 σ 1 1 σ 1 1 σ 1 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 36 Activity – example x1 1 σ 0 1 σ 1 1 σ 1 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 36 Learning Consider a network with n neurons, k input and ℓ output. 37 Learning Consider a network with n neurons, k input and ℓ output. ▶ Configuration of a network is a vector of all values of weights. (Configurations of a network with m connections are elements of Rm ) ▶ Weight-space of a network is a set of all configurations. 37 Learning Consider a network with n neurons, k input and ℓ output. ▶ Configuration of a network is a vector of all values of weights. (Configurations of a network with m connections are elements of Rm ) ▶ Weight-space of a network is a set of all configurations. ▶ initial configuration weights can be initialized randomly or using some sophisticated algorithm 37 Learning algorithms Learning rule for weight adaptation. (the goal is to find a configuration in which the network computes a desired function) 38 Learning algorithms Learning rule for weight adaptation. (the goal is to find a configuration in which the network computes a desired function) ▶ Supervised learning ▶ The desired function is described using training examples that are pairs of the form (input, output). ▶ Learning algorithm searches for a configuration which "corresponds" to the training examples, typically by minimizing an error function. 38 Learning algorithms Learning rule for weight adaptation. (the goal is to find a configuration in which the network computes a desired function) ▶ Supervised learning ▶ The desired function is described using training examples that are pairs of the form (input, output). ▶ Learning algorithm searches for a configuration which "corresponds" to the training examples, typically by minimizing an error function. ▶ Unsupervised learning ▶ The training set contains only inputs. ▶ The goal is to determine distribution of the inputs (clustering, deep belief networks, etc.) 38 Supervised learning – illustration A A A A B B B ▶ classification in the plane using a single neuron 39 Supervised learning – illustration A A A A B B B ▶ classification in the plane using a single neuron ▶ training examples are of the form (point, value) where the value is either 1, or 0 depending on whether the point is either A, or B 39 Supervised learning – illustration A A A A B B B ▶ classification in the plane using a single neuron ▶ training examples are of the form (point, value) where the value is either 1, or 0 depending on whether the point is either A, or B ▶ the algorithm considers examples one after another ▶ whenever an incorrectly classified point is considered, the learning algorithm turns the line in the direction of the point 39 Summary – Advantages of neural networks ▶ Massive parallelism ▶ neurons can be evaluated in parallel 40 Summary – Advantages of neural networks ▶ Massive parallelism ▶ neurons can be evaluated in parallel ▶ Learning ▶ many sophisticated learning algorithms used to "program" neural networks 40 Summary – Advantages of neural networks ▶ Massive parallelism ▶ neurons can be evaluated in parallel ▶ Learning ▶ many sophisticated learning algorithms used to "program" neural networks ▶ generalization and robustness ▶ information is encoded in a distributed manner in weights ▶ "close" inputs typicaly get similar values 40 Summary – Advantages of neural networks ▶ Massive parallelism ▶ neurons can be evaluated in parallel ▶ Learning ▶ many sophisticated learning algorithms used to "program" neural networks ▶ generalization and robustness ▶ information is encoded in a distributed manner in weights ▶ "close" inputs typicaly get similar values ▶ Graceful degradation ▶ damage typically causes only a decrease in precision of results 40 Expressive power of neural networks 41 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h ▶ x0 = 1, x1, . . . , xn ∈ R are inputs ▶ w0, w1, . . . , wn ∈ R are weights ▶ ξ is an inner potential; almost always ξ = w0 + n i=1 wixi ▶ y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. 42 Boolean functions Activation function: unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. 43 Boolean functions Activation function: unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. σ x1 x2 xn x0 = 1 y = AND(x1, . . . , xn) 1 1 · · · 1 −n σ x1 x2 xn x0 = 1 y = OR(x1, . . . , xn) 1 1 · · · 1 −1 σ x1 x0 = 1 y = NOT(x1) −1 0 43 Boolean functions Theorem Let σ be the unit step function. Two layer MLPs, where each neuron has σ as the activation function, are able to compute all functions of the form F : {0, 1}n → {0, 1}. 44 Boolean functions Theorem Let σ be the unit step function. Two layer MLPs, where each neuron has σ as the activation function, are able to compute all functions of the form F : {0, 1}n → {0, 1}. Proof. ▶ Given a vector ⃗v = (v1, . . . , vn) ∈ {0, 1}n, consider a neuron N⃗v whose output is 1 iff the input is ⃗v: σ y x1 xi xn x0 = 1 w1 wi · · ·· · · wn w0 w0 = − n i=1 vi wi =    1 vi = 1 −1 vi = 0 ▶ Now let us connect all outputs of all neurons N⃗v satisfying F(⃗v) = 1 using a neuron implementing OR. □ 44 Non-linear separation x1 x2 y ▶ Consider a three layer network; each neuron has the unit step activation function. ▶ The network divides the input space in two subspaces according to the output (0 or 1). 45 Non-linear separation x1 x2 y ▶ Consider a three layer network; each neuron has the unit step activation function. ▶ The network divides the input space in two subspaces according to the output (0 or 1). ▶ The first (hidden) layer divides the input space into half-spaces. 45 Non-linear separation x1 x2 y ▶ Consider a three layer network; each neuron has the unit step activation function. ▶ The network divides the input space in two subspaces according to the output (0 or 1). ▶ The first (hidden) layer divides the input space into half-spaces. ▶ The second layer may e.g. make intersections of the half-spaces ⇒ convex sets. 45 Non-linear separation x1 x2 y ▶ Consider a three layer network; each neuron has the unit step activation function. ▶ The network divides the input space in two subspaces according to the output (0 or 1). ▶ The first (hidden) layer divides the input space into half-spaces. ▶ The second layer may e.g. make intersections of the half-spaces ⇒ convex sets. ▶ The third layer may e.g. make unions of some convex sets. 45 Non-linear separation – illustration x1 xk · · · · · · · · · y ▶ Consider three layer networks; each neuron has the unit step activation function. ▶ Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . 46 Non-linear separation – illustration x1 xk · · · · · · · · · y ▶ Consider three layer networks; each neuron has the unit step activation function. ▶ Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . ▶ Cover A with hypercubes (in 2D squares, in 3D cubes, ...) 46 Non-linear separation – illustration x1 xk · · · · · · · · · y ▶ Consider three layer networks; each neuron has the unit step activation function. ▶ Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . ▶ Cover A with hypercubes (in 2D squares, in 3D cubes, ...) ▶ Each hypercube K can be separated using a two layer network NK (i.e. a function computed by NK gives 1 for points in K and 0 for the rest). 46 Non-linear separation – illustration x1 xk · · · · · · · · · y ▶ Consider three layer networks; each neuron has the unit step activation function. ▶ Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . ▶ Cover A with hypercubes (in 2D squares, in 3D cubes, ...) ▶ Each hypercube K can be separated using a two layer network NK (i.e. a function computed by NK gives 1 for points in K and 0 for the rest). ▶ Finally, connect outputs of the nets NK satisfying K ∩ A ∅ using a neuron implementing OR. 46 Power of ReLU x · · · y Consider a two layer network ▶ with a single input and single output; ▶ hidden neurons with the ReLU activation: σ(ξ) = max(ξ, 0); ▶ the output neuron with identity activation: σ(ξ) = ξ (linear model) 47 Power of ReLU x · · · y Consider a two layer network ▶ with a single input and single output; ▶ hidden neurons with the ReLU activation: σ(ξ) = max(ξ, 0); ▶ the output neuron with identity activation: σ(ξ) = ξ (linear model) For every continuous function f : [0, 1] → [0, 1] and ε > 0 there is a network of the above type computing a function F : [0, 1] → R such that |f(x) − F(x)| ≤ ε for all x ∈ [0, 1]. 47 Power of ReLU x · · · y Consider a two layer network ▶ with a single input and single output; ▶ hidden neurons with the ReLU activation: σ(ξ) = max(ξ, 0); ▶ the output neuron with identity activation: σ(ξ) = ξ (linear model) For every continuous function f : [0, 1] → [0, 1] and ε > 0 there is a network of the above type computing a function F : [0, 1] → R such that |f(x) − F(x)| ≤ ε for all x ∈ [0, 1]. For every open subset A ⊆ [0, 1] there is a network of the above type such that for "most" x ∈ [0, 1] we have that x ∈ A iff the network’s output is > 0 for the input x. Just consider a continuous function f where f(x) is the minimum difference between x and a point on the boundary of A. Then uniformly approximate f using the networks. 47 48 48 48 48 48 48 48 Non-linear separation - sigmoid Theorem (Cybenko 1989 - informal version) Let σ be a continuous function which is sigmoidal, i.e. satisfies σ(x) =    1 for x → +∞ 0 for x → −∞ For every "reasonable" set A ⊆ [0, 1]n, there is a two layer network where each hidden neuron has the activation function σ (output neurons are linear), that satisfies the following: For "most" vectors ⃗v ∈ [0, 1]n we have that ⃗v ∈ A iff the network output is > 0 for the input ⃗v. For mathematically oriented: ▶ "reasonable" means Lebesgue measurable ▶ "most" means that the set of incorrectly classified vectors has the Lebesgue measure smaller than a given ε > 0 49 Non-linear separation - practical illustration ▶ ALVINN drives a car 50 Non-linear separation - practical illustration ▶ ALVINN drives a car ▶ The net has 30 × 32 = 960 inputs (the input space is thus R960 ) 50 Non-linear separation - practical illustration ▶ ALVINN drives a car ▶ The net has 30 × 32 = 960 inputs (the input space is thus R960 ) ▶ Input values correspond to shades of gray of pixels. 50 Non-linear separation - practical illustration ▶ ALVINN drives a car ▶ The net has 30 × 32 = 960 inputs (the input space is thus R960 ) ▶ Input values correspond to shades of gray of pixels. ▶ Output neurons "classify" images of the road based on their "curvature". Image source: http://jmvidal.cse.sc.edu/talks/ann/alvin.html 50 Function approximation - two-layer networks Theorem (Cybenko 1989) Let σ be a continuous function which is sigmoidal, i.e., is increasing and satisfies σ(x) =    1 for x → +∞ 0 for x → −∞ For every continuous function f : [0, 1]n → [0, 1] and every ε > 0 there is a function F : [0, 1]n → [0, 1] computed by a two layer network where each hidden neuron has the activation function σ (output neurons are linear), that satisfies the following |f(⃗v) − F(⃗v)| < ε for every ⃗v ∈ [0, 1]n . 51 Neural networks and computability ▶ Consider recurrent networks (i.e., containing cycles) 52 Neural networks and computability ▶ Consider recurrent networks (i.e., containing cycles) ▶ with real weights (in general); 52 Neural networks and computability ▶ Consider recurrent networks (i.e., containing cycles) ▶ with real weights (in general); ▶ one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); 52 Neural networks and computability ▶ Consider recurrent networks (i.e., containing cycles) ▶ with real weights (in general); ▶ one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); ▶ parallel activity rule (output values of all neurons are recomputed in every step); 52 Neural networks and computability ▶ Consider recurrent networks (i.e., containing cycles) ▶ with real weights (in general); ▶ one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); ▶ parallel activity rule (output values of all neurons are recomputed in every step); ▶ activation function σ(ξ) =    1 ξ ≥ 1 ; ξ 0 ≤ ξ ≤ 1 ; 0 ξ < 0. 52 Neural networks and computability ▶ Consider recurrent networks (i.e., containing cycles) ▶ with real weights (in general); ▶ one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); ▶ parallel activity rule (output values of all neurons are recomputed in every step); ▶ activation function σ(ξ) =    1 ξ ≥ 1 ; ξ 0 ≤ ξ ≤ 1 ; 0 ξ < 0. ▶ We encode words ω ∈ {0, 1}+ into numbers as follows: δ(ω) = |ω| i=1 ω(i) 2i + 1 2|ω|+1 E.g. ω = 11001 gives δ(ω) = 1 2 + 1 22 + 1 25 + 1 26 (= 0.110011 in binary form). 52 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. 53 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. ▶ Recurrent networks with rational weights are equivalent to Turing machines ▶ For every recursively enumerable language L ⊆ {0, 1}+ there is a recurrent network with rational weights and less than 1000 neurons, which recognizes L. ▶ The halting problem is undecidable for networks with at least 25 neurons and rational weights. ▶ There is "universal" network (equivalent of the universal Turing machine) 53 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. ▶ Recurrent networks with rational weights are equivalent to Turing machines ▶ For every recursively enumerable language L ⊆ {0, 1}+ there is a recurrent network with rational weights and less than 1000 neurons, which recognizes L. ▶ The halting problem is undecidable for networks with at least 25 neurons and rational weights. ▶ There is "universal" network (equivalent of the universal Turing machine) ▶ Recurrent networks are super-Turing powerful 53 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. ▶ Recurrent networks with rational weights are equivalent to Turing machines ▶ For every recursively enumerable language L ⊆ {0, 1}+ there is a recurrent network with rational weights and less than 1000 neurons, which recognizes L. ▶ The halting problem is undecidable for networks with at least 25 neurons and rational weights. ▶ There is "universal" network (equivalent of the universal Turing machine) ▶ Recurrent networks are super-Turing powerful ▶ For every language L ⊆ {0, 1}+ there is a recurrent network with less than 1000 nerons which recognizes L. 53 Summary of theoretical results ▶ Neural networks are very strong from the point of view of theory: ▶ All Boolean functions can be expressed using two-layer networks. ▶ Two-layer networks may approximate any continuous function. ▶ Recurrent networks are at least as strong as Turing machines. 54 Summary of theoretical results ▶ Neural networks are very strong from the point of view of theory: ▶ All Boolean functions can be expressed using two-layer networks. ▶ Two-layer networks may approximate any continuous function. ▶ Recurrent networks are at least as strong as Turing machines. ▶ These results are purely theoretical! ▶ "Theoretical" networks are extremely huge. ▶ It is very difficult to handcraft them even for simplest problems. ▶ From practical point of view, the most important advantages of neural networks are: learning, generalization, robustness. 54 Neural networks vs classical computers Neural networks "Classical" computers Data implicitly in weights explicitly Computation naturally parallel sequential, localized Robustness robust w.r.t. input corruption & damage changing one bit may completely crash the computation Precision imprecise, network recalls a training example "similar" to the input (typically) precise Programming learning manual 55 History & implementations 56 History of neurocomputers ▶ 1951: SNARC (Minski et al) ▶ the first implementation of neural network ▶ a rat strives to exit a maze ▶ 40 artificial neurons (300 vacuum tubes, engines, etc.) 57 History of neurocomputers ▶ 1957: Mark I Perceptron (Rosenblatt et al) - the first successful network for image recognition ▶ single layer network ▶ image represented by 20 × 20 photocells ▶ intensity of pixels was treated as the input to a perceptron (basically the formal neuron), which recognized figures ▶ weights were implemented using potentiometers, each set by its own engine ▶ it was possible to arbitrarily reconnect inputs to neurons to demonstrate adaptability 58 History of neurocomputers ▶ 1960: ADALINE (Widrow & Hof) ▶ single layer neural network ▶ weights stored in a newly invented electronic component memistor, which remembers history of electric current in the form of resistance. ▶ Widrow founded a company Memistor Corporation, which sold implementations of neural networks. ▶ 1960-66: several companies concerned with neural networks were founded. 59 History of neurocomputers ▶ 1967-82: dead still after publication of a book by Minski & Papert (published 1969, title Perceptrons) ▶ 1983-end of 90s: revival of neural networks ▶ many attempts at hardware implementations ▶ application specific chips (ASIC) ▶ programmable hardware (FPGA) ▶ hw implementations typically not better than "software" implementations on universal computers (problems with weight storage, size, speed, cost of production etc.) 60 History of neurocomputers ▶ 1967-82: dead still after publication of a book by Minski & Papert (published 1969, title Perceptrons) ▶ 1983-end of 90s: revival of neural networks ▶ many attempts at hardware implementations ▶ application specific chips (ASIC) ▶ programmable hardware (FPGA) ▶ hw implementations typically not better than "software" implementations on universal computers (problems with weight storage, size, speed, cost of production etc.) ▶ end of 90s-cca 2005: NN suppressed by other machine learning methods (support vector machines (SVM)) ▶ 2006-now: The boom of neural networks! ▶ deep networks – often better than any other method ▶ GPU implementations ▶ ... specialized hw implementations (Google’s TPU) 60 Some highlights ▶ Breakthrough in image recognition. Accuracy of image recognition improved by an order of magnitude in 5 years. ▶ Breakthrough in game playing. Superhuman results in Go and Chess almost without any human intervention. Master level in Starcraft, poker, etc. ▶ Breakthrough in machine translation. Switching to deep learning produced a 60% increase in translation accuracy compared to the phrase-based approach previously used in Google Translate (in human evaluation) ▶ Breakthrough in speech processing. ▶ Breakthrough in text generation. GPT-4 generates pretty realistic articles, short plays (for a theatre) have been successfully generated, etc. 61 Example This slide was automatically generated byaskig GPT-4 "Give me a beamer slide with complexity of Steepest descent, Neton’s method and BFGS". 62 Example Source 63 History in waves ... Figure: The figure shows two of the three historical waves of artificial neural nets research, as measured by the frequency of the phrases "cybernetics" and "connectionism" or "neural networks" according to Google Books (the third wave is too recent to appear). 64 Current hardware – What do we face? Increasing dataset size ... ... weakly-supervised pre-training using hashtags from the Instagram uses 3.6 ∗ 109 images. Revisiting Weakly Supervised Pre-Training of Visual Perception Models. Singh et al. https://arxiv.org/pdf/2201.08371.pdf, 2022 65 GPT-3 Training Dataset 45 TB text data from multiple sources Source: Kindra Cooper. OpenAI GPT-3: Everything You Need to Know. Springboard. 2023 66 Current hardware – What do we face? ... and thus increasing size of neural networks ... 2. ADALINE 4. Early back-propagation network (Rumelhart et al., 1986b) 8. Image recognition: LeNet-5 (LeCun et al., 1998b) 10. Dimensionality reduction: Deep belief network (Hinton et al., 2006) ... here the third "wave" of neural networks started 15. Digit recognition: GPU-accelerated multilayer perceptron (Ciresan et al., 2010) 18. Image recognition (AlexNet): Multi-GPU convolutional network (Krizhevsky et al., 2012) 20. Image recognition: GoogLeNet (Szegedy et al., 2014a) 67 GPT-4’s Scale: GPT-4 has 1.8 trillion parameters across 120 layers, which is over 10 times larger than GPT-3. 68 Current hardware – What do we face? ... as a reward we get this ... Figure: Since deep networks reached the scale necessary to compete in the ImageNetLarge Scale Visual Recognition Challenge, they have consistently won the competition every year, and yielded lower and lower error rates each time. Data from Russakovsky et al. (2014b) and He et al. (2015). 69 Current hardware In 2012, Google trained a large network of 1.7 billion weights and 9 layers The task was image recognition (10 million youtube video frames) The hw comprised a 1000 computer network (16 000 cores), computation took three days. 70 Current hardware In 2012, Google trained a large network of 1.7 billion weights and 9 layers The task was image recognition (10 million youtube video frames) The hw comprised a 1000 computer network (16 000 cores), computation took three days. In 2014, similar task performed on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI. Able to train 1 billion parameter networks on just 3 machines in a couple of days. Able to scale to 11 billion weights (approx. 6.5 times larger than the Google model) on 16 GPUs. 70 Current hardware – NVIDIA DGX Station ▶ 8x GPU (Nvidia A100 80GB Tensor Core) ▶ 5 petaFLOPS ▶ System memory: 2 TB ▶ Network: 200 Gb/s InfiniBand 71 Deep learning in clouds Big companies offer cloud services for deep learning: ▶ Amazon Web Services ▶ Google Cloud ▶ Deep Cognition ▶ ... Advantages: ▶ Do not have to care (too much) about technical problems. ▶ Do not have to buy and optimize highend hw/sw, networks etc. ▶ Scaling & virtually limitless storage. Disadvatages: ▶ Do not have full control. ▶ Performance can vary, connectivity problems. ▶ Have to pay for services. ▶ Privacy issues. 72 Current software ▶ TensorFlow (Google) ▶ open source software library for numerical computation using data flow graphs ▶ allows implementation of most current neural networks ▶ allows computation on multiple devices (CPUs, GPUs, ...) ▶ Python API ▶ Keras: a part of TensorFlow that allows easy description of most modern neural networks ▶ PyTorch (Facebook) ▶ similar to TensorFlow ▶ object oriented ▶ ... majority of new models in research papers implemented in PyTorch https://www.cioinsight.com/big-data/pytorch-vs-tensorflow/ ▶ Theano (dead): ▶ The "academic" grand-daddy of deep-learning frameworks, written in Python. Strongly inspired TensorFlow (some people developing Theano moved on to develop TensorFlow). ▶ There are others: Caffe, Deeplearning4j, ... 73 Current software – Keras 74 Current software – Keras functional API 75 Current software – TensorFlow 76 Current software – TensorFlow 77 Current software – PyTorch 78 Other software implementations Most "mathematical" software packages contain some support of neural networks: ▶ MATLAB ▶ R ▶ STATISTICA ▶ Weka ▶ ... The implementations are typically not on par with the previously mentioned dedicated deep-learning libraries. 79 MLP training – theory 80 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 ▶ Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers ▶ layers numbered from 0; the input layer has number 0 ▶ E.g., a three-layer network has two hidden layers and one output layer ▶ Neurons in the i-th layer are connected with all neurons in the i + 1-st layer ▶ Architecture of a MLP is typically described by the numbers of neurons in individual layers (e.g., 2-4-3-2) 81 MLP – architecture Notation: ▶ Denote ▶ X a set of input neurons ▶ Y a set of output neurons ▶ Z a set of all neurons (X, Y ⊆ Z) 82 MLP – architecture Notation: ▶ Denote ▶ X a set of input neurons ▶ Y a set of output neurons ▶ Z a set of all neurons (X, Y ⊆ Z) ▶ individual neurons denoted by indices i, j etc. ▶ ξj is the inner potential of the neuron j after the computation stops 82 MLP – architecture Notation: ▶ Denote ▶ X a set of input neurons ▶ Y a set of output neurons ▶ Z a set of all neurons (X, Y ⊆ Z) ▶ individual neurons denoted by indices i, j etc. ▶ ξj is the inner potential of the neuron j after the computation stops ▶ yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) 82 MLP – architecture Notation: ▶ Denote ▶ X a set of input neurons ▶ Y a set of output neurons ▶ Z a set of all neurons (X, Y ⊆ Z) ▶ individual neurons denoted by indices i, j etc. ▶ ξj is the inner potential of the neuron j after the computation stops ▶ yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) ▶ wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e., wj0 = −bj where bj is the bias of the neuron j) 82 MLP – architecture Notation: ▶ Denote ▶ X a set of input neurons ▶ Y a set of output neurons ▶ Z a set of all neurons (X, Y ⊆ Z) ▶ individual neurons denoted by indices i, j etc. ▶ ξj is the inner potential of the neuron j after the computation stops ▶ yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) ▶ wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e., wj0 = −bj where bj is the bias of the neuron j) ▶ j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) 82 MLP – architecture Notation: ▶ Denote ▶ X a set of input neurons ▶ Y a set of output neurons ▶ Z a set of all neurons (X, Y ⊆ Z) ▶ individual neurons denoted by indices i, j etc. ▶ ξj is the inner potential of the neuron j after the computation stops ▶ yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) ▶ wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e., wj0 = −bj where bj is the bias of the neuron j) ▶ j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) ▶ j→ is a set of all i such that j is adjacent to i (i.e. there is an arc from j to i) 82 MLP – activity ▶ inner potential of neuron j: ξj = i∈j← wjiyi 83 MLP – activity ▶ inner potential of neuron j: ξj = i∈j← wjiyi ▶ activation function σj for neuron j (arbitrary differentiable) 83 MLP – activity ▶ inner potential of neuron j: ξj = i∈j← wjiyi ▶ activation function σj for neuron j (arbitrary differentiable) ▶ State of non-input neuron j ∈ Z \ X after the computation stops: yj = σj(ξj) (yj depends on the configuration ⃗w and the input ⃗x, so we sometimes write yj(⃗w,⃗x) ) 83 MLP – activity ▶ inner potential of neuron j: ξj = i∈j← wjiyi ▶ activation function σj for neuron j (arbitrary differentiable) ▶ State of non-input neuron j ∈ Z \ X after the computation stops: yj = σj(ξj) (yj depends on the configuration ⃗w and the input ⃗x, so we sometimes write yj(⃗w,⃗x) ) ▶ The network computes a function R|X| do R|Y| . Layer-wise computation: First, all input neurons are assigned values of the input. In the ℓ-th step, all neurons of the ℓ-th layer are evaluated. 83 MLP – learning ▶ Given a training dataset T of the form ⃗xk , ⃗dk k = 1, . . . , p Here, every ⃗xk ∈ R|X| is an input vector end every ⃗dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input ⃗xk (the vector ⃗dk can be written as dkj j∈Y ). 84 MLP – learning ▶ Given a training dataset T of the form ⃗xk , ⃗dk k = 1, . . . , p Here, every ⃗xk ∈ R|X| is an input vector end every ⃗dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input ⃗xk (the vector ⃗dk can be written as dkj j∈Y ). ▶ Error function: E(⃗w) = p k=1 Ek (⃗w) where Ek (⃗w) = 1 2 j∈Y yj(⃗w,⃗xk ) − dkj 2 This is just an example of an error function; we shall see other error functions later. 84 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors ⃗w(0), ⃗w(1), ⃗w(2), . . .. ▶ weights in ⃗w(0) are randomly initialized to values close to 0 ▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji 85 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors ⃗w(0), ⃗w(1), ⃗w(2), . . .. ▶ weights in ⃗w(0) are randomly initialized to values close to 0 ▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂E ∂wji (⃗w(t) ) is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is a learning rate in step t + 1. 85 MLP – learning algorithm Batch algorithm (gradient descent): The algorithm computes a sequence of weight vectors ⃗w(0), ⃗w(1), ⃗w(2), . . .. ▶ weights in ⃗w(0) are randomly initialized to values close to 0 ▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂E ∂wji (⃗w(t) ) is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is a learning rate in step t + 1. Note that ∂E ∂wji (⃗w(t) ) is a component of the gradient ∇E, i.e. the weight update can be written as ⃗w(t+1) = ⃗w(t) − ε(t) · ∇E(⃗w(t) ). https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop- 85 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji 86 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σ′ j (ξj) · yi 86 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σ′ j (ξj) · yi and for every j ∈ Z ∖ X we get ∂Ek ∂yj = yj − dkj for j ∈ Y 86 MLP – error function gradient For every wji we have ∂E ∂wji = p k=1 ∂Ek ∂wji where for every k = 1, . . . , p holds ∂Ek ∂wji = ∂Ek ∂yj · σ′ j (ξj) · yi and for every j ∈ Z ∖ X we get ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σ′ r (ξr ) · wrj for j ∈ Z ∖ (Y ∪ X) (Here all yj are in fact yj(⃗w,⃗xk )). 86 Derivation of backprop. Consider k = 1, . . . , p and a weight wji. By the chain rule: ∂Ek ∂wji = 87 Derivation of backprop. Consider k = 1, . . . , p and a weight wji. By the chain rule: ∂Ek ∂wji = ∂Ek ∂yj · ∂yj ∂wji = 87 Derivation of backprop. Consider k = 1, . . . , p and a weight wji. By the chain rule: ∂Ek ∂wji = ∂Ek ∂yj · ∂yj ∂wji = ∂Ek ∂yj · ∂yj ∂ξj · ∂ξj ∂wji = 87 Derivation of backprop. Consider k = 1, . . . , p and a weight wji. By the chain rule: ∂Ek ∂wji = ∂Ek ∂yj · ∂yj ∂wji = ∂Ek ∂yj · ∂yj ∂ξj · ∂ξj ∂wji = ∂Ek ∂yj · σ′ j (ξj) · yi since ∂yj ∂ξj = ∂(σj(ξj)) ∂ξj = σ′ j (ξj) ∂ξj ∂wji = ∂ r∈j← wjr yr ∂wji = yi 87 Derivation of backdrop. (cont.) For j ∈ Y : ∂Ek ∂yj = ∂ 1 2 r∈Y (yr − dkr )2 ∂yj = yj − dkj 88 Derivation of backdrop. (cont.) For j ∈ Y : ∂Ek ∂yj = ∂ 1 2 r∈Y (yr − dkr )2 ∂yj = yj − dkj ... and another application of the chain rule: For j Y : ∂Ek ∂yj = 88 Derivation of backdrop. (cont.) For j ∈ Y : ∂Ek ∂yj = ∂ 1 2 r∈Y (yr − dkr )2 ∂yj = yj − dkj ... and another application of the chain rule: For j Y : ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · ∂yr ∂yj = 88 Derivation of backdrop. (cont.) For j ∈ Y : ∂Ek ∂yj = ∂ 1 2 r∈Y (yr − dkr )2 ∂yj = yj − dkj ... and another application of the chain rule: For j Y : ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · ∂yr ∂yj = r∈j→ ∂Ek ∂yr · ∂yr ∂ξr · ∂ξr ∂yj = 88 Derivation of backdrop. (cont.) For j ∈ Y : ∂Ek ∂yj = ∂ 1 2 r∈Y (yr − dkr )2 ∂yj = yj − dkj ... and another application of the chain rule: For j Y : ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · ∂yr ∂yj = r∈j→ ∂Ek ∂yr · ∂yr ∂ξr · ∂ξr ∂yj = r∈j→ ∂Ek ∂yr · σ′ r (ξr ) · wrj since ∂yr ∂ξr = ∂(σr (ξr )) ∂ξr = σ′ r (ξr ) ∂ξr ∂yj = ∂ s∈r← wrsys ∂yj = wrj 88 MLP – error function gradient (history) ▶ If yj = σj(ξj) = 1 1+e −ξj for all j ∈ Z, then σ′ j (ξj) = yj(1 − yj) 89 MLP – error function gradient (history) ▶ If yj = σj(ξj) = 1 1+e −ξj for all j ∈ Z, then σ′ j (ξj) = yj(1 − yj) and thus for all j ∈ Z ∖ X: ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · yr (1 − yr ) · wrj for j ∈ Z ∖ (Y ∪ X) 89 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: 90 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) 90 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 90 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(⃗w,⃗xk ) for all j ∈ Z 90 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(⃗w,⃗xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 90 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(⃗w,⃗xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 3. compute ∂Ek ∂wji for all wji using ∂Ek ∂wji := ∂Ek ∂yj · σ′ j (ξj) · yi 90 MLP – computing the gradient Compute ∂E ∂wji = p k=1 ∂Ek ∂wji as follows: Initialize Eji := 0 (By the end of the computation: Eji = ∂E ∂wji ) For every k = 1, . . . , p do: 1. forward pass: compute yj = yj(⃗w,⃗xk ) for all j ∈ Z 2. backward pass: compute ∂Ek ∂yj for all j ∈ Z using backpropagation (see the next slide!) 3. compute ∂Ek ∂wji for all wji using ∂Ek ∂wji := ∂Ek ∂yj · σ′ j (ξj) · yi 4. Eji := Eji + ∂Ek ∂wji The resulting Eji equals ∂E ∂wji . 90 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: 91 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: ▶ if j ∈ Y, then ∂Ek ∂yj = yj − dkj 91 MLP – backpropagation Compute ∂Ek ∂yj for all j ∈ Z as follows: ▶ if j ∈ Y, then ∂Ek ∂yj = yj − dkj ▶ if j ∈ Z ∖ Y ∪ X, then assuming that j is in the ℓ-th layer and assuming that ∂Ek ∂yr has already been computed for all neurons in the ℓ + 1-st layer, compute ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σ′ r (ξr ) · wrj (This works because all neurons of r ∈ j→ belong to the ℓ + 1-st layer.) 91 Complexity of the batch algorithm Computation of ∂E ∂wji (⃗w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σ′ r (ξr ) for given ξr ) 92 Complexity of the batch algorithm Computation of ∂E ∂wji (⃗w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σ′ r (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 92 Complexity of the batch algorithm Computation of ∂E ∂wji (⃗w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σ′ r (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(⃗w,⃗xk ) 92 Complexity of the batch algorithm Computation of ∂E ∂wji (⃗w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σ′ r (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(⃗w,⃗xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 92 Complexity of the batch algorithm Computation of ∂E ∂wji (⃗w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σ′ r (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(⃗w,⃗xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) 92 Complexity of the batch algorithm Computation of ∂E ∂wji (⃗w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σ′ r (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(⃗w,⃗xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) The steps 1. - 3. take linear time w.r.t. the number of network weights. 92 Complexity of the batch algorithm Computation of ∂E ∂wji (⃗w(t−1)) stops in time linear in the size of the network plus the size of the training set. (assuming unit cost of operations including computation of σ′ r (ξr ) for given ξr ) Proof sketch: The algorithm does the following p times: 1. forward pass, i.e. computes yj(⃗w,⃗xk ) 2. backpropagation, i.e. computes ∂Ek ∂yj 3. computes ∂Ek ∂wji and adds it to Eji (a constant time operation in the unit cost framework) The steps 1. - 3. take linear time w.r.t. the number of network weights. Note that the speed of convergence of the gradient descent cannot be estimated ... 92 Illustration of the gradient descent – XOR Source: Pattern Classification (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork 93 MLP – learning algorithm Online algorithm: The algorithm computes a sequence of weight vectors ⃗w(0), ⃗w(1), ⃗w(2), . . .. ▶ weights in ⃗w(0) are randomly initialized to values close to 0 ▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1) are computed as follows: w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = −ε(t) · ∂Ek ∂wji (w (t) ji ) is the weight update of wji in the step t + 1 and 0 < ε(t) ≤ 1 is the learning rate in the step t + 1. There are other variants determined by the selection of the training examples used for the error computation (more on this later). 94 SGD ▶ weights in ⃗w(0) are randomly initialized to values close to 0 ▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1) are computed as follows: ▶ Choose (randomly) a set of training examples T ⊆ {1, . . . , p} ▶ Compute ⃗w(t+1) = ⃗w(t) + ∆⃗w(t) where ∆⃗w(t) = −ε(t) · k∈T ∇Ek (⃗w(t) ) ▶ 0 < ε(t) ≤ 1 is a learning rate in step t + 1 ▶ ∇Ek (⃗w(t) ) is the gradient of the error of the example k Note that the random choice of the minibatch is typically implemented by randomly shuffling all data and then choosing minibatches sequentially. 95 Regression: Output and Error ▶ For regression, the output activation is typically the identity, i.e., yi = σ(ξi) = ξi for i ∈ Y. 96 Regression: Output and Error ▶ For regression, the output activation is typically the identity, i.e., yi = σ(ξi) = ξi for i ∈ Y. ▶ A training dataset ⃗xk , ⃗dk k = 1, . . . , p Here, every ⃗xk ∈ R|X| is an input vector end every ⃗dk ∈ R|Y| is the desired network output. For every i ∈ Y, denote by dki the desired output of the neuron i for a given network input ⃗xk (the vector ⃗dk can be written as (dki)i∈Y ). 96 Regression: Output and Error ▶ For regression, the output activation is typically the identity, i.e., yi = σ(ξi) = ξi for i ∈ Y. ▶ A training dataset ⃗xk , ⃗dk k = 1, . . . , p Here, every ⃗xk ∈ R|X| is an input vector end every ⃗dk ∈ R|Y| is the desired network output. For every i ∈ Y, denote by dki the desired output of the neuron i for a given network input ⃗xk (the vector ⃗dk can be written as (dki)i∈Y ). ▶ The error function mean squared error (mse): E(⃗w) = 1 p p k=1 Ek (⃗w) where Ek (⃗w) = 1 2 i∈Y yi(⃗w,⃗xk ) − dki 2 96 Maximum Likelihood vs Least Squares Fix a training set D = (x1, d1) , (x2, d2) , . . . , xp, dp , dk ∈ R. Consider a single output neuron o. 97 Maximum Likelihood vs Least Squares Fix a training set D = (x1, d1) , (x2, d2) , . . . , xp, dp , dk ∈ R. Consider a single output neuron o. Assume that each dk was generated randomly as follows dk = yo(⃗w,⃗xk ) + ϵk ▶ ⃗w are unknown constants ▶ ϵk are normally distributed with mean 0 and an unknown variance σ2 97 Maximum Likelihood vs Least Squares Fix a training set D = (x1, d1) , (x2, d2) , . . . , xp, dp , dk ∈ R. Consider a single output neuron o. Assume that each dk was generated randomly as follows dk = yo(⃗w,⃗xk ) + ϵk ▶ ⃗w are unknown constants ▶ ϵk are normally distributed with mean 0 and an unknown variance σ2 Assume that ϵ1, . . . , ϵp have been generated independently. 97 Maximum Likelihood vs Least Squares Fix a training set D = (x1, d1) , (x2, d2) , . . . , xp, dp , dk ∈ R. Consider a single output neuron o. Assume that each dk was generated randomly as follows dk = yo(⃗w,⃗xk ) + ϵk ▶ ⃗w are unknown constants ▶ ϵk are normally distributed with mean 0 and an unknown variance σ2 Assume that ϵ1, . . . , ϵp have been generated independently. Denote by p(d1, . . . , dp | ⃗w, σ2 ) the probability density of the values d1, . . . , dn assuming fixed x1, . . . , xp, ⃗w, σ2 . (For the interested: The independence and definition of dk ’s imply p(d1, . . . , dp | ⃗w, σ2 ) = p k=1 N[yo(⃗w,⃗xk ), σ2 ](dk ) N[yo(⃗w,⃗xk ), σ2 ](dk ) is a normal dist. with the mean yo(⃗w,⃗xk ) and var. σ2 .) 97 Maximum Likelihood vs Least Squares Our goal is to find the weights ⃗w that maximize the likelihood L(⃗w, σ2 ) := p(d1, . . . , dp | ⃗w, σ2 ) But now with the fixed values d1, . . . , dn from the training set! 98 Maximum Likelihood vs Least Squares Our goal is to find the weights ⃗w that maximize the likelihood L(⃗w, σ2 ) := p(d1, . . . , dp | ⃗w, σ2 ) But now with the fixed values d1, . . . , dn from the training set! Theorem The unique ⃗w that minimize the least squares error E[⃗w] maximize L(⃗w, σ2) for an arbitrary variance σ2. 98 Classification: Output and Error ▶ The output activation function softmax: yi = σi(ξj1 , . . . , ξjk ) = eξi j∈Y eξj Here Y = {j1, . . . , jk } 99 Classification: Output and Error ▶ The output activation function softmax: yi = σi(ξj1 , . . . , ξjk ) = eξi j∈Y eξj Here Y = {j1, . . . , jk } ▶ A training dataset ⃗xk , ⃗dk k = 1, . . . , p Here, every ⃗xk ∈ R|X| is an input vector end every ⃗dk ∈ {0, 1}|Y| is the desired network output. For every i ∈ Y, denote by dki the desired output of the neuron i for a given network input ⃗xk (the vector ⃗dk can be written as (dki)i∈Y ). 99 Classification: Output and Error ▶ The output activation function softmax: yi = σi(ξj1 , . . . , ξjk ) = eξi j∈Y eξj Here Y = {j1, . . . , jk } ▶ A training dataset ⃗xk , ⃗dk k = 1, . . . , p Here, every ⃗xk ∈ R|X| is an input vector end every ⃗dk ∈ {0, 1}|Y| is the desired network output. For every i ∈ Y, denote by dki the desired output of the neuron i for a given network input ⃗xk (the vector ⃗dk can be written as (dki)i∈Y ). ▶ The error function (categorical) cross entropy: E(⃗w) = − 1 p p k=1 i∈Y dki log(yi(⃗w,⃗xk )) 99 Gradient with Softmax & Cross-Entropy Assume that V is the layer just below the output layer Y. E(⃗w) = − 1 p p k=1 i∈Y dki log(yi(⃗w,⃗xk )) = − 1 p p k=1 i∈Y dki log   eξi j∈Y eξj   = − 1 p p k=1 i∈Y dki   ξi − log   j∈Y eξj     = − 1 p p k=1 i∈Y dki   ℓ∈V wiℓyℓ − log   j∈Y e ℓ∈V wjℓyℓ     Now compute the derivatives δE δyℓ for ℓ ∈ V. 100 Binary Classification: Output and Error Assume a single output neuron o ∈ Y = {o}. ▶ The output activation function logistic sigmoid: σo(ξo) = eξo eξo + 1 = 1 1 + e−ξo 101 Binary Classification: Output and Error Assume a single output neuron o ∈ Y = {o}. ▶ The output activation function logistic sigmoid: σo(ξo) = eξo eξo + 1 = 1 1 + e−ξo ▶ A training dataset T = ⃗x1, d1 , ⃗x2, d2 , . . . , ⃗xp, dp Here ⃗xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ {0, 1} is the desired output. 101 Binary Classification: Output and Error Assume a single output neuron o ∈ Y = {o}. ▶ The output activation function logistic sigmoid: σo(ξo) = eξo eξo + 1 = 1 1 + e−ξo ▶ A training dataset T = ⃗x1, d1 , ⃗x2, d2 , . . . , ⃗xp, dp Here ⃗xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ {0, 1} is the desired output. ▶ The error function (Binary) cross-entropy: E(⃗w) = − p k=1 dk log(yo(⃗w,⃗xk ))+(1−dk ) log(1−yo(⃗w,⃗xk )) 101 Cross-entropy vs max likelihood Consider our model giving a probability yo(⃗w,⃗x) given input ⃗x. 102 Cross-entropy vs max likelihood Consider our model giving a probability yo(⃗w,⃗x) given input ⃗x. Recall that the training dataset is T = ⃗x1, d1 , ⃗x2, d2 , . . . , ⃗xp, dp Here ⃗xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ {0, 1} is the expected output. 102 Cross-entropy vs max likelihood Consider our model giving a probability yo(⃗w,⃗x) given input ⃗x. Recall that the training dataset is T = ⃗x1, d1 , ⃗x2, d2 , . . . , ⃗xp, dp Here ⃗xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input, and dk ∈ {0, 1} is the expected output. The likelihood: L(⃗w) = p k=1 yo(⃗w,⃗xk ) dk · 1 − yo(⃗w,⃗xk ) (1−dk ) log(L) = p k=1 dk · log(yo(⃗w,⃗xk )) + (1 − dk ) · log(1 − yo(⃗w,⃗xk )) and thus − log(L) = the cross-entropy. Minimizing the cross-entropy maximizes the log-likelihood (and vice versa). 102 Squared Error vs Logistic Output Activation Consider a single neuron model y = σ(w · x) = 1/(1 + e−w·x) where w ∈ R is the weight (ignore the bias). A training dataset T = {(x, d)} where x ∈ R and d ∈ {0, 1}. 103 Squared Error vs Logistic Output Activation Consider a single neuron model y = σ(w · x) = 1/(1 + e−w·x) where w ∈ R is the weight (ignore the bias). A training dataset T = {(x, d)} where x ∈ R and d ∈ {0, 1}. Squared error E(w) = 1 2 (y − d)2. δE δw = (y − d) · y · (1 − y) · x 103 Squared Error vs Logistic Output Activation Consider a single neuron model y = σ(w · x) = 1/(1 + e−w·x) where w ∈ R is the weight (ignore the bias). A training dataset T = {(x, d)} where x ∈ R and d ∈ {0, 1}. Squared error E(w) = 1 2 (y − d)2. δE δw = (y − d) · y · (1 − y) · x Thus ▶ If d = 1 and y ≈ 0, then δE δw ≈ 0 ▶ If d = 0 and y ≈ 1, then δE δw ≈ 0 The gradient of E is small even though the model is wrong! 103 Squared Error vs Logistic Output Activation Consider a single neuron model y = σ(w · x) = 1/(1 + e−w·x) where w ∈ R is the weight (ignore the bias). A training dataset T = {(x, d)} where x ∈ R and d ∈ {0, 1}. Cross-entropy error E(w) = −d · log(y) − (1 − d) · log(1 − y). 103 Squared Error vs Logistic Output Activation Consider a single neuron model y = σ(w · x) = 1/(1 + e−w·x) where w ∈ R is the weight (ignore the bias). A training dataset T = {(x, d)} where x ∈ R and d ∈ {0, 1}. Cross-entropy error E(w) = −d · log(y) − (1 − d) · log(1 − y). For d = 1 δE δw = − 1 y · y · (1 − y) · x = −(1 − y) · x which is close to −x for y ≈ 0. 103 Squared Error vs Logistic Output Activation Consider a single neuron model y = σ(w · x) = 1/(1 + e−w·x) where w ∈ R is the weight (ignore the bias). A training dataset T = {(x, d)} where x ∈ R and d ∈ {0, 1}. Cross-entropy error E(w) = −d · log(y) − (1 − d) · log(1 − y). For d = 1 δE δw = − 1 y · y · (1 − y) · x = −(1 − y) · x which is close to −x for y ≈ 0. For d = 0 δE δw = − 1 1 − y · (−y) · (1 − y) · x = y · x which is close to x for y ≈ 1. 103 MLP training – practical issues 104 Practical issues of gradient descent ▶ Training efficiency: ▶ What size of a minibatch? ▶ How to choose the learning rate ε(t) and control SGD ? ▶ How to pre-process the inputs? ▶ How to initialize weights? ▶ How to choose desired output values of the network? 105 Practical issues of gradient descent ▶ Training efficiency: ▶ What size of a minibatch? ▶ How to choose the learning rate ε(t) and control SGD ? ▶ How to pre-process the inputs? ▶ How to initialize weights? ▶ How to choose desired output values of the network? ▶ Quality of the resulting model: ▶ When to stop training? ▶ Regularization techniques. ▶ How large network? For simplicity, I will illustrate the reasoning on MLP + mse. Later we will see other topologies and error functions with different but always somewhat related issues. 105 Issues in gradient descent ▶ Small networks: Lots of local minima where the descent gets stuck. ▶ The model identifiability problem: Swapping incoming weights of neurons i and j leaves the same network topology – weight space symmetry. ▶ Recent studies show that for sufficiently large networks, all local minima have low values of the error function. 106 Issues in gradient descent ▶ Small networks: Lots of local minima where the descent gets stuck. ▶ The model identifiability problem: Swapping incoming weights of neurons i and j leaves the same network topology – weight space symmetry. ▶ Recent studies show that for sufficiently large networks, all local minima have low values of the error function. Saddle points One can show (by a combinatorial argument) that larger networks have exponentially more saddle points than local minima. 106 Issues in gradient descent – too slow descent ▶ flat regions 107 Issues in gradient descent – too fast descent ▶ steep cliffs: the gradient is extremely large, descent skips important weight vectors 108 Issues in gradient descent – local vs global structure What if we initialize on the left? 109 Gradient Descent in Large Networks Theorem Assume (roughly), ▶ activation functions: "smooth" ReLU (softplus) σ(z) = log(1 + exp(z)) In general: Smooth, non-polynomial, analytic, Lipschitz continuous. ▶ inputs ⃗xk of Euclidean norm equal to 1, desired values dk such that all |dk | are bounded by a constant, ▶ the number of hidden neurons per layer sufficiently large (polynomial in certain numerical characteristics of inputs roughly measuring their similarity, and exponential in the depth of the network), ▶ the learning rate constant and sufficiently small. The gradient descent converges (with high probability w.r.t. random initialization) to a global minimum with zero error at a linear rate. Later, we get to a special type of network called ResNet where the above result demands only polynomially many neurons per layer (w.r.t. depth). 110 Issues in computing the gradient ▶ vanishing and exploding gradients ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σ′ r (ξr ) · wrj for j ∈ Z ∖ (Y ∪ X) 111 Issues in computing the gradient ▶ vanishing and exploding gradients ∂Ek ∂yj = yj − dkj for j ∈ Y ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σ′ r (ξr ) · wrj for j ∈ Z ∖ (Y ∪ X) ▶ inexact gradient computation: ▶ Minibatch gradient is only an estimate of the true gradient. ▶ Note that the standard deviation of the estimate is (roughly) σ/ √ m where m is the size of the minibatch and σ is the variance of the gradient estimate for a single training example. (E.g. minibatch size 10 000 means 100 times more computation than the size 100 but gives only 10 times less deviation.) 111 Minibatch size ▶ Larger batches provide a more accurate estimate of the gradient but with less than linear returns. 112 Minibatch size ▶ Larger batches provide a more accurate estimate of the gradient but with less than linear returns. ▶ Multicore architectures are usually underutilized by extremely small batches. 112 Minibatch size ▶ Larger batches provide a more accurate estimate of the gradient but with less than linear returns. ▶ Multicore architectures are usually underutilized by extremely small batches. ▶ If all examples in the batch are to be processed in parallel (as is the typical case), then the amount of memory scales with the batch size. For many hardware setups, this is the limiting factor in batch size. 112 Minibatch size ▶ Larger batches provide a more accurate estimate of the gradient but with less than linear returns. ▶ Multicore architectures are usually underutilized by extremely small batches. ▶ If all examples in the batch are to be processed in parallel (as is the typical case), then the amount of memory scales with the batch size. For many hardware setups, this is the limiting factor in batch size. ▶ It is common (especially when using GPUs) for power of 2 batch sizes to offer better runtime. The typical power of 2 batch sizes ranges from 32 to 256, with 16 sometimes being attempted for large models. 112 Minibatch size ▶ Larger batches provide a more accurate estimate of the gradient but with less than linear returns. ▶ Multicore architectures are usually underutilized by extremely small batches. ▶ If all examples in the batch are to be processed in parallel (as is the typical case), then the amount of memory scales with the batch size. For many hardware setups, this is the limiting factor in batch size. ▶ It is common (especially when using GPUs) for power of 2 batch sizes to offer better runtime. The typical power of 2 batch sizes ranges from 32 to 256, with 16 sometimes being attempted for large models. ▶ Small batches can offer a regularizing effect, perhaps due to the noise they add to the learning process. It has been observed in practice that when using a larger batch, there is a degradation in the quality of the model, as measured by its ability to generalize. ("On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." Keskar et al, ICLR’17) 112 Momentum The issue in the gradient descent: ▶ ∇E(⃗w(t)) constantly changes direction (but the error steadily decreases). 113 Momentum The issue in the gradient descent: ▶ ∇E(⃗w(t)) constantly changes direction (but the error steadily decreases). Solution: In every step, add the change made in the previous step (weighted by a factor α): ∆⃗w(t) = −ε(t) · k∈T ∇Ek (⃗w(t) ) + α · ∆⃗w(t−1) where 0 < α < 1. 113 Momentum – illustration 114 SGD with momentum ▶ weights in ⃗w(0) are randomly initialized to values close to 0 ▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1) are computed as follows: ▶ Choose (randomly) a set of training examples T ⊆ {1, . . . , p} ▶ Compute ⃗w(t+1) = ⃗w(t) + ∆⃗w(t) where ∆⃗w(t) = −ε(t) · k∈T ∇Ek (⃗w(t) ) + α∆⃗w(t−1) ▶ 0 < ε(t) ≤ 1 is a learning rate in step t + 1 ▶ 0 < α < 1 measures the "influence" of the momentum ▶ ∇Ek (⃗w(t) ) is the gradient of the error of the example k Note that the random choice of the minibatch is typically implemented by randomly shuffling all data and then choosing minibatches sequentially. 115 Learning rate 116 Search for the learning rate ▶ Use settings from a successful solution of a similar problem as a baseline. ▶ Search for the learning rate using the learning monitoring: ▶ Search through values from small (e.g. 0.001) to (0.1), possibly multiplying by 2. ▶ Train for several epochs, observe the learning curves (see cross-validation later). 117 Adaptive learning rate ▶ Power scheduling: Set ϵ(t) = ϵ0/(1 + t/s) where ϵ0 is an initial learning rate and s is a constant number (after s steps the learning rate is ϵ0/2, after 2s it is ϵ0/3 etc.) 118 Adaptive learning rate ▶ Power scheduling: Set ϵ(t) = ϵ0/(1 + t/s) where ϵ0 is an initial learning rate and s is a constant number (after s steps the learning rate is ϵ0/2, after 2s it is ϵ0/3 etc.) ▶ Exponential scheduling: Set ϵ(t) = ϵ0 · 0.1t/s . (the learning rate decays faster than in the power scheduling) 118 Adaptive learning rate ▶ Power scheduling: Set ϵ(t) = ϵ0/(1 + t/s) where ϵ0 is an initial learning rate and s is a constant number (after s steps the learning rate is ϵ0/2, after 2s it is ϵ0/3 etc.) ▶ Exponential scheduling: Set ϵ(t) = ϵ0 · 0.1t/s . (the learning rate decays faster than in the power scheduling) ▶ Piecewise constant scheduling: A constant learning rate for a number of steps/epochs, then a smaller learning rate, and so on. 118 Adaptive learning rate ▶ Power scheduling: Set ϵ(t) = ϵ0/(1 + t/s) where ϵ0 is an initial learning rate and s is a constant number (after s steps the learning rate is ϵ0/2, after 2s it is ϵ0/3 etc.) ▶ Exponential scheduling: Set ϵ(t) = ϵ0 · 0.1t/s . (the learning rate decays faster than in the power scheduling) ▶ Piecewise constant scheduling: A constant learning rate for a number of steps/epochs, then a smaller learning rate, and so on. ▶ 1cycle scheduling: Start by increasing the initial learning rate from ϵ0 linearly to ϵ1 (approx. ϵ1 = 10ϵ0) halfway through training. Then decrease from ϵ1 linearly to ϵ0. Finish by dropping the learning rate by several orders of magnitude (still linearly). According to a 2018 paper by Leslie Smith, this may converge much faster (100 epochs vs 800 epochs on the CIFAR10 dataset). For a comparison of some methods, see: AN EMPIRICAL STUDY OF LEARNING RATES IN DEEP NEURAL NETWORKS FOR SPEECH RECOGNITION, Senior et al 118 AdaGrad So far, we have considered fixed schedules for learning rates. It is better to have ▶ larger rates for weights with smaller updates, ▶ smaller rates for weights with larger updates. AdaGrad uses individually adapting learning rates for each weight. 119 SGD with AdaGrad ▶ weights in ⃗w(0) are randomly initialized to values close to 0 ▶ in the step t + 1 (here t = 0, 1, 2 . . .), compute ⃗w(t+1) : ▶ Choose (randomly) a minibatch T ⊆ {1, . . . , p} ▶ Compute w (t+1) ji = w (t) ji + ∆w (t) ji 120 SGD with AdaGrad ▶ weights in ⃗w(0) are randomly initialized to values close to 0 ▶ in the step t + 1 (here t = 0, 1, 2 . . .), compute ⃗w(t+1) : ▶ Choose (randomly) a minibatch T ⊆ {1, . . . , p} ▶ Compute w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = − η r (t) ji + δ · k∈T ∂Ek ∂wji (⃗w(t) ) and r (t) ji = r (t−1) ji +   k∈T ∂Ek ∂wji (⃗w(t) )   2 ▶ η is a constant expressing the influence of the learning rate, typically 0.01. ▶ δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0. 120 RMSProp The main disadvantage of AdaGrad is the accumulation of gradients throughout the learning process. In case the learning needs to get over several "hills" before settling in a deep "valley," the weight updates get far too small before getting to it. RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after finding a convex bowl as if it were an instance of the AdaGrad algorithm initialized within that bowl. 121 SGD with RMSProp ▶ weights in ⃗w(0) are randomly initialized to values close to 0 ▶ in the step t + 1 (here t = 0, 1, 2 . . .), compute ⃗w(t+1) : ▶ Choose (randomly) a minibatch T ⊆ {1, . . . , p} ▶ Compute w (t+1) ji = w (t) ji + ∆w (t) ji 122 SGD with RMSProp ▶ weights in ⃗w(0) are randomly initialized to values close to 0 ▶ in the step t + 1 (here t = 0, 1, 2 . . .), compute ⃗w(t+1) : ▶ Choose (randomly) a minibatch T ⊆ {1, . . . , p} ▶ Compute w (t+1) ji = w (t) ji + ∆w (t) ji where ∆w (t) ji = − η r (t) ji + δ · k∈T ∂Ek ∂wji (⃗w(t) ) and r (t) ji = ρr (t−1) ji + (1 − ρ)   k∈T ∂Ek ∂wji (⃗w(t) )   2 ▶ η is a constant expressing the influence of the learning rate (Hinton suggests ρ = 0.9 and η = 0.001). ▶ δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0. 122 Other optimization methods There are more methods, such as AdaDelta and Adam (RMSProp combined with momentum). A natural question: Which algorithm should one choose? 123 Other optimization methods There are more methods, such as AdaDelta and Adam (RMSProp combined with momentum). A natural question: Which algorithm should one choose? Unfortunately, there is currently no consensus on this point. According to a recent study, the family of algorithms with adaptive learning rates (represented by RMSProp and AdaDelta) performed fairly robustly, no single best algorithm has emerged. 123 Other optimization methods There are more methods, such as AdaDelta and Adam (RMSProp combined with momentum). A natural question: Which algorithm should one choose? Unfortunately, there is currently no consensus on this point. According to a recent study, the family of algorithms with adaptive learning rates (represented by RMSProp and AdaDelta) performed fairly robustly, no single best algorithm has emerged. Currently, the most popular optimization algorithms actively in use include SGD, SGD with momentum, RMSProp, RMSProp with momentum, AdaDelta, and Adam. The choice of which algorithm to use, at this point, seems to depend largely on the user’s familiarity with the algorithm. 123 Choice of (hidden) activations Generic requirements imposed on activation functions: 1. differentiability (to do gradient descent) 2. non-linearity (linear multi-layer networks are equivalent to single-layer) 3. monotonicity (local extrema of activation functions induce local extrema of the error function) 4. "linearity" (i.e. preserve as much linearity as possible; linear models are easiest to fit; find the "minimum" non-linearity needed to solve a given task) The choice of activation functions is closely related to input preprocessing and the initial choice of weights. 124 Input preprocessing ▶ Some inputs may be much larger than others. For example, the height vs. weight of a person, the max. speed of a car (in km/h) vs. its price (in CZK), etc. 125 Input preprocessing ▶ Some inputs may be much larger than others. For example, the height vs. weight of a person, the max. speed of a car (in km/h) vs. its price (in CZK), etc. ▶ Large inputs have a greater influence on the training than the small ones. Also, too large inputs may slow down learning (saturation of some activation functions). 125 Input preprocessing ▶ Some inputs may be much larger than others. For example, the height vs. weight of a person, the max. speed of a car (in km/h) vs. its price (in CZK), etc. ▶ Large inputs have a greater influence on the training than the small ones. Also, too large inputs may slow down learning (saturation of some activation functions). ▶ Typical standardization: ▶ average = 0 (subtract the mean) ▶ variance = 1 (divide by the standard deviation) Here, the mean and standard deviation may be estimated from the data (the training set). (illustration of standard deviation) 125 Initial weights - intuition ▶ Assume weights are chosen randomly. What distribution? 126 Initial weights - intuition ▶ Assume weights are chosen randomly. What distribution? Consider the behavior of a deep network: ▶ Small weights make the values of inner potentials vanish. ▶ Large weights make the values of inner potentials explode. Hence, we want to choose weights so that the inner potentials of neurons are stable (similar in all layers of the network). 126 Normal LeCun initialization ▶ Assume the input data have the mean = 0 and the variance = 1. Consider a neuron j from the first layer with n inputs. Assume its weights are chosen randomly by the normal distribution N(0, w2 ). Assume that all random choices are independent of each other. ▶ The rule: Choose the standard deviation of weights w so that the standard deviation of ξj (denote by oj) satisfies oj ≈ 1. 127 Normal LeCun initialization ▶ Assume the input data have the mean = 0 and the variance = 1. Consider a neuron j from the first layer with n inputs. Assume its weights are chosen randomly by the normal distribution N(0, w2 ). Assume that all random choices are independent of each other. ▶ The rule: Choose the standard deviation of weights w so that the standard deviation of ξj (denote by oj) satisfies oj ≈ 1. ▶ Basic properties of the variance of independent variables give oj = √ n · w. Thus by putting w = 1 n we obtain oj = 1. 127 Normal LeCun initialization ▶ Assume the input data have the mean = 0 and the variance = 1. Consider a neuron j from the first layer with n inputs. Assume its weights are chosen randomly by the normal distribution N(0, w2 ). Assume that all random choices are independent of each other. ▶ The rule: Choose the standard deviation of weights w so that the standard deviation of ξj (denote by oj) satisfies oj ≈ 1. ▶ Basic properties of the variance of independent variables give oj = √ n · w. Thus by putting w = 1 n we obtain oj = 1. ▶ The same works for higher layers; n corresponds to the number of neurons in the layer one level lower. This gives normal LeCun initialization: wi ∼ N 0, 1 n 127 Derivation of the LeCun initialization Consider a single neuron without bias with the inner potential ξ = n i=1 wixi 128 Derivation of the LeCun initialization Consider a single neuron without bias with the inner potential ξ = n i=1 wixi Consider all wi and xi as independent random variables (hence also ξ is a random variable) where ▶ wi ∈ N(0, w2) for i = 1, . . . , n where w is a constant, ▶ Exi = 0 and Var[xi] = E[(xi − Exi)2] = 1 for i = 1, . . . , n 128 Derivation of the LeCun initialization Consider a single neuron without bias with the inner potential ξ = n i=1 wixi Consider all wi and xi as independent random variables (hence also ξ is a random variable) where ▶ wi ∈ N(0, w2) for i = 1, . . . , n where w is a constant, ▶ Exi = 0 and Var[xi] = E[(xi − Exi)2] = 1 for i = 1, . . . , n We prove that Var[ξ] = n · w2 as follows: 128 Derivation of the LeCun initialization Consider a single neuron without bias with the inner potential ξ = n i=1 wixi Consider all wi and xi as independent random variables (hence also ξ is a random variable) where ▶ wi ∈ N(0, w2) for i = 1, . . . , n where w is a constant, ▶ Exi = 0 and Var[xi] = E[(xi − Exi)2] = 1 for i = 1, . . . , n We prove that Var[ξ] = n · w2 as follows: Eξ = E n i=1 wixi = n i=1 Ewixi ind. = n i=1 EwiExi = 0 128 Derivation of the LeCun initialization Consider a single neuron without bias with the inner potential ξ = n i=1 wixi Consider all wi and xi as independent random variables (hence also ξ is a random variable) where ▶ wi ∈ N(0, w2) for i = 1, . . . , n where w is a constant, ▶ Exi = 0 and Var[xi] = E[(xi − Exi)2] = 1 for i = 1, . . . , n We prove that Var[ξ] = n · w2 as follows: Eξ = E n i=1 wixi = n i=1 Ewixi ind. = n i=1 EwiExi = 0 and Var[wixi] = E[w2 i x2 i ] − E[wixi]2 ind. = E[w2 i ]E[x2 i ] − 0 = w2 128 Derivation of the LeCun initialization Consider a single neuron without bias with the inner potential ξ = n i=1 wixi Consider all wi and xi as independent random variables (hence also ξ is a random variable) where ▶ wi ∈ N(0, w2) for i = 1, . . . , n where w is a constant, ▶ Exi = 0 and Var[xi] = E[(xi − Exi)2] = 1 for i = 1, . . . , n We prove that Var[ξ] = n · w2 as follows: Eξ = E n i=1 wixi = n i=1 Ewixi ind. = n i=1 EwiExi = 0 and Var[wixi] = E[w2 i x2 i ] − E[wixi]2 ind. = E[w2 i ]E[x2 i ] − 0 = w2 implies Var[ξ] = Var[ n i=1 wixi] ind. = n i=1 Var[wixi] = n i=1 w2 = n · w2 128 Normal Glorot initialization The previous heuristic for weight initialization ignores the variance of the gradient (i.e., it is concerned only with the "size" of activations in the forward pass). 129 Normal Glorot initialization The previous heuristic for weight initialization ignores the variance of the gradient (i.e., it is concerned only with the "size" of activations in the forward pass). Glorot & Bengio (2010) presented a normalized initialization by choosing weights randomly from the following normal distribution: N 0, 2 m + n = N 0, 1 (m + n)/2 Here n is the number of inputs to the layer, m is the number of neurons in the layer above. 129 Normal Glorot initialization The previous heuristic for weight initialization ignores the variance of the gradient (i.e., it is concerned only with the "size" of activations in the forward pass). Glorot & Bengio (2010) presented a normalized initialization by choosing weights randomly from the following normal distribution: N 0, 2 m + n = N 0, 1 (m + n)/2 Here n is the number of inputs to the layer, m is the number of neurons in the layer above. This is designed to compromise between the goal of initializing all layers to have the same activation variance and the goal of initializing all layers to have the same gradient variance. This gives normal Glorot initialization (also called normal Xavier initialization): wi ∼ N (0, 2 m + n 129 Uniform LeCun initialization ▶ Assume that the input data have mean = 0 and variance = 1. Consider a neuron j from the first layer with n inputs. Assume its weights are chosen randomly by the uniform distribution U(−w, w). Assume that all random choices are independent of each other. ▶ As before, we want the standard deviation oj of the inner potential ξj to be approximately 1. ▶ Basic properties of the variance of independent variables give oj = n 3 · w. Thus by putting w = 3 n we obtain oj = 1. We obtain uniform LeCun initialization: wi ∼ U  − 3 n , 3 n   130 Uniform Glorot initialization Similarly to the normal case, we want to normalize the initialization w.r.t. both forward and backward passes. We obtain uniform Glorot initialization (aka uniform Xavier init.): wi ∼ U  − 6 m + n , 6 m + n   = U   − 3 (m + n)/2 , 3 (m + n)/2   Here n is the number of inputs to the layer, m is the number of neurons in the layer above. 131 Modern activation functions For hidden neurons, sigmoidal functions are often substituted with piece-wise linear activation functions. Most prominent is ReLU: σ(ξ) = max{0, ξ} ▶ THE default activation function recommended for most feedforward neural networks. ▶ As close to linear function as possible; very simple; does not saturate for large potentials. ▶ Dead for negative potentials. 132 Normal He initialization ▶ The ReLU is not as sensitive to the large variance of the inner potential as sigmoidal functions (large variance does not matter as much). 133 Normal He initialization ▶ The ReLU is not as sensitive to the large variance of the inner potential as sigmoidal functions (large variance does not matter as much). ▶ Still, the variance is good to be constant (at least due to the output layer). 133 Normal He initialization ▶ The ReLU is not as sensitive to the large variance of the inner potential as sigmoidal functions (large variance does not matter as much). ▶ Still, the variance is good to be constant (at least due to the output layer). ▶ LeCun initialization cannot be justified for ReLU due to the following reason: The ReLU is not a symmetric function. So even if the inner potential ξj has variance = 1, it is not true of the output (the variance is halved). 133 Normal He initialization ▶ The ReLU is not as sensitive to the large variance of the inner potential as sigmoidal functions (large variance does not matter as much). ▶ Still, the variance is good to be constant (at least due to the output layer). ▶ LeCun initialization cannot be justified for ReLU due to the following reason: The ReLU is not a symmetric function. So even if the inner potential ξj has variance = 1, it is not true of the output (the variance is halved). Modifying the normal LeCun initialization to take the halving variance into account, we obtain normal He initialization: wi ∈ N 0, 2 n LeCun is wi ∈ N 0, 1 n 133 More modern activation functions ▶ Leaky ReLU (green board): ▶ Generalizes ReLU, not dead for negative potentials. ▶ Experimentally not much better than ReLU. 134 More modern activation functions ▶ Leaky ReLU (green board): ▶ Generalizes ReLU, not dead for negative potentials. ▶ Experimentally not much better than ReLU. ▶ ELU: "Smoothed" ReLU: σ(ξ) =    α(exp(ξ) − 1) for ξ < 0 ξ for ξ ≥ 0 Here α is a parameter, ELU converges to −α as ξ → −∞. As opposed to ReLU: Smooth, always non-zero gradient (but saturates), slower to compute. 134 More modern activation functions ▶ Leaky ReLU (green board): ▶ Generalizes ReLU, not dead for negative potentials. ▶ Experimentally not much better than ReLU. ▶ ELU: "Smoothed" ReLU: σ(ξ) =    α(exp(ξ) − 1) for ξ < 0 ξ for ξ ≥ 0 Here α is a parameter, ELU converges to −α as ξ → −∞. As opposed to ReLU: Smooth, always non-zero gradient (but saturates), slower to compute. ▶ SELU: Scaled variant of ELU: : σ(ξ) = λ    α(exp(ξ) − 1) for ξ < 0 ξ for ξ ≥ 0 Self-normalizing, i.e. output of each layer will tend to preserve a mean (close to) 0 and a standard deviation (close to) 1 for λ ≈ 1.050 and α ≈ 1.673, properly initialized weights (see below) and normalized inputs (zero mean, standard deviation 1). 134 135 Initializing with Normal Distribution Denote by n the number of inputs to the initialized layer, and m the number of neurons in the layer. ▶ normal Glorot: wi ∼ N (0, 2 m + n Suitable for none, tanh, logistic, softmax 136 Initializing with Normal Distribution Denote by n the number of inputs to the initialized layer, and m the number of neurons in the layer. ▶ normal Glorot: wi ∼ N (0, 2 m + n Suitable for none, tanh, logistic, softmax ▶ normal He: wi ∈ N 0, 2 n Suitable for ReLU, leaky ReLU 136 Initializing with Normal Distribution Denote by n the number of inputs to the initialized layer, and m the number of neurons in the layer. ▶ normal Glorot: wi ∼ N (0, 2 m + n Suitable for none, tanh, logistic, softmax ▶ normal He: wi ∈ N 0, 2 n Suitable for ReLU, leaky ReLU ▶ normal LeCun: wi ∼ N 0, 1 n Suitable for SELU (by the authors) 136 How to choose activation of hidden neurons ▶ The default is ReLU. ▶ According to Aurélien Géron: SELU > ELU > leakyReLU > ReLU > tanh > logistic For discussion see: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, Aurélien Géron 137 Batch normalization (roughly) Intuition: Instead of keeping mean = 0 and variance = 1 implicitly due to a clever weight initialization, we may renormalize values of neurons throughout the layers. 138 Batch normalization (roughly) Intuition: Instead of keeping mean = 0 and variance = 1 implicitly due to a clever weight initialization, we may renormalize values of neurons throughout the layers. Consider the ℓ-th layer of the network. Note that the output values of neurons in the ℓ-th layer can be seen as inputs to the sub-network consisting of all layers above the ℓ-th one. 138 Batch normalization (roughly) Intuition: Instead of keeping mean = 0 and variance = 1 implicitly due to a clever weight initialization, we may renormalize values of neurons throughout the layers. Consider the ℓ-th layer of the network. Note that the output values of neurons in the ℓ-th layer can be seen as inputs to the sub-network consisting of all layers above the ℓ-th one. What if we standardize the values of the ℓ-th layer as we did with the input data? For this we need to form a "dataset" of values of the ℓ-th layer. 138 Batch normalization (roughly) Let us consider the ℓ-th layer with n neurons. Consider a batch of training examples: {(⃗xk , ⃗dk ) | k = 1, . . . , p} (This is typically a minibatch.) 139 Batch normalization (roughly) Let us consider the ℓ-th layer with n neurons. Consider a batch of training examples: {(⃗xk , ⃗dk ) | k = 1, . . . , p} (This is typically a minibatch.) ▶ For every k = 1, . . . , p: Compute the values of neurons in the ℓ-th layer for the input ⃗xk and obtain a vector ⃗zk = (zk1, . . . , zkn) 139 Batch normalization (roughly) Let us consider the ℓ-th layer with n neurons. Consider a batch of training examples: {(⃗xk , ⃗dk ) | k = 1, . . . , p} (This is typically a minibatch.) ▶ For every k = 1, . . . , p: Compute the values of neurons in the ℓ-th layer for the input ⃗xk and obtain a vector ⃗zk = (zk1, . . . , zkn) ▶ Set all components of all vectors ⃗zk to the mean = 0 and the variance = 1 and obtain normalized vectors: ˆz1, . . . , ˆzp. 139 Batch normalization (roughly) Let us consider the ℓ-th layer with n neurons. Consider a batch of training examples: {(⃗xk , ⃗dk ) | k = 1, . . . , p} (This is typically a minibatch.) ▶ For every k = 1, . . . , p: Compute the values of neurons in the ℓ-th layer for the input ⃗xk and obtain a vector ⃗zk = (zk1, . . . , zkn) ▶ Set all components of all vectors ⃗zk to the mean = 0 and the variance = 1 and obtain normalized vectors: ˆz1, . . . , ˆzp. ▶ For every k = 1, . . . , p give ⃗γ · ˆzk + ⃗δ as the output of the ℓ-th layer instead of ⃗zk . Here ⃗γ and ⃗δ are new trainable weights. 139 Normalization During the training, the normalized vectors ˆz1, . . . , ˆzp are computed as follows: ˆzki = zki − µi si Here µi = 1 p p k=1 zki si = 1 p p k=1 (zki − µi)2 During inference, where we have just a single value ⃗z of the layer ℓ for an input ⃗x, we use µi and si estimated on a population (e.g., a larger sample of the training set). 140 Generalization 141 Generalization Intuition: Generalization = ability to cope with new unseen instances. Data are mostly noisy, so it is not good idea to fit exactly. In case of function approximation, the network should not return exact results as in the training set. 142 Generalization Intuition: Generalization = ability to cope with new unseen instances. Data are mostly noisy, so it is not good idea to fit exactly. In case of function approximation, the network should not return exact results as in the training set. More formally: It is typically assumed that the training set has been generated as follows: dkj = gj(⃗xk ) + Θkj where gj is the "underlying" function corresponding to the output neuron j ∈ Y and Θkj is random noise. The network should fit gj not the noise. Methods improving generalization are called regularization methods. 142 Regularization Regularization is a big issue in neural networks, as they typically use a huge amount of parameters and thus are very susceptible to overfitting. 143 Regularization Regularization is a big issue in neural networks, as they typically use a huge amount of parameters and thus are very susceptible to overfitting. von Neumann: "With four parameters, I can fit an elephant, and with five, I can make him wiggle his trunk." 143 Elephant x(t) = −60 cos(t) + 30 sin(t) − 8 sin(2t) + 10 sin(3t) y(t) = 50 sin(t) + 18 sin(2t) − 12 cos(3t) + 14 cos(5t) The four parameters are complex numbers (e.g., −60 + 50i). Mayer, Jurgen; Khairy, Khaled; Howard, Jonathon (May 12, 2010). "Drawing an elephant with four complex parameters". American Journal of Physics. 78 (6) 144 Fifth Elephant 145 Regularization Regularization is a big issue in neural networks, as they typically use a huge amount of parameters and thus are very susceptible to overfitting. 146 Regularization Regularization is a big issue in neural networks, as they typically use a huge amount of parameters and thus are very susceptible to overfitting. von Neumann: "With four parameters, I can fit an elephant, and with five, I can make him wiggle his trunk." ... and I ask you, prof. Neumann: What can you fit with 40GB of parameters?? 146 Early stopping Early stopping means that we stop learning before it reaches a minimum of the error E. When to stop? 147 Early stopping Early stopping means that we stop learning before it reaches a minimum of the error E. When to stop? In many applications the error function is not the main thing we want to optimize. E.g. in the case of a trading system, we typically want to maximize our profit not to minimize (strange) error functions designed to be easily differentiable. Also, as noted before, minimizing E completely is not good for generalization. For start: We may employ standard approach of training on one set and stopping on another one. 147 Early stopping Divide your dataset into several subsets: ▶ training set (e.g. 60%) – train the network here ▶ validation set (e.g. 20%) – use to stop the training ▶ test set (e.g. 20%) – use to evaluate the final model What to use as a stopping rule? 148 Early stopping Divide your dataset into several subsets: ▶ training set (e.g. 60%) – train the network here ▶ validation set (e.g. 20%) – use to stop the training ▶ test set (e.g. 20%) – use to evaluate the final model What to use as a stopping rule? You may observe E (or any other function of interest) on the validation set, if it does not improve for last k steps, stop. Alternatively, you may observe the gradient, if it is small for some time, stop. (some studies shown that this traditional rule is not too good: it may happen that the gradient is larger close to minimum values; on the other hand, E does not have to be evaluated which saves time.) To compare models you may use ML techniques such as various types of cross-validation etc. 148 Size of the network Similar problem as in the case of the training duration: ▶ Too small network is not able to capture intrinsic properties of the training set. ▶ Large networks overfit faster. Solution: Optimal number of neurons :-) 149 Size of the network Similar problem as in the case of the training duration: ▶ Too small network is not able to capture intrinsic properties of the training set. ▶ Large networks overfit faster. Solution: Optimal number of neurons :-) ▶ there are some (useless) theoretical bounds ▶ there are algorithms dynamically adding/removing neurons (not much use nowadays) ▶ In practice: Start with an existing network solving similar problem. If you are trully desperate trying to solve a brand new problem, you may try an ancient rule of thumb: the number of neurons ≈ ten times less than the number of training instances. Experiment, experiment, experiment. 149 Feature extraction Consider a two-layer network. Hidden neurons are supposed to represent "patterns" in the inputs. Example: Network 64-2-3 for letter classification: 150 Ensemble methods Techniques for reducing generalization error by combining several models. The reason that ensemble methods work is that different models will usually not make all the same errors on the test set. Idea: Train several different models separately, then have all of the models vote on the output for test examples. 151 Ensemble methods Techniques for reducing generalization error by combining several models. The reason that ensemble methods work is that different models will usually not make all the same errors on the test set. Idea: Train several different models separately, then have all of the models vote on the output for test examples. Bagging: ▶ Generate k training sets T1, ..., Tk by sampling from T uniformly with replacement. If the number of samples is |T |, then on average |Ti| = (1 − 1/e)|T |. ▶ For each i, train a model Mi on Ti. ▶ Combine outputs of the models: for regression by averaging, for classification by (majority) voting. 151 Dropout The algorithm: In every step of the gradient descent ▶ choose randomly a set N of neurons, each neuron is included independently with probability 1/2, (in practice, different probabilities are used as well). ▶ do forward and backward propagations only using the selected neurons (i.e. leave weights of the other neurons unchanged) 152 Dropout The algorithm: In every step of the gradient descent ▶ choose randomly a set N of neurons, each neuron is included independently with probability 1/2, (in practice, different probabilities are used as well). ▶ do forward and backward propagations only using the selected neurons (i.e. leave weights of the other neurons unchanged) Dropout resembles bagging: Large ensemble of neural networks is trained "at once" on parts of the data. Dropout is not exactly the same as bagging: The models share parameters, with each model inheriting a different subset of parameters from the parent neural network. This parameter sharing makes it possible to represent an exponential number of models with a tractable amount of memory. In the case of bagging, each model is trained to convergence on its respective training set. This would be infeasible for large networks/training sets. 152 Dropout – details ▶ The inner potential of a neuron j without dropout: ξj = i∈j← wjiyi ▶ The inner potential of a neuron j with dropout: ri ∼ Bernoulli(1/2) for all i ∈ j← ∖ {0} ξj = i∈j← wji(riyi) (Intuitively, randomly chosen neurons are masked out.) ▶ During inference do not drop out neurons and multiply values of neurons with 1/2. This compensates for the fact that without the drop out there are twice as many neurons. 153 Weight decay and L2 regularization Generalization can be improved by removing "unimportant" weights. Penalising large weights gives stronger indication about their importance. 154 Weight decay and L2 regularization Generalization can be improved by removing "unimportant" weights. Penalising large weights gives stronger indication about their importance. In every step we decrease weights (multiplicatively) as follows: w (t+1) ji = (1 − ζ)w (t) ji − ε · ∂E ∂wji (⃗w(t) ) Intuition: Unimportant weights will be pushed to 0, important weights will survive the decay. 154 Weight decay and L2 regularization Generalization can be improved by removing "unimportant" weights. Penalising large weights gives stronger indication about their importance. In every step we decrease weights (multiplicatively) as follows: w (t+1) ji = (1 − ζ)w (t) ji − ε · ∂E ∂wji (⃗w(t) ) Intuition: Unimportant weights will be pushed to 0, important weights will survive the decay. Weight decay is equivalent to the gradient descent with a constant learning rate ε and the following error function: E′ (⃗w) = E(⃗w) + ζ 2ε (⃗w · ⃗w) Here ζ 2ε (⃗w · ⃗w) is the L2 regularization that penalizes large weights. We use the gradient descent with a constant learning rate to illustrate the equivalence between L2 regularization and the weight decay. Both methods can be combined with other learning algorithnms (AdaGrad, etc.). 154 More optimization, regularization ... There are many more practical tips, optimization methods, regularization methods, etc. For a very nice survey see http://www.deeplearningbook.org/ ... and also all other infinitely many urls concerned with deep learning. 155 Some application(s) 156 MLP applications ▶ MLP is the basic network used when it is unclear what topology is appropriate. 157 MLP applications ▶ MLP is the basic network used when it is unclear what topology is appropriate. ▶ MLP is often used as a component of larger deep models (especially at the head of the network, interpreting the features extracted by the lower, more specialized layers.) 157 MLP applications ▶ MLP is the basic network used when it is unclear what topology is appropriate. ▶ MLP is often used as a component of larger deep models (especially at the head of the network, interpreting the features extracted by the lower, more specialized layers.) ▶ The most prominent NN architecture in the 80s and 90s. 157 MLP applications ▶ MLP is the basic network used when it is unclear what topology is appropriate. ▶ MLP is often used as a component of larger deep models (especially at the head of the network, interpreting the features extracted by the lower, more specialized layers.) ▶ The most prominent NN architecture in the 80s and 90s. ▶ Various applications: ▶ ALVINN - autonomous driving car ▶ Characters/digits recognition ▶ Table data processing ▶ ... 157 MLP applications ▶ MLP is the basic network used when it is unclear what topology is appropriate. ▶ MLP is often used as a component of larger deep models (especially at the head of the network, interpreting the features extracted by the lower, more specialized layers.) ▶ The most prominent NN architecture in the 80s and 90s. ▶ Various applications: ▶ ALVINN - autonomous driving car ▶ Characters/digits recognition ▶ Table data processing ▶ ... ▶ Two-layer MLPs are usually not much better than non-neural models. 157 MNIST – handwritten digits recognition ▶ Database of labeled images of handwritten digits: 60 000 training examples, 10 000 testing. ▶ Dimensions: 28 x 28, digits are centered to the "center of gravity" of pixel values and normalized to a fixed size. ▶ More at http: //yann.lecun.com/exdb/mnist/ The database is used as a standard benchmark in lots of publications. 158 MNIST – handwritten digits recognition ▶ Database of labeled images of handwritten digits: 60 000 training examples, 10 000 testing. ▶ Dimensions: 28 x 28, digits are centered to the "center of gravity" of pixel values and normalized to a fixed size. ▶ More at http: //yann.lecun.com/exdb/mnist/ The database is used as a standard benchmark in lots of publications. Allows comparison of various methods. 158 MNIST One of the best "old" results is the following: 6-layer NN 784-2500-2000-1500-1000-500-10 (on GPU) (Ciresan et al. 2010) Abstract: Good old on-line back-propagation for plain multi-layer perceptrons yields a very low 0.35 error rate on the famous MNIST handwritten digits benchmark. All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning. A famous application of a learning convolutional network LeNet-1 in 1998. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998 159 MNIST – LeNet1 160 MNIST – LeNet1 Interpretation of output: ▶ the output neuron with the highest value identifies the digit. ▶ the same, but if the two largest neuron values are too close together, the input is rejected (i.e. no answer). Learning: Inputs: ▶ training on 7291 samples, tested on 2007 samples Results: ▶ error on test set without rejection: 5% ▶ error on test set with rejection: 1% (12% rejected) ▶ compare with dense MLP with 40 hidden neurons: error 1% (19.4% rejected) 161 Convolutional Networks Some parts of the lecture are based on the online book Neural Networks and Deep Learning by Michael Nielsen. http://neuralnetworksanddeeplearning.com/index.html 162 Convolutional networks - local receptive fields Every neuron is connected with a field of k × k (in this case 5 × 5) neurons in the lower layer (this field is receptive field). The neuron is "standard": Computes a weighted sum of its inputs and applies an activation function. 163 Convolutional networks - stride length Then we slide the local receptive field over by one pixel to the right (i.e., by one neuron) to connect to a second hidden neuron: The "size" of the slide is called stride length. The group of all such neurons is feature map. all these neurons share weights and biases! 164 Feature maps Each feature map represents a property of the input that is supposed to be spatially invariant. Typically, we consider several feature maps in a single layer. 165 Pooling Neurons in the pooling layer compute functions of their receptive fields: ▶ Max-pooling : maximum of inputs ▶ L2-pooling : square root of the sum of squres ▶ Average-pooling : mean ▶ · · · 166 Trained receptive fields (20 feature maps, receptive fields 5 × 5) 167 Trained feature maps 168 Simple convolutional network 28 × 28 input image, 3 feature maps, each feature map has its own max-pooling (field 5 × 5, stride = 1), 10 output neurons. Each neuron in the output layer gets input from each neuron in the pooling layer. Trained using backprop, which can be easily adapted to convolutional networks. 169 Convolutional network 170 Simple convolutional network vs MNIST two convolutional-pooling layers, one 20, second 40 feature maps, two dense (MLP) layers (1000-1000), outputs (10) ▶ Activation functions of the feature maps and dense layers: ReLU ▶ max-pooling ▶ output layer: soft-max ▶ Error function: negative log-likelihood (= cross-entropy) ▶ Training: SGD, mini-batch size 10 ▶ learning rate 0.03 ▶ L2 regularization with "weight" λ = 0.1 + dropout with prob. 1/2 ▶ training for 40 epochs (i.e., every training example is considered 40 times) ▶ Expanded dataset: displacement by one pixel to an arbitrary direction. ▶ Committee voting of 5 networks. 171 MNIST Out of 10,000 images in the test set, only these 33 have been incorrectly classified: 172 More complex convolutional networks Convolutional networks have been used for the classification of images from the ImageNet database (16 million color images, 20 thousand classes) 173 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) Competition in classification over a subset of images from ImageNet. Started in 2010, assisted in a breakthrough in image recognition. The training set 1.2 million images, 1000 classes. Validation set: 50 000, test set: 150 000. Many images contain more than one object ⇒ model is allowed to choose five classes, the correct label must be among the five. (top-5 criterion). 174 ILSVRC 2014 The same set as in 2012, top-5 criterion. GoogLeNet: deep convolutional network, 22 layers Results: ▶ Accuracy 93.33% top-5 175 ILSVRC 2015 ▶ Deep convolutional network - ResNet ▶ Various numbers of layers, the winner has 152 layers ▶ Skip connections implementing residual learning ▶ Error 3.57% in top-5. 176 Further development of CNN architectures Convolutional networks made a breakthrough in image recognition. See IB031 for discussion of ILSVRC challenge. 177 Further development of CNN architectures Convolutional networks made a breakthrough in image recognition. See IB031 for discussion of ILSVRC challenge. There are myriads of CNNs for ▶ Image classification Binary, or many classes, 2D, 3D, ... nD "images" and movies. Various architectures. ▶ Image segmentation I.e., partitioning pixels of images, a HUGE amount of CNN variants (U-Net, fully convolutional, etc.) ▶ Object detection Bounding box prediction, R-CNN architectures ▶ ... 177 Further development of CNN architectures Convolutional networks made a breakthrough in image recognition. See IB031 for discussion of ILSVRC challenge. There are myriads of CNNs for ▶ Image classification Binary, or many classes, 2D, 3D, ... nD "images" and movies. Various architectures. ▶ Image segmentation I.e., partitioning pixels of images, a HUGE amount of CNN variants (U-Net, fully convolutional, etc.) ▶ Object detection Bounding box prediction, R-CNN architectures ▶ ... For more details, see PA228 Machine Learning in Image Processing. 177 Convolutional networks – learning theory 178 Convolutional networks – architecture Neurons organized in layers, L0, L1, . . . , Ln, connections (typically) only from Lm to Lm+1. 179 Convolutional networks – architecture Neurons organized in layers, L0, L1, . . . , Ln, connections (typically) only from Lm to Lm+1. Several types of layers: ▶ input layer L0 179 Convolutional networks – architecture Neurons organized in layers, L0, L1, . . . , Ln, connections (typically) only from Lm to Lm+1. Several types of layers: ▶ input layer L0 ▶ dense layer Lm: Each neuron of Lm connected with each neuron of Lm−1. 179 Convolutional networks – architecture Neurons organized in layers, L0, L1, . . . , Ln, connections (typically) only from Lm to Lm+1. Several types of layers: ▶ input layer L0 ▶ dense layer Lm: Each neuron of Lm connected with each neuron of Lm−1. ▶ convolutional layer Lm: Neurons organized into disjoint feature maps, all neurons of a given feature map share weights (but have different inputs) 179 Convolutional networks – architecture Neurons organized in layers, L0, L1, . . . , Ln, connections (typically) only from Lm to Lm+1. Several types of layers: ▶ input layer L0 ▶ dense layer Lm: Each neuron of Lm connected with each neuron of Lm−1. ▶ convolutional layer Lm: Neurons organized into disjoint feature maps, all neurons of a given feature map share weights (but have different inputs) ▶ pooling layer: "Neurons" organized into pooling maps, all neurons ▶ compute a simple aggregate function (such as max), ▶ have disjoint inputs. Pooling after convolution is usually applied to each feature map separately. I.e., a single pooling map after each feature map. 179 Convolutional networks – architecture ▶ Denote ▶ X a set of input neurons ▶ Y a set of output neurons ▶ Z a set of all neurons (X, Y ⊆ Z) ▶ individual neurons denoted by indices i, j etc. ▶ ξj is the inner potential of the neuron j after the computation stops ▶ yj is the output of the neuron j after the computation stops (define y0 = 1 is the value of the formal unit input) ▶ wji is the weight of the connection from i to j (in particular, wj0 is the weight of the connection from the formal unit input, i.e. wj0 = −bj where bj is the bias of the neuron j) ▶ j← is a set of all i such that j is adjacent from i (i.e. there is an arc to j from i) ▶ j→ is a set of all i such that j is adjacent to i (i.e. there is an arc from j to i) ▶ [ji] is a set of all connections (i.e. pairs of neurons) sharing the weight wji. 180 Convolutional networks – activity ▶ neurons of dense and convolutional layers: ▶ inner potential of neuron j: ξj = i∈j← wjiyi ▶ activation function σj for neuron j (arbitrary differentiable): yj = σj(ξj) 181 Convolutional networks – activity ▶ neurons of dense and convolutional layers: ▶ inner potential of neuron j: ξj = i∈j← wjiyi ▶ activation function σj for neuron j (arbitrary differentiable): yj = σj(ξj) ▶ Neurons of pooling layers: Apply the "pooling" function: ▶ max-pooling: yj = max i∈j← yi ▶ avg-pooling: yj = i∈j← yi |j←| A convolutional network is evaluated layer-wise (as MLP), for each j ∈ Y we have that yj(⃗w,⃗x) is the value of the output neuron j after evaluating the network with weights ⃗w and input ⃗x. 181 Convolutional networks – learning Learning: ▶ Given a training set T of the form ⃗xk , ⃗dk k = 1, . . . , p Here, every ⃗xk ∈ R|X| is an input vector end every ⃗dk ∈ R|Y| is the desired network output. For every j ∈ Y, denote by dkj the desired output of the neuron j for a given network input ⃗xk (the vector ⃗dk can be written as dkj j∈Y ). ▶ Error function – squared error (for example): E(⃗w) = p k=1 Ek (⃗w) where Ek (⃗w) = 1 2 j∈Y yj(⃗w,⃗xk ) − dkj 2 182 Convolutional networks – SGD The algorithm computes a sequence of weight vectors ⃗w(0), ⃗w(1), ⃗w(2), . . .. ▶ weights in ⃗w(0) are randomly initialized to values close to 0 ▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1) are computed as follows: ▶ Choose (randomly) a set of training examples T ⊆ {1, . . . , p} ▶ Compute ⃗w(t+1) = ⃗w(t) + ∆⃗w(t) where ∆⃗w(t) = −ε(t) · 1 |T| k∈T ∇Ek (⃗w(t) ) Here T is a minibatch (of a fixed size), ▶ 0 < ε(t) ≤ 1 is a learning rate in step t + 1 ▶ ∇Ek (⃗w(t)) is the gradient of the error of the example k Note that the random choice of the minibatch is typically implemented by randomly shuffling all data and then choosing minibatches sequentially. Epoch consists of one round through all data. 183 Backprop Recall that ∇Ek (⃗w(t)) is a vector of all partial derivatives of the form ∂Ek ∂wji . How to compute ∂Ek ∂wji ? 184 Backprop Recall that ∇Ek (⃗w(t)) is a vector of all partial derivatives of the form ∂Ek ∂wji . How to compute ∂Ek ∂wji ? First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj: ▶ Recall that for every wji where j is in a dense layer, i.e. does not share weights: ∂Ek ∂wji = ∂Ek ∂yj · σ′ j (ξj) · yi 184 Backprop Recall that ∇Ek (⃗w(t)) is a vector of all partial derivatives of the form ∂Ek ∂wji . How to compute ∂Ek ∂wji ? First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj: ▶ Recall that for every wji where j is in a dense layer, i.e. does not share weights: ∂Ek ∂wji = ∂Ek ∂yj · σ′ j (ξj) · yi ▶ Now for every wji where j is in a convolutional layer: ∂Ek ∂wji = rℓ∈[ji] ∂Ek ∂yr · σ′ r (ξr ) · yℓ ▶ Neurons of pooling layers do not have weights. 184 Backprop Now compute derivatives w.r.t. yj: ▶ for every j ∈ Y: ∂Ek ∂yj = yj − dkj This holds for the squared error, for other error functions the derivative w.r.t. outputs will be different. 185 Backprop Now compute derivatives w.r.t. yj: ▶ for every j ∈ Y: ∂Ek ∂yj = yj − dkj This holds for the squared error, for other error functions the derivative w.r.t. outputs will be different. ▶ for every j ∈ Z ∖ Y such that j→ is either a dense layer, or a convolutional layer: ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σ′ r (ξr ) · wrj 185 Backprop Now compute derivatives w.r.t. yj: ▶ for every j ∈ Y: ∂Ek ∂yj = yj − dkj This holds for the squared error, for other error functions the derivative w.r.t. outputs will be different. ▶ for every j ∈ Z ∖ Y such that j→ is either a dense layer, or a convolutional layer: ∂Ek ∂yj = r∈j→ ∂Ek ∂yr · σ′ r (ξr ) · wrj ▶ for every j ∈ Z ∖ Y such that j→ is max-pooling: Then j→ = {i} for a single "max" neuron and we have ∂Ek ∂yj =    ∂Ek ∂yi if j = arg maxr∈i← yr 0 otherwise The gradient can be propagated from the output layer downwards as in MLP. 185 Convolutional networks – summary ▶ Conv. nets. are nowadays the most used networks in image processing (and also in other areas where input has some local, "spatially" invariant properties) 2024 update: Vision transformers are gradually taking over. ▶ Typically trained using the gradient descent and its modifications (such as Adam). ▶ Due to the weight sharing allow (very) deep architectures. ▶ Typically extended with more adjustments and tricks in their topologies. 186 The problem of cancer detection in WSI The problem: Detect cancer in this image. 187 The problem of cancer detection in WSI ▶ WSI annotated by pathologists, not pixel level precise! 188 Input data WSI too large, 105,185 px × 221,772 px Cut into patches of size 512 px × 512 px Patch positive iff the inner square intersects the annotation 189 Training on WSI Our dataset from Masaryk Memorial Cancer Insitute: ▶ 785 WSI from 166 patients (698 WSI for training, 87 WSI for testing) ▶ Cut into 7,878,675 patches for training, 193,235 patches for testing. 190 Training on WSI Our dataset from Masaryk Memorial Cancer Insitute: ▶ 785 WSI from 166 patients (698 WSI for training, 87 WSI for testing) ▶ Cut into 7,878,675 patches for training, 193,235 patches for testing. Dataset augmentation: ▶ random vertical and horizontal flips ▶ random color perturbations 190 Training on WSI Our dataset from Masaryk Memorial Cancer Insitute: ▶ 785 WSI from 166 patients (698 WSI for training, 87 WSI for testing) ▶ Cut into 7,878,675 patches for training, 193,235 patches for testing. Dataset augmentation: ▶ random vertical and horizontal flips ▶ random color perturbations ▶ Training data three step sampling: 1. randomly select a label 2. randomly select a slide containing at least a single patch with the label 3. randomly select a patch with the label from the slide 190 VGG16 3 × 3 convolutions, stride 1, padding 1. Max pooling 2 × 2, stride 2. 191 Training VGG16 on WSI ▶ VGG16 pretrained on the ImageNet (of-the-shelf solution). Top fully connected parts removed, substituted with global max-pooling and a single dense layer. 192 Training VGG16 on WSI ▶ VGG16 pretrained on the ImageNet (of-the-shelf solution). Top fully connected parts removed, substituted with global max-pooling and a single dense layer. ▶ The network has single logistic output - the probability of cancer in the patch 192 Training VGG16 on WSI ▶ VGG16 pretrained on the ImageNet (of-the-shelf solution). Top fully connected parts removed, substituted with global max-pooling and a single dense layer. ▶ The network has single logistic output - the probability of cancer in the patch ▶ The error E = cross-entropy 192 Training VGG16 on WSI ▶ VGG16 pretrained on the ImageNet (of-the-shelf solution). Top fully connected parts removed, substituted with global max-pooling and a single dense layer. ▶ The network has single logistic output - the probability of cancer in the patch ▶ The error E = cross-entropy ▶ Training: ▶ RMSprop optimizer ▶ The "forgetting" hyperparameter: ρ = 0.9 ▶ The initial learning rate 5 × 10−5 ▶ If no improvement in E on validation data for 3 consecutive epochs ⇒ half the learning rate ▶ If no improvement in ROCAUC on validation data for 5 consecutive epochs ⇒ terminate ▶ Momentum with the weight α = 0.9 192 Prediction 193 Model evaluation - attempt 2 Can we detect cancer in patches? Predict I positive iff F(I) ≥ 0.75 Ok, does it detect cancer? 194 Model evaluation – attempt 3 – FROC Detect particular tumors ? How to evaluate the quality of tumor detection? 195 Model evaluation – attempt 3 – FROC sensitivity ≈ the proportion of tumors containing at least one patch I with F(I) ≥ t w.r.t. all tumors in all slides AvgFP ≈ average number of patches I with F(I) ≥ t in each non-cancerous slide 196 Is it good? ▶ Tile-based metric does not tell us whether cancer will be detected ▶ FROC curve captures tumor detection - however, WSI contains only cuts across tumors ▶ WSI level evaluation needs huge amounts of WSIs Our later results shown that the system has approx. 0.98 AUC for WSI level tumor detection on thousands of WSI from MMCI ▶ The system broke down when we used WSIs from a completely different scanner and different hospital (they used slightly different colors). We have corrected this by appropriate color normalization. But more data from more hospitals is needed! 197 Explainable methods (XAI) 198 199 What is XAI? IBM: "Explainable artificial intelligence (XAI) is a set of processes and methods that allows human users to comprehend and trust the results and output created by machine learning algorithms." https://www.ibm.com/topics/explainable-ai ISO/IEC TR 29119-11:2020(en): ▶ Interpretability: Level of understanding how the underlying (AI) technology works. ▶ Explainability: Level of understanding how the AI-based system came up with a given result. 200 XAI methods The goal is to understand how and why the network does what it does. We will consider classification models only. 201 202 XAI methods Many methods are available. We consider only a few representative methods for interpreting learning models. Methods based on various principles: ▶ Visualize weights and feature maps ▶ Visualize the most important inputs for a given class ▶ Visualize the effect of input perturbations on the output ▶ Construct an interpretable surrogate model 203 Alex-net - filters of the first convolutional layer ▶ 64 filters of depth 3 (RGB) ▶ Combined each filter RGB channels into one RGB image of size 11x11x3. 204 CNN - feature maps 205 CNN - feature maps - radar target classification Synthetic-aperture radar (SAR) – used to create two-dimensional images or three-dimensional reconstructions of objects, such as landscapes. 206 Maximizing input Now what if we try to find the most "representative" input vector for a given class? 207 Maximizing input Now what if we try to find the most "representative" input vector for a given class? Assume a trained model giving a score for each class given an input vector. 207 Maximizing input Now what if we try to find the most "representative" input vector for a given class? Assume a trained model giving a score for each class given an input vector. ▶ Denote by ξi(⃗x) the inner potential of the output neuron i ∈ Y given a network input vector ⃗x. 207 Maximizing input Now what if we try to find the most "representative" input vector for a given class? Assume a trained model giving a score for each class given an input vector. ▶ Denote by ξi(⃗x) the inner potential of the output neuron i ∈ Y given a network input vector ⃗x. ▶ Maximize ξi(⃗x) − λ ⃗x 2 2 over all input vectors ⃗x. 207 Maximizing input Now what if we try to find the most "representative" input vector for a given class? Assume a trained model giving a score for each class given an input vector. ▶ Denote by ξi(⃗x) the inner potential of the output neuron i ∈ Y given a network input vector ⃗x. ▶ Maximize ξi(⃗x) − λ ⃗x 2 2 over all input vectors ⃗x. ▶ A maximizing input vector computed using the gradient ascent. ▶ Gives the most "representative" input vector of the class represented by the neuron i. 207 Maximizing input - example 208 Input specific saliency maps The goal: Label features in a given input that are "most important" for the output of the network. 209 Input specific saliency maps The goal: Label features in a given input that are "most important" for the output of the network. Various approaches: ▶ gradient based ▶ Gradient saliency maps ▶ GradCAM ▶ SmoothGrad ▶ · · · ▶ occlusion based ▶ Simple occlusion maps ▶ LIME ▶ · · · 209 Gradient based saliency ▶ Let us fix an output neuron i and an input vector ⃗x. 210 Gradient based saliency ▶ Let us fix an output neuron i and an input vector ⃗x. ▶ Idea: Rank every input neuron k ∈ X based on its influence on the value ξi(⃗x). Note that the vector of input values is fixed. 210 Gradient based saliency ▶ Let us fix an output neuron i and an input vector ⃗x. ▶ Idea: Rank every input neuron k ∈ X based on its influence on the value ξi(⃗x). Note that the vector of input values is fixed. For every input neuron k ∈ X we consider ∂ξi ∂yk (⃗x) to measure the importance of the input yk for the output potential ξi with respect to the particular input vector ⃗x. 210 Gradient based saliency ▶ Let us fix an output neuron i and an input vector ⃗x. ▶ Idea: Rank every input neuron k ∈ X based on its influence on the value ξi(⃗x). Note that the vector of input values is fixed. For every input neuron k ∈ X we consider ∂ξi ∂yk (⃗x) to measure the importance of the input yk for the output potential ξi with respect to the particular input vector ⃗x. ▶ Note that saliency comes from a surrogate local linear model given by the first-order Taylor approximation: ξi(⃗x′ ) ≈ ξi(⃗x) + ∂ξi ∂X (⃗x) (⃗x′ − ⃗x) Here ∂ξi ∂X is the vector of all partial derivatives ∂ξi ∂yk where k ∈ X. 210 Saliency maps - example 211 Saliency maps - example Quite noisy, the signal is spread and does not say much about the perception of the owl. 212 Saliency maps - example SmoothGrad: ▶ Do the following several times: ▶ Add noise to the input image ▶ Compute a saliency map ▶ Average the resulting saliency maps. 213 GradCAM ▶ Consider a convolutional network and fix an input image I of the network. ALL values of all neurons yj are computed on the input I. 214 GradCAM ▶ Consider a convolutional network and fix an input image I of the network. ALL values of all neurons yj are computed on the input I. ▶ Fix a convolutional layer L consisting of convolutional feature maps F1, . . . , Fk . Each Fℓ is a set of neurons that belong to the feature map Fℓ . Slightly abusing notation, we write Fℓ(I) to denote the tensor of all values of all neurons in Fℓ(I). 214 GradCAM ▶ Consider a convolutional network and fix an input image I of the network. ALL values of all neurons yj are computed on the input I. ▶ Fix a convolutional layer L consisting of convolutional feature maps F1, . . . , Fk . Each Fℓ is a set of neurons that belong to the feature map Fℓ . Slightly abusing notation, we write Fℓ(I) to denote the tensor of all values of all neurons in Fℓ(I). ▶ Fix an output neuron i ∈ Y with the inner potential ξi. Compute the average importance of Fℓ(I): αℓ i = 1 |Fℓ| j∈Fℓ ∂ξi ∂yj (Fℓ (I)) 214 GradCAM ▶ Consider a convolutional network and fix an input image I of the network. ALL values of all neurons yj are computed on the input I. ▶ Fix a convolutional layer L consisting of convolutional feature maps F1, . . . , Fk . Each Fℓ is a set of neurons that belong to the feature map Fℓ . Slightly abusing notation, we write Fℓ(I) to denote the tensor of all values of all neurons in Fℓ(I). ▶ Fix an output neuron i ∈ Y with the inner potential ξi. Compute the average importance of Fℓ(I): αℓ i = 1 |Fℓ| j∈Fℓ ∂ξi ∂yj (Fℓ (I)) and the final gradCAM heat map for L is obtained using ML i = ReLU   k ℓ=1 αℓ i Fℓ (I)   214 GradCAM on VGG16 215 GradCAM on VGG16 Consider the last convolutional layer of the VGG16 (Block5, 215 GradCAM on VGG16 From left to right: ▶ An image of a cat (has to be resized to 224 × 224 to fit VGG16) ▶ The gradCAM heat map for the last convolutional layer and the class "cat" ▶ Rescaled and smoothed gradCAM heat map. ▶ The gradCAM overlay. 216 Occlusion ▶ Systematically cover parts of the input image. ▶ Observe the effect on the output value. ▶ Find regions with the largest effect. 217 Occlusion - example 218 Occlusion - example 219 LIME - for images Let us fix an image I to be explained. 220 LIME - for images Let us fix an image I to be explained. Outline: ▶ Consider superpixels of I as interpretable components. 220 LIME - for images Let us fix an image I to be explained. Outline: ▶ Consider superpixels of I as interpretable components. ▶ Construct a linear model approximating the network around the image I with weights corresponding to the superpixels. 220 LIME - for images Let us fix an image I to be explained. Outline: ▶ Consider superpixels of I as interpretable components. ▶ Construct a linear model approximating the network around the image I with weights corresponding to the superpixels. ▶ Select the superpixels with weights of large magnitude as the important ones. 220 Superpixels as interpretable components Denote by P1, . . . , Pℓ all superpixels of I. 221 Superpixels as interpretable components Denote by P1, . . . , Pℓ all superpixels of I. Consider binary vectors ⃗x = (x1, . . . , xℓ) ∈ {0, 1}ℓ . 221 Superpixels as interpretable components Denote by P1, . . . , Pℓ all superpixels of I. Consider binary vectors ⃗x = (x1, . . . , xℓ) ∈ {0, 1}ℓ . Each such vector ⃗x determines a "subimage" I[⃗x] of I obtained by removing all Pk with xk = 0. 221 LIME ▶ Let us fix an output neuron i, we denote by ξi(J) the inner potential of the output neuron i for the input image J. 222 LIME ▶ Let us fix an output neuron i, we denote by ξi(J) the inner potential of the output neuron i for the input image J. ▶ Given the image I to be interpreted, consider the following training set: T = (⃗x1, ξi(I[⃗x1])), . . . , (⃗xp, ξi(I[⃗xp]) Here ⃗xh = (xh1, . . . , xhℓ) are (some) binary vectors of {0, 1}. E.g., randomly selected. 222 LIME ▶ Let us fix an output neuron i, we denote by ξi(J) the inner potential of the output neuron i for the input image J. ▶ Given the image I to be interpreted, consider the following training set: T = (⃗x1, ξi(I[⃗x1])), . . . , (⃗xp, ξi(I[⃗xp]) Here ⃗xh = (xh1, . . . , xhℓ) are (some) binary vectors of {0, 1}. E.g., randomly selected. ▶ Train a linear model (ADALINE) with weights w0, w1, . . . , wℓ on T . Intuitively, the linear model approximates the network on "subimages" of I obtained by removing some superpixels. ▶ Inspect the weights (magnitude and sign). 222 LIME More precisely, we train a linear model (ADALINE) F with weights ⃗w = w0, w1, . . . , wℓ on T minimizing the weighted mean-squared error E(⃗w) = 1 p p k=1 πk · (F(⃗xk ) − ξi(I[⃗xk ]))2 + Ω(⃗w) where ▶ the weights are defined by πk = exp   −( 1 − 1 − (sk /ℓ) )2 2ν2   Here sk is the number of elements in ⃗xk equal to zero, ℓ is the number of superpixels, ν determines how much perturbed images are taken into account in the error. Small ν means that πk is close to zero for ⃗xk with many zeros. ▶ Ω(⃗w) is a regularization term making the number of non-zero weights as small as possible. 223 LIME - example 224 LIME - example 225 LIME - example 226 LIME - example 227 Recurrent Neural Networks 228 RNN ▶ Input: ⃗x = (x1, . . . , xM) ▶ Hidden: ⃗h = (h1, . . . , hH) ▶ Output: ⃗y = (y1, . . . , yN) 229 RNN example Activation function: σ(ξ) =    1 ξ ≥ 0 0 ξ < 0 y 1 0 1 h (0, 0) (1, 1) (1, 0) (0, 1) · · · x (0, 0) (1, 0) (1, 1) 230 RNN example Activation function: σ(ξ) =    1 ξ ≥ 0 0 ξ < 0 y ⃗y1 = 1 ⃗y2 = 0 ⃗y3 = 1 h ⃗h0 = (0, 0) ⃗h1 = (1, 1) ⃗h2 = (1, 0) ⃗h3 = (0, 1) · · · x ⃗x1 = (0, 0) ⃗x2 = (1, 0) ⃗x3 = (1, 1) 230 RNN example y ⃗y1 = 1 ⃗y2 = 0 ⃗y3 = 1 h ⃗h0 = (0, 0) ⃗h1 = (1, 1) ⃗h2 = (1, 0) ⃗h3 = (0, 1) · · · x ⃗x1 = (0, 0) ⃗x2 = (1, 0) ⃗x3 = (1, 1) 230 RNN – formally ▶ M inputs: ⃗x = (x1, . . . , xM) ▶ H hidden neurons: ⃗h = (h1, . . . , hH) ▶ N output neurons: ⃗y = (y1, . . . , yN) ▶ Weights: ▶ Ukk′ from input xk′ to hidden hk ▶ Wkk′ from hidden hk′ to hidden hk ▶ Vkk′ from hidden hk′ to output yk 231 RNN – formally ▶ Input sequence: x = ⃗x1, . . . ,⃗xT ⃗xt = (xt1, . . . , xtM) 232 RNN – formally ▶ Input sequence: x = ⃗x1, . . . ,⃗xT ⃗xt = (xt1, . . . , xtM) ▶ Hidden sequence: h = ⃗h0, ⃗h1, . . . , ⃗hT ⃗ht = (ht1, . . . , htH) We have ⃗h0 = (0, . . . , 0) and ⃗htk = σ   M k′=1 Ukk′ xtk′ + H k′=1 Wkk′ h(t−1)k′   232 RNN – formally ▶ Input sequence: x = ⃗x1, . . . ,⃗xT ⃗xt = (xt1, . . . , xtM) ▶ Hidden sequence: h = ⃗h0, ⃗h1, . . . , ⃗hT ⃗ht = (ht1, . . . , htH) We have ⃗h0 = (0, . . . , 0) and ⃗htk = σ   M k′=1 Ukk′ xtk′ + H k′=1 Wkk′ h(t−1)k′   ▶ Output sequence: y = ⃗y1, . . . ,⃗yT ⃗yt = (yt1, . . . , ytN) where ytk = σ H k′=1 Vkk′ htk′ . 232 RNN – in matrix form ▶ Input sequence: x = ⃗x1, . . . ,⃗xT 233 RNN – in matrix form ▶ Input sequence: x = ⃗x1, . . . ,⃗xT ▶ Hidden sequence: h = ⃗h0, ⃗h1, . . . , ⃗hT where ⃗h0 = (0, . . . , 0) and ⃗ht = σ(U⃗xt + W⃗ht−1) 233 RNN – in matrix form ▶ Input sequence: x = ⃗x1, . . . ,⃗xT ▶ Hidden sequence: h = ⃗h0, ⃗h1, . . . , ⃗hT where ⃗h0 = (0, . . . , 0) and ⃗ht = σ(U⃗xt + W⃗ht−1) ▶ Output sequence: y = ⃗y1, . . . ,⃗yT where yt = σ(Vht ) 233 RNN – Comments ▶ ⃗ht is the memory of the network, captures what happened in all previous steps (with decaying quality). ▶ RNN shares weights U, V, W along the sequence. Note the similarity to convolutional networks where the weights were shared spatially over images, here they are shared temporally over sequences. ▶ RNN can deal with sequences of variable length. Compare with MLP which accepts only fixed-dimension vectors on input. 234 Binary adder The Task: Design a recurrent network with a single hidden layer which works as a binary adder. Example of behavior: Input two binary numbers, e.g., 111 and 101 (we assume that the least significant bit is on the left). The input of the network will be: (1, 1), (1, 0), (1, 1) The output is supposed to be: 0, 0, 1 (we ignore the carry at the end). 235 RNN – training Training set T = (x1, d1), . . . , (xp, dp) here ▶ each xℓ = ⃗xℓ1, . . . ,⃗xℓTℓ is an input sequence, ▶ each dℓ = ⃗dℓ1, . . . , ⃗dℓTℓ is an expected output sequence. Here each ⃗xℓt = (xℓt1, . . . , xℓtM) is an input vector and each ⃗dℓt = (dℓt1, . . . , dℓtN) is an expected output vector. 236 Error function In what follows I will consider a training set with a single element (x, d). I.e. drop the index ℓ and have ▶ x = ⃗x1, . . . ,⃗xT where ⃗xt = (xt1, . . . , xtM) ▶ d = ⃗d1, . . . , ⃗dT where ⃗dt = (dt1, . . . , dtN) The squared error of (x, d) is defined by E(x,d) = T t=1 N k=1 1 2 (ytk − dtk )2 Recall that we have a sequence of network outputs y = ⃗y1, . . . ,⃗yT and thus ytk is the k-th component of ⃗yt 237 Gradient descent (single training example) Consider a single training example (x, d). The algorithm computes a sequence of weight matrices as follows: 238 Gradient descent (single training example) Consider a single training example (x, d). The algorithm computes a sequence of weight matrices as follows: ▶ Initialize all weights randomly close to 0. 238 Gradient descent (single training example) Consider a single training example (x, d). The algorithm computes a sequence of weight matrices as follows: ▶ Initialize all weights randomly close to 0. ▶ In the step ℓ + 1 (here ℓ = 0, 1, 2, . . .) compute "new" weights U(ℓ+1), V(ℓ+1), W(ℓ+1) from the "old" weights U(ℓ), V(ℓ), W(ℓ) as follows: U (ℓ+1) kk′ = U (ℓ) kk′ − ε(ℓ) · δE(x,d) δUkk′ V (ℓ+1) kk′ = V (ℓ) kk′ − ε(ℓ) · δE(x,d) δVkk′ W (ℓ+1) kk′ = W (ℓ) kk′ − ε(ℓ) · δE(x,d) δWkk′ 238 Gradient descent (single training example) Consider a single training example (x, d). The algorithm computes a sequence of weight matrices as follows: ▶ Initialize all weights randomly close to 0. ▶ In the step ℓ + 1 (here ℓ = 0, 1, 2, . . .) compute "new" weights U(ℓ+1), V(ℓ+1), W(ℓ+1) from the "old" weights U(ℓ), V(ℓ), W(ℓ) as follows: U (ℓ+1) kk′ = U (ℓ) kk′ − ε(ℓ) · δE(x,d) δUkk′ V (ℓ+1) kk′ = V (ℓ) kk′ − ε(ℓ) · δE(x,d) δVkk′ W (ℓ+1) kk′ = W (ℓ) kk′ − ε(ℓ) · δE(x,d) δWkk′ The above is THE learning algorithm that modifies weights! 238 Backpropagation Computes the derivatives of E, no weights are modified! 239 Backpropagation Computes the derivatives of E, no weights are modified! δE(x,d) δUkk′ = T t=1 δE(x,d) δhtk · σ′ · xtk′ k′ = 1, . . . , M δE(x,d) δVkk′ = T t=1 δE(x,d) δytk · σ′ · htk′ k′ = 1, . . . , H δE(x,d) δWkk′ = T t=1 δE(x,d) δhtk · σ′ · h(t−1)k′ k′ = 1, . . . , H 239 Backpropagation Computes the derivatives of E, no weights are modified! δE(x,d) δUkk′ = T t=1 δE(x,d) δhtk · σ′ · xtk′ k′ = 1, . . . , M δE(x,d) δVkk′ = T t=1 δE(x,d) δytk · σ′ · htk′ k′ = 1, . . . , H δE(x,d) δWkk′ = T t=1 δE(x,d) δhtk · σ′ · h(t−1)k′ k′ = 1, . . . , H Backpropagation: δE(x,d) δytk = ytk − dtk (assuming squared error) δE(x,d) δhtk = N k′=1 δE(x,d) δytk′ · σ′ · Vk′k + H k′=1 δE(x,d) δh(t+1)k′ · σ′ · Wk′k 239 Long-term dependencies δE(x,d) δhtk = N k′=1 δE(x,d) δytk′ · σ′ · Vk′k + H k′=1 δE(x,d) δh(t+1)k′ · σ′ · Wk′k ▶ Unless H k′=1 σ′ · Wk′k ≈ 1, the gradient either vanishes, or explodes. ▶ For a large T (long-term dependency), the gradient "deeper" in the past tends to be too small (large). ▶ A solution: LSTM etc. LSTM is currently a bit obsolete. The main idea is to decompose W into several matrices, each responsible for a different task. One is concerned about memory, one is concerned about the output at each step, etc. https://arxiv.org/pdf/2205.13504.pdf 240 LSTM ⃗ht = ⃗ot ◦ σh(⃗Ct ) output ⃗Ct = ⃗ft ◦ ⃗Ct−1 +⃗it ◦ ˜Ct memory ˜Ct = σh(WC · ⃗ht−1 + UC · ⃗xt ) new memory contents ⃗ot = σg(Wo · ⃗ht−1 + Uo · ⃗xt ) output gate ⃗ft = σg(Wf · ⃗ht−1 + Uf · ⃗xt ) forget gate ⃗it = σg(Wi · ⃗ht−1 + Ui · ⃗xt ) input gate ▶ ◦ is the component-wise product of vectors ▶ · is the matrix-vector product ▶ σh hyperbolic tangents (applied component-wise) ▶ σg logistic sigmoid (applied component-wise) 241 RNN vs LSTM Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 242 LSTM ⇒ ⃗ht = ⃗ot ◦ σh(⃗Ct ) ⇒ ⃗Ct = ⃗ft ◦ ⃗Ct−1 +⃗it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ⃗ht−1 + UC · ⃗xt ) ⇒ ⃗ot = σg(Wo · ⃗ht−1 + Uo · ⃗xt ) ⇒ ⃗ft = σg(Wf · ⃗ht−1 + Uf · ⃗xt ) ⇒⃗it = σg(Wi · ⃗ht−1 + Ui · ⃗xt ) Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 243 LSTM ⇒ ⃗ht = ⃗ot ◦ σh(⃗Ct ) ⇒ ⃗Ct = ⃗ft ◦ ⃗Ct−1 +⃗it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ⃗ht−1 + UC · ⃗xt ) ⇒ ⃗ot = σg(Wo · ⃗ht−1 + Uo · ⃗xt ) ⇒ ⃗ft = σg(Wf · ⃗ht−1 + Uf · ⃗xt ) ⇒⃗it = σg(Wi · ⃗ht−1 + Ui · ⃗xt ) Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 243 LSTM ⇒ ⃗ht = ⃗ot ◦ σh(⃗Ct ) ⇒ ⃗Ct = ⃗ft ◦ ⃗Ct−1 +⃗it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ⃗ht−1 + UC · ⃗xt ) ⇒ ⃗ot = σg(Wo · ⃗ht−1 + Uo · ⃗xt ) ⇒ ⃗ft = σg(Wf · ⃗ht−1 + Uf · ⃗xt ) ⇒⃗it = σg(Wi · ⃗ht−1 + Ui · ⃗xt ) Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 243 LSTM ⇒ ⃗ht = ⃗ot ◦ σh(⃗Ct ) ⇒ ⃗Ct = ⃗ft ◦ ⃗Ct−1 +⃗it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ⃗ht−1 + UC · ⃗xt ) ⇒ ⃗ot = σg(Wo · ⃗ht−1 + Uo · ⃗xt ) ⇒ ⃗ft = σg(Wf · ⃗ht−1 + Uf · ⃗xt ) ⇒⃗it = σg(Wi · ⃗ht−1 + Ui · ⃗xt ) Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 243 LSTM ⇒ ⃗ht = ⃗ot ◦ σh(⃗Ct ) ⇒ ⃗Ct = ⃗ft ◦ ⃗Ct−1 +⃗it ◦ ˜Ct ⇒ ˜Ct = σh(WC · ⃗ht−1 + UC · ⃗xt ) ⇒ ⃗ot = σg(Wo · ⃗ht−1 + Uo · ⃗xt ) ⇒ ⃗ft = σg(Wf · ⃗ht−1 + Uf · ⃗xt ) ⇒⃗it = σg(Wi · ⃗ht−1 + Ui · ⃗xt ) Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 243 LSTM – summary ▶ LSTM (almost) solves the vanishing gradient problem w.r.t. the "internal" state of the network. ▶ Learns to control its own memory (via forget gate). ▶ Revolution in machine translation and text processing. ... but the development goes on ... 244 Time-series Forecasting with LSTM (see: https://www.tensorflow.org/tutorials/structured_data/ time_series) ▶ Weather time series dataset ▶ 14 different features such as air temperature, atmospheric pressure, and humidity ▶ collected every 10 minutes, beginning in 2003 (only 2009 - 2016 considered in the example) The Task: Predict the temperature for the next hour. 245 Weather Data 246 Preprocessing (omitted) Before applying any prediction model, proper preprocessing is essential for time series data. ▶ Train-validation-test Split ▶ Data Normalization ▶ Detrending ▶ Seasonal Adjustment ▶ Smoothing 247 Baseline model The baseline: Predict that the temperature stays constant. 248 Simple linear model The linear model: Consider the current values of all variables and predict the temperature using linear regression. 249 Linear explained - weights 250 Data Windowing Assume that the samples are taken hourly (subsample the 10-minute samples). Consider windowed inputs to the model. E.g., predict one hour given 6 hours from the past: 251 Data Windowing 252 LSTM Given 24 hours in the past, predict the next hour with LSTM. A possible LSTM architecture: The used LSTM had the memory dimension equal to 32. 253 LSTM forecasting 254 Model comparison MAE = mean absolute error 255 Time-series prediction summary ▶ The presented approach is very basic! ▶ Omitted lots of important ideas: ▶ Preprocessing - extremely important! ▶ Cross-validation - tricky! ▶ Classical models ARIMA etc. - very deep and advanced area! ▶ Proper evaluation, explainability, ... (a whole new course possible!) ▶ Read books, e.g. ▶ Hyndman and Athanasopoulos. Forecasting: Principles and Practice. Online: https://otexts.com/fpp2/ ▶ Manu Joseph. Modern Time Series Forecasting with Python. Packt Publishing. 2022 256 RNN text generator (a little obsolete example) Generating texts letter by letter. 257 Shakespeare ▶ Generating Shakespeare letter by letter. ▶ Trained on Shakespeare’s plays (4.4MB). VIOLA: Why, Salisbury must find his flesh and thought That which I am not aps, not a man and in fire, To show the reining of the raven and the wars To grace my hand reproach within, and not a fair are hand, That Caesar and my goodly father’s world; When I was heaven of presence and our fleets, We spare with hours, but cut thy council I am great, Murdered and by thy master’s ready there My power to give thee but so much as hell: Some service in the noble bondman here, Would show him to her wine. KING LEAR: O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder’d at the deeds, So drop upon your lordship’s head, and your opinion Shall be against your honour. 258 Wikipedia ▶ Hutter Prize 100MB dataset from Wikipedia (96MB) Naturalism and decision for the majority of Arab countries’ capitalide was grounded by the Irish language by [[John Clair]], [[An Imperial Japanese Revolt]], associated with Guangzham’s sovereignty. His generals were the powerful ruler of the Portugal in the [[Protestant Immineners]], which could be said to be directly in Cantonese Communication, which followed a ceremony and set inspired prison, training. The emperor travelled back to [[Antioch, Perth, October 25|21]] to note, the Kingdom of Costa Rica, unsuccessful fashioned the [[Thrales]], [[Cynth’s Dajoard]], known in western [[Scotland]], near Italy to the conquest of India with the conflict. Copyright was the succession of independence in the slop of Syrian influence that was a famous German movement based on a more popular servicious, non-doctrinal and sexual power post. Many governments recognize the military housing of the [[Civil Liberalization and Infantry Resolution 265 National Party in Hungary]], that is sympathetic to be to the [[Punjab Resolution]] (PJS)[http: //www.humah.yahoo.com/guardian.cfm/7754800786d17551963s89.htm Official economics Adjoint for the Nazism, Montgomery was swear to 259 Xml halucination: Antichrist 865 15900676 2002-08-03T18:14:12Z Paris 23 Automated conversion #REDIRECT [[Christianity]] 260 LaTeX ▶ Algebraic geometry textbook. ▶ LaTeX source (16MB). ▶ Almost compilable. 261 262 Linux source code ▶ Trained on all source files of Linux kernel concatenated into a single file (474MB of C code). 263 264 265 Evolution of Shakespeare 100 iter.: 300 iter.: 500 iter.: 700 iter.: 1200 iter.: 2000 iter.: 266 Attention Consider the following task: Given a sequence of vectors x = ⃗x1, . . . ,⃗xT generate a new sequence y = ⃗y1, . . . ,⃗yT′ of possibly different lengths (i.e., possibly T T′). E.g., a machine translation task, x is an embedding of an English sentence, y is a sequence of probability distributions on a German vocabulary. 267 Attention Consider two recurrent networks: ▶ Enc the encoder ▶ Hidden state ⃗h0 initialized by standard methods for recurrent networks ▶ Reads ⃗x1, . . . ,⃗xT , does not output anything but produces a sequence of hidden states ⃗h1, . . . , ⃗hT 268 Attention Consider two recurrent networks: ▶ Enc the encoder ▶ Hidden state ⃗h0 initialized by standard methods for recurrent networks ▶ Reads ⃗x1, . . . ,⃗xT , does not output anything but produces a sequence of hidden states ⃗h1, . . . , ⃗hT ▶ Dec the decoder ▶ The initial hidden state is ⃗hT ▶ Does not read anything but outputs the sequence ⃗y1, . . . ,⃗yT′ This is a simplification. Typically, Dec reads ⃗y0,⃗y1, . . . ,⃗yT′−1 where ⃗y0 is a special vector embedding a separator. 268 Attention Consider two recurrent networks: ▶ Enc the encoder ▶ Hidden state ⃗h0 initialized by standard methods for recurrent networks ▶ Reads ⃗x1, . . . ,⃗xT , does not output anything but produces a sequence of hidden states ⃗h1, . . . , ⃗hT ▶ Dec the decoder ▶ The initial hidden state is ⃗hT ▶ Does not read anything but outputs the sequence ⃗y1, . . . ,⃗yT′ This is a simplification. Typically, Dec reads ⃗y0,⃗y1, . . . ,⃗yT′−1 where ⃗y0 is a special vector embedding a separator. Trained on pairs of sentences, able to learn a fine translation between major languages (if the recurrent networks are LSTM). Is not perfect because all info about x = ⃗x1, . . . ,⃗xT is squeezed into the single state vector ⃗hT . In particular, the network tends to forget the context of each word. 268 Attention in Recurrent Networks What if we provide the decoder with an information about the relevant context of the generated word? 269 Attention in Recurrent Networks What if we provide the decoder with an information about the relevant context of the generated word? We use the same encoder Enc producing the sequence of hidden states: ⃗h1, . . . , ⃗hT 269 Attention in Recurrent Networks What if we provide the decoder with an information about the relevant context of the generated word? We use the same encoder Enc producing the sequence of hidden states: ⃗h1, . . . , ⃗hT The decoder Dec is still a recurrent network but ▶ the hidden state ⃗h′ 0 initialized by ⃗hT and a sequence of hidden states ⃗h′ 0 , . . . , ⃗h′ T′ is computed, 269 Attention in Recurrent Networks What if we provide the decoder with an information about the relevant context of the generated word? We use the same encoder Enc producing the sequence of hidden states: ⃗h1, . . . , ⃗hT The decoder Dec is still a recurrent network but ▶ the hidden state ⃗h′ 0 initialized by ⃗hT and a sequence of hidden states ⃗h′ 0 , . . . , ⃗h′ T′ is computed, ▶ reads a sequence of context vectors ⃗c1, . . . ,⃗cT′ where ⃗ci = T j=1 αij ⃗hj where αij = exp(eij) T k=1 exp(eik ) where eij = MLP(⃗h′ i−1 , ⃗hj) ▶ outputs the sequence ⃗y1, . . . ,⃗yT′ 269 Do We Still Need the Recurrence? ▶ The attention mechanism extracts the information from the sequence quite well. 270 Do We Still Need the Recurrence? ▶ The attention mechanism extracts the information from the sequence quite well. ▶ Is there a reason for reading the input sequence sequentially? 270 Do We Still Need the Recurrence? ▶ The attention mechanism extracts the information from the sequence quite well. ▶ Is there a reason for reading the input sequence sequentially? ▶ Could we remove the recurrent network itself and preserve only the attention? 270 271 Self-Attention Layer (is all you need) Fix an input sequence: ⃗x1, . . . ,⃗xT Consider three learnable matrices: Wq, Wk , Wv Generate sequences of queries, keys, and values: ▶ ⃗q1, . . . ,⃗qT where ⃗qi = Wq⃗xi for all i = 1, . . . , T ▶ ⃗k1, . . . , ⃗kT where ⃗ki = Wk ⃗xi for all i = 1, . . . , T ▶ ⃗v1, . . . ,⃗vT where ⃗vi = Wv⃗xi for all i = 1, . . . , T 272 Self-Attention Layer (is all you need) Fix an input sequence: ⃗x1, . . . ,⃗xT Consider three learnable matrices: Wq, Wk , Wv Generate sequences of queries, keys, and values: ▶ ⃗q1, . . . ,⃗qT where ⃗qi = Wq⃗xi for all i = 1, . . . , T ▶ ⃗k1, . . . , ⃗kT where ⃗ki = Wk ⃗xi for all i = 1, . . . , T ▶ ⃗v1, . . . ,⃗vT where ⃗vi = Wv⃗xi for all i = 1, . . . , T Define a vector score for all i, j ∈ {1, . . . , T} by eij = ⃗qi · ⃗kj Intuitively, eij measures how much the input at the position i is related to the input at the position j, in other words, how much the query fits the key. Define αij = exp(eij / √ dattn) T k=1 exp(eik / √ dattn) dattn is the dimension of ⃗vi I.e., we apply the good old softmax to (ei1, . . . , eiT ) / √ dattn 272 Self-Attention Layer (is all you need) Define a vector score for all i, j ∈ {1, . . . , T} by eij = ⃗qi · ⃗kj Intuitively, eij measures how much the input at the position i is related to the input at the position j, in other words, how much the query fits the key. Define αij = exp(eij / √ dattn) T k=1 exp(eik / √ dattn) dattn is the dimension of ⃗vi I.e., we apply the good old softmax to (ei1, . . . , eiT ) / √ dattn Define a sequence of outputs ⃗y1, . . . ,⃗yT by ⃗yi = T j=1 αij · ⃗vj 272 Explainability and Self-Attention BertViz (https://arxiv.org/pdf/1904.02679) The intensity of the blue lines corresponds to αij 273 Bias Detection The visualization can be used to detect bias in the model: Here, according to the model, ▶ He ≈ Doctor ▶ She ≈ Nurse 274 Neuron View 275 Language Model A sequence of tokens a1, . . . , aT ∈ Σ∗ E.g. words from a vocabulary Σ. The goal: Maximize T k=1 P(ak | a1, . . . , ak−1; W) (= P(a1, . . . , aT ; W)) where ▶ P is the conditional probability measure over Σ modeled using a neural network with weights W. 276 Language Model A sequence of tokens a1, . . . , aT ∈ Σ∗ E.g. words from a vocabulary Σ. The goal: Maximize T k=1 P(ak | a1, . . . , ak−1; W) (= P(a1, . . . , aT ; W)) where ▶ P is the conditional probability measure over Σ modeled using a neural network with weights W. Can be used to generate text: Given a1, . . . , ak , sample ak+1 from P(ak+1 | a1, . . . , ak ; W) 276 GPT 277 GPT 278 Masked Self-Attention Layer (is all you need) Assume an attention mechanism which given an input sequence ⃗x1, . . . ,⃗xT generates ⃗y1, . . . ,⃗yT . The Problem: How to generate ⃗yk only based on ⃗x1, . . . ,⃗xk−1 ? 279 Masked Self-Attention Layer (is all you need) Assume an attention mechanism which given an input sequence ⃗x1, . . . ,⃗xT generates ⃗y1, . . . ,⃗yT . The Problem: How to generate ⃗yk only based on ⃗x1, . . . ,⃗xk−1 ? Define a vector score for all i, j ∈ {1, . . . , T} by eij =    ⃗qi · ⃗kj if j < i −∞ otherwise. This means that αij =    exp(eij / √ dattn) T k=1 exp(eik / √ dattn) if j < i 0 otherwise. 279 Masked Self-Attention Layer (is all you need) Assume an attention mechanism which given an input sequence ⃗x1, . . . ,⃗xT generates ⃗y1, . . . ,⃗yT . The Problem: How to generate ⃗yk only based on ⃗x1, . . . ,⃗xk−1 ? Define a vector score for all i, j ∈ {1, . . . , T} by eij =    ⃗qi · ⃗kj if j < i −∞ otherwise. This means that αij =    exp(eij / √ dattn) T k=1 exp(eik / √ dattn) if j < i 0 otherwise. Define a sequence of outputs ⃗y1, . . . ,⃗yT by ⃗yi = T j=1 αij · ⃗vj 279 Multi-head Self-Attention Layer (is all you need) Assume the number of heads is H. For h = 1, . . . , H the h-th head is an attention mechanism which given the input ⃗x1, . . . ,⃗xT produces ⃗yh 1 , . . . ,⃗yh T Note that the output may be different, which means that, in particular, the matrices Wq, Wk , Wv may be different for each head. Assume that all vectors ⃗yh k are of the same dimension dmid and consider a learnable matrix Wout of dimensions dout × (H · dmid). 280 Multi-head Self-Attention Layer (is all you need) Assume the number of heads is H. For h = 1, . . . , H the h-th head is an attention mechanism which given the input ⃗x1, . . . ,⃗xT produces ⃗yh 1 , . . . ,⃗yh T Note that the output may be different, which means that, in particular, the matrices Wq, Wk , Wv may be different for each head. Assume that all vectors ⃗yh k are of the same dimension dmid and consider a learnable matrix Wout of dimensions dout × (H · dmid). The multi-head attention produces the following output: ⃗y1, . . . ,⃗yT where ⃗yk = Wout · ⃗y1 k ⊙ ⃗y2 k ⊙ · · ·⃗yH k Here, ⊙ is a concatenation of vectors. 280 Multi-head Self-Attention Summary Input: A sequence ⃗x1, . . . ,⃗xT Output: A sequence ⃗y1, . . . ,⃗yT I.e., a sequence of the same length. The dimensions of ⃗yk and ⃗xk do not have to be equal. 281 Multi-head Self-Attention Summary Input: A sequence ⃗x1, . . . ,⃗xT Output: A sequence ⃗y1, . . . ,⃗yT I.e., a sequence of the same length. The dimensions of ⃗yk and ⃗xk do not have to be equal. Attention: Learnable parameters: Matrices Wq, Wk , Wv. These matrices are used to compute queries, keys, and values from ⃗x1, . . . ,⃗xT . Output ⃗y1, . . . ,⃗yT is computed using values "scaled" by the query-key attention. 281 Multi-head Self-Attention Summary Input: A sequence ⃗x1, . . . ,⃗xT Output: A sequence ⃗y1, . . . ,⃗yT I.e., a sequence of the same length. The dimensions of ⃗yk and ⃗xk do not have to be equal. Attention: Learnable parameters: Matrices Wq, Wk , Wv. These matrices are used to compute queries, keys, and values from ⃗x1, . . . ,⃗xT . Output ⃗y1, . . . ,⃗yT is computed using values "scaled" by the query-key attention. Multi-head attention: Learnable parameters: ▶ Matrices Wh q , Wh k , Wh v where h = 1, . . . , H and H is the number of heads. Each attention head operates independently on the input ⃗x1, . . . ,⃗xT . ▶ Matrix Wout . Linearly transforms the concatenated results of the attention heads. 281 GPT - transformer 282 Positional encoding The Goal: To encode a position (index) k ∈ {1, . . . , T} into a vector ⃗Pk of real numbers. 283 Positional encoding The Goal: To encode a position (index) k ∈ {1, . . . , T} into a vector ⃗Pk of real numbers. Assume that ⃗Pk should have a dimension d. Given a position k ∈ {1, . . . , T} and i ∈ {0, . . . , d/2} define Pk,2i = sin k n2i/d Pk,(2i+1) = cos k n2i/d Here n = 10000. A user defined constant, the original paper suggests n = 10000. 283 Positional encoding The Goal: To encode a position (index) k ∈ {1, . . . , T} into a vector ⃗Pk of real numbers. Assume that ⃗Pk should have a dimension d. Given a position k ∈ {1, . . . , T} and i ∈ {0, . . . , d/2} define Pk,2i = sin k n2i/d Pk,(2i+1) = cos k n2i/d Here n = 10000. A user defined constant, the original paper suggests n = 10000. Given an input sequence ⃗x1, . . . ,⃗xT we add the position embedding to each ⃗xk obtaining a new input sequence ⃗x′ 1 , . . . ,⃗x′ T where ⃗x′ k = ⃗xk + ⃗Pk 283 Positional encoding/embedding 284 Positional encoding/embedding ▶ Vertically: Sinusoidal functions ▶ Horizontally: Decreasing frequency For any offset o ∈ {1, . . . , T} there is a linear transformation M such that for any k ∈ {1, . . . , T − o} we have M⃗Pk = ⃗Pk+o. Intuitively, just rotate each component of the ⃗Pk appropriately. 285 GPT-2 - transformer 286 Layer normalization Given a vector ⃗x ∈ Rd, the layer normalization computes: ⃗x′ = γ · (⃗x − µ) σ + β Here ▶ µ = 1 d d i=1 xi and σ2 = 1 d d i=1(xi − µ)2 ▶ γ, β ∈ Rd are vectors of trainable parameters 287 Layer normalization Given a vector ⃗x ∈ Rd, the layer normalization computes: ⃗x′ = γ · (⃗x − µ) σ + β Here ▶ µ = 1 d d i=1 xi and σ2 = 1 d d i=1(xi − µ)2 ▶ γ, β ∈ Rd are vectors of trainable parameters In Transformer: The input to the layer normalization is a sequence of vectors: ⃗x1, . . . ,⃗xT . The layer normalization is applied to each ⃗xk , producing a sequence of "normalized" vectors. 287 GPT - learning A sequence of tokens a1, . . . , aT ∈ Σ and their one-hot encodings ⃗u1, . . . ,⃗uT ∈ {0, 1}|Σ| We assume that a1 is a special token marking the start of the sequence. Embed to vectors and add the position encoding (We is an embedding matrix): ⃗xk = We · ⃗uk + Pk ∈ Rsetd 288 GPT - learning A sequence of tokens a1, . . . , aT ∈ Σ and their one-hot encodings ⃗u1, . . . ,⃗uT ∈ {0, 1}|Σ| We assume that a1 is a special token marking the start of the sequence. Embed to vectors and add the position encoding (We is an embedding matrix): ⃗xk = We · ⃗uk + Pk ∈ Rsetd Apply the network (with the transformer block repeated 12x) to ⃗x1, . . . ,⃗xT and obtain ⃗y1, . . . ,⃗yT (Here assume that each ⃗yk ∈ [0, 1]Σ is a probability distribution on Σ) 288 GPT - learning A sequence of tokens a1, . . . , aT ∈ Σ and their one-hot encodings ⃗u1, . . . ,⃗uT ∈ {0, 1}|Σ| We assume that a1 is a special token marking the start of the sequence. Embed to vectors and add the position encoding (We is an embedding matrix): ⃗xk = We · ⃗uk + Pk ∈ Rsetd Apply the network (with the transformer block repeated 12x) to ⃗x1, . . . ,⃗xT and obtain ⃗y1, . . . ,⃗yT (Here assume that each ⃗yk ∈ [0, 1]Σ is a probability distribution on Σ) Compute the error: − T−1 ℓ=1 log ⃗yℓ[aℓ+1] Here ⃗yℓ[ak+1] is the probability of ak+1 in the distribution ⃗yk . 288 GPT - inference A sequence of tokens a1, . . . , aℓ ∈ Σ and their one-hot encodings ⃗u1, . . . ,⃗uℓ ∈ {0, 1}|Σ| Embed to vectors and add the position encoding: ⃗xk = We · ⃗uk + Pk ∈ Rsetd Apply the network to ⃗x1, . . . ,⃗xℓ and obtain ⃗y1, . . . ,⃗yℓ (Assume that each ⃗yk ∈ [0, 1]Σ is a probability distribution on Σ) Sample the next token from aℓ+1 ∼ ⃗yℓ https://transformer.huggingface.co/doc/distil-gpt2 289 Neural networks summary Architectures: ▶ Multi-layer perceptron (MLP): ▶ dense connections between layers ▶ Convolutional networks (CNN): ▶ local receptors, feature maps ▶ pooling ▶ Recurrent networks (RNN): ▶ self-loops but still feed-forward through time ▶ Transformer ▶ Attention, query-key-value Training: ▶ gradient descent algorithm + heuristics 290 Autoencoders An autoencoder consists of two parts: ▶ ϕ : Rn → Rm the encoder ▶ ψ : Rm → Rn the decoder The goal is to find ϕ, ψ so that ψ ◦ ϕ is (almost) identity. The value ⃗h = ϕ(⃗x) is called the latent representation of ⃗x. 291 Autoencoders – training Assume T = {⃗x1, . . . ,⃗xp} where ⃗xi ∈ Rn for all i ∈ {1, . . . , n}. Minimize the reconstruction error E = p i=1 (⃗xi − ψ(ϕ(⃗xi)))2 292 Autoencoders – neural networks Both ϕ and ψ can be represented using MLP Mϕ and Mψ, respectively. Mϕ and Mψ can be connected into a single network. 293 Autoencoders – Usage ▶ Compression – from ⃗x to ⃗h. ▶ Dimensionality reduction – the latent representation ⃗h has a smaller dimension. ▶ Pretraining (next slides) ▶ Generative versions – (roughly) generate ⃗h from a known distribution, let Mψ generate realistic inputs ⃗x 294 Application – dimensionality reduction ▶ Dimensionality reduction: A mapping R from Rn to Rm where ▶ m < n, ▶ for every example ⃗x we have that ⃗x can be "reconstructed" from R(⃗x). 295 Application – dimensionality reduction ▶ Dimensionality reduction: A mapping R from Rn to Rm where ▶ m < n, ▶ for every example ⃗x we have that ⃗x can be "reconstructed" from R(⃗x). ▶ Standard method: PCA (there are many linear as well as non-linear variants) 295 Reconstruction – PCA 1024 pixels compressed to 100 dimensions (i.e. 100 numbers). 296 PCA vs Autoencoders 297 Autoencoders – Pretraining ▶ An autoencoder is (pre)trained on input data ⃗xi without desired outputs (unsupervised) typically much larger datasets of unlabelled data ▶ the encoder Mϕ computes a latent representation for every input vector, it is supposed to extract important features (controversial) ▶ A new part of the model Mtop is added on top of Mϕ (e.g. a MLP taking the output of Mϕ as an input). ▶ Subsequently, labels are added and the whole model (composed of Mϕ and Mtop) is trained on labelled data. 298 Autoencoders – Pretraining 299 Generative adversarial networks Generative adversarial Nets, Goodfellow et al, NIPS 2014 An unsupervised generative model. Two networks: ▶ Generator: A network computing a function G : Rk → Rn which takes a random input ⃗z with a distribution p⃗z (e.g., multivariate normal distribution) and returns G(⃗z) which should follow the target probability distribution. E.g., G(⃗z) could be realistically looking faces. ▶ Discriminator: A network computing a function D : Rn → [0, 1] that given ⃗x ∈ Rn gives a probability D(x) that ⃗x is not "generated" by G. E.g., ⃗x can be an image, D(⃗x) is a probability that it is a true face of an existing person. What error function will "motivate" G to generate realistically and D to discriminate appropriately? 300 Generative adversarial networks – error function Let T = {⃗x1, . . . ,⃗xp} be a training multiset (or a minibatch). Intuition: G should produce outputs similar to elements of T . D should recognize that its input is not from T . 301 Generative adversarial networks – error function Let T = {⃗x1, . . . ,⃗xp} be a training multiset (or a minibatch). Intuition: G should produce outputs similar to elements of T . D should recognize that its input is not from T . Generate a multiset of noise samples: F = {⃗z1, . . . ,⃗zp} from the distribution p⃗z. ET ,F (G, D) = − 1 p p i=1 ln D(⃗xi) + ln(1 − D(G(⃗zi))) This is just the binary cross entropy error of D which classifies the input as either real, or fake. The problem can be seen as a game: The discriminator wants to minimize E, the generator wants to maximize E! 301 The learning algorithm Denote by WG and WD the weights of G and D, respectively. In every iteration of the training, modify weights of the discriminator and the generator as follows: 302 The learning algorithm Denote by WG and WD the weights of G and D, respectively. In every iteration of the training, modify weights of the discriminator and the generator as follows: For k steps (here k is a hyperparameter) update the discriminator as follows: ▶ Sample a minibatch T = {⃗x1, . . . ,⃗xm} from the training set T . ▶ Sample a minibatch F = {⃗z1, . . . ,⃗zm} from the distribution pz. ▶ Update WD using the gradient descent w.r.t. E: WD := WD − α · ∇WD ET,F (G, D) 302 The learning algorithm Denote by WG and WD the weights of G and D, respectively. In every iteration of the training, modify weights of the discriminator and the generator as follows: For k steps (here k is a hyperparameter) update the discriminator as follows: ▶ Sample a minibatch T = {⃗x1, . . . ,⃗xm} from the training set T . ▶ Sample a minibatch F = {⃗z1, . . . ,⃗zm} from the distribution pz. ▶ Update WD using the gradient descent w.r.t. E: WD := WD − α · ∇WD ET,F (G, D) Now update the generator: ▶ Sample a minibatch F = {⃗z1, . . . ,⃗zm} from the distribution pz. ▶ Update the generator by gradient ascent: WG := WG − α · ∇WG   1 p p i=1 ln(1 − D(G(⃗zi)))   (The updates may also use momentum, adaptive learning rate etc.) 302 GAN MNIST 303 GAN faces ... from the original paper. 304 GAN refined ... after some refinements. ... this was the start of deepfakes. 305