PV021: Neural networks Tomáš Brázdil 1 Course organization Course materials: Main: The lecture Neural Networks and Deep Learning by Michael Nielsen http://neuralnetworksanddeeplearning.com/ (Extremely well written modern online textbook.) Deep learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville http://www.deeplearningbook.org/ (A very good overview of the state-of-the-art in neural networks.) 2 Course organization Evaluation: Project teams of two students implementation of a selected model + analysis of real-world data implementation either in C++, or in Java without use of any specialized libraries for data analysis and machine learning real-world data means unprepared, cleaning and preparation of data is part of the project! Oral exam I may ask about anything from the lecture including some proofs that occur only on the whiteboard! (Optional) This year we will try to organize a simple data analysis competition (to try the concept). It is up to you whether to participate, or not. During the competition, you may use whatever tools for training neural networks you want. 3 FAQ Q: Why English? A: Couple of reasons. First, all resources about modern neural nets are in English, it is rather cumbersome to translate everything to Czech (combination of Czech and English is ugly). Second, to attract non-Czech speaking students to the course. Q: Why are the lectures not recorded? A: Apart from my personal reasons, I want you to participate actively in the lectures, i.e. to communicate with me and other students during the lectures. Also, in my opinion, online lectures should be prepared in a completely different way. I will not discuss this issue any further. Q: Why we cannot use specialized libraries in projects? A: In order to "touch" the low level implementation details of the algorithms. You should not even use libraries for linear algebra and numerical methods, so that you will be confronted with rounding errors and numerical instabilities. 4 Machine learning in general Machine learning = construction of systems that may learn their functionality from data (... and thus do not need to be programmed.) spam filter learns to recognize spam from a database of "labelled" emails consequently is able to distinguish spam from ham handwritten text reader learns from a database of handwritten letters (or text) labelled by their correct meaning consequently is able to recognize text · · · and lots of much much more sophisticated applications ... Basic attributes of learning algorithms: representation: ability to capture the inner structure of training data generalization: ability to work properly on new data 5 Machine learning in general Machine learning algorithms typically construct mathematical models of given data. The models may be subsequently applied to fresh data. There are many types of models: decision trees support vector machines hidden Markov models Bayes networks and other graphical models neural networks · · · Neural networks, based on models of a (human) brain, form a natural basis for learning algorithms! 6 Artificial neural networks Artificial neuron is a rough mathematical approximation of a biological neuron. (Aritificial) neural network (NN) consists of a number of interconnected artificial neurons. "Behavior" of the network is encoded in connections between neurons. σ ξ x1 x2 xn y Zdroj obrázku: http://tulane.edu/sse/cmb/people/schrader/ 7 Why artificial neural networks? Modelling of biological neural networks (computational neuroscience). simplified mathematical models help to identify important mechanisms How a brain receives information? How the information is stored? How a brain develops? · · · neuroscience is strongly multidisciplinary; precise mathematical descriptions help in communication among experts and in design of new experiments. I will not spend much time on this area! 8 Why artificial neural networks? Neural networks in machine learning. Typically primitive models, far from their biological counterparts (but often inspired by biology). Strongly oriented towards concrete application domains: decision making and control - autonomous vehicles, manufacturing processes, control of natural resources games - backgammon, poker, GO finance - stock prices, risk analysis medicine - diagnosis, signal processing (EKG, EEG, ...), image processing (MRI, roentgen, ...) text and speech processing - automatic translation, text generation, speech recognition other signal processing - filtering, radar tracking, noise reduction · · · I will concentrate on this area! 9 Important features of neural networks Massive parallelism many slow (and "dumb") computational elements work in parallel on several levels of abstraction Learning a kid learns to recognize a rabbit after seeing several rabbits Generalization a kid is able to recognize a new rabbit after seeing several (old) rabbits Robustness a blurred photo of a rabbit may still be classified as a picture of a rabbit Graceful degradation Experiments have shown that damaged neural network is still able to work quite well Damaged network may re-adapt, remaining neurons may take on functionality of the damaged ones 10 The aim of the course We will concentrate on basic techniques and principles of neural networks, fundamental models of neural networks and their applications. You should learn basic models (multilayer perceptron, convolutional networks, recurent network (LSTM), Hopfield and Boltzmann machines and their use in pre-training of deep nets) Standard applications of these models (image processing, speech and text processing) Basic learning algorithms (gradient descent & backpropagation, Hebb’s rule) Basic practical training techniques (data preparation, setting various parameters, control of learning) Basic information about current implementations (TensorFlow, CNTK) 11 Biological neural network Human neural network consists of approximately 1011 (100 billion on the short scale) neurons; a single cubic centimeter of a human brain contains almost 50 million neurons. Each neuron is connected with approx. 104 neurons. Neurons themselves are very complex systems. Rough description of nervous system: External stimulus is received by sensory receptors (e.g. eye cells). Information is futher transfered via peripheral nervous system (PNS) to the central nervous systems (CNS) where it is processed (integrated), and subseqently, an output signal is produced. Afterwards, the output signal is transfered via PNS to effectors (e.g. muscle cells). 12 Biological neural network Zdroj: N. Campbell and J. Reece; Biology, 7th Edition; ISBN: 080537146X 13 Cerebral cortex 14 Biological neuron Zdroj: http://www.web-books.com/eLibrary/Medicine/Physiology/Nervous/Nervous.htm 15 Synaptic connections 16 Resting potential 17 Action potential 18 Spreading action in axon Zdroj: D. A. Tamarkin; STCC Foundation Press 19 Chemical synapse 20 Summation 21 Biological and Mathematical neurons 22 Formal neuron (without bias) σ ξ x1 x2 xn y w1 w2 · · · wn x1, . . . , xn ∈ R are inputs w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = n i=1 wixi y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ h ; 0 ξ < h. where h ∈ R is a threshold. 23 Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = w0 + n i=1 wixi y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. (The threshold h has been substituted with the new input x0 = 1 and the weight w0 = −h.) 24 Neuron and linear separation ξ = 0 ξ > 0 ξ > 0 ξ < 0 ξ < 0 inner potential ξ = w0 + n i=1 wixi determines a separation hyperplane in the n-dimensional input space in 2d line in 3d plane · · · 25 Neuron and linear separation σ x1 xn · · · x0 = 1 1/0 by A/B w1 wn w0 n = 8 · 8, i.e. the number of pixels in the images. Inputs are binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0). 26 Neuron and linear separation ¯w0 + n i=1 ¯wixi = 0 w0 + n i=1 wixi = 0 A A A A B B B Red line classifies incorrectly Green line classifies correctly (may be a result of a correction by a learning algorithm) 27 Neuron and linear separation (XOR) 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) x1 x2 No line separates ones from zeros. 28 Neural networks Neural network consists of formal neurons interconnected in such a way that the output of one neuron is an input of several other neurons. In order to describe a particular type of neural networks we need to specify: Architecture How the neurons are connected. Activity How the network transforms inputs to outputs. Learning How the weights are changed during training. 29 Architecture Network architecture is given as a digraph whose nodes are neurons and edges are connections. We distinguish several categories of neurons: Output neurons Hidden neurons Input neurons (In general, a neuron may be both input and output; a neuron is hidden if it is neither input, nor output.) 30 Architecture – Cycles A network is cyclic (recurrent) if its architecture contains a directed cycle. Otherwise it is acyclic (feed-forward) 31 Architecture – Multilayer Perceptron (MLP) Input Hidden Output x1 x2 y1 y2 Neurons partitioned into layers; one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input layer has number 0 E.g. three-layer network has two hidden layers and one output layer Neurons in the i-th layer are connected with all neurons in the i + 1-st layer Architecture of a MLP is typically described by numbers of neurons in individual layers (e.g. 2-4-3-2) 32 Activity Consider a network with n neurons, k input and output. State of a network is a vector of output values of all neurons. (States of a network with n neurons are vectors of Rn ) State-space of a network is a set of all states. Network input is a vector of k real numbers, i.e. an element of Rk . Network input space is a set of all network inputs. (sometimes we restrict ourselves to a proper subset of Rk ) Initial state Input neurons set to values from the network input (each component of the network input corresponds to an input neuron) Values of the remaining neurons set to 0. 33 Activity – computation of a network Computation (typically) proceeds in discrete steps. In every step the following happens: 1. A set of neurons is selected according to some rule. 2. The selected neurons change their states according to their inputs (they are simply evaluated). (If a neuron does not have any inputs, its value remains constant.) A computation is finite on a network input x if the state changes only finitely many times (i.e. there is a moment in time after which the state of the network never changes). We also say that the network stops on x. Network output is a vector of values of all output neurons in the network (i.e. an element of R ). Note that the network output keeps changing throughout the computation! MLP uses the following selection rule: In the i-th step evaluate all neurons in the i-th layer. 34 Activity – semantics of a network Definition Consider a network with n neurons, k input, output. Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on every input of A. Then we say that the network computes a function F : A → B if for every network input x the vector F(x) ∈ B is the output of the network after the computation on x stops. Example 1 This network computes a function from R2 to R. 35 Activity – inner potential and activation functions In order to specify activity of the network, we need to specify how the inner potentials ξ are computed and what are the activation functions σ. We assume (unless otherwise specified) that ξ = w0 + n i=1 wi · xi here x = (x1, . . . , xn) are inputs of the neuron and w = (w1, . . . , wn) are weights. There are special types of neural network where the inner potential is computed differently, e.g. as a "distance" of an input from the weight vector: ξ = x − w here ||·|| is a vector norm, typically Euclidean. 36 Activity – inner potential and activation functions There are many activation functions, typical examples: Unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. (Logistic) sigmoid σ(ξ) = 1 1 + e−λ·ξ here λ ∈ R is a steepness parameter. Hyperbolic tangens σ(ξ) = 1 − e−ξ 1 + e−ξ 37 Activity – XOR 1010 1001 σ 01000110111 σ0001011011 1 σ 0101 1 −22 2 −2 1 −1 1 3 −2 Activation function is a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The network computes XOR(x1, x2) x1 x2 y 1 1 0 1 0 1 0 1 1 0 0 0 38 Activity – MLP and linear separation 0 (0, 0) 1 (0, 1) 1 (0, 1) 0 (1, 1) P1 P2 x1 x2 σ1 σ 1 σ1 −22 2 −2 1 −1 1 3 −2 The line P1 is given by −1 + 2x1 + 2x2 = 0 The line P2 is given by 3 − 2x1 − 2x2 = 0 39 Activity – example x1 1 σ 010 1 σ01 1 σ 01 1 1 2 −5 1 −2 11 −2 −1 The activation function is the unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. The input is equal to 1 40 Learning Consider a network with n neurons, k input and output. Configuration of a network is a vector of all values of weights. (Configurations of a network with m connections are elements of Rm ) Weight-space of a network is a set of all configurations. initial configuration weights can be initialized randomly or using some sophisticated algorithm 41 Learning algorithms Learning rule for weight adaptation. (the goal is to find a configuration in which the network computes a desired function) supervised learning The desired function is described using training examples that are pairs of the form (input, output). Learning algorithm searches for a configuration which "corresponds" to the training examples, typically by minimizing an error function. Unsupervised learning The training set contains only inputs. The goal is to determine distribution of the inputs (clustering, deep belief networks, etc.) 42 Supervised learning – illustration A A A A B B B classification in the plane using a single neuron training examples are of the form (point, value) where the value is either 1, or 0 depending on whether the point is either A, or B the algorithm considers examples one after another whenever an incorrectly classified point is considered, the learning algorithm turns the line in the direction of the point 43 Unsupervised learning – illustration X X X X A A A A B B B we search for two centres of clusters red crosses correspond to potential centres before application of the learning algorithm, green ones after the application 44 Summary – Advantages of neural networks Massive parallelism neurons can be evaluated in parallel Learning many sophisticated learning algorithms used to "program" neural networks generalization and robustness information is encoded in a distributed manned in weights "close" inputs typicaly get similar values Graceful degradation damage typically causes only a decrease in precision of results 45