Formal neuron (with bias) σ ξ x1 x2 xn x0 = 1 bias threshold y w1 w2 · · · wn w0 = −h x0 = 1, x1, . . . , xn ∈ R are inputs w0, w1, . . . , wn ∈ R are weights ξ is an inner potential; almost always ξ = w0 + n i=1 wixi y is an output given by y = σ(ξ) where σ is an activation function; e.g. a unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. 1 Boolean functions Activation function: unit step function σ(ξ) =    1 ξ ≥ 0 ; 0 ξ < 0. σ x1 x2 xn x0 = 1 y = AND(x1, . . . , xn) 1 1 · · · 1 −n σ x1 x2 xn x0 = 1 y = OR(x1, . . . , xn) 1 1 · · · 1 −1 σ x1 x0 = 1 y = NOT(x1) −1 0 2 Boolean functions Theorem Let σ be the unit step function. Two layer MLPs, where each neuron has σ as the activation function, are able to compute all functions of the form F : {0, 1}n → {0, 1}. Proof. Given a vector v = (v1, . . . , vn) ∈ {0, 1}n, consider a neuron Nv whose output is 1 iff the input is v: σ y x1 xi xn x0 = 1 w1 wi · · ·· · · wn w0 w0 = − n i=1 vi wi =    1 vi = 1 −1 vi = 0 Now let us connect all outputs of all neurons Nv satisfying F(v) = 1 using a neuron implementing OR. 3 Non-linear separation x1 x2 y Consider a three layer network; each neuron has the unit step activation function. The network divides the input space in two subspaces according to the output (0 or 1). The first (hidden) layer divides the input space into half-spaces. The second layer may e.g. make intersections of the half-spaces ⇒ convex sets. The third layer may e.g. make unions of some convex sets. 4 Non-linear separation – illustration x1 xk · · · · · · · · · y Consider three layer networks; each neuron has the unit step activation function. Three layer nets are capable of "approximating" any "reasonable" subset A of the input space Rk . Cover A with hypercubes (in 2D squares, in 3D cubes, ...) Each hypercube K can be separated using a two layer network NK (i.e. a function computed by NK gives 1 for points in K and 0 for the rest). Finally, connect outputs of the nets NK satisfying K ∩ A ∅ using a neuron implementing OR. 5 Non-linear separation - sigmoid Theorem (Cybenko 1989 - informal version) Let σ be a continuous function which is sigmoidal, i.e. satisfies σ(x) =    1 pro x → +∞ 0 pro x → −∞ For every reasonable set A ⊆ [0, 1]n, there is a two layer network where each hidden neuron has the activation function σ (output neurons are linear), that satisfies the following: For most vectors v ∈ [0, 1]n we have that v ∈ A iff the network output is > 0 for the input v. For mathematically oriented: "reasonable" means Lebesgue measurable "most" means that the set of incorrectly classified vectors has the Lebesgue measure smaller than a given ε > 0 6 Non-linear separation - practical illustration ALVINN drives a car The net has 30 × 32 = 960 inputs (the input space is thus R960 ) Input values correspond to shades of gray of pixels. Output neurons "classify" images of the road based on their "curvature". Zdroj obrázku: http://jmvidal.cse.sc.edu/talks/ann/alvin.html 7 Function approximation - three layers Let σ be a logistic sigmoid, i.e. σ(ξ) = 1 1 + e−ξ For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there is a three-layer network computing a function F : [0, 1]n → [0, 1] such that there is a linear activation in the output layer, i.e. the value of the output neuron is its inner potential ξ, the remaining neurons have the logistic sigmoid σ as their activation, for every v ∈ [0, 1]n we have that |F(v) − f(v)| < ε. 8 Function approximation – three layer networks x1 x2 σ σ σ σ σ· · · · · · · · · ζ y weighted sum of "spikes" ... + the other two 90 degree rotations a "spike" inner potential the value of the neuron σ(ξ) = 1 1+e−ξ ζ(ξ) = ξ 9 Function approximation - two-layer networks Theorem (Cybenko 1989) Let σ be a continuous function which is sigmoidal, i.e. is increasing and satisfies σ(x) =    1 pro x → +∞ 0 pro x → −∞ For every continuous function f : [0, 1]n → [0, 1] and every ε > 0 there is a function F : [0, 1]n → [0, 1] computed by a two layer network where each hidden neuron has the activation function σ (output neurons are linear), that satisfies the following |f(v) − F(v)| < ε pro každé v ∈ [0, 1]n . 10 Neural networks and computability Consider recurrent networks (i.e. containing cycles) with real weights (in general); one input neuron and one output neuron (the network computes a function F : A → R where A ⊆ R contains all inputs on which the network stops); parallel activity rule (output values of all neurons are recomputed in every step); activation function σ(ξ) =    1 ξ ≥ 0 ; ξ 0 ≤ ξ ≤ 1 ; 0 ξ < 0. We encode words ω ∈ {0, 1}+ into numbers as follows: δ(ω) = |ω| i=1 ω(i) 2i + 1 2|ω|+1 E.g. ω = 11001 gives δ(ω) = 1 2 + 1 22 + 1 25 + 1 26 (= 0.110011 in binary form). 11 Neural networks and computability A network recognizes a language L ⊆ {0, 1}+ if it computes a function F : A → R (A ⊆ R) such that ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0. Recurrent networks with rational weights are equivalent to Turing machines For every recursively enumerable language L ⊆ {0, 1}+ there is a recurrent network with rational weights and less than 1000 neurons, which recognizes L. The halting problem is undecidable for networks with at least 25 neurons and rational weights. There is "universal" network (equivalent of the universal Turing machine) Recurrent networks are super-Turing powerful For every language L ⊆ {0, 1}+ there is a recurrent network with less than 1000 nerons which recognizes L. 12 Summary of theoretical results Neural networks are very strong from the point of view of theory: All Boolean functions can be expressed using two-layer networks. Two-layer networks may approximate any continuous function. Recurrent networks are at least as strong as Turing machines. These results are purely theoretical! "Theoretical" networks are extremely huge. It is very difficult to handcraft them even for simplest problems. From practical point of view, the most important advantage of neural networks are: learning, generalization, robustness. 13 Neural networks vs classical computers Neural networks Classical computers Data implicitly in weights explicitly Computation naturally parallel sequential, localized Robustness robust w.r.t. input corruption & damage changing one bit may completely crash the computation Precision imprecise, network recalls a training example "similar" to the input (typically) precise Programming learning manual 14 History of neurocomputers 1951: SNARC (Minski et al) the first implementation of neural network a rat strives to exit a maze 40 artificial neurons (300 vacuum tubes, engines, etc.) 15 History of neurocomputers 1957: Mark I Perceptron (Rosenblatt et al) - the first successful network for image recognition single layer network image represented by 20 × 20 photocells intensity of pixels was treated as the input to a perceptron (basically the formal neuron), which recognized figures weights were implemented using potentiometers, each set by its own engine it was possible to arbitrarily reconnect inputs to neurons to demonstrate adaptability 16 History of neurocomputers 1960: ADALINE (Widrow & Hof) single layer neural network weights stored in a newly invented electronic component memistor, which remembers history of electric current in the form of resistance. Widrow founded a company Memistor Corporation, which sold implementations of neural networks. 1960-66: several companies concerned with neural networks were founded. 17 History of neurocomputers 1967-82: dead still after publication of a book by Minski & Papert (published 1969, title Perceptrons) 1983-end of 90s: revival of neural networks many attempts at hardware implementations application specific chips (ASIC) programmable hardware (FPGA) hw implementations typically not better than "software" implementations on universal computers (problems with weight storage, size, speed, cost of production etc.) end of 90s-cca 2005: NN suppressed by other machine learning methods (support vector machines (SVM)) 2006-now: The boom of neural networks! deep networks – often better than any other method GPU implementations ... some specialized hw implementations (not yet widespread) 18 History in waves ... Figure: The figure shows two of the three historical waves of articial neural nets research, as measured by the frequency of the phrases “cybernetics” and “connectionism” or “neural networks” according to Google Books (the third wave is too recent to appear). 19 Current hardware – What do we face? Increasing dataset size ... 20 Current hardware – What do we face? ... and thus increasing size of neural networks ... 2. ADALINE 4. Early back-propagation network (Rumelhart et al., 1986b) 8. Image recognition: LeNet-5 (LeCun et al., 1998b) 10. Dimensionality reduction: Deep belief network (Hinton et al., 2006) ... here the third "wave" of neural networks started 15. Digit recognition: GPU-accelerated multilayer perceptron (Ciresan et al., 2010) 18. Image recognition (AlexNet): Multi-GPU convolutional network (Krizhevsky et al., 2012) 20. Image recognition: GoogLeNet (Szegedy et al., 2014a) 21 Current hardware – What do we face? ... as a reward we get this ... Figure: Since deep networks reached the scale necessary to compete in the ImageNetLarge Scale Visual Recognition Challenge, they have consistently won the competition every year, and yielded lower and lower error rates each time. Data from Russakovsky et al. (2014b) and He et al. (2015). 22 Current hardware In 2012, Google trained a large network of 1.7 billion weights and 9 layers The task was image recognition (10 million youtube video frames) The hw comprised a 1000 computer network (16 000 cores), computation took three days. In 2014, similar task performed on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI. Able to train 1 billion parameter networks on just 3 machines in a couple of days. Able to scale to 11 billion weights (approx. 6.5 times larger than the Google model) on 16 GPUs. 23 Current hardware – NVIDIA DGX-1 (example) 8x GPU (Tesla GP100) TFLOPS = 170 GPU memory 16GB per GPU NVIDIA CUDA Cores: 28672 System memory: 512 GB Network: Dual 10 GbE NVIDIA Deep Learning SDK 24 Current software TensorFlow (Google) open source software library for numerical computation using data flow graphs allows implementation of most current neural networks allows computation on multiple devices (CPUs, GPUs, ...) Python API Keras: a library on top of TensorFlow that allows easy description of most modern neural networks CNTK (Microsoft) functionality similar to TensorFlow special input language called BrainScript Theano: The "academic" grand-daddy of deep-learning frameworks, written in Python. Strongly inspired TensorFlow (some people developing Theano moved on to develop TensorFlow). There are others: Caffe, Torch (Facebook), Deeplearning4j, ... 25 Current software – Keras 26 Other software implementations Most "mathematical" software packages contain some support of neural networks: MATLAB R STATISTICA Weka ... The implementations are typically not on par with the previously mentioned dedicated deep-learning libraries. 27 SyNAPSE (USA) Big research project, partially funded by DARPA Among the main subjects IBM a HRL, collaboraton with top US universities, e.g. Boston, Stanford The project started in 2008 Invested tens of millions USD. The goal Develop a neural network comparable with a real brain of a mammal The resulting hw chip should simulate 10 billion neurons, 100 trillion synaptic connections, consume 1 kilowatt (∼ a small heater), size 2 dm3 Oriented towards development of a new parallel computer architecture rather than neuroscience. 28 SyNAPSE (USA) – some results A cat brain simulation (2009) A simulation of a network with 109 neurons, 1013 synapses Simulated on a supercomputer Dawn (Blue Gene/P), 147,450 CPU, 144 tB of memory 643 times slower than the real brain The network modelled according to the real-brain structure (hierarchical model of a visual cortex, 4 layers) The authors claim that they observed some behaviour similar to the behaviour of the real brain (signal propagation, α, γ waves) ... simulation was heavily criticised (see latter) ... in 2012 the number of neurons increased to 530 bn neurons a 100 tn synapses 29 SyNAPSE (USA) – TrueNorth A chip with 5.4 billion elements 4096 neurosynaptic cores connected by a network, implementing 1 million programmable "spiking" neurons, 256 million programmable synaptic connections global frequency 1-kHz low energy consumption, approx. 63mW Offline learning, implemented some known algorithms (convolutional networks, RBM etc.) Applied to simple image recognition tasks. 30 Human Brain Project, HBP (Europe) Funded by EU, budget 109 EUR for 10 years Successor of Blue Brain Project at EPFL Lausanne Blue Brain started in 2005, ended in 2012, Human Brain Project started in 2013 The original goal: Deeper understanding of human brain networking in neuroscience diagnosis of brain diseases thinking machine The approach: study of brain tissue using current technology modelling of biological neurons simulation of the models (program NEURON) 31 HBP (Europe Blue brain project (2008) Model of a part of the brain cortex of a rat (approx. 10,000 neurons), much more complex model of neurons than in SyNAPSE Simulated on a supercomputer of the type Blue Gene/P (provided by IBM on discount), 16,384 CPU, 56 teraflops, 16 terabyt˚u pamˇeti, 1 PB disk space Simulation 300x slower than the real brain Human brain project (2015): Simplified model of the nervous system of a rat (approx. 200 000 neurons) 32 SyNAPSE vs HBP 2011: IBM Simulates 4.5 percent of the Human Brain, and All of the Cat Brain (Scientific American) “... performed the first near real-time cortical simulation of the brain that exceeds the scale of a cat cortex” (IBM) This announcement has been heavily criticised by Dr. Markram (head of HBP) “This is a mega public relations stunt – a clear case of scientific deception of the public” “Their so called “neurons” are the tiniest of points you can imagine, a microscopic dot” “Neurons contain 10’s of thousands of proteins that form a network with 10’s of millions of interactions. These interactions are incredibly complex and will require solving millions of differential equations. They have none of that.” 33 SyNAPSE vs HBP “Eugene Izhikevik himself already in 2005 ran a simulation with 100 billion such points interacting just for the fun of it: (over 60 times larger than Modha’s simulation)” Why did they get the Gordon Bell Prize? “They seem to have been very successful in influencing the committee with their claim, which technically is not peer-reviewed by the respective community and is neuroscientifically outrageous.” But is there any innovation here? “The only innovation here is that IBM has built a large supercomputer.” But did Mohda not collaborate with neuroscientists? “I would be very surprised if any neuroscientists that he may have had in his DARPA consortium realized he was going to make such an outrageous claim. I can’t imagine that the San Fransisco neuroscientists knew he was going to make such a stupid claim. Modha himself is a software engineer with no knowledge of the brain.” 34 ... and in the meantime in Europe In 2014, the European Commission received an open letter signed by more than 130 heads of laboratories demanding a substantial change in the management of the whole project. Peter Dayan, director of the computational neuroscience unit at UCL: “The main apparent goal of building the capacity to construct a larger-scale simulation of the human brain is radically premature.” “We are left with a project that can’t but fail from a scientific perspective. It is a waste of money, it will suck out funds from valuable neuroscience research, and would leave the public, who fund this work, justifiably upset.” 35 ... and in 2016 The European Commission and the Human Brain Project Coordinator, the École Polytechnique Fédérale de Lausanne (EPFL), have signed the first Specific Grant Agreement (SGA1), releasing EUR 89 million in funding retroactively from 1st April 2016 until the end of March 2018. The signature of SGA1 means that the HBP and the European Commision have agreed on the HBP Work Plan for this two year period. The SGA1 work plan will move the Project closer to achieving its aim of establishing a cutting-edge, ICT-based scientific Research Infrastructure for brain research, cognitive neuroscience and brain-inspired computing. 36