PV021: Neural networks
Tomáš Brázdil
1
Course organization
Course materials:
Main: The lecture
Neural Networks and Deep Learning by Michael Nielsen
http://neuralnetworksanddeeplearning.com/
(Extremely well written modern online textbook.)
Deep learning by Ian Goodfellow, Yoshua Bengio and Aaron
Courville
http://www.deeplearningbook.org/
(A very good overview of the state-of-the-art in neural networks.)
2
Course organization
Evaluation:
Project
teams of two students
implementation of a selected model + analysis of given data
implementation either in C, C++, or in Java without use of
any specialized libraries for data analysis and machine
learning
need to get over a given accuracy threshold (a gentle one,
just to eliminate non-functional implementations)
3
Course organization
Evaluation:
Project
teams of two students
implementation of a selected model + analysis of given data
implementation either in C, C++, or in Java without use of
any specialized libraries for data analysis and machine
learning
need to get over a given accuracy threshold (a gentle one,
just to eliminate non-functional implementations)
Oral exam
I may ask about anything from the lecture including some
proofs that occur only on the whiteboard!
3
Course organization
Evaluation:
Project
teams of two students
implementation of a selected model + analysis of given data
implementation either in C, C++, or in Java without use of
any specialized libraries for data analysis and machine
learning
need to get over a given accuracy threshold (a gentle one,
just to eliminate non-functional implementations)
Oral exam
I may ask about anything from the lecture including some
proofs that occur only on the whiteboard!
Application of any deep learning toolset on given (difﬁcult) data.
We prefer TensorFlow but you may use another library (CNTK,
Caffe, DeepLearning4j, ...) The goal is to get the best results on
increasingly more difﬁcult datasets.
The team with the best result on the hardest dataset will
automatically get > F at the exam.
3
FAQ
Q: Why English?
4
FAQ
Q: Why English?
A: Couple of reasons. First, all resources about modern neural nets
are in English, it is rather cumbersome to translate everything to
Czech (combination of Czech and English is ugly). Second, to
attract non-Czech speaking students to the course.
4
FAQ
Q: Why English?
A: Couple of reasons. First, all resources about modern neural nets
are in English, it is rather cumbersome to translate everything to
Czech (combination of Czech and English is ugly). Second, to
attract non-Czech speaking students to the course.
Q: Why we cannot use specialized libraries in projects?
4
FAQ
Q: Why English?
A: Couple of reasons. First, all resources about modern neural nets
are in English, it is rather cumbersome to translate everything to
Czech (combination of Czech and English is ugly). Second, to
attract non-Czech speaking students to the course.
Q: Why we cannot use specialized libraries in projects?
A: In order to "touch" the low level implementation details of the
algorithms. You should not even use libraries for linear algebra
and numerical methods, so that you will be confronted with
rounding errors and numerical instabilities.
4
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
5
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
spam ﬁlter
learns to recognize spam from a database of "labelled"
emails
consequently is able to distinguish spam from ham
5
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
spam ﬁlter
learns to recognize spam from a database of "labelled"
emails
consequently is able to distinguish spam from ham
handwritten text reader
learns from a database of handwritten
letters (or text) labelled by their correct
meaning
consequently is able to recognize text
5
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
spam ﬁlter
learns to recognize spam from a database of "labelled"
emails
consequently is able to distinguish spam from ham
handwritten text reader
learns from a database of handwritten
letters (or text) labelled by their correct
meaning
consequently is able to recognize text
· · ·
and lots of much much more sophisticated applications ...
5
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
spam ﬁlter
learns to recognize spam from a database of "labelled"
emails
consequently is able to distinguish spam from ham
handwritten text reader
learns from a database of handwritten
letters (or text) labelled by their correct
meaning
consequently is able to recognize text
· · ·
and lots of much much more sophisticated applications ...
Basic attributes of learning algorithms:
representation: ability to capture the inner structure of
training data
generalization: ability to work properly on new data
5
Machine learning in general
Machine learning algorithms typically construct mathematical
models of given data. The models may be subsequently
applied to fresh data.
6
Machine learning in general
Machine learning algorithms typically construct mathematical
models of given data. The models may be subsequently
applied to fresh data.
There are many types of models:
decision trees
support vector machines
hidden Markov models
Bayes networks and other graphical models
neural networks
· · ·
Neural networks, based on models of a (human) brain, form
a natural basis for learning algorithms!
6
Artiﬁcial neural networks
Artiﬁcial neuron is a rough mathematical approximation
of a biological neuron.
(Aritiﬁcial) neural network (NN) consists of a number of
interconnected artiﬁcial neurons. "Behavior" of the network
is encoded in connections between neurons.
σ
ξ
x1 x2 xn
y
Zdroj obrázku: http://tulane.edu/sse/cmb/people/schrader/
7
Why artiﬁcial neural networks?
Modelling of biological neural networks (computational
neuroscience).
simpliﬁed mathematical models help to identify important
mechanisms
How a brain receives information?
How the information is stored?
How a brain develops?
· · ·
8
Why artiﬁcial neural networks?
Modelling of biological neural networks (computational
neuroscience).
simpliﬁed mathematical models help to identify important
mechanisms
How a brain receives information?
How the information is stored?
How a brain develops?
· · ·
neuroscience is strongly multidisciplinary; precise
mathematical descriptions help in communication among
experts and in design of new experiments.
I will not spend much time on this area!
8
Why artiﬁcial neural networks?
Neural networks in machine learning.
Typically primitive models, far from their biological
counterparts (but often inspired by biology).
9
Why artiﬁcial neural networks?
Neural networks in machine learning.
Typically primitive models, far from their biological
counterparts (but often inspired by biology).
Strongly oriented towards concrete application domains:
decision making and control - autonomous vehicles,
manufacturing processes, control of natural resources
games - backgammon, poker, GO
ﬁnance - stock prices, risk analysis
medicine - diagnosis, signal processing (EKG, EEG, ...), image
processing (MRI, roentgen, ...)
text and speech processing - automatic translation, text
generation, speech recognition
other signal processing - ﬁltering, radar tracking, noise
reduction
· · ·
I will concentrate on this area!
9
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
10
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
Learning
a kid learns to recognize a rabbit after seeing several
rabbits
10
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
Learning
a kid learns to recognize a rabbit after seeing several
rabbits
Generalization
a kid is able to recognize a new rabbit after seeing several
(old) rabbits
10
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
Learning
a kid learns to recognize a rabbit after seeing several
rabbits
Generalization
a kid is able to recognize a new rabbit after seeing several
(old) rabbits
Robustness
a blurred photo of a rabbit may still be classiﬁed as a
picture of a rabbit
10
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
Learning
a kid learns to recognize a rabbit after seeing several
rabbits
Generalization
a kid is able to recognize a new rabbit after seeing several
(old) rabbits
Robustness
a blurred photo of a rabbit may still be classiﬁed as a
picture of a rabbit
Graceful degradation
Experiments have shown that damaged neural network is
still able to work quite well
Damaged network may re-adapt, remaining neurons may
take on functionality of the damaged ones
10
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets)
11
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets)
Standard applications of these models
(image processing, speech and text processing)
11
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets)
Standard applications of these models
(image processing, speech and text processing)
Basic learning algorithms
(gradient descent & backpropagation, Hebb’s rule)
11
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets)
Standard applications of these models
(image processing, speech and text processing)
Basic learning algorithms
(gradient descent & backpropagation, Hebb’s rule)
Basic practical training techniques
(data preparation, setting various parameters, control of learning)
11
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets)
Standard applications of these models
(image processing, speech and text processing)
Basic learning algorithms
(gradient descent & backpropagation, Hebb’s rule)
Basic practical training techniques
(data preparation, setting various parameters, control of learning)
Basic information about current implementations
(TensorFlow, CNTK)
11
Biological neural network
Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
Each neuron is connected with approx. 104 neurons.
Neurons themselves are very complex systems.
12
Biological neural network
Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
Each neuron is connected with approx. 104 neurons.
Neurons themselves are very complex systems.
Rough description of nervous system:
External stimulus is received by sensory receptors (e.g.
eye cells).
12
Biological neural network
Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
Each neuron is connected with approx. 104 neurons.
Neurons themselves are very complex systems.
Rough description of nervous system:
External stimulus is received by sensory receptors (e.g.
eye cells).
Information is futher transfered via peripheral nervous
system (PNS) to the central nervous systems (CNS) where
it is processed (integrated), and subseqently, an output
signal is produced.
12
Biological neural network
Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
Each neuron is connected with approx. 104 neurons.
Neurons themselves are very complex systems.
Rough description of nervous system:
External stimulus is received by sensory receptors (e.g.
eye cells).
Information is futher transfered via peripheral nervous
system (PNS) to the central nervous systems (CNS) where
it is processed (integrated), and subseqently, an output
signal is produced.
Afterwards, the output signal is transfered via PNS to
effectors (e.g. muscle cells).
12
Biological neural network
Zdroj: N. Campbell and J. Reece; Biology, 7th Edition; ISBN: 080537146X 13
Biological neuron
Zdroj: http://www.web-books.com/eLibrary/Medicine/Physiology/Nervous/Nervous.htm
14
Synaptic connections
15
Action potential
16
Summation
17
Biological and Mathematical neurons
18
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
x1, . . . , xn ∈ R are inputs
19
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
x1, . . . , xn ∈ R are inputs
w1, . . . , wn ∈ R are weights
19
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
x1, . . . , xn ∈ R are inputs
w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = n
i=1 wixi
19
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
x1, . . . , xn ∈ R are inputs
w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = n
i=1 wixi
y is an output given by y = σ(ξ)
where σ is an activation function;
e.g. a unit step function
σ(ξ) =



1 ξ ≥ h ;
0 ξ < h.
where h ∈ R is a threshold.
19
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
20
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
w0, w1, . . . , wn ∈ R are weights
20
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
w0, w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = w0 + n
i=1 wixi
20
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
w0, w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = w0 + n
i=1 wixi
y is an output given by y = σ(ξ)
where σ is an activation
function;
e.g. a unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
(The threshold h has been substituted
with the new input x0 = 1 and the weight
w0 = −h.)
20
Neuron and linear separation
ξ = 0
ξ > 0
ξ > 0
ξ < 0
ξ < 0
inner potential
ξ = w0 +
n
i=1
wixi
determines a separation
hyperplane in
the n-dimensional input space
in 2d line
in 3d plane
· · ·
21
Neuron and linear separation
σ
x1 xn
· · ·
x0 = 1
1/0 by A/B
w1 wn
w0
n = 8 · 8, i.e. the number of pixels in the images. Inputs are
binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0).
22
Neuron and linear separation
¯w0 + n
i=1 ¯wixi = 0
w0 + n
i=1 wixi = 0
A
A
A A
B
B
B
Red line classiﬁes incorrectly
Green line classiﬁes correctly
(may be a result of
a correction by a learning
algorithm)
23
Neuron and linear separation (XOR)
0
(0, 0)
1
(0, 1)
1
(0, 1)
0
(1, 1)
x1
x2
No line separates ones from
zeros.
24
Neural networks
Neural network consists of formal neurons interconnected in
such a way that the output of one neuron is an input of several
other neurons.
In order to describe a particular type of neural networks we
need to specify:
Architecture
How the neurons are connected.
Activity
How the network transforms inputs to outputs.
Learning
How the weights are changed during training.
25
Architecture
Network architecture is given as a digraph whose nodes are
neurons and edges are connections.
We distinguish several categories of
neurons:
Output neurons
Hidden neurons
Input neurons
(In general, a neuron may be both input and
output; a neuron is hidden if it is neither input,
nor output.)
26
Architecture – Cycles
A network is cyclic (recurrent) if its architecture contains a
directed cycle.
27
Architecture – Cycles
A network is cyclic (recurrent) if its architecture contains a
directed cycle.
Otherwise it is acyclic (feed-forward)
27
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
28
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
28
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
28
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
28
Activity
Consider a network with n neurons, k input and output.
29
Activity
Consider a network with n neurons, k input and output.
State of a network is a vector of output values of all
neurons.
(States of a network with n neurons are vectors of Rn
)
State-space of a network is a set of all states.
29
Activity
Consider a network with n neurons, k input and output.
State of a network is a vector of output values of all
neurons.
(States of a network with n neurons are vectors of Rn
)
State-space of a network is a set of all states.
Network input is a vector of k real numbers, i.e.
an element of Rk .
Network input space is a set of all network inputs.
(sometimes we restrict ourselves to a proper subset of Rk
)
29
Activity
Consider a network with n neurons, k input and output.
State of a network is a vector of output values of all
neurons.
(States of a network with n neurons are vectors of Rn
)
State-space of a network is a set of all states.
Network input is a vector of k real numbers, i.e.
an element of Rk .
Network input space is a set of all network inputs.
(sometimes we restrict ourselves to a proper subset of Rk
)
Initial state
Input neurons set to values from the network input
(each component of the network input corresponds to an input
neuron)
Values of the remaining neurons set to 0.
29
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
30
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
30
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
30
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
A computation is ﬁnite on a network input x if the state
changes only ﬁnitely many times (i.e. there is a moment in
time after which the state of the network never changes).
We also say that the network stops on x.
30
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
A computation is ﬁnite on a network input x if the state
changes only ﬁnitely many times (i.e. there is a moment in
time after which the state of the network never changes).
We also say that the network stops on x.
Network output is a vector of values of all output neurons
in the network (i.e. an element of R ).
Note that the network output keeps changing throughout
the computation!
30
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
A computation is ﬁnite on a network input x if the state
changes only ﬁnitely many times (i.e. there is a moment in
time after which the state of the network never changes).
We also say that the network stops on x.
Network output is a vector of values of all output neurons
in the network (i.e. an element of R ).
Note that the network output keeps changing throughout
the computation!
MLP uses the following selection rule:
In the i-th step evaluate all neurons in the i-th layer.
30
Activity – semantics of a network
Deﬁnition
Consider a network with n neurons, k input, output.
Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on
every input of A.
Then we say that the network computes a function F : A → B if
for every network input x the vector F(x) ∈ B is the output of
the network after the computation on x stops.
31
Activity – semantics of a network
Deﬁnition
Consider a network with n neurons, k input, output.
Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on
every input of A.
Then we say that the network computes a function F : A → B if
for every network input x the vector F(x) ∈ B is the output of
the network after the computation on x stops.
31
Activity – semantics of a network
Deﬁnition
Consider a network with n neurons, k input, output.
Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on
every input of A.
Then we say that the network computes a function F : A → B if
for every network input x the vector F(x) ∈ B is the output of
the network after the computation on x stops.
Example 1
This network computes a function
from R2 to R.
31
Activity – inner potential and activation functions
In order to specify activity of the network, we need to specify
how the inner potentials ξ are computed and what are
the activation functions σ.
32
Activity – inner potential and activation functions
In order to specify activity of the network, we need to specify
how the inner potentials ξ are computed and what are
the activation functions σ.
We assume (unless otherwise speciﬁed) that
ξ = w0 +
n
i=1
wi · xi
here x = (x1, . . . , xn) are inputs of the neuron and
w = (w1, . . . , wn) are weights.
32
Activity – inner potential and activation functions
In order to specify activity of the network, we need to specify
how the inner potentials ξ are computed and what are
the activation functions σ.
We assume (unless otherwise speciﬁed) that
ξ = w0 +
n
i=1
wi · xi
here x = (x1, . . . , xn) are inputs of the neuron and
w = (w1, . . . , wn) are weights.
There are special types of neural network where the inner
potential is computed differently, e.g. as a "distance" of an input
from the weight vector:
ξ = x − w
here ||·|| is a vector norm, typically Euclidean.
32
Activity – inner potential and activation functions
There are many activation functions, typical examples:
Unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
33
Activity – inner potential and activation functions
There are many activation functions, typical examples:
Unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
(Logistic) sigmoid
σ(ξ) =
1
1 + e−λ·ξ
here λ ∈ R is a steepness parameter.
Hyperbolic tangens
σ(ξ) =
1 − e−ξ
1 + e−ξ
33
Activity – XOR
1 1
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
1 1
σ
11 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
0 0
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
0 0
σ 01 σ
1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
1 0
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
1 0
σ
11 σ
1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
1 0
σ
11 σ
1 1
σ
1
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
0 1
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
0 1
σ
11 σ
1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
0 1
σ
11 σ
1 1
σ
1
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – MLP and linear separation
0
(0, 0)
1
(0, 1)
1
(0, 1)
0
(1, 1)
P1 P2
x1
x2
σ1 σ 1
σ1
−22 2 −2
1
−1
1
3
−2
The line P1 is given by
−1 + 2x1 + 2x2 = 0
The line P2 is given by
3 − 2x1 − 2x2 = 0
35
Activity – example
x1
1
σ
0
1
σ0 1
σ
0
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
36
Activity – example
x1
1
σ
1
1
σ0 1
σ
0
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
36
Activity – example
x1
1
σ
1
1
σ
1 1
σ
0
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
36
Activity – example
x1
1
σ
1
1
σ
1 1
σ
1
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
36
Activity – example
x1
1
σ
0
1
σ
1 1
σ
1
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
36
Learning
Consider a network with n neurons, k input and output.
37
Learning
Consider a network with n neurons, k input and output.
Conﬁguration of a network is a vector of all values of
weights.
(Conﬁgurations of a network with m connections are elements of Rm
)
Weight-space of a network is a set of all conﬁgurations.
37
Learning
Consider a network with n neurons, k input and output.
Conﬁguration of a network is a vector of all values of
weights.
(Conﬁgurations of a network with m connections are elements of Rm
)
Weight-space of a network is a set of all conﬁgurations.
initial conﬁguration
weights can be initialized randomly or using some sophisticated
algorithm
37
Learning algorithms
Learning rule for weight adaptation.
(the goal is to ﬁnd a conﬁguration in which the network computes
a desired function)
38
Learning algorithms
Learning rule for weight adaptation.
(the goal is to ﬁnd a conﬁguration in which the network computes
a desired function)
Supervised learning
The desired function is described using training examples
that are pairs of the form (input, output).
Learning algorithm searches for a conﬁguration which
"corresponds" to the training examples, typically by
minimizing an error function.
38
Learning algorithms
Learning rule for weight adaptation.
(the goal is to ﬁnd a conﬁguration in which the network computes
a desired function)
Supervised learning
The desired function is described using training examples
that are pairs of the form (input, output).
Learning algorithm searches for a conﬁguration which
"corresponds" to the training examples, typically by
minimizing an error function.
Unsupervised learning
The training set contains only inputs.
The goal is to determine distribution of the inputs
(clustering, deep belief networks, etc.)
38
Supervised learning – illustration
A
A
A A
B
B
B
classiﬁcation in the plane using
a single neuron
39
Supervised learning – illustration
A
A
A A
B
B
B
classiﬁcation in the plane using
a single neuron
training examples are of the form
(point, value) where the value is
either 1, or 0 depending on whether
the point is either A, or B
39
Supervised learning – illustration
A
A
A A
B
B
B
classiﬁcation in the plane using
a single neuron
training examples are of the form
(point, value) where the value is
either 1, or 0 depending on whether
the point is either A, or B
the algorithm considers examples
one after another
whenever an incorrectly classiﬁed
point is considered, the learning
algorithm turns the line in
the direction of the point
39
Unsupervised learning – illustration
X
X
X
X
A
A
A
A
B
B
B
we search for two centres of clusters
40
Unsupervised learning – illustration
X
X
X
X
A
A
A
A
B
B
B
we search for two centres of clusters
red crosses correspond to potential
centres before application of
the learning algorithm, green ones
after the application
40
Summary – Advantages of neural networks
Massive parallelism
neurons can be evaluated in parallel
41
Summary – Advantages of neural networks
Massive parallelism
neurons can be evaluated in parallel
Learning
many sophisticated learning algorithms used to "program"
neural networks
41
Summary – Advantages of neural networks
Massive parallelism
neurons can be evaluated in parallel
Learning
many sophisticated learning algorithms used to "program"
neural networks
generalization and robustness
information is encoded in a distributed manned in weights
"close" inputs typicaly get similar values
41
Summary – Advantages of neural networks
Massive parallelism
neurons can be evaluated in parallel
Learning
many sophisticated learning algorithms used to "program"
neural networks
generalization and robustness
information is encoded in a distributed manned in weights
"close" inputs typicaly get similar values
Graceful degradation
damage typically causes only a decrease in precision of
results
41
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
w0, w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = w0 + n
i=1 wixi
y is an output given by y = σ(ξ)
where σ is an activation
function;
e.g. a unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
42
Boolean functions
Activation function: unit step function σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
43
Boolean functions
Activation function: unit step function σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
σ
x1 x2 xn
x0 = 1
y = AND(x1, . . . , xn)
1 1
· · ·
1
−n
σ
x1 x2 xn
x0 = 1
y = OR(x1, . . . , xn)
1 1
· · ·
1
−1
σ
x1
x0 = 1
y = NOT(x1)
−1
0
43
Boolean functions
Theorem
Let σ be the unit step function. Two layer MLPs, where each
neuron has σ as the activation function, are able to compute all
functions of the form F : {0, 1}n → {0, 1}.
44
Boolean functions
Theorem
Let σ be the unit step function. Two layer MLPs, where each
neuron has σ as the activation function, are able to compute all
functions of the form F : {0, 1}n → {0, 1}.
Proof.
Given a vector v = (v1, . . . , vn) ∈ {0, 1}n, consider a neuron
Nv whose output is 1 iff the input is v:
σ
y
x1 xi xn
x0 = 1
w1 wi
· · ·· · ·
wn
w0
w0 = − n
i=1 vi
wi =



1 vi = 1
−1 vi = 0
Now let us connect all outputs of all neurons Nv satisfying
F(v) = 1 using a neuron implementing OR.
44
Non-linear separation
x1 x2
y
Consider a three layer network; each neuron
has the unit step activation function.
The network divides the input space in two
subspaces according to the output (0 or 1).
45
Non-linear separation
x1 x2
y
Consider a three layer network; each neuron
has the unit step activation function.
The network divides the input space in two
subspaces according to the output (0 or 1).
The ﬁrst (hidden) layer divides the input
space into half-spaces.
45
Non-linear separation
x1 x2
y
Consider a three layer network; each neuron
has the unit step activation function.
The network divides the input space in two
subspaces according to the output (0 or 1).
The ﬁrst (hidden) layer divides the input
space into half-spaces.
The second layer may e.g. make
intersections of the half-spaces ⇒ convex
sets.
45
Non-linear separation
x1 x2
y
Consider a three layer network; each neuron
has the unit step activation function.
The network divides the input space in two
subspaces according to the output (0 or 1).
The ﬁrst (hidden) layer divides the input
space into half-spaces.
The second layer may e.g. make
intersections of the half-spaces ⇒ convex
sets.
The third layer may e.g. make unions of some
convex sets.
45
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y
Consider three layer networks; each neuron
has the unit step activation function.
Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
46
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y
Consider three layer networks; each neuron
has the unit step activation function.
Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
Cover A with hypercubes (in 2D squares, in
3D cubes, ...)
46
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y
Consider three layer networks; each neuron
has the unit step activation function.
Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
Cover A with hypercubes (in 2D squares, in
3D cubes, ...)
Each hypercube K can be separated using
a two layer network NK
(i.e. a function computed by NK gives 1 for
points in K and 0 for the rest).
46
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y
Consider three layer networks; each neuron
has the unit step activation function.
Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
Cover A with hypercubes (in 2D squares, in
3D cubes, ...)
Each hypercube K can be separated using
a two layer network NK
(i.e. a function computed by NK gives 1 for
points in K and 0 for the rest).
Finally, connect outputs of the nets NK
satisfying K ∩ A ∅ using a neuron
implementing OR.
46
Non-linear separation - sigmoid
Theorem (Cybenko 1989 - informal version)
Let σ be a continuous function which is sigmoidal, i.e. satisﬁes
σ(x) =



1 pro x → +∞
0 pro x → −∞
For every reasonable set A ⊆ [0, 1]n, there is a two layer
network where each hidden neuron has the activation function
σ (output neurons are linear), that satisﬁes the following:
For most vectors v ∈ [0, 1]n we have that v ∈ A iff the network
output is > 0 for the input v.
For mathematically oriented:
"reasonable" means Lebesgue measurable
"most" means that the set of incorrectly classiﬁed vectors has
the Lebesgue measure smaller than a given ε > 0
47
Non-linear separation - practical illustration
ALVINN drives a car
48
Non-linear separation - practical illustration
ALVINN drives a car
The net has 30 × 32 = 960 inputs
(the input space is thus R960
)
48
Non-linear separation - practical illustration
ALVINN drives a car
The net has 30 × 32 = 960 inputs
(the input space is thus R960
)
Input values correspond to
shades of gray of pixels.
48
Non-linear separation - practical illustration
ALVINN drives a car
The net has 30 × 32 = 960 inputs
(the input space is thus R960
)
Input values correspond to
shades of gray of pixels.
Output neurons "classify" images
of the road based on their
"curvature".
Zdroj obrázku: http://jmvidal.cse.sc.edu/talks/ann/alvin.html 48
Function approximation - three layers
Let σ be a logistic sigmoid, i.e.
σ(ξ) =
1
1 + e−ξ
For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there
is a three-layer network computing a function F : [0, 1]n → [0, 1]
such that
there is a linear activation in the output layer, i.e. the value
of the output neuron is its inner potential ξ,
49
Function approximation - three layers
Let σ be a logistic sigmoid, i.e.
σ(ξ) =
1
1 + e−ξ
For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there
is a three-layer network computing a function F : [0, 1]n → [0, 1]
such that
there is a linear activation in the output layer, i.e. the value
of the output neuron is its inner potential ξ,
the remaining neurons have the logistic sigmoid σ as their
activation,
49
Function approximation - three layers
Let σ be a logistic sigmoid, i.e.
σ(ξ) =
1
1 + e−ξ
For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there
is a three-layer network computing a function F : [0, 1]n → [0, 1]
such that
there is a linear activation in the output layer, i.e. the value
of the output neuron is its inner potential ξ,
the remaining neurons have the logistic sigmoid σ as their
activation,
for every v ∈ [0, 1]n we have that |F(v) − f(v)| < ε.
49
Function approximation – three layer networks
x1 x2
σ σ σ σ
σ· · ·
· · · · · ·
ζ
y weighted sum of "spikes"
... + the other two 90 degree rotations
a "spike"
inner potential
the value of the neuron
σ(ξ) = 1
1+e−ξ
ζ(ξ) = ξ
50
Function approximation - two-layer networks
Theorem (Cybenko 1989)
Let σ be a continuous function which is sigmoidal, i.e. is
increasing and satisﬁes
σ(x) =



1 pro x → +∞
0 pro x → −∞
For every continuous function f : [0, 1]n → [0, 1] and every ε > 0
there is a function F : [0, 1]n → [0, 1] computed by a two layer
network where each hidden neuron has the activation function
σ (output neurons are linear), that satisﬁes the following
|f(v) − F(v)| < ε pro každé v ∈ [0, 1]n
.
51
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
52
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
52
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
52
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
parallel activity rule (output values of all neurons are
recomputed in every step);
52
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
parallel activity rule (output values of all neurons are
recomputed in every step);
activation function
σ(ξ) =



1 ξ ≥ 0 ;
ξ 0 ≤ ξ ≤ 1 ;
0 ξ < 0.
52
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
parallel activity rule (output values of all neurons are
recomputed in every step);
activation function
σ(ξ) =



1 ξ ≥ 0 ;
ξ 0 ≤ ξ ≤ 1 ;
0 ξ < 0.
We encode words ω ∈ {0, 1}+ into numbers as follows:
δ(ω) =
|ω|
i=1
ω(i)
2i
+
1
2|ω|+1
E.g. ω = 11001 gives δ(ω) = 1
2 + 1
22 + 1
25 + 1
26
(= 0.110011 in binary form).
52
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
53
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
Recurrent networks with rational weights are equivalent to
Turing machines
For every recursively enumerable language L ⊆ {0, 1}+
there is a recurrent network with rational weights and less
than 1000 neurons, which recognizes L.
The halting problem is undecidable for networks with at
least 25 neurons and rational weights.
There is "universal" network (equivalent of the universal
Turing machine)
53
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
Recurrent networks with rational weights are equivalent to
Turing machines
For every recursively enumerable language L ⊆ {0, 1}+
there is a recurrent network with rational weights and less
than 1000 neurons, which recognizes L.
The halting problem is undecidable for networks with at
least 25 neurons and rational weights.
There is "universal" network (equivalent of the universal
Turing machine)
Recurrent networks are super-Turing powerful
53
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
Recurrent networks with rational weights are equivalent to
Turing machines
For every recursively enumerable language L ⊆ {0, 1}+
there is a recurrent network with rational weights and less
than 1000 neurons, which recognizes L.
The halting problem is undecidable for networks with at
least 25 neurons and rational weights.
There is "universal" network (equivalent of the universal
Turing machine)
Recurrent networks are super-Turing powerful
For every language L ⊆ {0, 1}+
there is a recurrent network
with less than 1000 nerons which recognizes L.
53
Summary of theoretical results
Neural networks are very strong from the point of view of
theory:
All Boolean functions can be expressed using two-layer
networks.
Two-layer networks may approximate any continuous
function.
Recurrent networks are at least as strong as Turing
machines.
54
Summary of theoretical results
Neural networks are very strong from the point of view of
theory:
All Boolean functions can be expressed using two-layer
networks.
Two-layer networks may approximate any continuous
function.
Recurrent networks are at least as strong as Turing
machines.
These results are purely theoretical!
"Theoretical" networks are extremely huge.
It is very difﬁcult to handcraft them even for simplest
problems.
From practical point of view, the most important advantage
of neural networks are: learning, generalization,
robustness.
54
Neural networks vs classical computers
Neural networks Classical computers
Data implicitly in weights explicitly
Computation naturally parallel sequential, localized
Robustness robust w.r.t. input corruption
& damage
changing one bit may
completely crash the
computation
Precision imprecise, network recalls a
training example "similar" to
the input
(typically) precise
Programming learning manual
55
History of neurocomputers
1951: SNARC (Minski et al)
the ﬁrst implementation of neural network
a rat strives to exit a maze
40 artiﬁcial neurons (300 vacuum tubes, engines, etc.)
56
History of neurocomputers
1957: Mark I Perceptron (Rosenblatt et al) - the ﬁrst
successful network for image recognition
single layer network
image represented by 20 × 20 photocells
intensity of pixels was treated as the input to a perceptron
(basically the formal neuron), which recognized ﬁgures
weights were implemented using potentiometers, each set
by its own engine
it was possible to arbitrarily reconnect inputs to neurons to
demonstrate adaptability
57
History of neurocomputers
1960: ADALINE (Widrow & Hof)
single layer neural network
weights stored in a newly invented electronic component
memistor, which remembers history of electric current in
the form of resistance.
Widrow founded a company Memistor Corporation, which
sold implementations of neural networks.
1960-66: several companies concerned with neural
networks were founded. 58
History of neurocomputers
1967-82: dead still after publication of a book by Minski &
Papert (published 1969, title Perceptrons)
1983-end of 90s: revival of neural networks
many attempts at hardware implementations
application speciﬁc chips (ASIC)
programmable hardware (FPGA)
hw implementations typically not better than "software"
implementations on universal computers (problems with
weight storage, size, speed, cost of production etc.)
59
History of neurocomputers
1967-82: dead still after publication of a book by Minski &
Papert (published 1969, title Perceptrons)
1983-end of 90s: revival of neural networks
many attempts at hardware implementations
application speciﬁc chips (ASIC)
programmable hardware (FPGA)
hw implementations typically not better than "software"
implementations on universal computers (problems with
weight storage, size, speed, cost of production etc.)
end of 90s-cca 2005: NN suppressed by other machine
learning methods (support vector machines (SVM))
2006-now: The boom of neural networks!
deep networks – often better than any other method
GPU implementations
... some specialized hw implementations (Google’s TPU)
59
History in waves ...
Figure: The ﬁgure shows two of the three historical waves of articial
neural nets research, as measured by the frequency of the phrases
“cybernetics” and “connectionism” or “neural networks” according to
Google Books (the third wave is too recent to appear).
60
Current hardware – What do we face?
Increasing dataset size ...
61
Current hardware – What do we face?
... and thus increasing size of neural networks ...
2. ADALINE
4. Early back-propagation network (Rumelhart et al., 1986b)
8. Image recognition: LeNet-5 (LeCun et al., 1998b)
10. Dimensionality reduction: Deep belief network (Hinton et al., 2006)
... here the third "wave" of neural networks started
15. Digit recognition: GPU-accelerated multilayer perceptron (Ciresan et al., 2010)
18. Image recognition (AlexNet): Multi-GPU convolutional network (Krizhevsky et al., 2012)
20. Image recognition: GoogLeNet (Szegedy et al., 2014a)
62
Current hardware – What do we face?
... as a reward we get this ...
Figure: Since deep networks reached the scale necessary to
compete in the ImageNetLarge Scale Visual Recognition Challenge,
they have consistently won the competition every year, and yielded
lower and lower error rates each time. Data from Russakovsky et al.
(2014b) and He et al. (2015).
63
Current hardware
In 2012, Google trained a large network of 1.7
billion weights and 9 layers
The task was image recognition (10 million
youtube video frames)
The hw comprised a 1000 computer network
(16 000 cores), computation took three days.
64
Current hardware
In 2012, Google trained a large network of 1.7
billion weights and 9 layers
The task was image recognition (10 million
youtube video frames)
The hw comprised a 1000 computer network
(16 000 cores), computation took three days.
In 2014, similar task performed on Commodity
Off-The-Shelf High Performance Computing
(COTS HPC) technology: a cluster of GPU
servers with Inﬁniband interconnects and MPI.
Able to train 1 billion parameter networks on
just 3 machines in a couple of days.
Able to scale to 11 billion weights (approx. 6.5
times larger than the Google model) on 16
GPUs. 64
Current hardware – NVIDIA DGX Station
4x GPU (Tesla V100)
TFLOPS = 480
GPU memory 64GB total
NVIDIA Tensor Cores: 2,560
NVIDIA CUDA Cores: 20,480
System memory: 256 GB
Network: Dual 10 Gb LAN
NVIDIA Deep Learning SDK
65
Current software
TensorFlow (Google)
open source software library for numerical computation
using data ﬂow graphs
allows implementation of most current neural networks
allows computation on multiple devices (CPUs, GPUs, ...)
Python API
Keras: a library on top of TensorFlow that allows easy
description of most modern neural networks
CNTK (Microsoft)
functionality similar to TensorFlow
special input language called BrainScript
Theano:
The "academic" grand-daddy of deep-learning frameworks,
written in Python. Strongly inspired TensorFlow (some
people developing Theano moved on to develop
TensorFlow).
There are others: Caffe, Torch (Facebook),
Deeplearning4j, ...
66
Current software – Keras
67
Other software implementations
Most "mathematical" software packages contain some support
of neural networks:
MATLAB
R
STATISTICA
Weka
...
The implementations are typically not on par with the previously
mentioned dedicated deep-learning libraries.
68
SyNAPSE (USA)
Big research project, partially funded by DARPA
Among the main subjects IBM a HRL, collaboraton with top
US universities, e.g. Boston, Stanford
The project started in 2008
Invested tens of millions USD.
The goal
Develop a neural network comparable with a real brain of a
mammal
The resulting hw chip should simulate 10 billion neurons,
100 trillion synaptic connections, consume 1 kilowatt (∼ a
small heater), size 2 dm3
Oriented towards development of a new parallel computer
architecture rather than neuroscience.
69
SyNAPSE (USA) – some results
A cat brain simulation (2009)
A simulation of a network with 109 neurons, 1013 synapses
Simulated on a supercomputer Dawn (Blue Gene/P),
147,450 CPU, 144 tB of memory
643 times slower than the real brain
The network modelled according to the real-brain structure
(hierarchical model of a visual cortex, 4 layers)
The authors claim that they observed some behaviour
similar to the behaviour of the real brain (signal
propagation, α, γ waves)
... simulation was heavily criticised (see latter)
... in 2012 the number of neurons increased to 530 bn neurons
a 100 tn synapses
70
SyNAPSE (USA) – TrueNorth
A chip with 5.4 billion elements
4096 neurosynaptic cores connected by a network,
implementing 1 million programmable "spiking"
neurons, 256 million programmable synaptic
connections
global frequency 1-kHz
low energy consumption, approx. 63mW
Ofﬂine learning, implemented some known algorithms
(convolutional networks, RBM etc.)
Applied to simple image recognition tasks.
71
Human Brain Project, HBP (Europe)
Funded by EU, budget 109 EUR for 10 years
Successor of Blue Brain Project at EPFL Lausanne
Blue Brain started in 2005, ended in 2012, Human Brain
Project started in 2013
The original goal: Deeper understanding of human brain
networking in neuroscience
diagnosis of brain diseases
thinking machine
The approach:
study of brain tissue using current technology
modelling of biological neurons
simulation of the models (program NEURON)
72
HBP (Europe
Blue brain project (2008)
Model of a part of the brain cortex of a rat (approx. 10,000
neurons), much more complex model of neurons than in
SyNAPSE
Simulated on a supercomputer of the type Blue Gene/P
(provided by IBM on discount), 16,384 CPU, 56 teraﬂops,
16 terabyt˚u pamˇeti, 1 PB disk space
Simulation 300x slower than the real brain
Human brain project (2015):
Simpliﬁed model of the nervous system of a rat (approx.
200 000 neurons)
73
SyNAPSE vs HBP
2011: IBM Simulates 4.5 percent of the Human Brain, and All of
the Cat Brain (Scientiﬁc American)
“... performed the ﬁrst near real-time cortical simulation of the brain
that exceeds the scale of a cat cortex” (IBM)
74
SyNAPSE vs HBP
2011: IBM Simulates 4.5 percent of the Human Brain, and All of
the Cat Brain (Scientiﬁc American)
“... performed the ﬁrst near real-time cortical simulation of the brain
that exceeds the scale of a cat cortex” (IBM)
This announcement has been heavily criticised by Dr. Markram (head
of HBP)
74
SyNAPSE vs HBP
2011: IBM Simulates 4.5 percent of the Human Brain, and All of
the Cat Brain (Scientiﬁc American)
“... performed the ﬁrst near real-time cortical simulation of the brain
that exceeds the scale of a cat cortex” (IBM)
This announcement has been heavily criticised by Dr. Markram (head
of HBP)
“This is a mega public relations stunt – a clear case of scientiﬁc
deception of the public”
74
SyNAPSE vs HBP
2011: IBM Simulates 4.5 percent of the Human Brain, and All of
the Cat Brain (Scientiﬁc American)
“... performed the ﬁrst near real-time cortical simulation of the brain
that exceeds the scale of a cat cortex” (IBM)
This announcement has been heavily criticised by Dr. Markram (head
of HBP)
“This is a mega public relations stunt – a clear case of scientiﬁc
deception of the public”
“Their so called “neurons” are the tiniest of points you can imagine, a
microscopic dot”
74
SyNAPSE vs HBP
2011: IBM Simulates 4.5 percent of the Human Brain, and All of
the Cat Brain (Scientiﬁc American)
“... performed the ﬁrst near real-time cortical simulation of the brain
that exceeds the scale of a cat cortex” (IBM)
This announcement has been heavily criticised by Dr. Markram (head
of HBP)
“This is a mega public relations stunt – a clear case of scientiﬁc
deception of the public”
“Their so called “neurons” are the tiniest of points you can imagine, a
microscopic dot”
“Neurons contain 10’s of thousands of proteins that form a network
with 10’s of millions of interactions. These interactions are incredibly
complex and will require solving millions of differential equations.
They have none of that.”
74
SyNAPSE vs HBP
“Eugene Izhikevik himself already in 2005 ran a simulation with 100
billion such points interacting just for the fun of it: (over 60 times
larger than Modha’s simulation)”
75
SyNAPSE vs HBP
“Eugene Izhikevik himself already in 2005 ran a simulation with 100
billion such points interacting just for the fun of it: (over 60 times
larger than Modha’s simulation)”
Why did they get the Gordon Bell Prize?
“They seem to have been very successful in inﬂuencing the
committee with their claim, which technically is not peer-reviewed by
the respective community and is neuroscientiﬁcally outrageous.”
75
SyNAPSE vs HBP
“Eugene Izhikevik himself already in 2005 ran a simulation with 100
billion such points interacting just for the fun of it: (over 60 times
larger than Modha’s simulation)”
Why did they get the Gordon Bell Prize?
“They seem to have been very successful in inﬂuencing the
committee with their claim, which technically is not peer-reviewed by
the respective community and is neuroscientiﬁcally outrageous.”
But is there any innovation here?
“The only innovation here is that IBM has built a large
supercomputer.”
75
SyNAPSE vs HBP
“Eugene Izhikevik himself already in 2005 ran a simulation with 100
billion such points interacting just for the fun of it: (over 60 times
larger than Modha’s simulation)”
Why did they get the Gordon Bell Prize?
“They seem to have been very successful in inﬂuencing the
committee with their claim, which technically is not peer-reviewed by
the respective community and is neuroscientiﬁcally outrageous.”
But is there any innovation here?
“The only innovation here is that IBM has built a large
supercomputer.”
But did Mohda not collaborate with neuroscientists?
“I would be very surprised if any neuroscientists that he may have
had in his DARPA consortium realized he was going to make such an
outrageous claim. I can’t imagine that the San Fransisco
neuroscientists knew he was going to make such a stupid claim.
Modha himself is a software engineer with no knowledge of the brain.”
75
... and in the meantime in Europe
In 2014, the European Commission received an open letter signed by
more than 130 heads of laboratories demanding a substantial change
in the management of the whole project.
76
... and in the meantime in Europe
In 2014, the European Commission received an open letter signed by
more than 130 heads of laboratories demanding a substantial change
in the management of the whole project.
Peter Dayan, director of the computational neuroscience unit at UCL:
“The main apparent goal of building the capacity to construct
a larger-scale simulation of the human brain is radically premature.”
“We are left with a project that can’t but fail from a scientiﬁc
perspective. It is a waste of money, it will suck out funds from
valuable neuroscience research, and would leave the public, who
fund this work, justiﬁably upset.”
76
... and in 2016
The European Commission and the Human Brain Project
Coordinator, the École Polytechnique Fédérale de Lausanne
(EPFL), have signed the ﬁrst Speciﬁc Grant Agreement
(SGA1), releasing EUR 89 million in funding retroactively from
1st April 2016 until the end of March 2018. The signature of
SGA1 means that the HBP and the European Commision have
agreed on the HBP Work Plan for this two year period.
The SGA1 work plan will move the Project closer to achieving
its aim of establishing a cutting-edge, ICT-based scientiﬁc
Research Infrastructure for brain research, cognitive
neuroscience and brain-inspired computing.
77
ADALINE
Architecture:
x1 x2 xn
· · ·
y
x0 = 1
w0
w1 w2 wn
w = (w0, w1, . . . , wn) and x = (x0, x1, . . . , xn) where x0 = 1.
Activity:
inner potential: ξ = w0 + n
i=1 wixi = n
i=0 wixi = w · x
activation function: σ(ξ) = ξ
network function: y[w](x) = σ(ξ) = w · x
78
ADALINE
Learning:
Given a training set
T = x1, d1 , x2, d2 , . . . , xp, dp
Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th
input, and dk ∈ R is the expected output.
Intuition: The network is supposed to compute an afﬁne approximation of the
function (some of) whose values are given in the training set.
79
Oaks in Wisconsin
80
ADALINE
Error function:
E(w) =
1
2
p
k=1
w · xk − dk
2
=
1
2
p
k=1


n
i=0
wixki − dk


2
The goal is to ﬁnd w which minimizes E(w).
81
Error function
82
Gradient of the error function
Consider gradient of the error function:
E(w) =
∂E
∂w0
(w), . . . ,
∂E
∂wn
(w)
Intuition: E(w) is a vector in the weight space which points in
the direction of the steepest ascent of the error function.
Note that the vectors xk are just parameters of the function E, and are thus
ﬁxed!
83
Gradient of the error function
Consider gradient of the error function:
E(w) =
∂E
∂w0
(w), . . . ,
∂E
∂wn
(w)
Intuition: E(w) is a vector in the weight space which points in
the direction of the steepest ascent of the error function.
Note that the vectors xk are just parameters of the function E, and are thus
ﬁxed!
Fact
If E(w) = 0 = (0, . . . , 0), then w is a global minimum of E.
For ADALINE, the error function E(w) is a convex paraboloid and thus has
the unique global minimum.
83
Gradient - illustration
Caution! This picture just illustrates the notion of gradient ... it is not
the convex paraboloid E(w) !
84
Gradient of the error function (ADALINE)
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
85
Gradient of the error function (ADALINE)
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δw


n
i=0
wixki − dk


85
Gradient of the error function (ADALINE)
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δw


n
i=0
wixki − dk


=
1
2
p
k=1
2


n
i=0
wixki − dk




n
i=0
δ
δw
wixki −
δE
δw
dk


85
Gradient of the error function (ADALINE)
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δw


n
i=0
wixki − dk


=
1
2
p
k=1
2


n
i=0
wixki − dk




n
i=0
δ
δw
wixki −
δE
δw
dk


=
p
k=1
w · xk − dk xk
85
Gradient of the error function (ADALINE)
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δw


n
i=0
wixki − dk


=
1
2
p
k=1
2


n
i=0
wixki − dk




n
i=0
δ
δw
wixki −
δE
δw
dk


=
p
k=1
w · xk − dk xk
Thus
E(w) =
∂E
∂w0
(w), . . . ,
∂E
∂wn
(w) =
p
k=1
w · xk − dk xk
85
ADALINE - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
86
ADALINE - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
86
ADALINE - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1, weights w(t+1) are computed as follows:
w(t+1)
= w(t)
− ε · E(w(t)
)
= w(t)
− ε ·
p
k=1
w(t)
· xk − dk · xk
Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate.
86
ADALINE - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1, weights w(t+1) are computed as follows:
w(t+1)
= w(t)
− ε · E(w(t)
)
= w(t)
− ε ·
p
k=1
w(t)
· xk − dk · xk
Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate.
Proposition
For sufﬁciently small ε > 0 the sequence w(0), w(1), w(2), . . .
converges (componentwise) to the global minimum of E (i.e. to
the vector w satisfying E(w) = 0).
86
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE – Animation
87
ADALINE - learning
Online algorithm (Delta-rule, Widrow-Hoff rule):
weights in w(0) initialized randomly close to 0
in the step t + 1, weights w(t+1) are computed as follows:
w(t+1)
= w(t)
− ε(t) · w(t)
· xk − dk · xk
Here k = t mod p + 1 and 0 < ε(t) ≤ 1 is a learning rate
in the step t + 1.
Note that the algorithm does not work with the complete gradient but
only with its part determined by the currently considered training
example.
88
ADALINE - learning
Online algorithm (Delta-rule, Widrow-Hoff rule):
weights in w(0) initialized randomly close to 0
in the step t + 1, weights w(t+1) are computed as follows:
w(t+1)
= w(t)
− ε(t) · w(t)
· xk − dk · xk
Here k = t mod p + 1 and 0 < ε(t) ≤ 1 is a learning rate
in the step t + 1.
Note that the algorithm does not work with the complete gradient but
only with its part determined by the currently considered training
example.
Theorem (Widrow & Hoff)
If ε(t) = 1
t , then w(0), w(1), w(2), . . . converges to the global
minimum of E.
88
ADALINE - classiﬁcation
How to use the ADALINE for classiﬁcation?
The training set is
T = x1, d1 , x2, d2 , . . . , xp, dp
kde xk = (xk0, xk1, . . . , xkn) ∈ Rn+1 a dk ∈ {1, −1}.
Here dk determines a class.
89
ADALINE - classiﬁcation
How to use the ADALINE for classiﬁcation?
The training set is
T = x1, d1 , x2, d2 , . . . , xp, dp
kde xk = (xk0, xk1, . . . , xkn) ∈ Rn+1 a dk ∈ {1, −1}.
Here dk determines a class.
Train the network using the ADALINE algorithm.
89
ADALINE - classiﬁcation
How to use the ADALINE for classiﬁcation?
The training set is
T = x1, d1 , x2, d2 , . . . , xp, dp
kde xk = (xk0, xk1, . . . , xkn) ∈ Rn+1 a dk ∈ {1, −1}.
Here dk determines a class.
Train the network using the ADALINE algorithm.
We may expect the following:
if dk = 1, then w · xk ≥ 0
if dk = −1, then w · xk < 0
89
ADALINE - classiﬁcation
How to use the ADALINE for classiﬁcation?
The training set is
T = x1, d1 , x2, d2 , . . . , xp, dp
kde xk = (xk0, xk1, . . . , xkn) ∈ Rn+1 a dk ∈ {1, −1}.
Here dk determines a class.
Train the network using the ADALINE algorithm.
We may expect the following:
if dk = 1, then w · xk ≥ 0
if dk = −1, then w · xk < 0
This does not have to be always true but if the training set
is reasonably linearly separable, then the algorithm
typically gives satisfactory results.
89
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
90
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
91
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
91
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
91
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
91
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
91
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
91
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
92
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable)
[ e.g. logistic sigmoid σj(ξ) = 1
1+e
−λjξ ]
92
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable)
[ e.g. logistic sigmoid σj(ξ) = 1
1+e
−λjξ ]
State of non-input neuron j ∈ Z \ X after the computation
stops:
yj = σj(ξj)
(yj depends on the conﬁguration w and the input x, so we sometimes
write yj(w, x) )
92
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable)
[ e.g. logistic sigmoid σj(ξ) = 1
1+e
−λjξ ]
State of non-input neuron j ∈ Z \ X after the computation
stops:
yj = σj(ξj)
(yj depends on the conﬁguration w and the input x, so we sometimes
write yj(w, x) )
The network computes a function R|X|
do R|Y|
. Layer-wise computation:
First, all input neurons are assigned values of the input. In the -th step,
all neurons of the -th layer are evaluated.
92
MLP – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
93
MLP – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
Error function:
E(w) =
p
k=1
Ek (w)
where
Ek (w) =
1
2
j∈Y
yj(w, xk ) − dkj
2
93
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
94
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂E
∂wji
(w(t)
)
is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is
a learning rate in step t + 1.
94
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂E
∂wji
(w(t)
)
is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is
a learning rate in step t + 1.
Note that ∂E
∂wji
(w(t)
) is a component of the gradient E, i.e. the weight update
can be written as w(t+1)
= w(t)
− ε(t) · E(w(t)
).
94
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
95
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
95
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
and for every j ∈ Z X we get
∂Ek
∂yj
= yj − dkj for j ∈ Y
95
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
and for every j ∈ Z X we get
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj for j ∈ Z (Y ∪ X)
(Here all yj are in fact yj(w, xk )).
95
MLP – error function gradient
If σj(ξ) = 1
1+e
−λjξ for all j ∈ Z, then
σj (ξj) = λjyj(1 − yj)
96
MLP – error function gradient
If σj(ξ) = 1
1+e
−λjξ for all j ∈ Z, then
σj (ξj) = λjyj(1 − yj)
and thus for all j ∈ Z X:
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X)
96
MLP – error function gradient
If σj(ξ) = 1
1+e
−λjξ for all j ∈ Z, then
σj (ξj) = λjyj(1 − yj)
and thus for all j ∈ Z X:
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X)
If σj(ξ) = a · tanh(b · ξj) for all j ∈ Z, then
σj (ξj) =
b
a
(a − yj)(a + yj)
96
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
97
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
97
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
97
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
97
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
97
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
3. compute ∂Ek
∂wji
for all wji using
∂Ek
∂wji
:=
∂Ek
∂yj
· σj (ξj) · yi
97
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
3. compute ∂Ek
∂wji
for all wji using
∂Ek
∂wji
:=
∂Ek
∂yj
· σj (ξj) · yi
4. Eji := Eji + ∂Ek
∂wji
The resulting Eji equals ∂E
∂wji
.
97
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
98
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
if j ∈ Y, then ∂Ek
∂yj
= yj − dkj
98
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
if j ∈ Y, then ∂Ek
∂yj
= yj − dkj
if j ∈ Z Y ∪ X, then assuming that j is in the -th layer and
assuming that ∂Ek
∂yr
has already been computed for all
neurons in the + 1-st layer, compute
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj
(This works because all neurons of r ∈ j→
belong to the + 1-st layer.)
98
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
99
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
99
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
99
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
99
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
99
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
The steps 1. - 3. take linear time.
99
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
The steps 1. - 3. take linear time.
Note that the speed of convergence of the gradient descent cannot be
estimated ...
99
MLP – learning algorithm
Online algorithm:
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂Ek
∂wji
(w
(t)
ji
)
is the weight update of wji in the step t + 1 and 0 < ε(t) ≤ 1
is the learning rate in the step t + 1.
There are other variants determined by selection of the training examples
used for the error computation (more on this later).
100
Illustration of the gradient descent – XOR
Source: Pattern Classiﬁcation (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork
101
Animation (sin(x)), network 1-5-1)
One iteration:
102
Animation (sin(x)), network 1-5-1)
10 iterations:
102
Animation (sin(x)), network 1-5-1)
20 iterations:
102
Animation (sin(x)), network 1-5-1)
40 iterations:
102
Animation (sin(x)), network 1-5-1)
100 iterations:
102
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
103
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
104
MLP – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
Error function:
E(w) =
p
k=1
Ek (w)
where
Ek (w) =
1
2
j∈Y
yj(w, xk ) − dkj
2
105
MLP – batch learning
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w(t+1)
= w(t)
+ ∆w(t)
Here
∆w(t)
= −ε(t) · E(w(t)
) = −ε(t) ·
p
k=1
Ek (w(t)
)
0 < ε(t) ≤ 1 is a learning rate in step t + 1
E(w(t)) is the gradient of the error function
Ek (w(t)) is the gradient of the error function
for the training example k
106
MLP – error functions
square error:
E(w) =
p
k=1
Ek (w)
where Ek (w) = 1
2 j∈Y yj(w, xk ) − dkj
2
mean square error (mse):
E(w) =
1
p
p
k=1
Ek (w)
I will use mse throughout the rest of this lecture.
107
MLP – mse gradient
For every wji we have
∂E
∂wji
=
1
p
p
k=1
∂Ek
∂wji
108
MLP – mse gradient
For every wji we have
∂E
∂wji
=
1
p
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
and for every j ∈ Z X we get
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj for j ∈ Z (Y ∪ X)
(Here all yj are in fact yj(w, xk )).
108
Practical issues of gradient descent
Training efﬁciency:
What size of a minibatch?
How to choose the learning rate ε(t) and control SGD ?
How to pre-process the inputs?
How to initialize weights?
How to choose desired output values of the network?
109
Practical issues of gradient descent
Training efﬁciency:
What size of a minibatch?
How to choose the learning rate ε(t) and control SGD ?
How to pre-process the inputs?
How to initialize weights?
How to choose desired output values of the network?
Quality of the resulting model:
When to stop training?
Regularization techniques.
How large network?
For simplicity, I will illustrate the reasoning on MLP + mse.
Later we will see other topologies and error functions with
different but always somewhat related issues.
109
Issues in gradient descent
Lots of local minima where the descent gets stuck:
The model identiﬁability problem: Swapping incoming
weights of neurons i and j leaves the same network
topology – weight space symmetry
Recent studies show that for sufﬁciently large networks all
local minima have low values of the error function.
110
Issues in gradient descent
Lots of local minima where the descent gets stuck:
The model identiﬁability problem: Swapping incoming
weights of neurons i and j leaves the same network
topology – weight space symmetry
Recent studies show that for sufﬁciently large networks all
local minima have low values of the error function.
Saddle points
One can show (by a combinatorial
argument) that larger networks
have exponentially more saddle
points than local minima.
110
Issues in gradient descent – too slow descent
ﬂat regions
E.g. if the inner potentials are too large (in abs. value), then their
derivative is extremely small.
111
Issues in gradient descent – too fast descent
steep cliffs: the gradient is extremely large, descent skips
important weight vectors
112
Issues in gradient descent – local vs global
structure
What if we initialize on the left?
113
Issues in computing the gradient
vanishing and exploding gradients
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj for j ∈ Z (Y ∪ X)
114
Issues in computing the gradient
vanishing and exploding gradients
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj for j ∈ Z (Y ∪ X)
inexact gradient computation:
Minibatch gradient is only an estimate of the true gradient.
Note that the variance of the estimate is (roughly) σ/
√
m
where m is the size of the minibatch and σ is the variance
of the gradient estimate for a single training example.
(E.g. minibatch size 10 000 means 100 times more computation
than the size 100 but gives only 10 times less variance.)
114
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
115
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
Multicore architectures are usually underutilized by
extremely small batches.
115
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
Multicore architectures are usually underutilized by
extremely small batches.
If all examples in the batch are to be processed in parallel
(as is the typical case), then the amount of memory scales
with the batch size. For many hardware setups this is the
limiting factor in batch size.
115
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
Multicore architectures are usually underutilized by
extremely small batches.
If all examples in the batch are to be processed in parallel
(as is the typical case), then the amount of memory scales
with the batch size. For many hardware setups this is the
limiting factor in batch size.
Some kinds of hardware achieve better runtime with
speciﬁc sizes of arrays. Especially when using GPUs, it is
common for power of 2 batch sizes to offer better runtime.
Typical power of 2 batch sizes range from 32 to 256, with
16 sometimes being attempted for large models.
115
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
Multicore architectures are usually underutilized by
extremely small batches.
If all examples in the batch are to be processed in parallel
(as is the typical case), then the amount of memory scales
with the batch size. For many hardware setups this is the
limiting factor in batch size.
Some kinds of hardware achieve better runtime with
speciﬁc sizes of arrays. Especially when using GPUs, it is
common for power of 2 batch sizes to offer better runtime.
Typical power of 2 batch sizes range from 32 to 256, with
16 sometimes being attempted for large models.
Small batches can offer a regularizing effect, perhaps due
to the noise they add to the learning process.
115
Moment
Issue in the gradient descent:
E(w(t)) constantly changes direction (but the error
steadily decreases).
116
Moment
Issue in the gradient descent:
E(w(t)) constantly changes direction (but the error
steadily decreases).
Solution: In every step add the change made in the previous
step (weighted by a factor α):
∆w(t)
= −ε(t) ·
k∈T
Ek (w(t)
) + α · ∆w
(t−1)
ji
where 0 < α < 1.
116
Momentum – illustration
117
SGD with momentum
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1)
are
computed as follows:
Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
Compute
w(t+1)
= w(t)
+ ∆w(t)
where
∆w(t)
= −ε(t) ·
k∈T
Ek (w(t)
) + α∆w(t−1)
0 < ε(t) ≤ 1 is a learning rate in step t + 1
0 < α < 1 measures the "inﬂuence" of the moment
Ek (w(t)
) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented by
randomly shufﬂing all data and then choosing minibatches sequentially.
118
Learning rate
Generic rules for adaptation of ε(t)
119
Learning rate
Generic rules for adaptation of ε(t)
Start with a larger learning rate (e.g. ε = 0.1).
Later decrease as the descent is supposed to settle in
a minimum of E.
Some tools allow to set a list of learning rates, each rate for one epoch
of the descent.
119
Learning rate
Generic rules for adaptation of ε(t)
Start with a larger learning rate (e.g. ε = 0.1).
Later decrease as the descent is supposed to settle in
a minimum of E.
Some tools allow to set a list of learning rates, each rate for one epoch
of the descent.
In case you may observe the error
evolving:
If the error decreases, increase
slightly the rate.
If the error increases, decrease the
rate.
Note that the error may increase for
the short period without any harm to
convergence of the learning process.
119
AdaGrad
So far we have considered a uniform learning rate.
It is better to have
larger rates for weights with smaller updates,
smaller rates for weights with larger updates.
AdaGrad uses individually adapting learning rate for each
weight.
120
SGD with AdaGrad
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1)
:
Choose (randomly) a minibatch T ⊆ {1, . . . , p}
Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
121
SGD with AdaGrad
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1)
:
Choose (randomly) a minibatch T ⊆ {1, . . . , p}
Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −
η
r
(t)
j
+ δ
·
k∈T
∂Ek
∂wji
(w(t)
)
and
r
(t)
j
= r
(t−1)
j
+


k∈T
∂Ek
∂wji
(w(t)
)


2
η is a constant expressing the inﬂuence of the learning rate,
typically 0.01.
δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0.
121
RMSProp
The main disadvantage of AdaGrad is the accumulation of the
gradient throughout the whole learning process.
In case the learning needs to get over several "hills" before
settling in a deep "valley", the weight updates get far too small
before getting to it.
RMSProp uses an exponentially decaying average to discard
history from the extreme past so that it can converge rapidly
after ﬁnding a convex bowl, as if it were an instance of the
AdaGrad algorithm initialized within that bowl.
122
SGD with RMSProp
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1)
:
Choose (randomly) a minibatch T ⊆ {1, . . . , p}
Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
123
SGD with RMSProp
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1)
:
Choose (randomly) a minibatch T ⊆ {1, . . . , p}
Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −
η
r
(t)
j
+ δ
·
k∈T
∂Ek
∂wji
(w(t)
)
and
r
(t)
j
= ρr
(t−1)
j
+ (1 − ρ)


k∈T
∂Ek
∂wji
(w(t)
)


2
η is a constant expressing the inﬂuence of the learning rate
(Hinton suggests ρ = 0.9 and η = 0.001).
δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0.
123
Other optimization methods
There are more methods such as AdaDelta, Adam (roughly
RMSProp combined with momentum), etc.
A natural question: Which algorithm should one choose?
124
Other optimization methods
There are more methods such as AdaDelta, Adam (roughly
RMSProp combined with momentum), etc.
A natural question: Which algorithm should one choose?
Unfortunately, there is currently no consensus on this point.
According to a recent study, the family of algorithms with
adaptive learning rates (represented by RMSProp and
AdaDelta) performed fairly robustly, no single best algorithm
has emerged.
124
Other optimization methods
There are more methods such as AdaDelta, Adam (roughly
RMSProp combined with momentum), etc.
A natural question: Which algorithm should one choose?
Unfortunately, there is currently no consensus on this point.
According to a recent study, the family of algorithms with
adaptive learning rates (represented by RMSProp and
AdaDelta) performed fairly robustly, no single best algorithm
has emerged.
Currently, the most popular optimization algorithms actively in
use include SGD, SGD with momentum, RMSProp, RMSProp
with momentum, AdaDelta and Adam.
The choice of which algorithm to use, at this point, seems to
depend largely on the user’s familiarity with the algorithm.
124
Choice of (hidden) activations
Generic requirements imposed on activation functions:
1. differentiability
(to do gradient descent)
2. non-linearity
(linear multi-layer networks are equivalent to single-layer)
3. monotonicity
(local extrema of activation functions induce local extrema of the error
function)
4. "linearity"
(i.e. preserve as much linearity as possible; linear models are easiest to
ﬁt; ﬁnd the "minimum" non-linearity needed to solve a given task)
The choice of activation functions is closely related to input
preprocessing and the initial choice of weights. I will illustrate the
reasoning on sigmoidal functions; say few words about other
activation functions later.
125
Activation functions – tanh
σ(ξ) = 1.7159 · tanh(2
3 · ξ), we have limξ→∞ σ(ξ) = 1.7159 and
limξ→−∞ σ(ξ) = −1.7159
126
Activation functions – tanh
σ(ξ) = 1.7159 · tanh(2
3 · ξ) is almost linear on [−1, 1]
127
Activation functions – tanh
ﬁrst derivative: σ(ξ) = 1.7159 · tanh(2
3 · ξ)
128
Input preprocessing
Some inputs may be much larger than others.
E.g..: Height vs weight of a person, maximum speed of
a car (in km/h) vs its price (in CZK), etc.
129
Input preprocessing
Some inputs may be much larger than others.
E.g..: Height vs weight of a person, maximum speed of
a car (in km/h) vs its price (in CZK), etc.
Large inputs have greater inﬂuence on the training than the
small ones. In addition, too large inputs may slow down
learning (saturation of activation functions).
129
Input preprocessing
Some inputs may be much larger than others.
E.g..: Height vs weight of a person, maximum speed of
a car (in km/h) vs its price (in CZK), etc.
Large inputs have greater inﬂuence on the training than the
small ones. In addition, too large inputs may slow down
learning (saturation of activation functions).
Typical standardization:
average = 0 (subtract the mean)
variance = 1 (divide by the standard deviation)
Here the mean and standard deviation may be estimated
from data (the training set).
(illustration of standard deviation)
129
Input preprocessing
Individual inputs should not be correlated.
Correlated inputs can be removed as a part of
dimensionality reduction.
(Dimensionality reduction and decorrelation can be implemented using
neural networks. There are also standard methods such as PCA.)
130
Initial weights (for tanh)
Typically, the weights are chosen randomly from an interval
[−w, w] where w depends on the number of inputs of a
given neuron.
131
Initial weights (for tanh)
Typically, the weights are chosen randomly from an interval
[−w, w] where w depends on the number of inputs of a
given neuron.
Consider the activation function σ(ξ) = 1.7159 · tanh(2
3 · ξ)
for all neurons.
σ is almost linear on [−1, 1],
extreme values of σ are close to −1 and 1,
σ saturates out of the interval [−4, 4] (i.e. it is close to its
limit values and its derivative is close to 0.
131
Initial weights (for tanh)
Typically, the weights are chosen randomly from an interval
[−w, w] where w depends on the number of inputs of a
given neuron.
Consider the activation function σ(ξ) = 1.7159 · tanh(2
3 · ξ)
for all neurons.
σ is almost linear on [−1, 1],
extreme values of σ are close to −1 and 1,
σ saturates out of the interval [−4, 4] (i.e. it is close to its
limit values and its derivative is close to 0.
Thus
for too small w we may get (almost) linear model.
for too large w (i.e. much larger than 1) the activations may
get saturated and the learning will be very slow.
Hence, we want to choose w so that the inner potentials of
neurons will be roughly in the interval [−1, 1].
131
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data. Assume that individual inputs are (almost) uncorrelated.
132
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data. Assume that individual inputs are (almost) uncorrelated.
Consider a neuron j from the ﬁrst layer with d inputs. Assume
that its weights are chosen uniformly from [−w, w].
132
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data. Assume that individual inputs are (almost) uncorrelated.
Consider a neuron j from the ﬁrst layer with d inputs. Assume
that its weights are chosen uniformly from [−w, w].
The rule: choose w so that the standard deviation of ξj (denote
by oj) is close to the border of the interval on which σj is linear.
In our case: oj ≈ 1.
132
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data. Assume that individual inputs are (almost) uncorrelated.
Consider a neuron j from the ﬁrst layer with d inputs. Assume
that its weights are chosen uniformly from [−w, w].
The rule: choose w so that the standard deviation of ξj (denote
by oj) is close to the border of the interval on which σj is linear.
In our case: oj ≈ 1.
Our assumptions imply: oj = d
3 · w.
Thus we put w =
√
3√
d
.
132
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data. Assume that individual inputs are (almost) uncorrelated.
Consider a neuron j from the ﬁrst layer with d inputs. Assume
that its weights are chosen uniformly from [−w, w].
The rule: choose w so that the standard deviation of ξj (denote
by oj) is close to the border of the interval on which σj is linear.
In our case: oj ≈ 1.
Our assumptions imply: oj = d
3 · w.
Thus we put w =
√
3√
d
.
The same works for higher layers, d corresponds to the number
of neurons in the layer one level lower.
132
Glorot & Bengio initialization
The previous heuristics for weight initialization ignores variance of the
gradient (i.e. it is concerned only with the "size" of activations in the
forward pass.
133
Glorot & Bengio initialization
The previous heuristics for weight initialization ignores variance of the
gradient (i.e. it is concerned only with the "size" of activations in the
forward pass.
Glorot & Bengio (2010) presented a normalized initialization by
choosing w uniformly from the interval:

−
6
m + n
,
6
m + n


Here m is the number of inputs to the neuron, m is the number of
outputs of the neuron.
133
Glorot & Bengio initialization
The previous heuristics for weight initialization ignores variance of the
gradient (i.e. it is concerned only with the "size" of activations in the
forward pass.
Glorot & Bengio (2010) presented a normalized initialization by
choosing w uniformly from the interval:

−
6
m + n
,
6
m + n


Here m is the number of inputs to the neuron, m is the number of
outputs of the neuron.
This is designed to compromise between the goal of initializing all
layers to have the same activation variance and the goal of initializing
all layers to have the same gradient variance.
The formula is derived using the assumption that the network consists only of
a chain of matrix multiplications, with no non-linearities. Real neural networks
obviously violate this assumption, but many strategies designed for the linear
model perform reasonably well on its non-linear counterparts.
133
Target values (tanh)
Target values dkj should be chosen in the range of the
output activation functions, in our case [−1.716, 1.716].
Target values too close to extrema of the output
activations, in our case ±1.716, may cause that the
weights will grow indeﬁnitely (slows down learning).
Thus it is good to choose target values from the interval
[−1.716 + δ, 1.716 − δ].
As before, ideally [−1.716 + δ, 1.716 − δ] should span
the interval on which the activation function is linear, i.e. dkj
should be taken from [−1, 1].
134
Modern activation functions
For hidden neurons sigmoidal functions are often substituted with
piece-wise linear activations functions. Most prominent is ReLU:
σ(ξ) = max{0, ξ}
THE default activation function recommended for use with most
feedforward neural networks.
As close to linear function as possible; very simple; does not
saturate for large potentials.
135
Output neurons
The choice of activation functions for output units depends on the
concrete applications.
For regression (function approximation) the output is typically linear
(or sigmoidal).
136
Output neurons
The choice of activation functions for output units depends on the
concrete applications.
For regression (function approximation) the output is typically linear
(or sigmoidal).
For classiﬁcation, the current activation functions of choice are
logistic sigmoid or tanh – binary classiﬁcation
softmax:
σj(ξj) =
eξj
i∈Y eξi
for multi-class classiﬁcation.
136
Output neurons
The choice of activation functions for output units depends on the
concrete applications.
For regression (function approximation) the output is typically linear
(or sigmoidal).
For classiﬁcation, the current activation functions of choice are
logistic sigmoid or tanh – binary classiﬁcation
softmax:
σj(ξj) =
eξj
i∈Y eξi
for multi-class classiﬁcation.
For some reasons the error function used with softmax (assuming
that the target values dkj are from {0, 1}) is typically cross-entropy:
−
1
p
p
k=1 j∈Y
dkj ln(yj) + (1 − dkj) ln(1 − yj)
... which somewhat corresponds to the maximum likelihood principle.
136
Sigmoidal outputs with cross-entropy – in detail
Consider
Binary classiﬁcation, two classes {0, 1}
One output neuron j, its activation logistic sigmoid
σj(ξj) =
1
1 + e−ξj
The output of the network is y = σj(ξj).
137
Sigmoidal outputs with cross-entropy – in detail
Consider
Binary classiﬁcation, two classes {0, 1}
One output neuron j, its activation logistic sigmoid
σj(ξj) =
1
1 + e−ξj
The output of the network is y = σj(ξj).
For a training set
T = xk , dk k = 1, . . . , p
(here xk ∈ R|X| and dk ∈ R), the cross-entropy looks like
this:
Ecross
= −
1
p
p
k=1
[dk ln(yk ) + (1 − dk ) ln(1 − yk )]
where yk is the output of the network for the k-th training
input xk , and dk is the k-th desired output.
137
Generalization
Intuition: Generalization = ability to cope with new unseen
instances.
Data are mostly noisy, so it is not good idea to ﬁt exactly.
In case of function approximation, the network should not
return exact results as in the training set.
138
Generalization
Intuition: Generalization = ability to cope with new unseen
instances.
Data are mostly noisy, so it is not good idea to ﬁt exactly.
In case of function approximation, the network should not
return exact results as in the training set.
More formally: It is typically assumed that the training set has
been generated as follows:
dkj = gj(xk ) + Θkj
where gj is the "underlying" function corresponding to
the output neuron j ∈ Y and Θkj is random noise.
The network should ﬁt gj not the noise.
Methods improving generalization are called regularization
methods.
138
Regularization
Regularization is a big issue in neural networks, as they
typically use a huge amount of parameters and thus are very
susceptible to overﬁtting.
139
Regularization
Regularization is a big issue in neural networks, as they
typically use a huge amount of parameters and thus are very
susceptible to overﬁtting.
von Neumann: "With four parameters I can ﬁt an elephant,
and with ﬁve I can make him wiggle his trunk."
... and I ask you prof. Neumann:
What can you ﬁt with 40GB of parameters??
139
Early stopping
Early stopping means that we stop learning before it reaches
a minimum of the error E.
When to stop?
140
Early stopping
Early stopping means that we stop learning before it reaches
a minimum of the error E.
When to stop?
In many applications the error function is not the main thing we
want to optimize.
E.g. in the case of a trading system, we typically want to maximize our proﬁt
not to minimize (strange) error functions designed to be easily differentiable.
Also, as noted before, minimizing E completely is not good for
generalization.
For start: We may employ standard approach of training on one
set and stopping on another one.
140
Early stopping
Divide your dataset into several subsets:
training set (e.g. 60%) – train the network here
validation set (e.g. 20%) – use to stop the training
(possibly) test set (e.g. 20%) – use to compare trained
models
What to use as a stopping rule?
141
Early stopping
Divide your dataset into several subsets:
training set (e.g. 60%) – train the network here
validation set (e.g. 20%) – use to stop the training
(possibly) test set (e.g. 20%) – use to compare trained
models
What to use as a stopping rule?
You may observe E (or any other function of interest) on the
validation set, if it does not improve for last k steps, stop.
Alternatively, you may observe the gradient, if it is small for
some time, stop.
(recent studies shown that this traditional rule is not too good: it may happen
that the gradient is larger close to minimum values; on the other hand, E
does not have to be evaluated which saves time.
To compare models you may use ML techniques such as
cross-validation etc.
141
Size of the network
Similar problem as in the case of the training duration:
Too small network is not able to capture intrinsic properties
of the training set.
Large networks overﬁt faster – bad generalization.
Solution: Optimal number of neurons :-)
142
Size of the network
Similar problem as in the case of the training duration:
Too small network is not able to capture intrinsic properties
of the training set.
Large networks overﬁt faster – bad generalization.
Solution: Optimal number of neurons :-)
there are some (useless) theoretical bounds
there are algorithms dynamically adding/removing neurons
(not much use nowadays)
In practice:
start using a rule of thumb: the number of neurons ≈ ten
times less than the number of training instances.
experiment, experiment, experiment.
142
Feature extraction
Consider a two layer network. Hidden neurons are supposed to
represent "patterns" in the inputs.
Example: Network 64-2-3 for letter classiﬁcation:
143
Ensemble methods
Techniques for reducing generalization error by combining
several models.
The reason that ensemble methods work is that different models will usually
not make all the same errors on the test set.
Idea: Train several different models separately, then have all of
the models vote on the output for test examples.
144
Ensemble methods
Techniques for reducing generalization error by combining
several models.
The reason that ensemble methods work is that different models will usually
not make all the same errors on the test set.
Idea: Train several different models separately, then have all of
the models vote on the output for test examples.
Bagging:
Generate k training sets T1, ..., Tk of the same size by
sampling from T uniformly with replacement.
If |Ti| = |T |, then on average |Ti| = (1 − 1/e)|T |.
For each i, train a model Mi on Ti.
Combine outputs of the models: for regression by
averaging, for classiﬁcation by (majority) voting.
144
Dropout
The algorithm: In every step of the gradient descent
choose randomly a set N of neurons, each neuron is
included in N independently with probability 1/2,
(in practice, different probabilities are used as well).
update weights of neurons in N (in a standard way), leave
weights of the other neurons unchanged.
145
Dropout
The algorithm: In every step of the gradient descent
choose randomly a set N of neurons, each neuron is
included in N independently with probability 1/2,
(in practice, different probabilities are used as well).
update weights of neurons in N (in a standard way), leave
weights of the other neurons unchanged.
Dropout resembles bagging: Large ensemble of neural
networks is trained "at once" on parts of the data.
Dropout is not exactly the same as bagging: The models share
parameters, with each model inheriting a different subset of
parameters from the parent neural network. This parameter
sharing makes it possible to represent an exponential number
of models with a tractable amount of memory.
In the case of bagging, each model is trained to convergence on its respective
training set. This would be infeasible for large networks/training sets.
145
Weight decay
Generalization can be improved by removing "unimportant"
weights.
Penalising large weights gives stronger indication about their
importance.
146
Weight decay
Generalization can be improved by removing "unimportant"
weights.
Penalising large weights gives stronger indication about their
importance.
In every step we decrease weights (multiplicatively) as follows:
w
(t+1)
ji
= (1 − ζ)(w
(t)
ji
+ ∆w
(t)
ji
)
Intuition: Unimportant weights will be pushed to 0, important
weights will survive the decay.
146
Weight decay
Generalization can be improved by removing "unimportant"
weights.
Penalising large weights gives stronger indication about their
importance.
In every step we decrease weights (multiplicatively) as follows:
w
(t+1)
ji
= (1 − ζ)(w
(t)
ji
+ ∆w
(t)
ji
)
Intuition: Unimportant weights will be pushed to 0, important
weights will survive the decay.
Weight decay is equivalent to the gradient descent with a
constant learning rate ε and the following error function:
E (w) = E(w) +
2ζ
ε
(w · w)
Here 2ζ
ε (w · w) penalizes large weights.
146
More optimization, regularization ...
There are many more practical tips, optimization methods,
regularization methods, etc.
For a very nice survey see
http://www.deeplearningbook.org/
... and also all other inﬁnitely many urls concerned with deep
learning.
147