PV021: Neural networks
Tomáš Brázdil
1
Course organization
Course materials:
Main: The lecture
Neural Networks and Deep Learning by Michael Nielsen
http://neuralnetworksanddeeplearning.com/
(Extremely well written modern online textbook.)
Deep learning by Ian Goodfellow, Yoshua Bengio and Aaron
Courville
http://www.deeplearningbook.org/
(A very good overview of the state-of-the-art in neural networks.)
Suggested: deeplearning.ai courses by Andrew Ng
2
Course organization
Evaluation:
Project
teams of two students
implementation of a selected model + analysis of given data
implementation either in C, C++ without use of any
specialized libraries for data analysis and machine
learning
need to get over a given accuracy threshold (a gentle one,
just to eliminate non-functional implementations)
3
Course organization
Evaluation:
Project
teams of two students
implementation of a selected model + analysis of given data
implementation either in C, C++ without use of any
specialized libraries for data analysis and machine
learning
need to get over a given accuracy threshold (a gentle one,
just to eliminate non-functional implementations)
Oral exam
I may ask about anything from the lecture!
3
FAQ
Q: Why we cannot use specialized libraries in projects?
4
FAQ
Q: Why we cannot use specialized libraries in projects?
A: In order to "touch" the low level implementation details of the
algorithms. You should not even use libraries for linear algebra
and numerical methods, so that you will be confronted with
rounding errors and numerical instabilities.
4
FAQ
Q: Why we cannot use specialized libraries in projects?
A: In order to "touch" the low level implementation details of the
algorithms. You should not even use libraries for linear algebra
and numerical methods, so that you will be confronted with
rounding errors and numerical instabilities.
Q: Why should you attend this course when there are inﬁnitely
many great reasources elsewhere?
A: There are at least two reasons:
You may discuss issues with me, my colleagues and other
students.
I will make you truly learn fundamentals by heart.
4
Notable features of the course
Use of mathematical notation and reasoning (contains several
proofs that are mandatory for the exam)
Sometimes goes deeper into statistical underpinnings of neural
networks learning
The project demands a complete working solution which must
satisfy a prescribed performance speciﬁcation
5
Notable features of the course
Use of mathematical notation and reasoning (contains several
proofs that are mandatory for the exam)
Sometimes goes deeper into statistical underpinnings of neural
networks learning
The project demands a complete working solution which must
satisfy a prescribed performance speciﬁcation
An unusual exam system! You can repeat the oral exam as many
times as needed (only the best grade goes into IS).
5
Notable features of the course
Use of mathematical notation and reasoning (contains several
proofs that are mandatory for the exam)
Sometimes goes deeper into statistical underpinnings of neural
networks learning
The project demands a complete working solution which must
satisfy a prescribed performance speciﬁcation
An unusual exam system! You can repeat the oral exam as many
times as needed (only the best grade goes into IS).
An example of an instruction email (from another course with the
same system):
It is typically not sufficient to devote a single
afternoon to the preparation for the exam.
You have to know _everything_ (which means every
single thing) starting with the slide 42
and ending with the slide 245 with notable exceptions
of slides: 121 - 123, 137 - 140, 165, 167.
Proofs presented on the whiteboard are also mandatory.
5
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
6
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
spam ﬁlter
learns to recognize spam from a database of "labelled"
emails
consequently is able to distinguish spam from ham
6
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
spam ﬁlter
learns to recognize spam from a database of "labelled"
emails
consequently is able to distinguish spam from ham
handwritten text reader
learns from a database of handwritten
letters (or text) labelled by their correct
meaning
consequently is able to recognize text
6
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
spam ﬁlter
learns to recognize spam from a database of "labelled"
emails
consequently is able to distinguish spam from ham
handwritten text reader
learns from a database of handwritten
letters (or text) labelled by their correct
meaning
consequently is able to recognize text
· · ·
and lots of much much more sophisticated applications ...
6
Machine learning in general
Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
spam ﬁlter
learns to recognize spam from a database of "labelled"
emails
consequently is able to distinguish spam from ham
handwritten text reader
learns from a database of handwritten
letters (or text) labelled by their correct
meaning
consequently is able to recognize text
· · ·
and lots of much much more sophisticated applications ...
Basic attributes of learning algorithms:
representation: ability to capture the inner structure of
training data
generalization: ability to work properly on new data
6
Machine learning in general
Machine learning algorithms typically construct mathematical
models of given data. The models may be subsequently
applied to fresh data.
7
Machine learning in general
Machine learning algorithms typically construct mathematical
models of given data. The models may be subsequently
applied to fresh data.
There are many types of models:
decision trees
support vector machines
hidden Markov models
Bayes networks and other graphical models
neural networks
· · ·
Neural networks, based on models of a (human) brain, form
a natural basis for learning algorithms!
7
Artiﬁcial neural networks
Artiﬁcial neuron is a rough mathematical approximation
of a biological neuron.
(Aritiﬁcial) neural network (NN) consists of a number of
interconnected artiﬁcial neurons. "Behavior" of the network
is encoded in connections between neurons.
σ
ξ
x1 x2 xn
y
Zdroj obrázku: http://tulane.edu/sse/cmb/people/schrader/
8
Why artiﬁcial neural networks?
Modelling of biological neural networks (computational
neuroscience).
simpliﬁed mathematical models help to identify important
mechanisms
How a brain receives information?
How the information is stored?
How a brain develops?
· · ·
9
Why artiﬁcial neural networks?
Modelling of biological neural networks (computational
neuroscience).
simpliﬁed mathematical models help to identify important
mechanisms
How a brain receives information?
How the information is stored?
How a brain develops?
· · ·
neuroscience is strongly multidisciplinary; precise
mathematical descriptions help in communication among
experts and in design of new experiments.
I will not spend much time on this area!
9
Why artiﬁcial neural networks?
Neural networks in machine learning.
Typically primitive models, far from their biological
counterparts (but often inspired by biology).
10
Why artiﬁcial neural networks?
Neural networks in machine learning.
Typically primitive models, far from their biological
counterparts (but often inspired by biology).
Strongly oriented towards concrete application domains:
decision making and control - autonomous vehicles,
manufacturing processes, control of natural resources
games - backgammon, poker, GO, Starcraft, ...
ﬁnance - stock prices, risk analysis
medicine - diagnosis, signal processing (EKG, EEG, ...), image
processing (MRI, roentgen, WSI ...)
text and speech processing - automatic translation, text
generation, speech recognition
other signal processing - ﬁltering, radar tracking, noise
reduction
· · ·
I will concentrate on this area!
10
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
11
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
Learning
a kid learns to recognize a rabbit after seeing several
rabbits
11
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
Learning
a kid learns to recognize a rabbit after seeing several
rabbits
Generalization
a kid is able to recognize a new rabbit after seeing several
(old) rabbits
11
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
Learning
a kid learns to recognize a rabbit after seeing several
rabbits
Generalization
a kid is able to recognize a new rabbit after seeing several
(old) rabbits
Robustness
a blurred photo of a rabbit may still be classiﬁed as an
image of a rabbit
11
Important features of neural networks
Massive parallelism
many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
Learning
a kid learns to recognize a rabbit after seeing several
rabbits
Generalization
a kid is able to recognize a new rabbit after seeing several
(old) rabbits
Robustness
a blurred photo of a rabbit may still be classiﬁed as an
image of a rabbit
Graceful degradation
Experiments have shown that damaged neural network is
still able to work quite well
Damaged network may re-adapt, remaining neurons may
take on functionality of the damaged ones
11
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurrent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets, autoencoders and generative
adversarial networks)
12
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurrent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets, autoencoders and generative
adversarial networks)
Standard applications of these models
(image processing, a little bit of speech and text processing)
12
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurrent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets, autoencoders and generative
adversarial networks)
Standard applications of these models
(image processing, a little bit of speech and text processing)
Basic learning algorithms
(gradient descent & backpropagation, Hebb’s rule)
12
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurrent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets, autoencoders and generative
adversarial networks)
Standard applications of these models
(image processing, a little bit of speech and text processing)
Basic learning algorithms
(gradient descent & backpropagation, Hebb’s rule)
Basic practical training techniques
(data preparation, setting various parameters, control of learning)
12
The aim of the course
We will concentrate on
basic techniques and principles of neural networks,
fundamental models of neural networks and their
applications.
You should learn
basic models
(multilayer perceptron, convolutional networks, recurrent network
(LSTM), Hopﬁeld and Boltzmann machines and their use in
pre-training of deep nets, autoencoders and generative
adversarial networks)
Standard applications of these models
(image processing, a little bit of speech and text processing)
Basic learning algorithms
(gradient descent & backpropagation, Hebb’s rule)
Basic practical training techniques
(data preparation, setting various parameters, control of learning)
Basic information about current implementations
(TensorFlow, Keras)
12
Biological neural network
Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
Each neuron is connected with approx. 104 neurons.
Neurons themselves are very complex systems.
13
Biological neural network
Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
Each neuron is connected with approx. 104 neurons.
Neurons themselves are very complex systems.
Rough description of nervous system:
External stimulus is received by sensory receptors (e.g.
eye cells).
13
Biological neural network
Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
Each neuron is connected with approx. 104 neurons.
Neurons themselves are very complex systems.
Rough description of nervous system:
External stimulus is received by sensory receptors (e.g.
eye cells).
Information is futher transfered via peripheral nervous
system (PNS) to the central nervous systems (CNS) where
it is processed (integrated), and subseqently, an output
signal is produced.
13
Biological neural network
Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
Each neuron is connected with approx. 104 neurons.
Neurons themselves are very complex systems.
Rough description of nervous system:
External stimulus is received by sensory receptors (e.g.
eye cells).
Information is futher transfered via peripheral nervous
system (PNS) to the central nervous systems (CNS) where
it is processed (integrated), and subseqently, an output
signal is produced.
Afterwards, the output signal is transfered via PNS to
effectors (e.g. muscle cells).
13
Biological neural network
Zdroj: N. Campbell and J. Reece; Biology, 7th Edition; ISBN: 080537146X
14
Summation
15
Biological and Mathematical neurons
16
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
x1, . . . , xn ∈ R are inputs
17
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
x1, . . . , xn ∈ R are inputs
w1, . . . , wn ∈ R are weights
17
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
x1, . . . , xn ∈ R are inputs
w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = n
i=1 wixi
17
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
x1, . . . , xn ∈ R are inputs
w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = n
i=1 wixi
y is an output given by y = σ(ξ)
where σ is an activation function;
e.g. a unit step function
σ(ξ) =



1 ξ ≥ h ;
0 ξ < h.
where h ∈ R is a threshold.
17
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
18
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
w0, w1, . . . , wn ∈ R are weights
18
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
w0, w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = w0 + n
i=1 wixi
18
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
w0, w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = w0 + n
i=1 wixi
y is an output given by y = σ(ξ)
where σ is an activation
function;
e.g. a unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
(The threshold h has been substituted
with the new input x0 = 1 and the weight
w0 = −h.)
18
Neuron and linear separation
ξ = 0
ξ > 0
ξ > 0
ξ < 0
ξ < 0
inner potential
ξ = w0 +
n
i=1
wixi
determines a separation
hyperplane in
the n-dimensional input space
in 2d line
in 3d plane
· · ·
19
Neuron and linear separation
σ σ( wixi)
x1 xn
· · ·
1/0 by A/B
w1 wn
n = 8 · 8, i.e. the number of pixels in the images. Inputs are
binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0).
20
Neuron and linear separation
σ
x1 xn
· · ·
x0 = 1
1/0 pro A/B
w1 wn
w0
n = 8 · 8, i.e. the number of pixels in the images. Inputs are
binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0).
21
Neuron and linear separation
¯w0 + n
i=1 ¯wixi = 0
w0 + n
i=1 wixi = 0
A
A
A A
B
B
B
Red line classiﬁes incorrectly
Green line classiﬁes correctly
(may be a result of
a correction by a learning
algorithm)
22
Neuron and linear separation (XOR)
0
(0, 0)
1
(0, 1)
1
(0, 1)
0
(1, 1)
x1
x2
No line separates ones from
zeros.
23
Neural networks
Neural network consists of formal neurons interconnected in
such a way that the output of one neuron is an input of several
other neurons.
In order to describe a particular type of neural networks we
need to specify:
Architecture
How the neurons are connected.
Activity
How the network transforms inputs to outputs.
Learning
How the weights are changed during training.
24
Architecture
Network architecture is given as a digraph whose nodes are
neurons and edges are connections.
We distinguish several categories of
neurons:
Output neurons
Hidden neurons
Input neurons
(In general, a neuron may be both input and
output; a neuron is hidden if it is neither input,
nor output.)
25
Architecture – Cycles
A network is cyclic (recurrent) if its architecture contains a
directed cycle.
26
Architecture – Cycles
A network is cyclic (recurrent) if its architecture contains a
directed cycle.
Otherwise it is acyclic (feed-forward)
26
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
27
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
27
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
27
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
27
Activity
Consider a network with n neurons, k input and output.
28
Activity
Consider a network with n neurons, k input and output.
State of a network is a vector of output values of all
neurons.
(States of a network with n neurons are vectors of Rn
)
State-space of a network is a set of all states.
28
Activity
Consider a network with n neurons, k input and output.
State of a network is a vector of output values of all
neurons.
(States of a network with n neurons are vectors of Rn
)
State-space of a network is a set of all states.
Network input is a vector of k real numbers, i.e.
an element of Rk .
Network input space is a set of all network inputs.
(sometimes we restrict ourselves to a proper subset of Rk
)
28
Activity
Consider a network with n neurons, k input and output.
State of a network is a vector of output values of all
neurons.
(States of a network with n neurons are vectors of Rn
)
State-space of a network is a set of all states.
Network input is a vector of k real numbers, i.e.
an element of Rk .
Network input space is a set of all network inputs.
(sometimes we restrict ourselves to a proper subset of Rk
)
Initial state
Input neurons set to values from the network input
(each component of the network input corresponds to an input
neuron)
Values of the remaining neurons set to 0.
28
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
29
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
29
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
29
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
A computation is ﬁnite on a network input x if the state
changes only ﬁnitely many times (i.e. there is a moment in
time after which the state of the network never changes).
We also say that the network stops on x.
29
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
A computation is ﬁnite on a network input x if the state
changes only ﬁnitely many times (i.e. there is a moment in
time after which the state of the network never changes).
We also say that the network stops on x.
Network output is a vector of values of all output neurons
in the network (i.e. an element of R ).
Note that the network output keeps changing throughout
the computation!
29
Activity – computation of a network
Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
A computation is ﬁnite on a network input x if the state
changes only ﬁnitely many times (i.e. there is a moment in
time after which the state of the network never changes).
We also say that the network stops on x.
Network output is a vector of values of all output neurons
in the network (i.e. an element of R ).
Note that the network output keeps changing throughout
the computation!
MLP uses the following selection rule:
In the i-th step evaluate all neurons in the i-th layer.
29
Activity – semantics of a network
Deﬁnition
Consider a network with n neurons, k input, output.
Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on
every input of A.
Then we say that the network computes a function F : A → B if
for every network input x the vector F(x) ∈ B is the output of
the network after the computation on x stops.
30
Activity – semantics of a network
Deﬁnition
Consider a network with n neurons, k input, output.
Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on
every input of A.
Then we say that the network computes a function F : A → B if
for every network input x the vector F(x) ∈ B is the output of
the network after the computation on x stops.
30
Activity – semantics of a network
Deﬁnition
Consider a network with n neurons, k input, output.
Let A ⊆ Rk and B ⊆ R . Suppose that the network stops on
every input of A.
Then we say that the network computes a function F : A → B if
for every network input x the vector F(x) ∈ B is the output of
the network after the computation on x stops.
Example 1
This network computes a function
from R2 to R.
30
Activity – inner potential and activation functions
In order to specify activity of the network, we need to specify
how the inner potentials ξ are computed and what are
the activation functions σ.
31
Activity – inner potential and activation functions
In order to specify activity of the network, we need to specify
how the inner potentials ξ are computed and what are
the activation functions σ.
We assume (unless otherwise speciﬁed) that
ξ = w0 +
n
i=1
wi · xi
here x = (x1, . . . , xn) are inputs of the neuron and
w = (w1, . . . , wn) are weights.
31
Activity – inner potential and activation functions
In order to specify activity of the network, we need to specify
how the inner potentials ξ are computed and what are
the activation functions σ.
We assume (unless otherwise speciﬁed) that
ξ = w0 +
n
i=1
wi · xi
here x = (x1, . . . , xn) are inputs of the neuron and
w = (w1, . . . , wn) are weights.
There are special types of neural network where the inner
potential is computed differently, e.g. as a "distance" of an input
from the weight vector:
ξ = x − w
here ||·|| is a vector norm, typically Euclidean.
31
Activity – inner potential and activation functions
There are many activation functions, typical examples:
Unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
32
Activity – inner potential and activation functions
There are many activation functions, typical examples:
Unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
(Logistic) sigmoid
σ(ξ) =
1
1 + e−λ·ξ
here λ ∈ R is a steepness parameter.
Hyperbolic tangens
σ(ξ) =
1 − e−ξ
1 + e−ξ
ReLU
σ(ξ) = max(ξ, 0)
32
Activity – XOR
1 1
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
1 1
σ
11 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
0 0
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
0 0
σ 01 σ
1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
1 0
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
1 0
σ
11 σ
1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
1 0
σ
11 σ
1 1
σ
1
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
0 1
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
0 1
σ
11 σ
1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – XOR
0 1
σ
11 σ
1 1
σ
1
1
−22 2 −2
1
−1
1
3
−2
Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
33
Activity – MLP and linear separation
0
(0, 0)
1
(0, 1)
1
(0, 1)
0
(1, 1)
P1 P2
x1
x2
σ1 σ 1
σ1
−22 2 −2
1
−1
1
3
−2
The line P1 is given by
−1 + 2x1 + 2x2 = 0
The line P2 is given by
3 − 2x1 − 2x2 = 0
34
Activity – example
x1
1
σ
0
1
σ0 1
σ
0
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
35
Activity – example
x1
1
σ
1
1
σ0 1
σ
0
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
35
Activity – example
x1
1
σ
1
1
σ
1 1
σ
0
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
35
Activity – example
x1
1
σ
1
1
σ
1 1
σ
1
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
35
Activity – example
x1
1
σ
0
1
σ
1 1
σ
1
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
35
Learning
Consider a network with n neurons, k input and output.
36
Learning
Consider a network with n neurons, k input and output.
Conﬁguration of a network is a vector of all values of
weights.
(Conﬁgurations of a network with m connections are elements of Rm
)
Weight-space of a network is a set of all conﬁgurations.
36
Learning
Consider a network with n neurons, k input and output.
Conﬁguration of a network is a vector of all values of
weights.
(Conﬁgurations of a network with m connections are elements of Rm
)
Weight-space of a network is a set of all conﬁgurations.
initial conﬁguration
weights can be initialized randomly or using some sophisticated
algorithm
36
Learning algorithms
Learning rule for weight adaptation.
(the goal is to ﬁnd a conﬁguration in which the network computes
a desired function)
37
Learning algorithms
Learning rule for weight adaptation.
(the goal is to ﬁnd a conﬁguration in which the network computes
a desired function)
Supervised learning
The desired function is described using training examples
that are pairs of the form (input, output).
Learning algorithm searches for a conﬁguration which
"corresponds" to the training examples, typically by
minimizing an error function.
37
Learning algorithms
Learning rule for weight adaptation.
(the goal is to ﬁnd a conﬁguration in which the network computes
a desired function)
Supervised learning
The desired function is described using training examples
that are pairs of the form (input, output).
Learning algorithm searches for a conﬁguration which
"corresponds" to the training examples, typically by
minimizing an error function.
Unsupervised learning
The training set contains only inputs.
The goal is to determine distribution of the inputs
(clustering, deep belief networks, etc.)
37
Supervised learning – illustration
A
A
A A
B
B
B
classiﬁcation in the plane using
a single neuron
38
Supervised learning – illustration
A
A
A A
B
B
B
classiﬁcation in the plane using
a single neuron
training examples are of the form
(point, value) where the value is
either 1, or 0 depending on whether
the point is either A, or B
38
Supervised learning – illustration
A
A
A A
B
B
B
classiﬁcation in the plane using
a single neuron
training examples are of the form
(point, value) where the value is
either 1, or 0 depending on whether
the point is either A, or B
the algorithm considers examples
one after another
whenever an incorrectly classiﬁed
point is considered, the learning
algorithm turns the line in
the direction of the point
38
Summary – Advantages of neural networks
Massive parallelism
neurons can be evaluated in parallel
39
Summary – Advantages of neural networks
Massive parallelism
neurons can be evaluated in parallel
Learning
many sophisticated learning algorithms used to "program"
neural networks
39
Summary – Advantages of neural networks
Massive parallelism
neurons can be evaluated in parallel
Learning
many sophisticated learning algorithms used to "program"
neural networks
generalization and robustness
information is encoded in a distributed manned in weights
"close" inputs typicaly get similar values
39
Summary – Advantages of neural networks
Massive parallelism
neurons can be evaluated in parallel
Learning
many sophisticated learning algorithms used to "program"
neural networks
generalization and robustness
information is encoded in a distributed manned in weights
"close" inputs typicaly get similar values
Graceful degradation
damage typically causes only a decrease in precision of
results
39
Expressive power of neural networks
40
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
x0 = 1, x1, . . . , xn ∈ R are inputs
w0, w1, . . . , wn ∈ R are weights
ξ is an inner potential;
almost always ξ = w0 + n
i=1 wixi
y is an output given by y = σ(ξ)
where σ is an activation
function;
e.g. a unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
41
Boolean functions
Activation function: unit step function σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
42
Boolean functions
Activation function: unit step function σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
σ
x1 x2 xn
x0 = 1
y = AND(x1, . . . , xn)
1 1
· · ·
1
−n
σ
x1 x2 xn
x0 = 1
y = OR(x1, . . . , xn)
1 1
· · ·
1
−1
σ
x1
x0 = 1
y = NOT(x1)
−1
0
42
Boolean functions
Theorem
Let σ be the unit step function. Two layer MLPs, where each
neuron has σ as the activation function, are able to compute all
functions of the form F : {0, 1}n → {0, 1}.
43
Boolean functions
Theorem
Let σ be the unit step function. Two layer MLPs, where each
neuron has σ as the activation function, are able to compute all
functions of the form F : {0, 1}n → {0, 1}.
Proof.
Given a vector v = (v1, . . . , vn) ∈ {0, 1}n, consider a neuron
Nv whose output is 1 iff the input is v:
σ
y
x1 xi xn
x0 = 1
w1 wi
· · ·· · ·
wn
w0 w0 = − n
i=1 vi
wi =



1 vi = 1
−1 vi = 0
Now let us connect all outputs of all neurons Nv satisfying
F(v) = 1 using a neuron implementing OR.
43
Non-linear separation
x1 x2
y
Consider a three layer network; each neuron
has the unit step activation function.
The network divides the input space in two
subspaces according to the output (0 or 1).
44
Non-linear separation
x1 x2
y
Consider a three layer network; each neuron
has the unit step activation function.
The network divides the input space in two
subspaces according to the output (0 or 1).
The ﬁrst (hidden) layer divides the input
space into half-spaces.
44
Non-linear separation
x1 x2
y
Consider a three layer network; each neuron
has the unit step activation function.
The network divides the input space in two
subspaces according to the output (0 or 1).
The ﬁrst (hidden) layer divides the input
space into half-spaces.
The second layer may e.g. make
intersections of the half-spaces ⇒ convex
sets.
44
Non-linear separation
x1 x2
y
Consider a three layer network; each neuron
has the unit step activation function.
The network divides the input space in two
subspaces according to the output (0 or 1).
The ﬁrst (hidden) layer divides the input
space into half-spaces.
The second layer may e.g. make
intersections of the half-spaces ⇒ convex
sets.
The third layer may e.g. make unions of some
convex sets.
44
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y Consider three layer networks; each neuron
has the unit step activation function.
Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
45
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y Consider three layer networks; each neuron
has the unit step activation function.
Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
Cover A with hypercubes (in 2D squares, in
3D cubes, ...)
45
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y Consider three layer networks; each neuron
has the unit step activation function.
Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
Cover A with hypercubes (in 2D squares, in
3D cubes, ...)
Each hypercube K can be separated using
a two layer network NK
(i.e. a function computed by NK gives 1 for
points in K and 0 for the rest).
45
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y Consider three layer networks; each neuron
has the unit step activation function.
Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
Cover A with hypercubes (in 2D squares, in
3D cubes, ...)
Each hypercube K can be separated using
a two layer network NK
(i.e. a function computed by NK gives 1 for
points in K and 0 for the rest).
Finally, connect outputs of the nets NK
satisfying K ∩ A ∅ using a neuron
implementing OR.
45
Non-linear separation - sigmoid
Theorem (Cybenko 1989 - informal version)
Let σ be a continuous function which is sigmoidal, i.e. satisﬁes
σ(x) =



1 pro x → +∞
0 pro x → −∞
For every "reasonable" set A ⊆ [0, 1]n, there is a two layer
network where each hidden neuron has the activation function
σ (output neurons are linear), that satisﬁes the following:
For "most" vectors v ∈ [0, 1]n we have that v ∈ A iff the network
output is > 0 for the input v.
For mathematically oriented:
"reasonable" means Lebesgue measurable
"most" means that the set of incorrectly classiﬁed vectors has
the Lebesgue measure smaller than a given ε > 0
46
Non-linear separation - practical illustration
ALVINN drives a car
47
Non-linear separation - practical illustration
ALVINN drives a car
The net has 30 × 32 = 960 inputs
(the input space is thus R960
)
47
Non-linear separation - practical illustration
ALVINN drives a car
The net has 30 × 32 = 960 inputs
(the input space is thus R960
)
Input values correspond to
shades of gray of pixels.
47
Non-linear separation - practical illustration
ALVINN drives a car
The net has 30 × 32 = 960 inputs
(the input space is thus R960
)
Input values correspond to
shades of gray of pixels.
Output neurons "classify" images
of the road based on their
"curvature".
Zdroj obrázku: http://jmvidal.cse.sc.edu/talks/ann/alvin.html
47
Function approximation - three layers
Let σ be a logistic sigmoid, i.e.
σ(ξ) =
1
1 + e−ξ
For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there
is a three-layer network computing a function F : [0, 1]n → [0, 1]
such that
there is a linear activation in the output layer, i.e. the value
of the output neuron is its inner potential ξ,
48
Function approximation - three layers
Let σ be a logistic sigmoid, i.e.
σ(ξ) =
1
1 + e−ξ
For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there
is a three-layer network computing a function F : [0, 1]n → [0, 1]
such that
there is a linear activation in the output layer, i.e. the value
of the output neuron is its inner potential ξ,
the remaining neurons have the logistic sigmoid σ as their
activation,
48
Function approximation - three layers
Let σ be a logistic sigmoid, i.e.
σ(ξ) =
1
1 + e−ξ
For every continuous function f : [0, 1]n → [0, 1] and ε > 0 there
is a three-layer network computing a function F : [0, 1]n → [0, 1]
such that
there is a linear activation in the output layer, i.e. the value
of the output neuron is its inner potential ξ,
the remaining neurons have the logistic sigmoid σ as their
activation,
for every v ∈ [0, 1]n we have that |F(v) − f(v)| < ε.
48
Function approximation – three layer networks
x1 x2
σ σ σ σ
σ· · ·
· · · · · ·
ζ
y weighted sum of "spikes"
... + the other two 90 degree rotations
a "spike"
inner potential
the value of the neuron
49
Function approximation - two-layer networks
Theorem (Cybenko 1989)
Let σ be a continuous function which is sigmoidal, i.e. is
increasing and satisﬁes
σ(x) =



1 pro x → +∞
0 pro x → −∞
For every continuous function f : [0, 1]n → [0, 1] and every ε > 0
there is a function F : [0, 1]n → [0, 1] computed by a two layer
network where each hidden neuron has the activation function
σ (output neurons are linear), that satisﬁes the following
|f(v) − F(v)| < ε pro každé v ∈ [0, 1]n
.
50
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
51
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
51
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
51
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
parallel activity rule (output values of all neurons are
recomputed in every step);
51
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
parallel activity rule (output values of all neurons are
recomputed in every step);
activation function
σ(ξ) =



1 ξ ≥ 1 ;
ξ 0 ≤ ξ ≤ 1 ;
0 ξ < 0.
51
Neural networks and computability
Consider recurrent networks (i.e. containing cycles)
with real weights (in general);
one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
parallel activity rule (output values of all neurons are
recomputed in every step);
activation function
σ(ξ) =



1 ξ ≥ 1 ;
ξ 0 ≤ ξ ≤ 1 ;
0 ξ < 0.
We encode words ω ∈ {0, 1}+ into numbers as follows:
δ(ω) =
|ω|
i=1
ω(i)
2i
+
1
2|ω|+1
E.g. ω = 11001 gives δ(ω) = 1
2 + 1
22 + 1
25 + 1
26
(= 0.110011 in binary form).
51
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
52
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
Recurrent networks with rational weights are equivalent to
Turing machines
For every recursively enumerable language L ⊆ {0, 1}+
there is a recurrent network with rational weights and less
than 1000 neurons, which recognizes L.
The halting problem is undecidable for networks with at
least 25 neurons and rational weights.
There is "universal" network (equivalent of the universal
Turing machine)
52
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
Recurrent networks with rational weights are equivalent to
Turing machines
For every recursively enumerable language L ⊆ {0, 1}+
there is a recurrent network with rational weights and less
than 1000 neurons, which recognizes L.
The halting problem is undecidable for networks with at
least 25 neurons and rational weights.
There is "universal" network (equivalent of the universal
Turing machine)
Recurrent networks are super-Turing powerful
52
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
Recurrent networks with rational weights are equivalent to
Turing machines
For every recursively enumerable language L ⊆ {0, 1}+
there is a recurrent network with rational weights and less
than 1000 neurons, which recognizes L.
The halting problem is undecidable for networks with at
least 25 neurons and rational weights.
There is "universal" network (equivalent of the universal
Turing machine)
Recurrent networks are super-Turing powerful
For every language L ⊆ {0, 1}+
there is a recurrent network
with less than 1000 nerons which recognizes L.
52
Summary of theoretical results
Neural networks are very strong from the point of view of
theory:
All Boolean functions can be expressed using two-layer
networks.
Two-layer networks may approximate any continuous
function.
Recurrent networks are at least as strong as Turing
machines.
53
Summary of theoretical results
Neural networks are very strong from the point of view of
theory:
All Boolean functions can be expressed using two-layer
networks.
Two-layer networks may approximate any continuous
function.
Recurrent networks are at least as strong as Turing
machines.
These results are purely theoretical!
"Theoretical" networks are extremely huge.
It is very difﬁcult to handcraft them even for simplest
problems.
From practical point of view, the most important advantage
of neural networks are: learning, generalization,
robustness.
53
Neural networks vs classical computers
Neural networks "Classical" computers
Data implicitly in weights explicitly
Computation naturally parallel sequential, localized
Robustness robust w.r.t. input corruption
& damage
changing one bit may
completely crash the
computation
Precision imprecise, network recalls a
training example "similar" to
the input
(typically) precise
Programming learning manual
54
History & implementations
55
History of neurocomputers
1951: SNARC (Minski et al)
the ﬁrst implementation of neural network
a rat strives to exit a maze
40 artiﬁcial neurons (300 vacuum tubes, engines, etc.)
56
History of neurocomputers
1957: Mark I Perceptron (Rosenblatt et al) - the ﬁrst
successful network for image recognition
single layer network
image represented by 20 × 20 photocells
intensity of pixels was treated as the input to a perceptron
(basically the formal neuron), which recognized ﬁgures
weights were implemented using potentiometers, each set
by its own engine
it was possible to arbitrarily reconnect inputs to neurons to
demonstrate adaptability
57
History of neurocomputers
1960: ADALINE (Widrow & Hof)
single layer neural network
weights stored in a newly invented electronic component
memistor, which remembers history of electric current in
the form of resistance.
Widrow founded a company Memistor Corporation, which
sold implementations of neural networks.
1960-66: several companies concerned with neural
networks were founded.
58
History of neurocomputers
1967-82: dead still after publication of a book by Minski &
Papert (published 1969, title Perceptrons)
1983-end of 90s: revival of neural networks
many attempts at hardware implementations
application speciﬁc chips (ASIC)
programmable hardware (FPGA)
hw implementations typically not better than "software"
implementations on universal computers (problems with
weight storage, size, speed, cost of production etc.)
59
History of neurocomputers
1967-82: dead still after publication of a book by Minski &
Papert (published 1969, title Perceptrons)
1983-end of 90s: revival of neural networks
many attempts at hardware implementations
application speciﬁc chips (ASIC)
programmable hardware (FPGA)
hw implementations typically not better than "software"
implementations on universal computers (problems with
weight storage, size, speed, cost of production etc.)
end of 90s-cca 2005: NN suppressed by other machine
learning methods (support vector machines (SVM))
2006-now: The boom of neural networks!
deep networks – often better than any other method
GPU implementations
... some specialized hw implementations (Google’s TPU)
59
Some highlights
Breakthrough in image recognition.
Accuracy of image recognition improved by an order of magnitude in 5
years.
Breakthrough in game playing.
Superhuman results in Go and Chess almost without any human
intervention. Master level in Starcraft, poker, etc.
Breakthrough in machine translation.
Switching to deep learning produced a 60% increase in translation
accuracy compared to the phrase-based approach previously used in
Google Translate (in human evaluation)
Breakthrough in speech processing.
60
History in waves ...
Figure: The ﬁgure shows two of the three historical waves of artiﬁcial
neural nets research, as measured by the frequency of the phrases
"cybernetics" and "connectionism" or "neural networks" according to
Google Books (the third wave is too recent to appear).
61
Current hardware – What do we face?
Increasing dataset size ...
62
Current hardware – What do we face?
... and thus increasing size of neural networks ...
2. ADALINE
4. Early back-propagation network (Rumelhart et al., 1986b)
8. Image recognition: LeNet-5 (LeCun et al., 1998b)
10. Dimensionality reduction: Deep belief network (Hinton et al., 2006)
... here the third "wave" of neural networks started
15. Digit recognition: GPU-accelerated multilayer perceptron (Ciresan et al., 2010)
18. Image recognition (AlexNet): Multi-GPU convolutional network (Krizhevsky et al., 2012)
20. Image recognition: GoogLeNet (Szegedy et al., 2014a)
63
Current hardware – What do we face?
... as a reward we get this ...
Figure: Since deep networks reached the scale necessary to
compete in the ImageNetLarge Scale Visual Recognition Challenge,
they have consistently won the competition every year, and yielded
lower and lower error rates each time. Data from Russakovsky et al.
(2014b) and He et al. (2015).
64
Current hardware
In 2012, Google trained a large network of 1.7
billion weights and 9 layers
The task was image recognition (10 million
youtube video frames)
The hw comprised a 1000 computer network
(16 000 cores), computation took three days.
65
Current hardware
In 2012, Google trained a large network of 1.7
billion weights and 9 layers
The task was image recognition (10 million
youtube video frames)
The hw comprised a 1000 computer network
(16 000 cores), computation took three days.
In 2014, similar task performed on Commodity
Off-The-Shelf High Performance Computing
(COTS HPC) technology: a cluster of GPU
servers with Inﬁniband interconnects and MPI.
Able to train 1 billion parameter networks on
just 3 machines in a couple of days.
Able to scale to 11 billion weights (approx. 6.5
times larger than the Google model) on 16
GPUs. 65
Current hardware – NVIDIA DGX-1 Station
8x GPU (Tesla V100)
TFLOPS = 1000
GPU memory 256GB total
NVIDIA Tensor Cores: 5,120
NVIDIA CUDA Cores: 40,960
System memory: 512 GB
Network: Dual 10 Gb LAN
NVIDIA Deep Learning SDK
66
Deep learning in clouds
Several companies offer cloud services for deep learning:
Amazon Web Services
Google Cloud
Deep Cognition
...
Advantages:
Do not have to care (too much) about technical problems.
Do not have to buy and optimize highend hw/sw, networks etc.
Scaling & virtually limitless storage.
Disadvatages:
Do not have full control.
Performance can vary, connectivity problems.
Have to pay for services.
Privacy issues.
67
Current software
TensorFlow (Google)
open source software library for numerical computation
using data ﬂow graphs
allows implementation of most current neural networks
allows computation on multiple devices (CPUs, GPUs, ...)
Python API
Keras: a part of TensorFlow that allows easy description of
most modern neural networks
PyTorch (Facebook)
similar to TensorFlow
object oriented
Theano (dead):
The "academic" grand-daddy of deep-learning frameworks,
written in Python. Strongly inspired TensorFlow (some
people developing Theano moved on to develop
TensorFlow).
There are others: Caffe, Deeplearning4j, ...
68
Current software – Keras
69
Current software – Keras functional API
70
Current software – TensorFlow
71
Current software – TensorFlow
72
Current software – PyTorch
73
Other software implementations
Most "mathematical" software packages contain some support
of neural networks:
MATLAB
R
STATISTICA
Weka
...
The implementations are typically not on par with the previously
mentioned dedicated deep-learning libraries.
74
Training linear models
75
Linear regression (ADALINE)
Architecture:
x1 x2 xn
· · ·
y
x0 = 1
w0
w1 w2 wn
w = (w0, w1, . . . , wn) and x = (x0, x1, . . . , xn) where x0 = 1.
Activity:
inner potential: ξ = w0 + n
i=1 wixi = n
i=0 wixi = w · x
activation function: σ(ξ) = ξ
network function: y[w](x) = σ(ξ) = w · x
76
Linear regression (ADALINE)
Learning:
Given a training dataset
T = x1, d1 , x2, d2 , . . . , xp, dp
Here xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th
input, and dk ∈ R is the expected output.
Intuition: The network is supposed to compute an afﬁne approximation of the
function (some of) whose values are given in the training set.
77
Oaks in Wisconsin
78
Linear regression (ADALINE)
Error function:
E(w) =
1
2
p
k=1
w · xk − dk
2
=
1
2
p
k=1


n
i=0
wixki − dk


2
The goal is to ﬁnd w which minimizes E(w).
79
Error function
80
Gradient of the error function
Consider gradient of the error function:
E(w) =
∂E
∂w0
(w), . . . ,
∂E
∂wn
(w)
Intuition: E(w) is a vector in the weight space which points in
the direction of the steepest ascent of the error function.
Note that the vectors xk are just parameters of the function E, and are thus
ﬁxed!
81
Gradient of the error function
Consider gradient of the error function:
E(w) =
∂E
∂w0
(w), . . . ,
∂E
∂wn
(w)
Intuition: E(w) is a vector in the weight space which points in
the direction of the steepest ascent of the error function.
Note that the vectors xk are just parameters of the function E, and are thus
ﬁxed!
Fact
If E(w) = 0 = (0, . . . , 0), then w is a global minimum of E.
For ADALINE, the error function E(w) is a convex paraboloid and thus has
the unique global minimum.
81
Gradient - illustration
Caution! This picture just illustrates the notion of gradient ... it is not
the convex paraboloid E(w) !
82
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
83
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
Consider a concrete training set:
T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)}
= ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3)
83
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
Consider a concrete training set:
T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)}
= ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3)
E(w0, w1) = 1
2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2]
83
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
Consider a concrete training set:
T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)}
= ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3)
E(w0, w1) = 1
2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2]
δE
δw0
83
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
Consider a concrete training set:
T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)}
= ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3)
E(w0, w1) = 1
2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2]
δE
δw0
= (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1
83
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
Consider a concrete training set:
T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)}
= ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3)
E(w0, w1) = 1
2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2]
δE
δw0
= (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1
δE
δw1
83
Gradient of the error function
First, consider n = 1.
Then the model is y = w0 + w1 · x.
Consider a concrete training set:
T = {((1, 2), 1), ((1, 3), 2), ((1, 4), 5)}
= ((x10, x11), d1), ((x20, x21), d2), ((x30, x31), d3)
E(w0, w1) = 1
2 [(w0+w1·2−1)2+(w0+w1·3−2)2+(w0+w1·4−5)2]
δE
δw0
= (w0 +w1 ·2−1)·1+(w0 +w1 ·3−2)·1+(w0 +w1 ·4−5)·1
δE
δw1
= (w0 +w1 ·2−1)·2+(w0 +w1 ·3−2)·3+(w0 +w1 ·4−5)·4
83
Gradient of the error function
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
84
Gradient of the error function
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δw


n
i=0
wixki − dk


84
Gradient of the error function
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δw


n
i=0
wixki − dk


=
1
2
p
k=1
2


n
i=0
wixki − dk




n
i=0
δ
δw
wixki −
δE
δw
dk


84
Gradient of the error function
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δw


n
i=0
wixki − dk


=
1
2
p
k=1
2


n
i=0
wixki − dk




n
i=0
δ
δw
wixki −
δE
δw
dk


=
p
k=1
w · xk − dk xk
84
Gradient of the error function
∂E
∂w
(w) =
1
2
p
k=1
δ
δw


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δw


n
i=0
wixki − dk


=
1
2
p
k=1
2


n
i=0
wixki − dk




n
i=0
δ
δw
wixki −
δE
δw
dk


=
p
k=1
w · xk − dk xk
Thus
E(w) =
∂E
∂w0
(w), . . . ,
∂E
∂wn
(w) =
p
k=1
w · xk − dk xk
84
Linear regression - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
85
Linear regression - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
85
Linear regression - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1, weights w(t+1) are computed as follows:
w(t+1)
= w(t)
− ε · E(w(t)
)
= w(t)
− ε ·
p
k=1
w(t)
· xk − dk · xk
Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate.
85
Linear regression - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1, weights w(t+1) are computed as follows:
w(t+1)
= w(t)
− ε · E(w(t)
)
= w(t)
− ε ·
p
k=1
w(t)
· xk − dk · xk
Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate.
Proposition
For sufﬁciently small ε > 0 the sequence w(0), w(1), w(2), . . .
converges (componentwise) to the global minimum of E (i.e. to
the vector w satisfying E(w) = 0).
85
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
ADALINE - learning
Online algorithm (Delta-rule, Widrow-Hoff rule):
weights in w(0) initialized randomly close to 0
in the step t + 1, weights w(t+1) are computed as follows:
w(t+1)
= w(t)
− ε(t) · w(t)
· xk − dk · xk
Here k = t mod p + 1 and 0 < ε(t) ≤ 1 is a learning rate in
the step t + 1.
Note that the algorithm does not work with the complete gradient but
only with its part determined by the currently considered training
example.
87
ADALINE - learning
Online algorithm (Delta-rule, Widrow-Hoff rule):
weights in w(0) initialized randomly close to 0
in the step t + 1, weights w(t+1) are computed as follows:
w(t+1)
= w(t)
− ε(t) · w(t)
· xk − dk · xk
Here k = t mod p + 1 and 0 < ε(t) ≤ 1 is a learning rate in
the step t + 1.
Note that the algorithm does not work with the complete gradient but
only with its part determined by the currently considered training
example.
Theorem (Widrow & Hoff)
If ε(t) = 1
t , then w(0), w(1), w(2), . . . converges to the global
minimum of E.
87