PV021: Neural networks
Tomáš Brázdil
1
Course organization
Course materials:
▶ Main: The lecture
▶ Neural Networks and Deep Learning by Michael Nielsen
http://neuralnetworksanddeeplearning.com/
(Extremely well written modern online textbook.)
▶ Deep learning by Ian Goodfellow, Yoshua Bengio and Aaron
Courville
http://www.deeplearningbook.org/
(A very good overview of the state-of-the-art in neural networks.)
▶ Iniﬁnitely many online tutorials on everything (to build intuition)
Suggested: deeplearning.ai courses by Andrew Ng
2
Course organization
Evaluation:
▶ Project
▶ teams of two students
▶ implementation of a selected model + analysis of given data
▶ implementation either in C, C++ without use of any
specialized libraries for data analysis and machine
learning
▶ need to get over a given accuracy threshold (a gentle one,
just to eliminate non-functional implementations)
3
Course organization
Evaluation:
▶ Project
▶ teams of two students
▶ implementation of a selected model + analysis of given data
▶ implementation either in C, C++ without use of any
specialized libraries for data analysis and machine
learning
▶ need to get over a given accuracy threshold (a gentle one,
just to eliminate non-functional implementations)
▶ Oral exam
▶ I may ask about anything from the lecture! You will get
a detailed manual specifying the mandatory knowledge.
3
FAQ
Q: Why we cannot use specialized libraries in projects?
4
FAQ
Q: Why we cannot use specialized libraries in projects?
A: In order to "touch" the low level implementation details of the
algorithms. You should not even use libraries for linear algebra
and numerical methods, so that you will be confronted with
rounding errors and numerical instabilities.
4
FAQ
Q: Why we cannot use specialized libraries in projects?
A: In order to "touch" the low level implementation details of the
algorithms. You should not even use libraries for linear algebra
and numerical methods, so that you will be confronted with
rounding errors and numerical instabilities.
Q: Why should you attend this course when there are inﬁnitely
many great reasources elsewhere?
A: There are at least two reasons:
▶ You may discuss issues with me, my colleagues and other
students.
▶ I will make you truly learn fundamentals by heart.
4
Notable features of the course
▶ Use of mathematical notation and reasoning (contains several
proofs that are mandatory for the exam)
▶ Sometimes goes deeper into statistical underpinnings of neural
networks learning
▶ The project demands a complete working solution which must
satisfy a prescribed performance speciﬁcation
5
Notable features of the course
▶ Use of mathematical notation and reasoning (contains several
proofs that are mandatory for the exam)
▶ Sometimes goes deeper into statistical underpinnings of neural
networks learning
▶ The project demands a complete working solution which must
satisfy a prescribed performance speciﬁcation
An unusual exam system! You can repeat the oral exam as many
times as needed (only the best grade goes into IS).
5
Notable features of the course
▶ Use of mathematical notation and reasoning (contains several
proofs that are mandatory for the exam)
▶ Sometimes goes deeper into statistical underpinnings of neural
networks learning
▶ The project demands a complete working solution which must
satisfy a prescribed performance speciﬁcation
An unusual exam system! You can repeat the oral exam as many
times as needed (only the best grade goes into IS).
An example of an instruction email (from another course with the
same system):
It is typically not sufficient to devote a single
afternoon to the preparation for the exam.
You have to know _everything_ (which means every
single thing) starting with the slide 42
and ending with the slide 245 with notable exceptions
of slides: 121 - 123, 137 - 140, 165, 167.
Proofs presented on the whiteboard are also mandatory.
5
Machine learning in general
▶ Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
6
Machine learning in general
▶ Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
▶ spam ﬁlter
▶ learns to recognize spam from a database of "labelled"
emails
▶ consequently is able to distinguish spam from ham
6
Machine learning in general
▶ Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
▶ spam ﬁlter
▶ learns to recognize spam from a database of "labelled"
emails
▶ consequently is able to distinguish spam from ham
▶ handwritten text reader
▶ learns from a database of handwritten
letters (or text) labelled by their correct
meaning
▶ consequently is able to recognize text
6
Machine learning in general
▶ Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
▶ spam ﬁlter
▶ learns to recognize spam from a database of "labelled"
emails
▶ consequently is able to distinguish spam from ham
▶ handwritten text reader
▶ learns from a database of handwritten
letters (or text) labelled by their correct
meaning
▶ consequently is able to recognize text
▶ · · ·
▶ and lots of much much more sophisticated applications ...
6
Machine learning in general
▶ Machine learning = construction of systems that may learn their
functionality from data
(... and thus do not need to be programmed.)
▶ spam ﬁlter
▶ learns to recognize spam from a database of "labelled"
emails
▶ consequently is able to distinguish spam from ham
▶ handwritten text reader
▶ learns from a database of handwritten
letters (or text) labelled by their correct
meaning
▶ consequently is able to recognize text
▶ · · ·
▶ and lots of much much more sophisticated applications ...
▶ Basic attributes of learning algorithms:
▶ representation: ability to capture the inner structure of
training data
▶ generalization: ability to work properly on new data
6
Machine learning in general
Machine learning algorithms typically construct mathematical
models of given data. The models may be subsequently
applied to fresh data.
7
Machine learning in general
Machine learning algorithms typically construct mathematical
models of given data. The models may be subsequently
applied to fresh data.
There are many types of models:
▶ decision trees
▶ support vector machines
▶ hidden Markov models
▶ Bayes networks and other graphical models
▶ neural networks
▶ · · ·
Neural networks, based on models of a (human) brain, form
a natural basis for learning algorithms!
7
Artiﬁcial neural networks
▶ Artiﬁcial neuron is a rough mathematical approximation
of a biological neuron.
▶ (Aritiﬁcial) neural network (NN) consists of a number of
interconnected artiﬁcial neurons. "Behavior" of the network
is encoded in connections between neurons.
σ
ξ
x1 x2 xn
y
Zdroj obrázku: http://tulane.edu/sse/cmb/people/schrader/
8
Why artiﬁcial neural networks?
Modelling of biological neural networks (computational
neuroscience).
▶ simpliﬁed mathematical models help to identify important
mechanisms
▶ How a brain receives information?
▶ How the information is stored?
▶ How a brain develops?
▶ · · ·
9
Why artiﬁcial neural networks?
Modelling of biological neural networks (computational
neuroscience).
▶ simpliﬁed mathematical models help to identify important
mechanisms
▶ How a brain receives information?
▶ How the information is stored?
▶ How a brain develops?
▶ · · ·
▶ neuroscience is strongly multidisciplinary; precise
mathematical descriptions help in communication among
experts and in design of new experiments.
I will not spend much time on this area!
9
Why artiﬁcial neural networks?
Neural networks in machine learning.
▶ Typically primitive models, far from their biological
counterparts (but often inspired by biology).
10
Why artiﬁcial neural networks?
Neural networks in machine learning.
▶ Typically primitive models, far from their biological
counterparts (but often inspired by biology).
▶ Strongly oriented towards concrete application domains:
▶ decision making and control - autonomous vehicles,
manufacturing processes, control of natural resources
▶ games - backgammon, poker, GO, Starcraft, ...
▶ ﬁnance - stock prices, risk analysis
▶ medicine - diagnosis, signal processing (EKG, EEG, ...), image
processing (MRI, CT, WSI ...)
▶ text and speech processing - machine translation, text
generation, speech recognition
▶ other signal processing - ﬁltering, radar tracking, noise
reduction
▶ art - music and painting generation, deepfakes
▶ · · ·
I will concentrate on this area!
10
Important features of neural networks
▶ Massive parallelism
▶ many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
11
Important features of neural networks
▶ Massive parallelism
▶ many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
▶ Learning
▶ a kid learns to recognize a rabbit after seeing several
rabbits
11
Important features of neural networks
▶ Massive parallelism
▶ many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
▶ Learning
▶ a kid learns to recognize a rabbit after seeing several
rabbits
▶ Generalization
▶ a kid is able to recognize a new rabbit after seeing several
(old) rabbits
11
Important features of neural networks
▶ Massive parallelism
▶ many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
▶ Learning
▶ a kid learns to recognize a rabbit after seeing several
rabbits
▶ Generalization
▶ a kid is able to recognize a new rabbit after seeing several
(old) rabbits
▶ Robustness
▶ a blurred photo of a rabbit may still be classiﬁed as an
image of a rabbit
11
Important features of neural networks
▶ Massive parallelism
▶ many slow (and "dumb") computational elements work in
parallel on several levels of abstraction
▶ Learning
▶ a kid learns to recognize a rabbit after seeing several
rabbits
▶ Generalization
▶ a kid is able to recognize a new rabbit after seeing several
(old) rabbits
▶ Robustness
▶ a blurred photo of a rabbit may still be classiﬁed as an
image of a rabbit
▶ Graceful degradation
▶ Experiments have shown that damaged neural network is
still able to work quite well
▶ Damaged network may re-adapt, remaining neurons may
take on functionality of the damaged ones
11
The aim of the course
▶ We will concentrate on
▶ basic techniques and principles of neural networks,
▶ fundamental models of neural networks and their
applications.
▶ You should learn
▶ basic models
(multilayer perceptron, convolutional networks, recurrent networks,
transformers, autoencoders and generative adversarial networks)
12
The aim of the course
▶ We will concentrate on
▶ basic techniques and principles of neural networks,
▶ fundamental models of neural networks and their
applications.
▶ You should learn
▶ basic models
(multilayer perceptron, convolutional networks, recurrent networks,
transformers, autoencoders and generative adversarial networks)
▶ Simple applications of these models
(image processing, a little bit of speech and text processing)
12
The aim of the course
▶ We will concentrate on
▶ basic techniques and principles of neural networks,
▶ fundamental models of neural networks and their
applications.
▶ You should learn
▶ basic models
(multilayer perceptron, convolutional networks, recurrent networks,
transformers, autoencoders and generative adversarial networks)
▶ Simple applications of these models
(image processing, a little bit of speech and text processing)
▶ Basic learning algorithms
(gradient descent with backpropagation)
12
The aim of the course
▶ We will concentrate on
▶ basic techniques and principles of neural networks,
▶ fundamental models of neural networks and their
applications.
▶ You should learn
▶ basic models
(multilayer perceptron, convolutional networks, recurrent networks,
transformers, autoencoders and generative adversarial networks)
▶ Simple applications of these models
(image processing, a little bit of speech and text processing)
▶ Basic learning algorithms
(gradient descent with backpropagation)
▶ Basic practical training techniques
(data preparation, setting various hyper-parameters, control of
learning, improving generalization)
12
The aim of the course
▶ We will concentrate on
▶ basic techniques and principles of neural networks,
▶ fundamental models of neural networks and their
applications.
▶ You should learn
▶ basic models
(multilayer perceptron, convolutional networks, recurrent networks,
transformers, autoencoders and generative adversarial networks)
▶ Simple applications of these models
(image processing, a little bit of speech and text processing)
▶ Basic learning algorithms
(gradient descent with backpropagation)
▶ Basic practical training techniques
(data preparation, setting various hyper-parameters, control of
learning, improving generalization)
▶ Basic information about current implementations
(TensorFlow-Keras, Pytorch)
12
Biological neural network
▶ Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
▶ Each neuron is connected with approx. 104 neurons.
▶ Neurons themselves are very complex systems.
13
Biological neural network
▶ Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
▶ Each neuron is connected with approx. 104 neurons.
▶ Neurons themselves are very complex systems.
Rough description of nervous system:
▶ External stimulus is received by sensory receptors (e.g.
eye cells).
13
Biological neural network
▶ Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
▶ Each neuron is connected with approx. 104 neurons.
▶ Neurons themselves are very complex systems.
Rough description of nervous system:
▶ External stimulus is received by sensory receptors (e.g.
eye cells).
▶ Information is futher transfered via peripheral nervous
system (PNS) to the central nervous systems (CNS) where
it is processed (integrated), and subseqently, an output
signal is produced.
13
Biological neural network
▶ Human neural network consists of approximately 1011 (100
billion on the short scale) neurons; a single cubic
centimeter of a human brain contains almost 50 million
neurons.
▶ Each neuron is connected with approx. 104 neurons.
▶ Neurons themselves are very complex systems.
Rough description of nervous system:
▶ External stimulus is received by sensory receptors (e.g.
eye cells).
▶ Information is futher transfered via peripheral nervous
system (PNS) to the central nervous systems (CNS) where
it is processed (integrated), and subseqently, an output
signal is produced.
▶ Afterwards, the output signal is transfered via PNS to
effectors (e.g. muscle cells).
13
Biological neural network
Zdroj: N. Campbell and J. Reece; Biology, 7th Edition; ISBN: 080537146X
14
Summation
15
Biological and Mathematical neurons
16
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
▶ x1, . . . , xn ∈ R are inputs
17
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
▶ x1, . . . , xn ∈ R are inputs
▶ w1, . . . , wn ∈ R are weights
17
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
▶ x1, . . . , xn ∈ R are inputs
▶ w1, . . . , wn ∈ R are weights
▶ ξ is an inner potential;
almost always ξ = n
i=1 wixi
17
Formal neuron (without bias)
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
▶ x1, . . . , xn ∈ R are inputs
▶ w1, . . . , wn ∈ R are weights
▶ ξ is an inner potential;
almost always ξ = n
i=1 wixi
▶ y is an output given by y = σ(ξ)
where σ is an activation function;
e.g. a unit step function
σ(ξ) =



1 ξ ≥ h ;
0 ξ < h.
where h ∈ R is a threshold.
17
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
▶ x0 = 1, x1, . . . , xn ∈ R are inputs
18
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
▶ x0 = 1, x1, . . . , xn ∈ R are inputs
▶ w0, w1, . . . , wn ∈ R are weights
18
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
▶ x0 = 1, x1, . . . , xn ∈ R are inputs
▶ w0, w1, . . . , wn ∈ R are weights
▶ ξ is an inner potential;
almost always ξ = w0 + n
i=1 wixi
18
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
▶ x0 = 1, x1, . . . , xn ∈ R are inputs
▶ w0, w1, . . . , wn ∈ R are weights
▶ ξ is an inner potential;
almost always ξ = w0 + n
i=1 wixi
▶ y is an output given by y = σ(ξ)
where σ is an activation
function;
e.g. a unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
(The threshold h has been substituted
with the new input x0 = 1 and the weight
w0 = −h.)
18
Neuron and linear separation
ξ = 0
ξ > 0
ξ > 0
ξ < 0
ξ < 0
▶ inner potential
ξ = w0 +
n
i=1
wixi
determines a separation
hyperplane in
the n-dimensional input space
▶ in 2d line
▶ in 3d plane
▶ · · ·
19
Neuron geometry
20
Neuron and linear separation
σ σ( wixi)
x1 xn
· · ·
1/0 by A/B
w1 wn
n = 8 · 8, i.e. the number of pixels in the images. Inputs are
binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0).
21
Neuron and linear separation
σ
x1 xn
· · ·
x0 = 1
1/0 pro A/B
w1 wn
w0
n = 8 · 8, i.e. the number of pixels in the images. Inputs are
binary vectors of dimension n (black pixel ≈ 1, white pixel ≈ 0).
22
Neuron and linear separation
¯w0 + n
i=1 ¯wixi = 0
w0 + n
i=1 wixi = 0
A
A
A A
B
B
B
▶ Red line classiﬁes incorrectly
▶ Green line classiﬁes correctly
(may be a result of
a correction by a learning
algorithm)
23
Neuron and linear separation (XOR)
0
(0, 0)
1
(0, 1)
1
(0, 1)
0
(1, 1)
x1
x2
▶ No line separates ones from
zeros.
24
Neural networks
Neural network consists of formal neurons interconnected in
such a way that the output of one neuron is an input of several
other neurons.
In order to describe a particular type of neural networks we
need to specify:
▶ Architecture
How the neurons are connected.
▶ Activity
How the network transforms inputs to outputs.
▶ Learning
How the weights are changed during training.
25
Architecture
Network architecture is given as a digraph whose nodes are
neurons and edges are connections.
We distinguish several categories of
neurons:
▶ Output neurons
▶ Hidden neurons
▶ Input neurons
(In general, a neuron may be both input and
output; a neuron is hidden if it is neither input,
nor output.)
26
Architecture – Cycles
▶ A network is cyclic (recurrent) if its architecture contains a
directed cycle.
27
Architecture – Cycles
▶ A network is cyclic (recurrent) if its architecture contains a
directed cycle.
▶ Otherwise it is acyclic (feed-forward)
27
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
▶ Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
28
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
▶ Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
▶ layers numbered from 0; the
input layer has number 0
▶ E.g. three-layer network has
two hidden layers and one
output layer
28
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
▶ Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
▶ layers numbered from 0; the
input layer has number 0
▶ E.g. three-layer network has
two hidden layers and one
output layer
▶ Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
28
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
▶ Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
▶ layers numbered from 0; the
input layer has number 0
▶ E.g. three-layer network has
two hidden layers and one
output layer
▶ Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
▶ Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
28
Activity
Consider a network with n neurons, k input and ℓ output.
29
Activity
Consider a network with n neurons, k input and ℓ output.
▶ State of a network is a vector of output values of all
neurons.
(States of a network with n neurons are vectors of Rn
)
▶ State-space of a network is a set of all states.
29
Activity
Consider a network with n neurons, k input and ℓ output.
▶ State of a network is a vector of output values of all
neurons.
(States of a network with n neurons are vectors of Rn
)
▶ State-space of a network is a set of all states.
▶ Network input is a vector of k real numbers, i.e.
an element of Rk .
▶ Network input space is a set of all network inputs.
(sometimes we restrict ourselves to a proper subset of Rk
)
29
Activity
Consider a network with n neurons, k input and ℓ output.
▶ State of a network is a vector of output values of all
neurons.
(States of a network with n neurons are vectors of Rn
)
▶ State-space of a network is a set of all states.
▶ Network input is a vector of k real numbers, i.e.
an element of Rk .
▶ Network input space is a set of all network inputs.
(sometimes we restrict ourselves to a proper subset of Rk
)
▶ Initial state
Input neurons set to values from the network input
(each component of the network input corresponds to an input
neuron)
Values of the remaining neurons set to 0.
29
Activity – computation of a network
▶ Computation (typically) proceeds in discrete steps.
30
Activity – computation of a network
▶ Computation (typically) proceeds in discrete steps.
In every step the following happens:
30
Activity – computation of a network
▶ Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
30
Activity – computation of a network
▶ Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
A computation is ﬁnite on a network input ⃗x if the state
changes only ﬁnitely many times (i.e. there is a moment in
time after which the state of the network never changes).
We also say that the network stops on ⃗x.
30
Activity – computation of a network
▶ Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
A computation is ﬁnite on a network input ⃗x if the state
changes only ﬁnitely many times (i.e. there is a moment in
time after which the state of the network never changes).
We also say that the network stops on ⃗x.
▶ Network output is a vector of values of all output neurons
in the network (i.e. an element of Rℓ).
Note that the network output keeps changing throughout
the computation!
30
Activity – computation of a network
▶ Computation (typically) proceeds in discrete steps.
In every step the following happens:
1. A set of neurons is selected according to some rule.
2. The selected neurons change their states according to their
inputs (they are simply evaluated).
(If a neuron does not have any inputs, its value remains constant.)
A computation is ﬁnite on a network input ⃗x if the state
changes only ﬁnitely many times (i.e. there is a moment in
time after which the state of the network never changes).
We also say that the network stops on ⃗x.
▶ Network output is a vector of values of all output neurons
in the network (i.e. an element of Rℓ).
Note that the network output keeps changing throughout
the computation!
MLP uses the following selection rule:
In the i-th step evaluate all neurons in the i-th layer.
30
Activity – semantics of a network
Deﬁnition
Consider a network with n neurons, k input, ℓ output.
Let A ⊆ Rk and B ⊆ Rℓ. Suppose that the network stops on
every input of A.
Then we say that the network computes a function F : A → B if
for every network input ⃗x the vector F(⃗x) ∈ B is the output of
the network after the computation on ⃗x stops.
31
Activity – semantics of a network
Deﬁnition
Consider a network with n neurons, k input, ℓ output.
Let A ⊆ Rk and B ⊆ Rℓ. Suppose that the network stops on
every input of A.
Then we say that the network computes a function F : A → B if
for every network input ⃗x the vector F(⃗x) ∈ B is the output of
the network after the computation on ⃗x stops.
31
Activity – semantics of a network
Deﬁnition
Consider a network with n neurons, k input, ℓ output.
Let A ⊆ Rk and B ⊆ Rℓ. Suppose that the network stops on
every input of A.
Then we say that the network computes a function F : A → B if
for every network input ⃗x the vector F(⃗x) ∈ B is the output of
the network after the computation on ⃗x stops.
Example 1
This network computes a function
from R2 to R.
31
Activity – inner potential and activation functions
In order to specify activity of the network, we need to specify
how the inner potentials ξ are computed and what are
the activation functions σ.
32
Activity – inner potential and activation functions
In order to specify activity of the network, we need to specify
how the inner potentials ξ are computed and what are
the activation functions σ.
We assume (unless otherwise speciﬁed) that
ξ = w0 +
n
i=1
wi · xi
here ⃗x = (x1, . . . , xn) are inputs of the neuron and
⃗w = (w1, . . . , wn) are weights.
32
Activity – inner potential and activation functions
In order to specify activity of the network, we need to specify
how the inner potentials ξ are computed and what are
the activation functions σ.
We assume (unless otherwise speciﬁed) that
ξ = w0 +
n
i=1
wi · xi
here ⃗x = (x1, . . . , xn) are inputs of the neuron and
⃗w = (w1, . . . , wn) are weights.
There are special types of neural networks where the inner
potential is computed differently, e.g., as a "distance" of
an input from the weight vector:
ξ = ⃗x − ⃗w
here ||·|| is a vector norm, typically Euclidean.
32
Activity – inner potential and activation functions
There are many activation functions, typical examples:
▶ Unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
33
Activity – inner potential and activation functions
There are many activation functions, typical examples:
▶ Unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
▶ (Logistic) sigmoid
σ(ξ) =
1
1 + e−λ·ξ
here λ ∈ R is a steepness parameter.
▶ Hyperbolic tangens
σ(ξ) =
1 − e−ξ
1 + e−ξ
▶ ReLU
σ(ξ) = max(ξ, 0)
33
Activity – XOR
1 1
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
▶ The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
1 1
σ
11 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
▶ The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
0 0
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
▶ The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
0 0
σ 01 σ
1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
▶ The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
1 0
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
▶ The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
1 0
σ
11 σ
1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
▶ The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
1 0
σ
11 σ
1 1
σ
1
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
▶ The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
0 1
σ 01 σ0 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
▶ The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
0 1
σ
11 σ
1 1
σ
0
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
▶ The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – XOR
0 1
σ
11 σ
1 1
σ
1
1
−22 2 −2
1
−1
1
3
−2
▶ Activation function is a unit
step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
▶ The network computes
XOR(x1, x2)
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
34
Activity – MLP and linear separation
0
(0, 0)
1
(0, 1)
1
(0, 1)
0
(1, 1)
P1 P2
x1
x2
σ1 σ 1
σ1
−22 2 −2
1
−1
1
3
−2
▶ The line P1 is given by
−1 + 2x1 + 2x2 = 0
▶ The line P2 is given by
3 − 2x1 − 2x2 = 0
35
Activity – example
x1
1
σ
0
1
σ0 1
σ
0
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
36
Activity – example
x1
1
σ
1
1
σ0 1
σ
0
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
36
Activity – example
x1
1
σ
1
1
σ
1 1
σ
0
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
36
Activity – example
x1
1
σ
1
1
σ
1 1
σ
1
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
36
Activity – example
x1
1
σ
0
1
σ
1 1
σ
1
1
1
2
−5
1
−2
11
−2
−1
The activation function is
the unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
The input is equal to 1
36
Learning
Consider a network with n neurons, k input and ℓ output.
37
Learning
Consider a network with n neurons, k input and ℓ output.
▶ Conﬁguration of a network is a vector of all values of
weights.
(Conﬁgurations of a network with m connections are elements of Rm
)
▶ Weight-space of a network is a set of all conﬁgurations.
37
Learning
Consider a network with n neurons, k input and ℓ output.
▶ Conﬁguration of a network is a vector of all values of
weights.
(Conﬁgurations of a network with m connections are elements of Rm
)
▶ Weight-space of a network is a set of all conﬁgurations.
▶ initial conﬁguration
weights can be initialized randomly or using some sophisticated
algorithm
37
Learning algorithms
Learning rule for weight adaptation.
(the goal is to ﬁnd a conﬁguration in which the network computes
a desired function)
38
Learning algorithms
Learning rule for weight adaptation.
(the goal is to ﬁnd a conﬁguration in which the network computes
a desired function)
▶ Supervised learning
▶ The desired function is described using training examples
that are pairs of the form (input, output).
▶ Learning algorithm searches for a conﬁguration which
"corresponds" to the training examples, typically by
minimizing an error function.
38
Learning algorithms
Learning rule for weight adaptation.
(the goal is to ﬁnd a conﬁguration in which the network computes
a desired function)
▶ Supervised learning
▶ The desired function is described using training examples
that are pairs of the form (input, output).
▶ Learning algorithm searches for a conﬁguration which
"corresponds" to the training examples, typically by
minimizing an error function.
▶ Unsupervised learning
▶ The training set contains only inputs.
▶ The goal is to determine distribution of the inputs
(clustering, deep belief networks, etc.)
38
Supervised learning – illustration
A
A
A A
B
B
B
▶ classiﬁcation in the plane using
a single neuron
39
Supervised learning – illustration
A
A
A A
B
B
B
▶ classiﬁcation in the plane using
a single neuron
▶ training examples are of the form
(point, value) where the value is
either 1, or 0 depending on whether
the point is either A, or B
39
Supervised learning – illustration
A
A
A A
B
B
B
▶ classiﬁcation in the plane using
a single neuron
▶ training examples are of the form
(point, value) where the value is
either 1, or 0 depending on whether
the point is either A, or B
▶ the algorithm considers examples
one after another
▶ whenever an incorrectly classiﬁed
point is considered, the learning
algorithm turns the line in
the direction of the point
39
Summary – Advantages of neural networks
▶ Massive parallelism
▶ neurons can be evaluated in parallel
40
Summary – Advantages of neural networks
▶ Massive parallelism
▶ neurons can be evaluated in parallel
▶ Learning
▶ many sophisticated learning algorithms used to "program"
neural networks
40
Summary – Advantages of neural networks
▶ Massive parallelism
▶ neurons can be evaluated in parallel
▶ Learning
▶ many sophisticated learning algorithms used to "program"
neural networks
▶ generalization and robustness
▶ information is encoded in a distributed manner in weights
▶ "close" inputs typicaly get similar values
40
Summary – Advantages of neural networks
▶ Massive parallelism
▶ neurons can be evaluated in parallel
▶ Learning
▶ many sophisticated learning algorithms used to "program"
neural networks
▶ generalization and robustness
▶ information is encoded in a distributed manner in weights
▶ "close" inputs typicaly get similar values
▶ Graceful degradation
▶ damage typically causes only a decrease in precision of
results
40
Expressive power of neural networks
41
Formal neuron (with bias)
σ
ξ
x1 x2 xn
x0 = 1
bias
threshold
y
w1 w2
· · ·
wn
w0 = −h
▶ x0 = 1, x1, . . . , xn ∈ R are inputs
▶ w0, w1, . . . , wn ∈ R are weights
▶ ξ is an inner potential;
almost always ξ = w0 + n
i=1 wixi
▶ y is an output given by y = σ(ξ)
where σ is an activation
function;
e.g. a unit step function
σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
42
Boolean functions
Activation function: unit step function σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
43
Boolean functions
Activation function: unit step function σ(ξ) =



1 ξ ≥ 0 ;
0 ξ < 0.
σ
x1 x2 xn
x0 = 1
y = AND(x1, . . . , xn)
1 1
· · ·
1
−n
σ
x1 x2 xn
x0 = 1
y = OR(x1, . . . , xn)
1 1
· · ·
1
−1
σ
x1
x0 = 1
y = NOT(x1)
−1
0
43
Boolean functions
Theorem
Let σ be the unit step function. Two layer MLPs, where each
neuron has σ as the activation function, are able to compute all
functions of the form F : {0, 1}n → {0, 1}.
44
Boolean functions
Theorem
Let σ be the unit step function. Two layer MLPs, where each
neuron has σ as the activation function, are able to compute all
functions of the form F : {0, 1}n → {0, 1}.
Proof.
▶ Given a vector ⃗v = (v1, . . . , vn) ∈ {0, 1}n, consider a neuron
N⃗v whose output is 1 iff the input is ⃗v:
σ
y
x1 xi xn
x0 = 1
w1 wi
· · ·· · ·
wn
w0 w0 = − n
i=1 vi
wi =



1 vi = 1
−1 vi = 0
▶ Now let us connect all outputs of all neurons N⃗v satisfying
F(⃗v) = 1 using a neuron implementing OR. □
44
Non-linear separation
x1 x2
y
▶ Consider a three layer network; each neuron
has the unit step activation function.
▶ The network divides the input space in two
subspaces according to the output (0 or 1).
45
Non-linear separation
x1 x2
y
▶ Consider a three layer network; each neuron
has the unit step activation function.
▶ The network divides the input space in two
subspaces according to the output (0 or 1).
▶ The ﬁrst (hidden) layer divides the input
space into half-spaces.
45
Non-linear separation
x1 x2
y
▶ Consider a three layer network; each neuron
has the unit step activation function.
▶ The network divides the input space in two
subspaces according to the output (0 or 1).
▶ The ﬁrst (hidden) layer divides the input
space into half-spaces.
▶ The second layer may e.g. make
intersections of the half-spaces ⇒ convex
sets.
45
Non-linear separation
x1 x2
y
▶ Consider a three layer network; each neuron
has the unit step activation function.
▶ The network divides the input space in two
subspaces according to the output (0 or 1).
▶ The ﬁrst (hidden) layer divides the input
space into half-spaces.
▶ The second layer may e.g. make
intersections of the half-spaces ⇒ convex
sets.
▶ The third layer may e.g. make unions of some
convex sets.
45
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y ▶ Consider three layer networks; each neuron
has the unit step activation function.
▶ Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
46
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y ▶ Consider three layer networks; each neuron
has the unit step activation function.
▶ Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
▶ Cover A with hypercubes (in 2D squares, in
3D cubes, ...)
46
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y ▶ Consider three layer networks; each neuron
has the unit step activation function.
▶ Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
▶ Cover A with hypercubes (in 2D squares, in
3D cubes, ...)
▶ Each hypercube K can be separated using
a two layer network NK
(i.e. a function computed by NK gives 1 for
points in K and 0 for the rest).
46
Non-linear separation – illustration
x1 xk
· · ·
· · ·
· · ·
y ▶ Consider three layer networks; each neuron
has the unit step activation function.
▶ Three layer nets are capable of
"approximating" any "reasonable" subset A of
the input space Rk .
▶ Cover A with hypercubes (in 2D squares, in
3D cubes, ...)
▶ Each hypercube K can be separated using
a two layer network NK
(i.e. a function computed by NK gives 1 for
points in K and 0 for the rest).
▶ Finally, connect outputs of the nets NK
satisfying K ∩ A ∅ using a neuron
implementing OR.
46
Power of ReLU
x
· · ·
y Consider a two layer network
▶ with a single input and single output;
▶ hidden neurons with the ReLU activation:
σ(ξ) = max(ξ, 0);
▶ the output neuron with identity activation:
σ(ξ) = ξ (linear model)
47
Power of ReLU
x
· · ·
y Consider a two layer network
▶ with a single input and single output;
▶ hidden neurons with the ReLU activation:
σ(ξ) = max(ξ, 0);
▶ the output neuron with identity activation:
σ(ξ) = ξ (linear model)
For every continuous function f : [0, 1] → [0, 1] and ε > 0 there
is a network of the above type computing a function
F : [0, 1] → R such that |f(x) − F(x)| ≤ ε for all x ∈ [0, 1].
47
Power of ReLU
x
· · ·
y Consider a two layer network
▶ with a single input and single output;
▶ hidden neurons with the ReLU activation:
σ(ξ) = max(ξ, 0);
▶ the output neuron with identity activation:
σ(ξ) = ξ (linear model)
For every continuous function f : [0, 1] → [0, 1] and ε > 0 there
is a network of the above type computing a function
F : [0, 1] → R such that |f(x) − F(x)| ≤ ε for all x ∈ [0, 1].
For every open subset A ⊆ [0, 1] there is a network of the
above type such that for "most" x ∈ [0, 1] we have that x ∈ A iff
the network’s output is > 0 for the input x.
Just consider a continuous function f where f(x) is the minimum difference
between x and a point on the boundary of A. Then uniformly approximate f
using the networks. 47
Non-linear separation - sigmoid
Theorem (Cybenko 1989 - informal version)
Let σ be a continuous function which is sigmoidal, i.e. satisﬁes
σ(x) =



1 pro x → +∞
0 pro x → −∞
For every "reasonable" set A ⊆ [0, 1]n, there is a two layer
network where each hidden neuron has the activation function
σ (output neurons are linear), that satisﬁes the following:
For "most" vectors ⃗v ∈ [0, 1]n we have that ⃗v ∈ A iff the network
output is > 0 for the input ⃗v.
For mathematically oriented:
▶ "reasonable" means Lebesgue measurable
▶ "most" means that the set of incorrectly classiﬁed vectors has
the Lebesgue measure smaller than a given ε > 0
48
Non-linear separation - practical illustration
▶ ALVINN drives a car
49
Non-linear separation - practical illustration
▶ ALVINN drives a car
▶ The net has 30 × 32 = 960 inputs
(the input space is thus R960
)
49
Non-linear separation - practical illustration
▶ ALVINN drives a car
▶ The net has 30 × 32 = 960 inputs
(the input space is thus R960
)
▶ Input values correspond to
shades of gray of pixels.
49
Non-linear separation - practical illustration
▶ ALVINN drives a car
▶ The net has 30 × 32 = 960 inputs
(the input space is thus R960
)
▶ Input values correspond to
shades of gray of pixels.
▶ Output neurons "classify" images
of the road based on their
"curvature".
Zdroj obrázku: http://jmvidal.cse.sc.edu/talks/ann/alvin.html
49
Function approximation - two-layer networks
Theorem (Cybenko 1989)
Let σ be a continuous function which is sigmoidal, i.e. is
increasing and satisﬁes
σ(x) =



1 pro x → +∞
0 pro x → −∞
For every continuous function f : [0, 1]n → [0, 1] and every ε > 0
there is a function F : [0, 1]n → [0, 1] computed by a two layer
network where each hidden neuron has the activation function
σ (output neurons are linear), that satisﬁes the following
|f(⃗v) − F(⃗v)| < ε for every ⃗v ∈ [0, 1]n
.
50
Neural networks and computability
▶ Consider recurrent networks (i.e., containing cycles)
51
Neural networks and computability
▶ Consider recurrent networks (i.e., containing cycles)
▶ with real weights (in general);
51
Neural networks and computability
▶ Consider recurrent networks (i.e., containing cycles)
▶ with real weights (in general);
▶ one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
51
Neural networks and computability
▶ Consider recurrent networks (i.e., containing cycles)
▶ with real weights (in general);
▶ one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
▶ parallel activity rule (output values of all neurons are
recomputed in every step);
51
Neural networks and computability
▶ Consider recurrent networks (i.e., containing cycles)
▶ with real weights (in general);
▶ one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
▶ parallel activity rule (output values of all neurons are
recomputed in every step);
▶ activation function
σ(ξ) =



1 ξ ≥ 1 ;
ξ 0 ≤ ξ ≤ 1 ;
0 ξ < 0.
51
Neural networks and computability
▶ Consider recurrent networks (i.e., containing cycles)
▶ with real weights (in general);
▶ one input neuron and one output neuron (the network
computes a function F : A → R where A ⊆ R contains all
inputs on which the network stops);
▶ parallel activity rule (output values of all neurons are
recomputed in every step);
▶ activation function
σ(ξ) =



1 ξ ≥ 1 ;
ξ 0 ≤ ξ ≤ 1 ;
0 ξ < 0.
▶ We encode words ω ∈ {0, 1}+ into numbers as follows:
δ(ω) =
|ω|
i=1
ω(i)
2i
+
1
2|ω|+1
E.g. ω = 11001 gives δ(ω) = 1
2 + 1
22 + 1
25 + 1
26
(= 0.110011 in binary form).
51
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
52
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
▶ Recurrent networks with rational weights are equivalent to
Turing machines
▶ For every recursively enumerable language L ⊆ {0, 1}+
there is a recurrent network with rational weights and less
than 1000 neurons, which recognizes L.
▶ The halting problem is undecidable for networks with at
least 25 neurons and rational weights.
▶ There is "universal" network (equivalent of the universal
Turing machine)
52
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
▶ Recurrent networks with rational weights are equivalent to
Turing machines
▶ For every recursively enumerable language L ⊆ {0, 1}+
there is a recurrent network with rational weights and less
than 1000 neurons, which recognizes L.
▶ The halting problem is undecidable for networks with at
least 25 neurons and rational weights.
▶ There is "universal" network (equivalent of the universal
Turing machine)
▶ Recurrent networks are super-Turing powerful
52
Neural networks and computability
A network recognizes a language L ⊆ {0, 1}+ if it computes a
function F : A → R (A ⊆ R) such that
ω ∈ L iff δ(ω) ∈ A and F(δ(ω)) > 0.
▶ Recurrent networks with rational weights are equivalent to
Turing machines
▶ For every recursively enumerable language L ⊆ {0, 1}+
there is a recurrent network with rational weights and less
than 1000 neurons, which recognizes L.
▶ The halting problem is undecidable for networks with at
least 25 neurons and rational weights.
▶ There is "universal" network (equivalent of the universal
Turing machine)
▶ Recurrent networks are super-Turing powerful
▶ For every language L ⊆ {0, 1}+
there is a recurrent network
with less than 1000 nerons which recognizes L.
52
Summary of theoretical results
▶ Neural networks are very strong from the point of view of
theory:
▶ All Boolean functions can be expressed using two-layer
networks.
▶ Two-layer networks may approximate any continuous
function.
▶ Recurrent networks are at least as strong as Turing
machines.
53
Summary of theoretical results
▶ Neural networks are very strong from the point of view of
theory:
▶ All Boolean functions can be expressed using two-layer
networks.
▶ Two-layer networks may approximate any continuous
function.
▶ Recurrent networks are at least as strong as Turing
machines.
▶ These results are purely theoretical!
▶ "Theoretical" networks are extremely huge.
▶ It is very difﬁcult to handcraft them even for simplest
problems.
▶ From practical point of view, the most important advantage
of neural networks are: learning, generalization,
robustness.
53
Neural networks vs classical computers
Neural networks "Classical" computers
Data implicitly in weights explicitly
Computation naturally parallel sequential, localized
Robustness robust w.r.t. input corruption
& damage
changing one bit may
completely crash the
computation
Precision imprecise, network recalls a
training example "similar" to
the input
(typically) precise
Programming learning manual
54
History & implementations
55
History of neurocomputers
▶ 1951: SNARC (Minski et al)
▶ the ﬁrst implementation of neural network
▶ a rat strives to exit a maze
▶ 40 artiﬁcial neurons (300 vacuum tubes, engines, etc.)
56
History of neurocomputers
▶ 1957: Mark I Perceptron (Rosenblatt et al) - the ﬁrst
successful network for image recognition
▶ single layer network
▶ image represented by 20 × 20 photocells
▶ intensity of pixels was treated as the input to a perceptron
(basically the formal neuron), which recognized ﬁgures
▶ weights were implemented using potentiometers, each set
by its own engine
▶ it was possible to arbitrarily reconnect inputs to neurons to
demonstrate adaptability
57
History of neurocomputers
▶ 1960: ADALINE (Widrow & Hof)
▶ single layer neural network
▶ weights stored in a newly invented electronic component
memistor, which remembers history of electric current in
the form of resistance.
▶ Widrow founded a company Memistor Corporation, which
sold implementations of neural networks.
▶ 1960-66: several companies concerned with neural
networks were founded.
58
History of neurocomputers
▶ 1967-82: dead still after publication of a book by Minski &
Papert (published 1969, title Perceptrons)
▶ 1983-end of 90s: revival of neural networks
▶ many attempts at hardware implementations
▶ application speciﬁc chips (ASIC)
▶ programmable hardware (FPGA)
▶ hw implementations typically not better than "software"
implementations on universal computers (problems with
weight storage, size, speed, cost of production etc.)
59
History of neurocomputers
▶ 1967-82: dead still after publication of a book by Minski &
Papert (published 1969, title Perceptrons)
▶ 1983-end of 90s: revival of neural networks
▶ many attempts at hardware implementations
▶ application speciﬁc chips (ASIC)
▶ programmable hardware (FPGA)
▶ hw implementations typically not better than "software"
implementations on universal computers (problems with
weight storage, size, speed, cost of production etc.)
▶ end of 90s-cca 2005: NN suppressed by other machine
learning methods (support vector machines (SVM))
▶ 2006-now: The boom of neural networks!
▶ deep networks – often better than any other method
▶ GPU implementations
▶ ... specialized hw implementations (Google’s TPU)
59
Some highlights
▶ Breakthrough in image recognition.
Accuracy of image recognition improved by an order of magnitude in 5
years.
▶ Breakthrough in game playing.
Superhuman results in Go and Chess almost without any human
intervention. Master level in Starcraft, poker, etc.
▶ Breakthrough in machine translation.
Switching to deep learning produced a 60% increase in translation
accuracy compared to the phrase-based approach previously used in
Google Translate (in human evaluation)
▶ Breakthrough in speech processing.
▶ Breakthrough in text generation.
GPT-3 generates pretty realistic articles, short plays (for a theatre) have
been successfully generated, etc.
60
History in waves ...
Figure: The ﬁgure shows two of the three historical waves of artiﬁcial
neural nets research, as measured by the frequency of the phrases
"cybernetics" and "connectionism" or "neural networks" according to
Google Books (the third wave is too recent to appear).
61
Current hardware – What do we face?
Increasing dataset size ...
... weakly-supervised pre-training using hashtags from
the Instagram uses 3.6 ∗ 109 images.
Revisiting Weakly Supervised Pre-Training of Visual Perception Models. Singh et al.
https://arxiv.org/pdf/2201.08371.pdf, 2022
62
Current hardware – What do we face?
... and thus increasing size of neural networks ...
2. ADALINE
4. Early back-propagation network (Rumelhart et al., 1986b)
8. Image recognition: LeNet-5 (LeCun et al., 1998b)
10. Dimensionality reduction: Deep belief network (Hinton et al., 2006)
... here the third "wave" of neural networks started
15. Digit recognition: GPU-accelerated multilayer perceptron (Ciresan et al., 2010)
18. Image recognition (AlexNet): Multi-GPU convolutional network (Krizhevsky et al., 2012)
20. Image recognition: GoogLeNet (Szegedy et al., 2014a)
63
Current hardware - What do we face?
64
Current hardware – What do we face?
... as a reward we get this ...
Figure: Since deep networks reached the scale necessary to
compete in the ImageNetLarge Scale Visual Recognition Challenge,
they have consistently won the competition every year, and yielded
lower and lower error rates each time. Data from Russakovsky et al.
(2014b) and He et al. (2015).
65
Current hardware
In 2012, Google trained a large network of 1.7
billion weights and 9 layers
The task was image recognition (10 million
youtube video frames)
The hw comprised a 1000 computer network
(16 000 cores), computation took three days.
66
Current hardware
In 2012, Google trained a large network of 1.7
billion weights and 9 layers
The task was image recognition (10 million
youtube video frames)
The hw comprised a 1000 computer network
(16 000 cores), computation took three days.
In 2014, similar task performed on Commodity
Off-The-Shelf High Performance Computing
(COTS HPC) technology: a cluster of GPU
servers with Inﬁniband interconnects and MPI.
Able to train 1 billion parameter networks on
just 3 machines in a couple of days.
Able to scale to 11 billion weights (approx. 6.5
times larger than the Google model) on 16
GPUs. 66
Current hardware – NVIDIA DGX Station
▶ 8x GPU (Nvidia A100 80GB
Tensor Core)
▶ 5 petaFLOPS
▶ System memory: 2 TB
▶ Network: 200 Gb/s InﬁniBand
67
Deep learning in clouds
Big companies offer cloud services for deep learning:
▶ Amazon Web Services
▶ Google Cloud
▶ Deep Cognition
▶ ...
Advantages:
▶ Do not have to care (too much) about technical problems.
▶ Do not have to buy and optimize highend hw/sw, networks etc.
▶ Scaling & virtually limitless storage.
Disadvatages:
▶ Do not have full control.
▶ Performance can vary, connectivity problems.
▶ Have to pay for services.
▶ Privacy issues.
68
Current software
▶ TensorFlow (Google)
▶ open source software library for numerical computation
using data ﬂow graphs
▶ allows implementation of most current neural networks
▶ allows computation on multiple devices (CPUs, GPUs, ...)
▶ Python API
▶ Keras: a part of TensorFlow that allows easy description of
most modern neural networks
▶ PyTorch (Facebook)
▶ similar to TensorFlow
▶ object oriented
▶ ... majority of new models in research papers implemented
in PyTorch
https://www.cioinsight.com/big-data/pytorch-vs-tensorﬂow/
▶ Theano (dead):
▶ The "academic" grand-daddy of deep-learning frameworks,
written in Python. Strongly inspired TensorFlow (some
people developing Theano moved on to develop
TensorFlow).
▶ There are others: Caffe, Deeplearning4j, ... 69
Current software – Keras
70
Current software – Keras functional API
71
Current software – TensorFlow
72
Current software – TensorFlow
73
Current software – PyTorch
74
Other software implementations
Most "mathematical" software packages contain some support
of neural networks:
▶ MATLAB
▶ R
▶ STATISTICA
▶ Weka
▶ ...
The implementations are typically not on par with the previously
mentioned dedicated deep-learning libraries.
75
Training linear models
76
Linear regression (ADALINE)
Architecture:
x1 x2 xn
· · ·
y
⃗x0 = 1
w0
w1 w2 wn
⃗w = (w0, w1, . . . , wn) and ⃗x = (x0, x1, . . . , xn) where x0 = 1.
Activity:
▶ inner potential: ξ = w0 + n
i=1 wixi = n
i=0 wixi = ⃗w · ⃗x
▶ activation function: σ(ξ) = ξ
▶ network function: y[⃗w](⃗x) = σ(ξ) = ⃗w · ⃗x
77
Linear regression (ADALINE)
Learning:
▶ Given a training dataset
T = ⃗x1, d1 , ⃗x2, d2 , . . . , ⃗xp, dp
Here ⃗xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th
input, and dk ∈ R is the expected output.
Intuition: The network is supposed to compute an afﬁne approximation of the
function (some of) whose values are given in the training set.
78
Oaks in Wisconsin
79
Linear regression (ADALINE)
▶ Error function:
E(⃗w) =
1
2
p
k=1
⃗w · ⃗xk − dk
2
=
1
2
p
k=1


n
i=0
wixki − dk


2
▶ The goal is to ﬁnd ⃗w which minimizes E(⃗w).
80
Error function
81
Gradient of the error function
Consider gradient of the error function:
∇E(⃗w) =
∂E
∂w0
(⃗w), . . . ,
∂E
∂wn
(⃗w)
Intuition: ∇E(⃗w) is a vector in the weight space which points in
the direction of the steepest ascent of the error function.
Note that the vectors ⃗xk are just parameters of the function E, and are thus
ﬁxed!
82
Gradient of the error function
Consider gradient of the error function:
∇E(⃗w) =
∂E
∂w0
(⃗w), . . . ,
∂E
∂wn
(⃗w)
Intuition: ∇E(⃗w) is a vector in the weight space which points in
the direction of the steepest ascent of the error function.
Note that the vectors ⃗xk are just parameters of the function E, and are thus
ﬁxed!
Fact
If ∇E(⃗w) = ⃗0 = (0, . . . , 0), then ⃗w is a global minimum of E.
For ADALINE, the error function E(⃗w) is a convex paraboloid and thus has
the unique global minimum.
82
Gradient - illustration
Caution! This picture just illustrates the notion of gradient ... it is not
the convex paraboloid E(⃗w) !
83
Gradient of the error function
∂E
∂wℓ
(⃗w) =
1
2
p
k=1
δ
δwℓ


n
i=0
wixki − dk


2
84
Gradient of the error function
∂E
∂wℓ
(⃗w) =
1
2
p
k=1
δ
δwℓ


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δwℓ


n
i=0
wixki − dk


84
Gradient of the error function
∂E
∂wℓ
(⃗w) =
1
2
p
k=1
δ
δwℓ


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δwℓ


n
i=0
wixki − dk


=
1
2
p
k=1
2


n
i=0
wixki − dk




n
i=0
δ
δwℓ
wixki −
δE
δwℓ
dk


84
Gradient of the error function
∂E
∂wℓ
(⃗w) =
1
2
p
k=1
δ
δwℓ


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δwℓ


n
i=0
wixki − dk


=
1
2
p
k=1
2


n
i=0
wixki − dk




n
i=0
δ
δwℓ
wixki −
δE
δwℓ
dk


=
p
k=1
⃗w · ⃗xk − dk xkℓ
84
Gradient of the error function
∂E
∂wℓ
(⃗w) =
1
2
p
k=1
δ
δwℓ


n
i=0
wixki − dk


2
=
1
2
p
k=1
2


n
i=0
wixki − dk


δ
δwℓ


n
i=0
wixki − dk


=
1
2
p
k=1
2


n
i=0
wixki − dk




n
i=0
δ
δwℓ
wixki −
δE
δwℓ
dk


=
p
k=1
⃗w · ⃗xk − dk xkℓ
Thus
∇E(⃗w) =
∂E
∂w0
(⃗w), . . . ,
∂E
∂wn
(⃗w) =
p
k=1
⃗w · ⃗xk − dk ⃗xk
84
Linear regression - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
85
Linear regression - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
The algorithm computes a sequence of weight vectors
⃗w(0), ⃗w(1), ⃗w(2), . . ..
▶ weights in ⃗w(0) are randomly initialized to values close to 0
85
Linear regression - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
The algorithm computes a sequence of weight vectors
⃗w(0), ⃗w(1), ⃗w(2), . . ..
▶ weights in ⃗w(0) are randomly initialized to values close to 0
▶ in the step t + 1, weights ⃗w(t+1) are computed as follows:
⃗w(t+1)
= ⃗w(t)
− ε · ∇E(⃗w(t)
)
= ⃗w(t)
− ε ·
p
k=1
⃗w(t)
· ⃗xk − dk · ⃗xk
Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate.
85
Linear regression - learning
Batch algorithm (gradient descent):
Idea: In every step "move" the weights in the direction opposite
to the gradient.
The algorithm computes a sequence of weight vectors
⃗w(0), ⃗w(1), ⃗w(2), . . ..
▶ weights in ⃗w(0) are randomly initialized to values close to 0
▶ in the step t + 1, weights ⃗w(t+1) are computed as follows:
⃗w(t+1)
= ⃗w(t)
− ε · ∇E(⃗w(t)
)
= ⃗w(t)
− ε ·
p
k=1
⃗w(t)
· ⃗xk − dk · ⃗xk
Here k = (t mod p) + 1 and 0 < ε ≤ 1 is a learning rate.
Proposition
For sufﬁciently small ε > 0 the sequence ⃗w(0), ⃗w(1), ⃗w(2), . . .
converges (componentwise) to the global minimum of E (i.e. to
the vector ⃗w satisfying ∇E(⃗w) = ⃗0).
85
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
Linear regression - animation
86
MLP training – theory
87
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
▶ Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
▶ layers numbered from 0; the
input layer has number 0
▶ E.g. three-layer network has
two hidden layers and one
output layer
▶ Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
▶ Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
88
MLP – architecture
Notation:
▶ Denote
▶ X a set of input neurons
▶ Y a set of output neurons
▶ Z a set of all neurons (X, Y ⊆ Z)
89
MLP – architecture
Notation:
▶ Denote
▶ X a set of input neurons
▶ Y a set of output neurons
▶ Z a set of all neurons (X, Y ⊆ Z)
▶ individual neurons denoted by indices i, j etc.
▶ ξj is the inner potential of the neuron j after the computation
stops
89
MLP – architecture
Notation:
▶ Denote
▶ X a set of input neurons
▶ Y a set of output neurons
▶ Z a set of all neurons (X, Y ⊆ Z)
▶ individual neurons denoted by indices i, j etc.
▶ ξj is the inner potential of the neuron j after the computation
stops
▶ yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
89
MLP – architecture
Notation:
▶ Denote
▶ X a set of input neurons
▶ Y a set of output neurons
▶ Z a set of all neurons (X, Y ⊆ Z)
▶ individual neurons denoted by indices i, j etc.
▶ ξj is the inner potential of the neuron j after the computation
stops
▶ yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
▶ wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
89
MLP – architecture
Notation:
▶ Denote
▶ X a set of input neurons
▶ Y a set of output neurons
▶ Z a set of all neurons (X, Y ⊆ Z)
▶ individual neurons denoted by indices i, j etc.
▶ ξj is the inner potential of the neuron j after the computation
stops
▶ yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
▶ wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
▶ j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
89
MLP – architecture
Notation:
▶ Denote
▶ X a set of input neurons
▶ Y a set of output neurons
▶ Z a set of all neurons (X, Y ⊆ Z)
▶ individual neurons denoted by indices i, j etc.
▶ ξj is the inner potential of the neuron j after the computation
stops
▶ yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
▶ wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
▶ j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
▶ j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
89
MLP – activity
Activity:
▶ inner potential of neuron j:
ξj =
i∈j←
wjiyi
90
MLP – activity
Activity:
▶ inner potential of neuron j:
ξj =
i∈j←
wjiyi
▶ activation function σj for neuron j (arbitrary differentiable)
90
MLP – activity
Activity:
▶ inner potential of neuron j:
ξj =
i∈j←
wjiyi
▶ activation function σj for neuron j (arbitrary differentiable)
▶ State of non-input neuron j ∈ Z \ X after the computation
stops:
yj = σj(ξj)
(yj depends on the conﬁguration ⃗w and the input ⃗x, so we sometimes
write yj(⃗w,⃗x) )
90
MLP – activity
Activity:
▶ inner potential of neuron j:
ξj =
i∈j←
wjiyi
▶ activation function σj for neuron j (arbitrary differentiable)
▶ State of non-input neuron j ∈ Z \ X after the computation
stops:
yj = σj(ξj)
(yj depends on the conﬁguration ⃗w and the input ⃗x, so we sometimes
write yj(⃗w,⃗x) )
▶ The network computes a function R|X|
do R|Y|
. Layer-wise computation:
First, all input neurons are assigned values of the input. In the ℓ-th step,
all neurons of the ℓ-th layer are evaluated.
90
MLP – learning
Learning:
▶ Given a training dataset T of the form
⃗xk , ⃗dk k = 1, . . . , p
Here, every ⃗xk ∈ R|X| is an input vector end every ⃗dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input ⃗xk (the vector ⃗dk can be written as dkj j∈Y
).
91
MLP – learning
Learning:
▶ Given a training dataset T of the form
⃗xk , ⃗dk k = 1, . . . , p
Here, every ⃗xk ∈ R|X| is an input vector end every ⃗dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input ⃗xk (the vector ⃗dk can be written as dkj j∈Y
).
▶ Error function:
E(⃗w) =
p
k=1
Ek (⃗w)
where
Ek (⃗w) =
1
2
j∈Y
yj(⃗w,⃗xk ) − dkj
2
91
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
⃗w(0), ⃗w(1), ⃗w(2), . . ..
▶ weights in ⃗w(0) are randomly initialized to values close to 0
▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
92
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
⃗w(0), ⃗w(1), ⃗w(2), . . ..
▶ weights in ⃗w(0) are randomly initialized to values close to 0
▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂E
∂wji
(⃗w(t)
)
is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is
a learning rate in step t + 1.
92
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
⃗w(0), ⃗w(1), ⃗w(2), . . ..
▶ weights in ⃗w(0) are randomly initialized to values close to 0
▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂E
∂wji
(⃗w(t)
)
is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is
a learning rate in step t + 1.
Note that ∂E
∂wji
(⃗w(t)
) is a component of the gradient ∇E, i.e. the weight update
can be written as ⃗w(t+1)
= ⃗w(t)
− ε(t) · ∇E(⃗w(t)
).
92
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
93
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σ′
j (ξj) · yi
93
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σ′
j (ξj) · yi
and for every j ∈ Z ∖ X we get
∂Ek
∂yj
= yj − dkj for j ∈ Y
93
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σ′
j (ξj) · yi
and for every j ∈ Z ∖ X we get
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σ′
r(ξr) · wrj for j ∈ Z ∖ (Y ∪ X)
(Here all yj are in fact yj(⃗w,⃗xk )).
93
MLP – error function gradient (history)
▶ If yj = σj(ξj) = 1
1+e
−ξj
for all j ∈ Z, then
σ′
j (ξj) = yj(1 − yj)
94
MLP – error function gradient (history)
▶ If yj = σj(ξj) = 1
1+e
−ξj
for all j ∈ Z, then
σ′
j (ξj) = yj(1 − yj)
and thus for all j ∈ Z ∖ X:
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· yr(1 − yr) · wrj for j ∈ Z ∖ (Y ∪ X)
94
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
95
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
95
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
95
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(⃗w,⃗xk ) for all j ∈ Z
95
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(⃗w,⃗xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
95
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(⃗w,⃗xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
3. compute ∂Ek
∂wji
for all wji using
∂Ek
∂wji
:=
∂Ek
∂yj
· σ′
j (ξj) · yi
95
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(⃗w,⃗xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
3. compute ∂Ek
∂wji
for all wji using
∂Ek
∂wji
:=
∂Ek
∂yj
· σ′
j (ξj) · yi
4. Eji := Eji + ∂Ek
∂wji
The resulting Eji equals ∂E
∂wji
.
95
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
96
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
▶ if j ∈ Y, then ∂Ek
∂yj
= yj − dkj
96
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
▶ if j ∈ Y, then ∂Ek
∂yj
= yj − dkj
▶ if j ∈ Z ∖ Y ∪ X, then assuming that j is in the ℓ-th layer and
assuming that ∂Ek
∂yr
has already been computed for all
neurons in the ℓ + 1-st layer, compute
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σ′
r(ξr) · wrj
(This works because all neurons of r ∈ j→
belong to the ℓ + 1-st layer.)
96
Complexity of the batch algorithm
Computation of ∂E
∂wji
(⃗w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σ′
r (ξr ) for given ξr )
97
Complexity of the batch algorithm
Computation of ∂E
∂wji
(⃗w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σ′
r (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
97
Complexity of the batch algorithm
Computation of ∂E
∂wji
(⃗w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σ′
r (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(⃗w,⃗xk )
97
Complexity of the batch algorithm
Computation of ∂E
∂wji
(⃗w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σ′
r (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(⃗w,⃗xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
97
Complexity of the batch algorithm
Computation of ∂E
∂wji
(⃗w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σ′
r (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(⃗w,⃗xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
97
Complexity of the batch algorithm
Computation of ∂E
∂wji
(⃗w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σ′
r (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(⃗w,⃗xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
The steps 1. - 3. take linear time.
97
Complexity of the batch algorithm
Computation of ∂E
∂wji
(⃗w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σ′
r (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(⃗w,⃗xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
The steps 1. - 3. take linear time.
Note that the speed of convergence of the gradient descent cannot be
estimated ...
97
Illustration of the gradient descent – XOR
Source: Pattern Classiﬁcation (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork
98
MLP – learning algorithm
Online algorithm:
The algorithm computes a sequence of weight vectors
⃗w(0), ⃗w(1), ⃗w(2), . . ..
▶ weights in ⃗w(0) are randomly initialized to values close to 0
▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂Ek
∂wji
(w
(t)
ji
)
is the weight update of wji in the step t + 1 and 0 < ε(t) ≤ 1
is the learning rate in the step t + 1.
There are other variants determined by selection of the training examples
used for the error computation (more on this later).
99
SGD
▶ weights in ⃗w(0)
are randomly initialized to values close to 0
▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1)
are
computed as follows:
▶ Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
▶ Compute
⃗w(t+1)
= ⃗w(t)
+ ∆⃗w(t)
where
∆⃗w(t)
= −ε(t) ·
k∈T
∇Ek (⃗w(t)
)
▶ 0 < ε(t) ≤ 1 is a learning rate in step t + 1
▶ ∇Ek (⃗w(t)
) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented by
randomly shufﬂing all data and then choosing minibatches sequentially.
100
Output activations and error functions
Regression:
▶ The output activation is typically the identity yi = σ(ξi) = ξi.
101
Output activations and error functions
Regression:
▶ The output activation is typically the identity yi = σ(ξi) = ξi.
▶ A training dataset
⃗xk , ⃗dk k = 1, . . . , p
Here, every ⃗xk ∈ R|X| is an input vector end every ⃗dk ∈ R|Y|
is the desired network output. For every i ∈ Y, denote by
dki the desired output of the neuron i for a given network
input ⃗xk (the vector ⃗dk can be written as (dki)i∈Y ).
101
Output activations and error functions
Regression:
▶ The output activation is typically the identity yi = σ(ξi) = ξi.
▶ A training dataset
⃗xk , ⃗dk k = 1, . . . , p
Here, every ⃗xk ∈ R|X| is an input vector end every ⃗dk ∈ R|Y|
is the desired network output. For every i ∈ Y, denote by
dki the desired output of the neuron i for a given network
input ⃗xk (the vector ⃗dk can be written as (dki)i∈Y ).
▶ The error function mean squared error (mse):
E(⃗w) =
1
p
p
k=1
Ek (⃗w)
where
Ek (⃗w) =
1
2
i∈Y
yi(⃗w,⃗xk ) − dki
2
101
Output activations and error functions
Classiﬁcation
▶ The output activation function softmax:
yi = σi(ξi) =
eξi
j∈Y eξj
102
Output activations and error functions
Classiﬁcation
▶ The output activation function softmax:
yi = σi(ξi) =
eξi
j∈Y eξj
▶ A training dataset
⃗xk , ⃗dk k = 1, . . . , p
Here, every ⃗xk ∈ R|X| is an input vector end every
⃗dk ∈ {0, 1}|Y| is the desired network output. For every i ∈ Y,
denote by dki the desired output of the neuron i for a given
network input ⃗xk (the vector ⃗dk can be written as (dki)i∈Y ).
102
Output activations and error functions
Classiﬁcation
▶ The output activation function softmax:
yi = σi(ξi) =
eξi
j∈Y eξj
▶ A training dataset
⃗xk , ⃗dk k = 1, . . . , p
Here, every ⃗xk ∈ R|X| is an input vector end every
⃗dk ∈ {0, 1}|Y| is the desired network output. For every i ∈ Y,
denote by dki the desired output of the neuron i for a given
network input ⃗xk (the vector ⃗dk can be written as (dki)i∈Y ).
▶ The error function (categorical) cross entropy:
E(⃗w) = −
1
p
p
k=1 i∈Y
dki log(yi(⃗w,⃗xk ))
102
Gradient with Softmax & Cross-Entropy
Assume that V is the layer just below the output layer Y.
E(⃗w) = −
1
p
p
k=1 i∈Y
dki log(yi(⃗w,⃗xk ))
= −
1
p
p
k=1 i∈Y
dki log


eξi
j∈Y eξj


= −
1
p
p
k=1 i∈Y
dki


ξi − log


j∈Y
eξj




= −
1
p
p
k=1 i∈Y
dki


ℓ∈V
wiℓyℓ − log


j∈Y
e ℓ∈V wjℓyℓ




Now compute the derivatives δE
δyℓ
for ℓ ∈ V.
103
Output activations and error functions
Binary classiﬁcation
Assume a single output neuron o ∈ Y = {o}.
▶ The output activation function logistic sigmoid:
σo(ξo) =
eξo
eξo + 1
=
1
1 + e−ξo
104
Output activations and error functions
Binary classiﬁcation
Assume a single output neuron o ∈ Y = {o}.
▶ The output activation function logistic sigmoid:
σo(ξo) =
eξo
eξo + 1
=
1
1 + e−ξo
▶ A training dataset
T = ⃗x1, d1 , ⃗x2, d2 , . . . , ⃗xp, dp
Here ⃗xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th
input, and dk ∈ {0, 1} is the desired output.
104
Output activations and error functions
Binary classiﬁcation
Assume a single output neuron o ∈ Y = {o}.
▶ The output activation function logistic sigmoid:
σo(ξo) =
eξo
eξo + 1
=
1
1 + e−ξo
▶ A training dataset
T = ⃗x1, d1 , ⃗x2, d2 , . . . , ⃗xp, dp
Here ⃗xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th
input, and dk ∈ {0, 1} is the desired output.
▶ The error function (Binary) cross-entropy:
E(⃗w) =
p
k=1
−(dk log(yo(⃗w,⃗xk ))+(1−dk ) log(1−yo(⃗w,⃗xk )))
104
But what is the meaning of the sigmoid?
The model gives a probability yo of the class 1 given an input ⃗x.
But why do we model such a probability using 1/(1 + e−ξo ) ??
105
But what is the meaning of the sigmoid?
The model gives a probability yo of the class 1 given an input ⃗x.
But why do we model such a probability using 1/(1 + e−ξo ) ??
Let ¯y be the "true" probability of the class 1 to be modeled.
What about odds of the class 1?
odds(¯y) = ¯y/1 − ¯y
... stretches from 0 to ∞
105
But what is the meaning of the sigmoid?
The model gives a probability yo of the class 1 given an input ⃗x.
But why do we model such a probability using 1/(1 + e−ξo ) ??
Let ¯y be the "true" probability of the class 1 to be modeled.
What about log odds (aka logit) of the class 1?
logit(¯y) = log(¯y/(1 − ¯y))
... stretches from −∞ to ∞
105
But what is the meaning of the sigmoid?
Assume that ¯y is the probability of the class 1. Put
log(¯y/(1 − ¯y)) = ξo
(here ξo is the inner potential of the output neuron).
106
But what is the meaning of the sigmoid?
Assume that ¯y is the probability of the class 1. Put
log(¯y/(1 − ¯y)) = ξo
(here ξo is the inner potential of the output neuron). Then
log((1 − ¯y)/¯y) = −ξo
106
But what is the meaning of the sigmoid?
Assume that ¯y is the probability of the class 1. Put
log(¯y/(1 − ¯y)) = ξo
(here ξo is the inner potential of the output neuron). Then
log((1 − ¯y)/¯y) = −ξo
and
(1 − ¯y)/¯y = e−ξo
106
But what is the meaning of the sigmoid?
Assume that ¯y is the probability of the class 1. Put
log(¯y/(1 − ¯y)) = ξo
(here ξo is the inner potential of the output neuron). Then
log((1 − ¯y)/¯y) = −ξo
and
(1 − ¯y)/¯y = e−ξo
and
¯y =
1
1 + e−ξo
That is, modeling the probability using the classiﬁcation model (with
the logistic output activation) corresponds to modeling log-odds using
the regression model (with the identity output activation).
106
Log likelihood is your friend!
What is the statistical meaning of the cross-entropy?
▶ Let’s have a "coin" (sides 0 and 1).
107
Log likelihood is your friend!
What is the statistical meaning of the cross-entropy?
▶ Let’s have a "coin" (sides 0 and 1).
▶ The probability of 1 is ¯y and is unknown!
107
Log likelihood is your friend!
What is the statistical meaning of the cross-entropy?
▶ Let’s have a "coin" (sides 0 and 1).
▶ The probability of 1 is ¯y and is unknown!
▶ You have tossed the coin 5 times and got a training
dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Consider this to be a very special case where the input dimension is 0
107
Log likelihood is your friend!
What is the statistical meaning of the cross-entropy?
▶ Let’s have a "coin" (sides 0 and 1).
▶ The probability of 1 is ¯y and is unknown!
▶ You have tossed the coin 5 times and got a training
dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Consider this to be a very special case where the input dimension is 0
▶ What is the best model y of ¯y based on the data?
107
Log likelihood is your friend!
What is the statistical meaning of the cross-entropy?
▶ Let’s have a "coin" (sides 0 and 1).
▶ The probability of 1 is ¯y and is unknown!
▶ You have tossed the coin 5 times and got a training
dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Consider this to be a very special case where the input dimension is 0
▶ What is the best model y of ¯y based on the data?
Answer: The one that generates the data with maximum
probability!
107
Log likelihood is your friend!
Keep in mind our dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
108
Log likelihood is your friend!
Keep in mind our dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Assume that the data was generated by independent trials,
then the probability of getting exactly T from our model is
L = y · y · (1 − y) · (1 − y) · y
How to maximize this w.r.t. y?
108
Log likelihood is your friend!
Keep in mind our dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Assume that the data was generated by independent trials,
then the probability of getting exactly T from our model is
L = y · y · (1 − y) · (1 − y) · y
How to maximize this w.r.t. y?
Maximize
LL = log(L) = log(y)+log(y)+log(1−y)+log(1−y)+log(y)
108
Log likelihood is your friend!
Keep in mind our dataset:
T = {1, 1, 0, 0, 1} = {d1, . . . , d5}
Assume that the data was generated by independent trials,
then the probability of getting exactly T from our model is
L = y · y · (1 − y) · (1 − y) · y
How to maximize this w.r.t. y?
Maximize
LL = log(L) = log(y)+log(y)+log(1−y)+log(1−y)+log(y)
But then
−LL = −1·log(y)−1·log(y)−(1−0)·log(1−y)−(1−0)·log(1−y)−1·log(y)
i.e. −LL is the cross-entropy.
108
Let the coin depend on the input
Consider our model giving a probability yo(⃗w,⃗x) given input ⃗x.
109
Let the coin depend on the input
Consider our model giving a probability yo(⃗w,⃗x) given input ⃗x.
Recall that the training dataset is
T = ⃗x1, d1 , ⃗x2, d2 , . . . , ⃗xp, dp
Here ⃗xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input,
and dk ∈ {0, 1} is the expected output.
109
Let the coin depend on the input
Consider our model giving a probability yo(⃗w,⃗x) given input ⃗x.
Recall that the training dataset is
T = ⃗x1, d1 , ⃗x2, d2 , . . . , ⃗xp, dp
Here ⃗xk = (xk0, xk1 . . . , xkn) ∈ Rn+1, xk0 = 1, is the k-th input,
and dk ∈ {0, 1} is the expected output.
The likelihood:
L(⃗w) =
p
k=1
yo(⃗w,⃗xk )
dk
· 1 − yo(⃗w,⃗xk )
(1−dk )
log(L) =
p
k=1
dk · log(yo(⃗w,⃗xk )) + (1 − dk ) · log(1 − yo(⃗w,⃗xk ))
and thus − log(L) = the cross-entropy.
Minimizing the cross-netropy maximizes the log-likelihood
(and vice versa).
109
Squared Error vs Logistic Output Activation
Consider a single neuron model y = σ(w · x) = 1/(1 + e−w·x)
where w ∈ R is the weight (ignore the bias).
A training dataset T = {(x, d)} where x ∈ R and d ∈ {0, 1}.
110
Squared Error vs Logistic Output Activation
Consider a single neuron model y = σ(w · x) = 1/(1 + e−w·x)
where w ∈ R is the weight (ignore the bias).
A training dataset T = {(x, d)} where x ∈ R and d ∈ {0, 1}.
Squared error E(w) = 1
2 (y − d)2.
δE
δw
= (y − d) · y · (1 − y) · x
110
Squared Error vs Logistic Output Activation
Consider a single neuron model y = σ(w · x) = 1/(1 + e−w·x)
where w ∈ R is the weight (ignore the bias).
A training dataset T = {(x, d)} where x ∈ R and d ∈ {0, 1}.
Squared error E(w) = 1
2 (y − d)2.
δE
δw
= (y − d) · y · (1 − y) · x
Thus
▶ If d = 1 and y ≈ 0, then δE
δw ≈ 0
▶ If d = 0 and y ≈ 1, then δE
δw ≈ 0
The gradient of E is small even though the model is wrong!
110
Squared Error vs Logistic Output Activation
Consider a single neuron model y = σ(w · x) = 1/(1 + e−w·x)
where w ∈ R is the weight (ignore the bias).
A training dataset T = {(x, d)} where x ∈ R and d ∈ {0, 1}.
Cross-entropy error E(w) = −d · log(y) − (1 − d) · log(1 − y).
110
Squared Error vs Logistic Output Activation
Consider a single neuron model y = σ(w · x) = 1/(1 + e−w·x)
where w ∈ R is the weight (ignore the bias).
A training dataset T = {(x, d)} where x ∈ R and d ∈ {0, 1}.
Cross-entropy error E(w) = −d · log(y) − (1 − d) · log(1 − y).
For d = 1
δE
δw
= −
1
y
· y · (1 − y) · x = −(1 − y) · x
which is close to −x for y ≈ 0.
110
Squared Error vs Logistic Output Activation
Consider a single neuron model y = σ(w · x) = 1/(1 + e−w·x)
where w ∈ R is the weight (ignore the bias).
A training dataset T = {(x, d)} where x ∈ R and d ∈ {0, 1}.
Cross-entropy error E(w) = −d · log(y) − (1 − d) · log(1 − y).
For d = 1
δE
δw
= −
1
y
· y · (1 − y) · x = −(1 − y) · x
which is close to −x for y ≈ 0.
For d = 0
δE
δw
= −
1
1 − y
· (−y) · (1 − y) · x = y · x
which is close to x for y ≈ 1.
110
MLP training – practical issues
111
Practical issues of gradient descent
▶ Training efﬁciency:
▶ What size of a minibatch?
▶ How to choose the learning rate ε(t) and control SGD ?
▶ How to pre-process the inputs?
▶ How to initialize weights?
▶ How to choose desired output values of the network?
112
Practical issues of gradient descent
▶ Training efﬁciency:
▶ What size of a minibatch?
▶ How to choose the learning rate ε(t) and control SGD ?
▶ How to pre-process the inputs?
▶ How to initialize weights?
▶ How to choose desired output values of the network?
▶ Quality of the resulting model:
▶ When to stop training?
▶ Regularization techniques.
▶ How large network?
For simplicity, I will illustrate the reasoning on MLP + mse.
Later we will see other topologies and error functions with
different but always somewhat related issues.
112
Issues in gradient descent
▶ Small networks: Lots of local minima where the descent
gets stuck.
▶ The model identiﬁability problem: Swapping incoming
weights of neurons i and j leaves the same network
topology – weight space symmetry.
▶ Recent studies show that for sufﬁciently large networks all
local minima have low values of the error function.
113
Issues in gradient descent
▶ Small networks: Lots of local minima where the descent
gets stuck.
▶ The model identiﬁability problem: Swapping incoming
weights of neurons i and j leaves the same network
topology – weight space symmetry.
▶ Recent studies show that for sufﬁciently large networks all
local minima have low values of the error function.
Saddle points
One can show (by a combinatorial
argument) that larger networks
have exponentially more saddle
points than local minima.
113
Issues in gradient descent – too slow descent
▶ ﬂat regions
E.g. if the inner potentials are too large (in abs. value), then their
derivative is extremely small.
114
Issues in gradient descent – too fast descent
▶ steep cliffs: the gradient is extremely large, descent skips
important weight vectors
115
Issues in gradient descent – local vs global
structure
What if we initialize on the left?
116
Gradient Descent in Large Networks
Theorem
Assume (roughly),
▶ activation functions: "smooth" ReLU (softplus)
σ(z) = log(1 + exp(z))
In general: Smooth, non-polynomial, analytic, Lipschitz continuous.
▶ inputs ⃗xk of Euclidean norm equal to 1, desired values dk
satisfying |dk | ∈ O(1),
▶ the number of hidden neurons per layer sufﬁciently large
(polynomial in certain numerical characteristics of inputs roughly
measuring their similarity, and exponential in the depth of the network),
▶ the learning rate constant and sufﬁciently small.
The gradient descent converges (with high probability w.r.t. random
initialization) to a global minimum with zero error at linear rate.
Later we get to a special type of networks called ResNet where the above
result demands only polynomially many neurons per layer (w.r.t. depth). 117
Issues in computing the gradient
▶ vanishing and exploding gradients
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σ′
r(ξr) · wrj for j ∈ Z ∖ (Y ∪ X)
118
Issues in computing the gradient
▶ vanishing and exploding gradients
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σ′
r(ξr) · wrj for j ∈ Z ∖ (Y ∪ X)
▶ inexact gradient computation:
▶ Minibatch gradient is only an estimate of the true gradient.
▶ Note that the standard deviation of the estimate is (roughly)
σ/
√
m where m is the size of the minibatch and σ is the
variance of the gradient estimate for a single training
example.
(E.g. minibatch size 10 000 means 100 times more computation
than the size 100 but gives only 10 times less deviation.)
118
Minibatch size
▶ Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
119
Minibatch size
▶ Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
▶ Multicore architectures are usually underutilized by extremely
small batches.
119
Minibatch size
▶ Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
▶ Multicore architectures are usually underutilized by extremely
small batches.
▶ If all examples in the batch are to be processed in parallel (as is
the typical case), then the amount of memory scales with the
batch size. For many hardware setups this is the limiting factor in
batch size.
119
Minibatch size
▶ Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
▶ Multicore architectures are usually underutilized by extremely
small batches.
▶ If all examples in the batch are to be processed in parallel (as is
the typical case), then the amount of memory scales with the
batch size. For many hardware setups this is the limiting factor in
batch size.
▶ It is common (especially when using GPUs) for power of 2 batch
sizes to offer better runtime. Typical power of 2 batch sizes
range from 32 to 256, with 16 sometimes being attempted for
large models.
119
Minibatch size
▶ Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
▶ Multicore architectures are usually underutilized by extremely
small batches.
▶ If all examples in the batch are to be processed in parallel (as is
the typical case), then the amount of memory scales with the
batch size. For many hardware setups this is the limiting factor in
batch size.
▶ It is common (especially when using GPUs) for power of 2 batch
sizes to offer better runtime. Typical power of 2 batch sizes
range from 32 to 256, with 16 sometimes being attempted for
large models.
▶ Small batches can offer a regularizing effect, perhaps due to the
noise they add to the learning process.
It has been observed in practice that when using a larger batch
there is a degradation in the quality of the model, as measured
by its ability to generalize.
("On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima". Keskar et al, ICLR’17)
119
Momentum
Issue in the gradient descent:
▶ ∇E(⃗w(t)) constantly changes direction (but the error
steadily decreases).
120
Momentum
Issue in the gradient descent:
▶ ∇E(⃗w(t)) constantly changes direction (but the error
steadily decreases).
Solution: In every step add the change made in the previous
step (weighted by a factor α):
∆⃗w(t)
= −ε(t) ·
k∈T
∇Ek (⃗w(t)
) + α · ∆⃗w(t−1)
where 0 < α < 1.
120
Momentum – illustration
121
SGD with momentum
▶ weights in ⃗w(0)
are randomly initialized to values close to 0
▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1)
are
computed as follows:
▶ Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
▶ Compute
⃗w(t+1)
= ⃗w(t)
+ ∆⃗w(t)
where
∆⃗w(t)
= −ε(t) ·
k∈T
∇Ek (⃗w(t)
) + α∆⃗w(t−1)
▶ 0 < ε(t) ≤ 1 is a learning rate in step t + 1
▶ 0 < α < 1 measures the "inﬂuence" of the momentum
▶ ∇Ek (⃗w(t)
) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented by
randomly shufﬂing all data and then choosing minibatches sequentially.
122
Learning rate
123
Search for the learning rate
▶ Use settings from a successful solution of a similar problem as a
baseline.
▶ Search for the learning rate using the learning monitoring:
▶ Search through values from small (e.g. 0.001) to (0.1),
possibly multiplying by 2.
▶ Train for several epochs, observe the learning curves (see
cross-validation later).
124
Adaptive learning rate
▶ Power scheduling: Set ϵ(t) = ϵ0/(1 + t/s) where ϵ0 is an initial
learning rate and s a number of steps
(after s steps the learning rate is ϵ0/2, after 2s it is ϵ0/3 etc.)
125
Adaptive learning rate
▶ Power scheduling: Set ϵ(t) = ϵ0/(1 + t/s) where ϵ0 is an initial
learning rate and s a number of steps
(after s steps the learning rate is ϵ0/2, after 2s it is ϵ0/3 etc.)
▶ Exponential scheduling: Set ϵ(t) = ϵ0 · 0.1t/s
.
(the learning rate decays faster than in the power scheduling)
125
Adaptive learning rate
▶ Power scheduling: Set ϵ(t) = ϵ0/(1 + t/s) where ϵ0 is an initial
learning rate and s a number of steps
(after s steps the learning rate is ϵ0/2, after 2s it is ϵ0/3 etc.)
▶ Exponential scheduling: Set ϵ(t) = ϵ0 · 0.1t/s
.
(the learning rate decays faster than in the power scheduling)
▶ Piecewise constant scheduling: A constant learning rate for a
number of steps/epochs, then a smaller learning rate, and so on.
125
Adaptive learning rate
▶ Power scheduling: Set ϵ(t) = ϵ0/(1 + t/s) where ϵ0 is an initial
learning rate and s a number of steps
(after s steps the learning rate is ϵ0/2, after 2s it is ϵ0/3 etc.)
▶ Exponential scheduling: Set ϵ(t) = ϵ0 · 0.1t/s
.
(the learning rate decays faster than in the power scheduling)
▶ Piecewise constant scheduling: A constant learning rate for a
number of steps/epochs, then a smaller learning rate, and so on.
▶ 1cycle scheduling: Start by increasing the initial learning rate
from ϵ0 linearly to ϵ1 (approx. ϵ1 = 10ϵ0) halfway through
training. Then decrease from ϵ1 linearly to ϵ0. Finish by dropping
the learning rate by several orders of magnitude (still linearly).
According to a 2018 paper by Leslie Smith this may converge much
faster (100 epochs vs 800 epochs on CIFAR10 dataset).
For comparison of some methods see: AN EMPIRICAL STUDY OF LEARNING RATES IN DEEP NEURAL
NETWORKS FOR SPEECH RECOGNITION, Senior et al
125
AdaGrad
So far we have considered ﬁxed schedules for learning rates.
It is better to have
▶ larger rates for weights with smaller updates,
▶ smaller rates for weights with larger updates.
AdaGrad uses individually adapting learning rate for each
weight.
126
SGD with AdaGrad
▶ weights in ⃗w(0)
are randomly initialized to values close to 0
▶ in the step t + 1 (here t = 0, 1, 2 . . .), compute ⃗w(t+1)
:
▶ Choose (randomly) a minibatch T ⊆ {1, . . . , p}
▶ Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
127
SGD with AdaGrad
▶ weights in ⃗w(0)
are randomly initialized to values close to 0
▶ in the step t + 1 (here t = 0, 1, 2 . . .), compute ⃗w(t+1)
:
▶ Choose (randomly) a minibatch T ⊆ {1, . . . , p}
▶ Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −
η
r
(t)
ji
+ δ
·
k∈T
∂Ek
∂wji
(⃗w(t)
)
and
r
(t)
ji
= r
(t−1)
ji
+


k∈T
∂Ek
∂wji
(⃗w(t)
)


2
▶ η is a constant expressing the inﬂuence of the learning rate,
typically 0.01.
▶ δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0.
127
RMSProp
The main disadvantage of AdaGrad is the accumulation of the
gradient throughout the whole learning process.
In case the learning needs to get over several "hills" before
settling in a deep "valley", the weight updates get far too small
before getting to it.
RMSProp uses an exponentially decaying average to discard
history from the extreme past so that it can converge rapidly
after ﬁnding a convex bowl, as if it were an instance of the
AdaGrad algorithm initialized within that bowl.
128
SGD with RMSProp
▶ weights in ⃗w(0)
are randomly initialized to values close to 0
▶ in the step t + 1 (here t = 0, 1, 2 . . .), compute ⃗w(t+1)
:
▶ Choose (randomly) a minibatch T ⊆ {1, . . . , p}
▶ Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
129
SGD with RMSProp
▶ weights in ⃗w(0)
are randomly initialized to values close to 0
▶ in the step t + 1 (here t = 0, 1, 2 . . .), compute ⃗w(t+1)
:
▶ Choose (randomly) a minibatch T ⊆ {1, . . . , p}
▶ Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −
η
r
(t)
ji
+ δ
·
k∈T
∂Ek
∂wji
(⃗w(t)
)
and
r
(t)
ji
= ρr
(t−1)
ji
+ (1 − ρ)


k∈T
∂Ek
∂wji
(⃗w(t)
)


2
▶ η is a constant expressing the inﬂuence of the learning rate
(Hinton suggests ρ = 0.9 and η = 0.001).
▶ δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0.
129
Other optimization methods
There are more methods such as AdaDelta, Adam (roughly
RMSProp combined with momentum), etc.
A natural question: Which algorithm should one choose?
130
Other optimization methods
There are more methods such as AdaDelta, Adam (roughly
RMSProp combined with momentum), etc.
A natural question: Which algorithm should one choose?
Unfortunately, there is currently no consensus on this point.
According to a recent study, the family of algorithms with
adaptive learning rates (represented by RMSProp and
AdaDelta) performed fairly robustly, no single best algorithm
has emerged.
130
Other optimization methods
There are more methods such as AdaDelta, Adam (roughly
RMSProp combined with momentum), etc.
A natural question: Which algorithm should one choose?
Unfortunately, there is currently no consensus on this point.
According to a recent study, the family of algorithms with
adaptive learning rates (represented by RMSProp and
AdaDelta) performed fairly robustly, no single best algorithm
has emerged.
Currently, the most popular optimization algorithms actively in
use include SGD, SGD with momentum, RMSProp, RMSProp
with momentum, AdaDelta and Adam.
The choice of which algorithm to use, at this point, seems to
depend largely on the users familiarity with the algorithm.
130
Choice of (hidden) activations
Generic requirements imposed on activation functions:
1. differentiability
(to do gradient descent)
2. non-linearity
(linear multi-layer networks are equivalent to single-layer)
3. monotonicity
(local extrema of activation functions induce local extrema of the error
function)
4. "linearity"
(i.e. preserve as much linearity as possible; linear models are easiest to
ﬁt; ﬁnd the "minimum" non-linearity needed to solve a given task)
The choice of activation functions is closely related to input
preprocessing and the initial choice of weights. I will illustrate the
reasoning on sigmoidal functions; say few words about other
activation functions later.
131
Activation functions – tanh
σ(ξ) = 1.7159 · tanh(2
3 · ξ), we have limξ→∞ σ(ξ) = 1.7159 and
limξ→−∞ σ(ξ) = −1.7159
132
Activation functions – tanh
σ(ξ) = 1.7159 · tanh(2
3 · ξ) is almost linear on [−1, 1]
133
Activation functions – tanh
ﬁrst derivative: σ(ξ) = 1.7159 · tanh(2
3 · ξ)
134
Input preprocessing
▶ Some inputs may be much larger than others.
E.g..: Height vs weight of a person, maximum speed of
a car (in km/h) vs its price (in CZK), etc.
135
Input preprocessing
▶ Some inputs may be much larger than others.
E.g..: Height vs weight of a person, maximum speed of
a car (in km/h) vs its price (in CZK), etc.
▶ Large inputs have greater inﬂuence on the training than the
small ones. In addition, too large inputs may slow down
learning (saturation of activation functions).
135
Input preprocessing
▶ Some inputs may be much larger than others.
E.g..: Height vs weight of a person, maximum speed of
a car (in km/h) vs its price (in CZK), etc.
▶ Large inputs have greater inﬂuence on the training than the
small ones. In addition, too large inputs may slow down
learning (saturation of activation functions).
▶ Typical standardization:
▶ average = 0 (subtract the mean)
▶ variance = 1 (divide by the standard deviation)
Here the mean and standard deviation may be estimated
from data (the training set).
(illustration of standard deviation)
135
Initial weights (for tanh)
▶ Assume weights chosen in random. What distribution?
136
Initial weights (for tanh)
▶ Assume weights chosen in random. What distribution?
▶ Consider the activation function σ(ξ) = 1.7159 · tanh(2
3 · ξ)
for all neurons.
▶ σ is almost linear on [−1, 1],
▶ σ saturates out of the interval [−4, 4] (i.e. it is close to its
limit values and its derivative is close to 0.
136
Initial weights (for tanh)
▶ Assume weights chosen in random. What distribution?
▶ Consider the activation function σ(ξ) = 1.7159 · tanh(2
3 · ξ)
for all neurons.
▶ σ is almost linear on [−1, 1],
▶ σ saturates out of the interval [−4, 4] (i.e. it is close to its
limit values and its derivative is close to 0.
Thus
▶ for too small weights we may get (almost) linear model.
▶ for too large weights the activations may get saturated and
the learning will be very slow.
Hence, we want to choose weights so that the inner
potentials of neurons will be roughly in the interval [−1, 1].
136
Normal LeCun initialization
▶ Assume the input data have the mean = 0 and the variance = 1.
Consider a neuron j from the ﬁrst layer with n inputs. Assume its
weights chosen randomly by the normal distribution N(0, w2
).
Assume that all random choices are independent of each other.
▶ The rule: Choose the standard deviation of weights w so that
the standard deviation of ξj (denote by oj) satisﬁes oj ≈ 1.
137
Normal LeCun initialization
▶ Assume the input data have the mean = 0 and the variance = 1.
Consider a neuron j from the ﬁrst layer with n inputs. Assume its
weights chosen randomly by the normal distribution N(0, w2
).
Assume that all random choices are independent of each other.
▶ The rule: Choose the standard deviation of weights w so that
the standard deviation of ξj (denote by oj) satisﬁes oj ≈ 1.
▶ Basic properties of the variance of independent variables give
oj =
√
n · w.
Thus by putting w = 1
n we obtain oj = 1.
137
Normal LeCun initialization
▶ Assume the input data have the mean = 0 and the variance = 1.
Consider a neuron j from the ﬁrst layer with n inputs. Assume its
weights chosen randomly by the normal distribution N(0, w2
).
Assume that all random choices are independent of each other.
▶ The rule: Choose the standard deviation of weights w so that
the standard deviation of ξj (denote by oj) satisﬁes oj ≈ 1.
▶ Basic properties of the variance of independent variables give
oj =
√
n · w.
Thus by putting w = 1
n we obtain oj = 1.
▶ The same works for higher layers, n corresponds to the number
of neurons in the layer one level lower.
This gives normal LeCun initialization:
wi ∼ N 0,
1
n
137
Normal Glorot initialization
The previous heuristics for weight initialization ignores variance of the
gradient (i.e. it is concerned only with the "size" of activations in the
forward pass).
138
Normal Glorot initialization
The previous heuristics for weight initialization ignores variance of the
gradient (i.e. it is concerned only with the "size" of activations in the
forward pass).
Glorot & Bengio (2010) presented a normalized initialization by
choosing weights randomly from the following normal distribution:
N 0,
2
m + n
= N 0,
1
(m + n)/2
Here n is the number of inputs to the layer, m is the number of
neurons in the layer above.
138
Normal Glorot initialization
The previous heuristics for weight initialization ignores variance of the
gradient (i.e. it is concerned only with the "size" of activations in the
forward pass).
Glorot & Bengio (2010) presented a normalized initialization by
choosing weights randomly from the following normal distribution:
N 0,
2
m + n
= N 0,
1
(m + n)/2
Here n is the number of inputs to the layer, m is the number of
neurons in the layer above.
This is designed to compromise between the goal of initializing all
layers to have the same activation variance and the goal of initializing
all layers to have the same gradient variance.
This gives normal Glorot initialization (also called normal Xavier
initialization):
wi ∼ N (0,
2
m + n
138
Uniform LeCun initialization
▶ Assume that the input data have mean = 0 and variance = 1.
Consider a neuron j from the ﬁrst layer with n inputs. Assume its
weights chosen randomly by the uniform distribution U(−w, w).
Assume that all random choices are independent of each other.
▶ As before, we want the standard deviation oj of the inner
potential ξj to be approximately 1.
▶ Basic properties of the variance of independent variables give
oj = n
3 · w.
Thus by putting w = 3
n we obtain oj = 1.
We obtain uniform LeCun initialization:
wi ∼ U

−
3
n
,
3
n


139
Uniform Glorot initialization
Similarly to the normal case, we want to normalize the initialization
w.r.t. both forward and backward passes.
We obtain uniform Glorot initialization (aka uniform Xavier init.):
wi ∼ U

−
6
m + n
,
6
m + n

 = U


−
3
(m + n)/2
,
3
(m + n)/2


Here n is the number of inputs to the layer, m is the number of
neurons in the layer above.
140
Modern activation functions
For hidden neurons sigmoidal functions are often substituted with
piece-wise linear activations functions. Most prominent is ReLU:
σ(ξ) = max{0, ξ}
▶ THE default activation function recommended for use with most
feedforward neural networks.
▶ As close to linear function as possible; very simple; does not
saturate for large potentials.
▶ Dead for negative potentials.
141
Normal He initialization
▶ The ReLU is not as sensitive to the large variance of
the inner potential as sigmoidal functions (large variance
does not matter as much).
142
Normal He initialization
▶ The ReLU is not as sensitive to the large variance of
the inner potential as sigmoidal functions (large variance
does not matter as much).
▶ Still the variance is good to be constant (at least due to the
output layer).
142
Normal He initialization
▶ The ReLU is not as sensitive to the large variance of
the inner potential as sigmoidal functions (large variance
does not matter as much).
▶ Still the variance is good to be constant (at least due to the
output layer).
▶ LeCun initialization cannot be justiﬁed for ReLU due to
the following reason:
The ReLU is not a symmetric function. So even if the inner
potential ξj has mean = 0 and variance = 1, it is not true of
the output (the variance is halved).
142
Normal He initialization
▶ The ReLU is not as sensitive to the large variance of
the inner potential as sigmoidal functions (large variance
does not matter as much).
▶ Still the variance is good to be constant (at least due to the
output layer).
▶ LeCun initialization cannot be justiﬁed for ReLU due to
the following reason:
The ReLU is not a symmetric function. So even if the inner
potential ξj has mean = 0 and variance = 1, it is not true of
the output (the variance is halved).
Modifying the normal LeCun initialization to take the halving
variance into account, we obtain normal He initialization:
wi ∈ N 0,
2
n
LeCun is wi ∈ N 0,
1
n
142
More modern activation functions
▶ Leaky ReLU (greenboard):
▶ Generalizes ReLU, not dead for negative potentials.
▶ Experimentally not much better than ReLU.
143
More modern activation functions
▶ Leaky ReLU (greenboard):
▶ Generalizes ReLU, not dead for negative potentials.
▶ Experimentally not much better than ReLU.
▶ ELU: "Smoothed" ReLU:
σ(ξ) =



α(exp(ξ) − 1) for ξ < 0
ξ for ξ ≥ 0
Here α is a parameter, ELU converges to −α as ξ → −∞. As
opposed to ReLU: Smooth, always non-zero gradient (but
saturates), slower to compute.
143
More modern activation functions
▶ Leaky ReLU (greenboard):
▶ Generalizes ReLU, not dead for negative potentials.
▶ Experimentally not much better than ReLU.
▶ ELU: "Smoothed" ReLU:
σ(ξ) =



α(exp(ξ) − 1) for ξ < 0
ξ for ξ ≥ 0
Here α is a parameter, ELU converges to −α as ξ → −∞. As
opposed to ReLU: Smooth, always non-zero gradient (but
saturates), slower to compute.
▶ SELU: Scaled variant of ELU: :
σ(ξ) = λ



α(exp(ξ) − 1) for ξ < 0
ξ for ξ ≥ 0
Self-normalizing, i.e. output of each layer will tend to preserve
a mean (close to) 0 and a standard deviation (close to) 1 for
λ ≈ 1.050 and α ≈ 1.673, properly initialized weights (see below)
and normalized inputs (zero mean, standard deviation 1).
143
Initializing with Normal Distribution
Denote by n the number of inputs to the initialized layer, and m the
number of neurons in the layer.
▶ normal Glorot:
wi ∼ N (0,
2
m + n
Suitable for none, tanh, logistic, softmax
144
Initializing with Normal Distribution
Denote by n the number of inputs to the initialized layer, and m the
number of neurons in the layer.
▶ normal Glorot:
wi ∼ N (0,
2
m + n
Suitable for none, tanh, logistic, softmax
▶ normal He:
wi ∈ N 0,
2
n
Suitable for ReLU, leaky ReLU
144
Initializing with Normal Distribution
Denote by n the number of inputs to the initialized layer, and m the
number of neurons in the layer.
▶ normal Glorot:
wi ∼ N (0,
2
m + n
Suitable for none, tanh, logistic, softmax
▶ normal He:
wi ∈ N 0,
2
n
Suitable for ReLU, leaky ReLU
▶ normal LeCun:
wi ∼ N 0,
1
n
Suitable for SELU (by the authors)
144
How to choose activation of hidden neurons
▶ The default is ReLU.
▶ According to Aurélien Géron:
SELU > ELU > leakyReLU > ReLU > tanh > logistic
For discussion see: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and
Techniques to Build Intelligent Systems, Aurélien Géron
145
Batch normalization (roughly)
Intuition: Instead of keeping mean = 0 and variance = 1
implicitly due to a clever weight initialization, we may
renormalize values of neurons throughout the layers.
146
Batch normalization (roughly)
Intuition: Instead of keeping mean = 0 and variance = 1
implicitly due to a clever weight initialization, we may
renormalize values of neurons throughout the layers.
Consider the ℓ-th layer of the network.
Note that the output values of neurons in the ℓ-th layer can be
seen as inputs to the sub-network consisting of all layers above
the ℓ-th one.
146
Batch normalization (roughly)
Intuition: Instead of keeping mean = 0 and variance = 1
implicitly due to a clever weight initialization, we may
renormalize values of neurons throughout the layers.
Consider the ℓ-th layer of the network.
Note that the output values of neurons in the ℓ-th layer can be
seen as inputs to the sub-network consisting of all layers above
the ℓ-th one.
What if we standardize the values of the ℓ-th layer as we did
with the input data?
For this we need to form a "dataset" of values of the ℓ-th layer.
146
Batch normalization (roughly)
Let us consider the ℓ-th layer with n neurons.
Consider a batch of training examples:
{(⃗xk , ⃗dk ) | k = 1, . . . , p}
(This is typically a minibatch.)
147
Batch normalization (roughly)
Let us consider the ℓ-th layer with n neurons.
Consider a batch of training examples:
{(⃗xk , ⃗dk ) | k = 1, . . . , p}
(This is typically a minibatch.)
▶ For every k = 1, . . . , p: Compute the values of neurons in
the ℓ-th layer for the input ⃗xk and obtain a vector
⃗zk = (⃗zk1, . . . ,⃗zkn)
147
Batch normalization (roughly)
Let us consider the ℓ-th layer with n neurons.
Consider a batch of training examples:
{(⃗xk , ⃗dk ) | k = 1, . . . , p}
(This is typically a minibatch.)
▶ For every k = 1, . . . , p: Compute the values of neurons in
the ℓ-th layer for the input ⃗xk and obtain a vector
⃗zk = (⃗zk1, . . . ,⃗zkn)
▶ Set all components of all vectors ⃗zk to the mean = 0 and
the variance = 1 and obtain normalized vectors: ˆz1, . . . , ˆzp.
147
Batch normalization (roughly)
Let us consider the ℓ-th layer with n neurons.
Consider a batch of training examples:
{(⃗xk , ⃗dk ) | k = 1, . . . , p}
(This is typically a minibatch.)
▶ For every k = 1, . . . , p: Compute the values of neurons in
the ℓ-th layer for the input ⃗xk and obtain a vector
⃗zk = (⃗zk1, . . . ,⃗zkn)
▶ Set all components of all vectors ⃗zk to the mean = 0 and
the variance = 1 and obtain normalized vectors: ˆz1, . . . , ˆzp.
▶ For every k = 1, . . . , p give
⃗γ · ˆzk + ⃗δ
as the output of the ℓ-th layer instead of ⃗zk . Here ⃗γ and ⃗δ
are new trainable weights.
147
Generalization
Intuition: Generalization = ability to cope with new unseen
instances.
Data are mostly noisy, so it is not good idea to ﬁt exactly.
In case of function approximation, the network should not
return exact results as in the training set.
148
Generalization
Intuition: Generalization = ability to cope with new unseen
instances.
Data are mostly noisy, so it is not good idea to ﬁt exactly.
In case of function approximation, the network should not
return exact results as in the training set.
More formally: It is typically assumed that the training set has
been generated as follows:
dkj = gj(⃗xk ) + Θkj
where gj is the "underlying" function corresponding to
the output neuron j ∈ Y and Θkj is random noise.
The network should ﬁt gj not the noise.
Methods improving generalization are called regularization
methods.
148
Regularization
Regularization is a big issue in neural networks, as they
typically use a huge amount of parameters and thus are very
susceptible to overﬁtting.
149
Regularization
Regularization is a big issue in neural networks, as they
typically use a huge amount of parameters and thus are very
susceptible to overﬁtting.
von Neumann: "With four parameters I can ﬁt an elephant,
and with ﬁve I can make him wiggle his trunk."
... and I ask you prof. Neumann:
What can you ﬁt with 40GB of parameters??
149
Early stopping
Early stopping means that we stop learning before it reaches
a minimum of the error E.
When to stop?
150
Early stopping
Early stopping means that we stop learning before it reaches
a minimum of the error E.
When to stop?
In many applications the error function is not the main thing we
want to optimize.
E.g. in the case of a trading system, we typically want to maximize our proﬁt
not to minimize (strange) error functions designed to be easily differentiable.
Also, as noted before, minimizing E completely is not good for
generalization.
For start: We may employ standard approach of training on one
set and stopping on another one.
150
Early stopping
Divide your dataset into several subsets:
▶ training set (e.g. 60%) – train the network here
▶ validation set (e.g. 20%) – use to stop the training
▶ test set (e.g. 20%) – use to evaluate the ﬁnal model
What to use as a stopping rule?
151
Early stopping
Divide your dataset into several subsets:
▶ training set (e.g. 60%) – train the network here
▶ validation set (e.g. 20%) – use to stop the training
▶ test set (e.g. 20%) – use to evaluate the ﬁnal model
What to use as a stopping rule?
You may observe E (or any other function of interest) on the
validation set, if it does not improve for last k steps, stop.
Alternatively, you may observe the gradient, if it is small for
some time, stop.
(recent studies shown that this traditional rule is not too good: it may happen
that the gradient is larger close to minimum values; on the other hand, E
does not have to be evaluated which saves time.
To compare models you may use ML techniques such as
various types of cross-validation etc.
151
Size of the network
Similar problem as in the case of the training duration:
▶ Too small network is not able to capture intrinsic properties
of the training set.
▶ Large networks overﬁt faster.
Solution: Optimal number of neurons :-)
152
Size of the network
Similar problem as in the case of the training duration:
▶ Too small network is not able to capture intrinsic properties
of the training set.
▶ Large networks overﬁt faster.
Solution: Optimal number of neurons :-)
▶ there are some (useless) theoretical bounds
▶ there are algorithms dynamically adding/removing neurons
(not much use nowadays)
▶ In practice: Start with an existing network solving similar
problem.
If you are trully desperate trying to solve a brand new problem, you may
try an ancient rule of thumb: the number of neurons ≈ ten times less
than the number of training instances.
Experiment, experiment, experiment.
152
Feature extraction
Consider a two layer network. Hidden neurons are supposed to
represent "patterns" in the inputs.
Example: Network 64-2-3 for letter classiﬁcation:
153
Ensemble methods
Techniques for reducing generalization error by combining
several models.
The reason that ensemble methods work is that different models will usually
not make all the same errors on the test set.
Idea: Train several different models separately, then have all of
the models vote on the output for test examples.
154
Ensemble methods
Techniques for reducing generalization error by combining
several models.
The reason that ensemble methods work is that different models will usually
not make all the same errors on the test set.
Idea: Train several different models separately, then have all of
the models vote on the output for test examples.
Bagging:
▶ Generate k training sets T1, ..., Tk by sampling from T
uniformly with replacement.
If the number of samples is |T |, then on average |Ti| = (1 − 1/e)|T |.
▶ For each i, train a model Mi on Ti.
▶ Combine outputs of the models: for regression by
averaging, for classiﬁcation by (majority) voting.
154
Dropout
The algorithm: In every step of the gradient descent
▶ choose randomly a set N of neurons, each neuron is included in
N independently with probability 1/2,
(in practice, different probabilities are used as well).
▶ do forward and backward propagations only using the selected
neurons
(i.e. leave weights of the other neurons unchanged)
155
Dropout
The algorithm: In every step of the gradient descent
▶ choose randomly a set N of neurons, each neuron is included in
N independently with probability 1/2,
(in practice, different probabilities are used as well).
▶ do forward and backward propagations only using the selected
neurons
(i.e. leave weights of the other neurons unchanged)
Dropout resembles bagging: Large ensemble of neural networks is
trained "at once" on parts of the data.
Dropout is not exactly the same as bagging: The models share
parameters, with each model inheriting a different subset of
parameters from the parent neural network. This parameter sharing
makes it possible to represent an exponential number of models with
a tractable amount of memory.
In the case of bagging, each model is trained to convergence on its respective
training set. This would be infeasible for large networks/training sets.
155
Dropout – details
▶ The inner potential of a neuron j without dropout:
ξj =
i∈j←
wjiyi
▶ The inner potential of a neuron j with dropout:
ri ∼ Bernoulli(1/2) for all i ∈ j← ∖ {0}
ξj =
i∈j←
wji(riyi)
(Intuitively, randomly chosen neurons are masked out.)
▶ During inference do not drop out neurons and multiply
values of neurons with 1/2.
This compensates for the fact that without the drop out there are twice
as many neurons.
156
Weight decay and L2 regularization
Generalization can be improved by removing "unimportant" weights.
Penalising large weights gives stronger indication about their
importance.
157
Weight decay and L2 regularization
Generalization can be improved by removing "unimportant" weights.
Penalising large weights gives stronger indication about their
importance.
In every step we decrease weights (multiplicatively) as follows:
w
(t+1)
ji
= (1 − ζ)w
(t)
ji
− ε ·
∂E
∂wji
(⃗w(t)
)
Intuition: Unimportant weights will be pushed to 0, important weights
will survive the decay.
157
Weight decay and L2 regularization
Generalization can be improved by removing "unimportant" weights.
Penalising large weights gives stronger indication about their
importance.
In every step we decrease weights (multiplicatively) as follows:
w
(t+1)
ji
= (1 − ζ)w
(t)
ji
− ε ·
∂E
∂wji
(⃗w(t)
)
Intuition: Unimportant weights will be pushed to 0, important weights
will survive the decay.
Weight decay is equivalent to the gradient descent with a constant
learning rate ε and the following error function:
E′
(⃗w) = E(⃗w) +
ζ
2ε
(⃗w · ⃗w)
Here ζ
2ε (⃗w · ⃗w) is the L2 regularization that penalizes large weights.
We use the gradient descent with a constant learning rate to illustrate
the equivalence between L2 regularization and the weight decay. Both
methods can be combined with other learning algorithnms (AdaGrad, etc.).
157
More optimization, regularization ...
There are many more practical tips, optimization methods,
regularization methods, etc.
For a very nice survey see
http://www.deeplearningbook.org/
... and also all other inﬁnitely many urls concerned with deep
learning.
158
Some applications
159
ALVINN (history)
160
ALVINN
Architecture:
▶ MLP, 960 − 4 − 30 (also 960 − 5 − 30)
▶ inputs correspond to pixels
161
ALVINN
Architecture:
▶ MLP, 960 − 4 − 30 (also 960 − 5 − 30)
▶ inputs correspond to pixels
Activity:
▶ activation functions: logistic sigmoid
▶ Steering wheel position determined by "center of mass" of
neuron values.
161
ALVINN
Learning: Trained during (live) drive.
▶ Front window view captured by a camera, 25 images per
second.
▶ Training samples of the form (⃗xk , ⃗dk ) where
▶ ⃗xk = image of the road
▶ ⃗dk = corresponding position of the steering wheel
▶ position of the steering wheel "blurred" by Gaussian
distribution:
dki = e−D2
i
/10
where Di is the distance of the i-th output from the one
which corresponds to the correct position of the wheel.
(The authors claim that this was better than the binary
output.)
162
ALVINN – Selection of training samples
Naive approach: take images directly from the camera and
adapt accordingly.
163
ALVINN – Selection of training samples
Naive approach: take images directly from the camera and
adapt accordingly.
Problems:
▶ If the driver is gentle enough, the car never learns how to
get out of dangerous situations. A solution may be
▶ turn off learning for a moment, then suddenly switch on,
and let the net catch on,
▶ let the driver drive as if being insane (dangerous, possibly
expensive).
▶ The real view out of the front window is repetitive and
boring, the net would overﬁt on few examples.
163
ALVINN – Selection of training examples
Problem with a "good" driver is solved as follows:
164
ALVINN – Selection of training examples
Problem with a "good" driver is solved as follows:
▶ 15 distorted copies of each image:
▶ desired output generated for each copy
164
ALVINN – Selection of training examples
Problem with a "good" driver is solved as follows:
▶ 15 distorted copies of each image:
▶ desired output generated for each copy
"Boring" images solved as follows:
▶ a buffer of 200 images (including 15 copies of the original), in
every step the system trains on the buffer
▶ after several updates a new image is captured, 15 copies are
made and they will substitute 15 images in the buffer (5 chosen
randomly, 10 with the smallest error).
164
ALVINN - learning
▶ pure backpropagation
▶ constant learning rate
▶ momentum, slowly increasing.
Results:
▶ Trained for 5 minutes, speed 4 miles per hour.
▶ ALVINN was able to drive well on a new road it has never
seen (in different weather conditions).
165
ALVINN - learning
▶ pure backpropagation
▶ constant learning rate
▶ momentum, slowly increasing.
Results:
▶ Trained for 5 minutes, speed 4 miles per hour.
▶ ALVINN was able to drive well on a new road it has never
seen (in different weather conditions).
▶ The maximum speed was limited by the hydraulic controller
of the steering wheel, not the learning algorithm.
165
ALVINN - weight development
round 0
round 10
round 20
round 50
h1 h2 h3 h4 h5
Here h1, . . . , h5 are hidden neurons.
166
MNIST – handwritten digits recognition
▶ Database of labelled images of
handwritten digits: 60 000
training examples, 10 000 testing.
▶ Dimensions: 28 x 28, digits are
centered to the "center of gravity"
of pixel values and normalized to
ﬁxed size.
▶ More at http:
//yann.lecun.com/exdb/mnist/
The database is used as a standard benchmark in lots of publications.
167
MNIST – handwritten digits recognition
▶ Database of labelled images of
handwritten digits: 60 000
training examples, 10 000 testing.
▶ Dimensions: 28 x 28, digits are
centered to the "center of gravity"
of pixel values and normalized to
ﬁxed size.
▶ More at http:
//yann.lecun.com/exdb/mnist/
The database is used as a standard benchmark in lots of publications.
Allows comparison of various methods.
167
MNIST
One of the best "old" results is the following:
6-layer NN 784-2500-2000-1500-1000-500-10 (on GPU)
(Ciresan et al. 2010)
Abstract: Good old on-line back-propagation for plain multi-layer
perceptrons yields a very low 0.35 error rate on the famous MNIST
handwritten digits benchmark. All we need to achieve this best result so far
are many hidden layers, many neurons per layer, numerous deformed
training images, and graphics cards to greatly speed up learning.
A famous application of a learning convolutional network LeNet-1 in
1998.
168
MNIST – LeNet1
169
MNIST – LeNet1
Interpretation of output:
▶ the output neuron with the highest value identiﬁes the digit.
▶ the same, but if the two largest neuron values are too close
together, the input is rejected (i.e. no answer).
Learning:
Inputs:
▶ training on 7291 samples, tested on 2007 samples
Results:
▶ error on test set without rejection: 5%
▶ error on test set with rejection: 1% (12% rejected)
▶ compare with dense MLP with 40 hidden neurons: error
1% (19.4% rejected)
170
Modern convolutional networks
The rest of the lecture is based on the online book Neural
Networks and Deep Learning by Michael Nielsen.
http://neuralnetworksanddeeplearning.com/index.html
▶ Convolutional networks are currently the best networks for
image classiﬁcation.
▶ Their common ancestor is LeNet-5 (and other LeNets)
from nineties.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 1998
171
AlexNet
In 2012 this network made a breakthrough in ILVSCR
competition, taking the classiﬁcation error from around 28% to
16%:
A convolutional network, trained on two GPUs.
172
Convolutional networks - local receptive ﬁelds
Every neuron is connected with a ﬁeld of k × k (in this case
5 × 5) neurons in the lower layer (this ﬁled is receptive ﬁeld).
Neuron is "standard": Computes a weighted sum of its inputs,
applies an activation function.
173
Convolutional networks - stride length
Then we slide the local receptive ﬁeld over by one pixel to the right
(i.e., by one neuron), to connect to a second hidden neuron:
The "size" of the slide is
called stride length.
The group of all such
neurons is feature map.
all these neurons share
weights and biases!
174
Feature maps
Each feature map represents a property of the input that is
supposed to be spatially invariant.
Typically, we consider several feature maps in a single layer.
175
Pooling
Neurons in the pooling layer compute functions of their
receptive ﬁelds:
▶ Max-pooling : maximum of inputs
▶ L2-pooling : square root of the sum of squres
▶ Average-pooling : mean
▶ · · · 176
Trained feature maps
(20 feature maps, receptive ﬁelds 5 × 5)
177
Trained feature maps
178
Simple convolutional network
28 × 28 input image, 3 feature maps, each feature map has its
own max-pooling (ﬁeld 5 × 5, stride = 1), 10 output neurons.
Each neuron in the output layer gets input from each neuron in
the pooling layer.
Trained using backprop, which can be easily adapted to
convolutional networks.
179
Convolutional network
180
Simple convolutional network vs MNIST
two convolutional-pooling layers, one 20, second 40 feature
maps, two dense (MLP) layers (1000-1000), outputs (10)
▶ Activation functions of the feature maps and dense layers:
ReLU
▶ max-pooling
▶ output layer: soft-max
▶ Error function: negative log-likelihood (= cross-entropy)
▶ Training: SGD, mini-batch size 10
▶ learning rate 0.03
▶ L2 regularization with "weight" λ = 0.1 + dropout with prob.
1/2
▶ training for 40 epochs (i.e. every training example is
considered 40 times)
▶ Expanded dataset: displacement by one pixel to an
arbitrary direction.
▶ Committee voting of 5 networks. 181
MNIST
Out of 10 000 images in the test set, only these 33 have been
incorrectly classiﬁed:
182
More complex convolutional networks
Convolutional networks have been used for classiﬁcation of
images from the ImageNet database (16 million color images,
20 thousand classes)
183
ImageNet Large-Scale Visual Recognition
Challenge (ILSVRC)
Competition in classiﬁcation over a subset of images from
ImageNet.
Started in 2010, assisted in breakthrough in image recognition.
Training set 1.2 million images, 1000 classes. Validation set: 50
000, test set: 150 000.
Many images contain more than one object ⇒ model is allowed
to choose ﬁve classes, the correct label must be among the
ﬁve. (top-5 criterion).
184
AlexNet
ImageNet classiﬁcation with deep convolutional neural networks, by Alex
Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton (2012).
Trained on two GPUs (NVIDIA GeForce GTX 580)
Výsledky:
▶ accuracy 84.7% in top-5 (second best algorithm at the time
73.8%)
▶ 63.3% "perfect" (top-1) classiﬁcation
185
ILSVRC 2014
The same set as in 2012, top-5 criterion.
GoogLeNet: deep convolutional network, 22 layers
Results:
▶ Accuracy 93.33% top-5
186
ILSVRC 2015
▶ Deep convolutional network
▶ Various numbers of layers, the winner has
152 layers
▶ Skip connections implementing residual
learning
▶ Error 3.57% in top-5.
187
ILSVRC 2016
Trimps-Soushen (The Third Research Institute of Ministry of
Public Security)
There is no new innovative technology or novelty by
Trimps-Soushen.
Ensemble of the pretrained models from Inception-v3,
Inception-v4, Inception-ResNet-v2, Pre-Activation ResNet-200,
and Wide ResNet (WRN-682).
Each of the models are strong at classifying some categories,
but also weak at classifying some categories.
Test error: 2.99%
188
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
189
Top-20 typical errors
Out of 1458 misclassiﬁed images in Top-20:
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
190
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
191
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
192
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
193
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
194
Superhuman convolutional nets?!
Andrej Karpathy: ...the task of labeling images with 5 out of 1000
categories quickly turned out to be extremely challenging, even for some
friends in the lab who have been working on ILSVRC and its classes for a
while. First we thought we would put it up on [Amazon Mechanical Turk].
Then we thought we could recruit paid undergrads. Then I organized a
labeling party of intense labeling effort only among the (expert labelers) in
our lab. Then I developed a modiﬁed interface that used GoogLeNet
predictions to prune the number of categories from 1000 to only about 100. It
was still too hard - people kept missing categories and getting up to ranges of
13-15% error rates. In the end I realized that to get anywhere competitively
close to GoogLeNet, it was most efﬁcient if I sat down and went through the
painfully long training process and the subsequent careful annotation process
myself... The labeling happened at a rate of about 1 per minute, but this
decreased over time... Some images are easily recognized, while some
images (such as those of ﬁne-grained breeds of dogs, birds, or monkeys) can
require multiple minutes of concentrated effort. I became very good at
identifying breeds of dogs... Based on the sample of images I worked on, the
GoogLeNet classiﬁcation error turned out to be 6.8%... My own error in the
end turned out to be 5.1%, approximately 1.7% better.
195
Does it really work?
196
Convolutional networks – theory
197
Convolutional network
198
Convolutional layers
Every neuron is connected with a (typically small) receptive
ﬁeld of neurons in the lower layer.
Neuron is "standard": Computes a weighted sum of its inputs,
applies an activation function.
199
Convolutional layers
Neurons grouped into
feature maps sharing
weights.
200
Convolutional layers
Each feature map represents a property of the input that is
supposed to be spatially invariant.
Typically, we consider several feature maps in a single layer.
201
Pooling layers
Neurons in the pooling layer compute simple functions of their
receptive ﬁelds (the ﬁelds are typically disjoint):
▶ Max-pooling : maximum of inputs
▶ L2-pooling : square root of the sum of squres
▶ Average-pooling : mean
▶ · · · 202
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
203
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
Several types of layers:
▶ input layer L0
203
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
Several types of layers:
▶ input layer L0
▶ dense layer Lm: Each neuron of Lm connected with each
neuron of Lm−1.
203
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
Several types of layers:
▶ input layer L0
▶ dense layer Lm: Each neuron of Lm connected with each
neuron of Lm−1.
▶ convolutional layer Lm: Neurons organized into disjoint
feature maps, all neurons of a given feature map share
weights (but have different inputs)
203
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
Several types of layers:
▶ input layer L0
▶ dense layer Lm: Each neuron of Lm connected with each
neuron of Lm−1.
▶ convolutional layer Lm: Neurons organized into disjoint
feature maps, all neurons of a given feature map share
weights (but have different inputs)
▶ pooling layer: "Neurons" organized into pooling maps, all
neurons
▶ compute a simple aggregate function (such as max),
▶ have disjoint inputs.
Pooling after convolution is applied to each feature map separately.
I.e. a single pooling map after each feature map.
203
Convolutional networks – architecture
▶ Denote
▶ X a set of input neurons
▶ Y a set of output neurons
▶ Z a set of all neurons (X, Y ⊆ Z)
▶ individual neurons denoted by indices i, j etc.
▶ ξj is the inner potential of the neuron j after the computation
stops
▶ yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
▶ wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
▶ j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
▶ j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
▶ [ji] is a set of all connections (i.e. pairs of neurons) sharing
the weight wji. 204
Convolutional networks – activity
▶ neurons of dense and convolutional layers:
▶ inner potential of neuron j:
ξj =
i∈j←
wjiyi
▶ activation function σj for neuron j (arbitrary differentiable):
yj = σj(ξj)
205
Convolutional networks – activity
▶ neurons of dense and convolutional layers:
▶ inner potential of neuron j:
ξj =
i∈j←
wjiyi
▶ activation function σj for neuron j (arbitrary differentiable):
yj = σj(ξj)
▶ Neurons of pooling layers: Apply the "pooling" function:
▶ max-pooling:
yj = max
i∈j←
yi
▶ avg-pooling:
yj =
i∈j←
yi
|j←|
A convolutional network is evaluated layer-wise (as MLP), for each j ∈ Y we
have that yj(⃗w,⃗x) is the value of the output neuron j after evaluating the
network with weights ⃗w and input ⃗x.
205
Convolutional networks – learning
Learning:
▶ Given a training set T of the form
⃗xk , ⃗dk k = 1, . . . , p
Here, every ⃗xk ∈ R|X| is an input vector end every ⃗dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input ⃗xk (the vector ⃗dk can be written as dkj j∈Y
).
▶ Error function – mean squared error (for example):
E(⃗w) =
1
p
p
k=1
Ek (⃗w)
where
Ek (⃗w) =
1
2
j∈Y
yj(⃗w,⃗xk ) − dkj
2
206
Convolutional networks – SGD
The algorithm computes a sequence of weight vectors
⃗w(0), ⃗w(1), ⃗w(2), . . ..
▶ weights in ⃗w(0) are randomly initialized to values close to 0
▶ in the step t + 1 (here t = 0, 1, 2 . . .), weights ⃗w(t+1) are
computed as follows:
▶ Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
▶ Compute
⃗w(t+1)
= ⃗w(t)
+ ∆⃗w(t)
where
∆⃗w(t)
= −ε(t) ·
1
|T|
k∈T
∇Ek (⃗w(t)
)
Here T is a minibatch (of a ﬁxed size),
▶ 0 < ε(t) ≤ 1 is a learning rate in step t + 1
▶ ∇Ek (⃗w(t)) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented by
randomly shufﬂing all data and then choosing minibatches sequentially.
Epoch consists of one round through all data. 207
Backprop
Recall that ∇Ek (⃗w(t)) is a vector of all partial derivatives of
the form ∂Ek
∂wji
.
How to compute ∂Ek
∂wji
?
208
Backprop
Recall that ∇Ek (⃗w(t)) is a vector of all partial derivatives of
the form ∂Ek
∂wji
.
How to compute ∂Ek
∂wji
?
First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj:
▶ Recall that for every wji where j is in a dense layer, i.e.
does not share weights:
∂Ek
∂wji
=
∂Ek
∂yj
· σ′
j (ξj) · yi
208
Backprop
Recall that ∇Ek (⃗w(t)) is a vector of all partial derivatives of
the form ∂Ek
∂wji
.
How to compute ∂Ek
∂wji
?
First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj:
▶ Recall that for every wji where j is in a dense layer, i.e.
does not share weights:
∂Ek
∂wji
=
∂Ek
∂yj
· σ′
j (ξj) · yi
▶ Now for every wji where j is in a convolutional layer:
∂Ek
∂wji
=
rℓ∈[ji]
∂Ek
∂yr
· σ′
r(ξr) · yℓ
▶ Neurons of pooling layers do not have weights.
208
Backprop
Now compute derivatives w.r.t. yj:
▶ for every j ∈ Y:
∂Ek
∂yj
= yj − dkj
This holds for the squared error, for other error functions the derivative
w.r.t. outputs will be different.
209
Backprop
Now compute derivatives w.r.t. yj:
▶ for every j ∈ Y:
∂Ek
∂yj
= yj − dkj
This holds for the squared error, for other error functions the derivative
w.r.t. outputs will be different.
▶ for every j ∈ Z ∖ Y such that j→
is either a dense layer, or a
convolutional layer:
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σ′
r (ξr ) · wrj
209
Backprop
Now compute derivatives w.r.t. yj:
▶ for every j ∈ Y:
∂Ek
∂yj
= yj − dkj
This holds for the squared error, for other error functions the derivative
w.r.t. outputs will be different.
▶ for every j ∈ Z ∖ Y such that j→
is either a dense layer, or a
convolutional layer:
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σ′
r (ξr ) · wrj
▶ for every j ∈ Z ∖ Y such that j→
is max-pooling: Then j→
= {i} for
a single "max" neuron and we have
∂Ek
∂yj
=



∂Ek
∂yi
if j = arg maxr∈i←
yr
0 otherwise
I.e. gradient can be propagated from the output layer downwards as in MLP.
209
Convolutional networks – summary
▶ Conv. nets. are nowadays the most used networks in
image processing (and also in other areas where input has
some local, "spatially" invariant properties)
▶ Typically trained using the gradient descent.
▶ Due to the weight sharing allow (very) deep architectures.
▶ Typically extended with more adjustments and tricks in
their topologies.
210
The problem of cancer detection in WSI
The problem: Detect cancer in this image.
211
The problem of cancer detection in WSI
▶ WSI annotated by pathologists, not pixel level precise!
212
Input data
WSI too large, 105,185 px Œ 221,772 px
Cut into patches of size 512 px Œ 512 px
Patch positive iff the inner square intersects the annotation
213
Training on WSI
Our dataset from Masaryk Memorial Cancer Insitute:
▶ 785 WSI from 166 patients
(698 WSI for training, 87 WSI for testing)
▶ Cut into 7,878,675 patches for training, 193,235 patches
for testing.
214
Training on WSI
Our dataset from Masaryk Memorial Cancer Insitute:
▶ 785 WSI from 166 patients
(698 WSI for training, 87 WSI for testing)
▶ Cut into 7,878,675 patches for training, 193,235 patches
for testing.
Dataset augmentation:
▶ random vertical and horizontal ﬂips
▶ random color perturbations
214
Training on WSI
Our dataset from Masaryk Memorial Cancer Insitute:
▶ 785 WSI from 166 patients
(698 WSI for training, 87 WSI for testing)
▶ Cut into 7,878,675 patches for training, 193,235 patches
for testing.
Dataset augmentation:
▶ random vertical and horizontal ﬂips
▶ random color perturbations
▶ Training data three step sampling:
1. randomly select a label
2. randomly select a slide containing at least a single patch
with the label
3. randomly select a patch with the label from the slide
214
VGG16
3 × 3 convolutions, stride 1, padding 1. Max pooling 2 × 2,
stride 2.
215
Training VGG16 on WSI
▶ VGG16 pretrained on the ImageNet (of-the-shelf solution).
Top fully connected parts removed, substituted with global
max-pooling and a single dense layer.
216
Training VGG16 on WSI
▶ VGG16 pretrained on the ImageNet (of-the-shelf solution).
Top fully connected parts removed, substituted with global
max-pooling and a single dense layer.
▶ The network has single logistic output - the probability of
cancer in the patch
216
Training VGG16 on WSI
▶ VGG16 pretrained on the ImageNet (of-the-shelf solution).
Top fully connected parts removed, substituted with global
max-pooling and a single dense layer.
▶ The network has single logistic output - the probability of
cancer in the patch
▶ The error E = cross-entropy
216
Training VGG16 on WSI
▶ VGG16 pretrained on the ImageNet (of-the-shelf solution).
Top fully connected parts removed, substituted with global
max-pooling and a single dense layer.
▶ The network has single logistic output - the probability of
cancer in the patch
▶ The error E = cross-entropy
▶ Training:
▶ RMSprop optimizer
▶ The "forgetting" hyperparameter: ρ = 0.9
▶ The initial learning rate 5 × 10−5
▶ If no improvement in E on validation data for 3 consecutive
epochs ⇒ half the learning rate
▶ If no improvement in ROCAUC on validation data for 5
consecutive epochs ⇒ terminate
▶ Momentum with the weight α = 0.9
216
Prediction
217
Model evaluation - attempt 1
Can we detect cancer somewhere in WSI?
Denote by F the function
computed by our model. I.e.,
given a patch I, F(I) is the
output value of the single
output neuron with logistic
activation function.
218
Model evaluation - attempt 1
Can we detect cancer somewhere in WSI?
Denote by F the function
computed by our model. I.e.,
given a patch I, F(I) is the
output value of the single
output neuron with logistic
activation function.
Interpret the F(I) as the probability of cancer in the patch.
218
Model evaluation - attempt 1
Can we detect cancer somewhere in WSI?
Denote by F the function
computed by our model. I.e.,
given a patch I, F(I) is the
output value of the single
output neuron with logistic
activation function.
Interpret the F(I) as the probability of cancer in the patch.
Predict WSI positive iff at least one patch I satisﬁes F(I) ≥ t for
a ﬁxed threshold t ∈ [0, 1].
218
Model evaluation - attempt 1
Can we detect cancer somewhere in WSI?
Denote by F the function
computed by our model. I.e.,
given a patch I, F(I) is the
output value of the single
output neuron with logistic
activation function.
Interpret the F(I) as the probability of cancer in the patch.
Predict WSI positive iff at least one patch I satisﬁes F(I) ≥ t for
a ﬁxed threshold t ∈ [0, 1].
Choosing t close to 1, we have achieved 100% accuracy, i.e.,
slide positive iff predicted positive. Problem solved ... No?
218
Model evaluation - attempt 2
Can we detect cancer in patches?
Predict I positive iff F(I) ≥ 0.75
Ok, does it detect cancer?
219
Model evaluation – attempt 3 – FROC
Detect particular tumors ?
How to evaluate the quality of tumor detection?
220
Model evaluation – attempt 3 – FROC
sensitivity ≈ the proportion of tumors containing at least one
patch I with F(I) ≥ t w.r.t. all tumors in all slides
AvgFP ≈ average number of patches I with F(I) ≥ t in each
non-cancerous slide
221
Explainable methods (XAI)
222
XAI methods
The goal is to understand how and why the network does what
it does.
We will consider classiﬁcation models only.
223
XAI methods
The goal is to understand how and why the network does what
it does.
We will consider classiﬁcation models only.
Methods based on various principles:
▶ Visualize weights and feature maps
▶ Visualize most important inputs for a given class
▶ Visualize the effect of input perturbations on the output
▶ Construct an intepretable surrogate model
223
Alex-net - ﬁlters of the ﬁrst convolutional layer
▶ 64 ﬁlters of depth 3 (RGB)
▶ Combined each ﬁlter RGB channels into one RGB image
of size 11x11x3.
224
CNN - feature maps
225
CNN - feature maps - radar target classiﬁcation
Synthetic-aperture radar (SAR) – used to create two-dimensional images or
three-dimensional reconstructions of objects, such as landscapes.
226
Maximizing input
Now what if we try to ﬁnd the most "representative" input vector
for a given class?
227
Maximizing input
Now what if we try to ﬁnd the most "representative" input vector
for a given class?
Assume a trained model giving a score for each class given
an input vector.
227
Maximizing input
Now what if we try to ﬁnd the most "representative" input vector
for a given class?
Assume a trained model giving a score for each class given
an input vector.
▶ Denote by ξi(⃗x) the inner potential of the output neuron
i ∈ Y given a network input vector ⃗x.
227
Maximizing input
Now what if we try to ﬁnd the most "representative" input vector
for a given class?
Assume a trained model giving a score for each class given
an input vector.
▶ Denote by ξi(⃗x) the inner potential of the output neuron
i ∈ Y given a network input vector ⃗x.
▶ Maximize
ξi(⃗x) − λ ⃗x
2
2
over all input vectors ⃗x.
227
Maximizing input
Now what if we try to ﬁnd the most "representative" input vector
for a given class?
Assume a trained model giving a score for each class given
an input vector.
▶ Denote by ξi(⃗x) the inner potential of the output neuron
i ∈ Y given a network input vector ⃗x.
▶ Maximize
ξi(⃗x) − λ ⃗x
2
2
over all input vectors ⃗x.
▶ A maximizing input vector computed using the gradient
ascent.
▶ Gives the most "representative" input vector of the class
represented by the neuron i.
227
Maximizing input - example
228
Input speciﬁc saliency maps
The goal: Label features in a given input that are "most
important" for the output of the network.
229
Input speciﬁc saliency maps
The goal: Label features in a given input that are "most
important" for the output of the network.
Various approaches:
▶ gradient based
▶ Gradient saliency maps
▶ GradCAM
▶ · · ·
▶ occlusion based
▶ Simple occlusion maps
▶ LIME
▶ · · ·
229
Gradient based saliency
▶ Let us ﬁx an output neuron i and an input vector ⃗x.
230
Gradient based saliency
▶ Let us ﬁx an output neuron i and an input vector ⃗x.
▶ Idea: Rank every input neuron k ∈ X based on its
inﬂuence on the value ξi(⃗x).
Note that the vector of input values is ﬁxed.
230
Gradient based saliency
▶ Let us ﬁx an output neuron i and an input vector ⃗x.
▶ Idea: Rank every input neuron k ∈ X based on its
inﬂuence on the value ξi(⃗x).
Note that the vector of input values is ﬁxed.
For every input neuron k ∈ X we consider
∂ξi
∂yk
(⃗x)
to measure the importance of the input yk for the output
potential ξi with respect to the particular input vector ⃗x.
230
Gradient based saliency
▶ Let us ﬁx an output neuron i and an input vector ⃗x.
▶ Idea: Rank every input neuron k ∈ X based on its
inﬂuence on the value ξi(⃗x).
Note that the vector of input values is ﬁxed.
For every input neuron k ∈ X we consider
∂ξi
∂yk
(⃗x)
to measure the importance of the input yk for the output
potential ξi with respect to the particular input vector ⃗x.
▶ Note that saliency comes from a surrogate local linear
model given by the ﬁrst-order Taylor approximation:
ξi(⃗x′
) ≈ ξi(⃗x) +
∂ξi
∂X
(⃗x) (⃗x′
− ⃗x)
Here ∂ξi
∂X is the vector of all partial derivatives ∂ξi
∂yk
where
k ∈ X.
230
Saliency maps - example
231
Saliency maps - example
Quite noisy, the signal is spread and does not say much about
the perception of the owl.
232
Saliency maps - example
SmoothGrad:
▶ Do the following several times:
▶ Add noise to the input image
▶ Compute a saliency map
▶ Average the resulting saliency maps.
233
GradCAM
▶ Consider a convolutional network and ﬁx an input image I
of the network.
ALL values of all neurons yj are computed on the input I.
234
GradCAM
▶ Consider a convolutional network and ﬁx an input image I
of the network.
ALL values of all neurons yj are computed on the input I.
▶ Fix a convolutional layer L consisting of convolutional
feature maps F1, . . . , Fk .
Each Fℓ
is a set of neurons that belong to the feature map Fℓ
.
Slightly abusing notation, we write Fℓ(I) to denote
the tensor of all values of all neurons in Fℓ(I).
234
GradCAM
▶ Consider a convolutional network and ﬁx an input image I
of the network.
ALL values of all neurons yj are computed on the input I.
▶ Fix a convolutional layer L consisting of convolutional
feature maps F1, . . . , Fk .
Each Fℓ
is a set of neurons that belong to the feature map Fℓ
.
Slightly abusing notation, we write Fℓ(I) to denote
the tensor of all values of all neurons in Fℓ(I).
▶ Fix an output neuron i ∈ Y with the inner potential ξi.
Compute the average importance of Fℓ(I):
αℓ
i =
1
|Fℓ|
j∈Fℓ
∂ξi
∂yj
(I)
234
GradCAM
▶ Consider a convolutional network and ﬁx an input image I
of the network.
ALL values of all neurons yj are computed on the input I.
▶ Fix a convolutional layer L consisting of convolutional
feature maps F1, . . . , Fk .
Each Fℓ
is a set of neurons that belong to the feature map Fℓ
.
Slightly abusing notation, we write Fℓ(I) to denote
the tensor of all values of all neurons in Fℓ(I).
▶ Fix an output neuron i ∈ Y with the inner potential ξi.
Compute the average importance of Fℓ(I):
αℓ
i =
1
|Fℓ|
j∈Fℓ
∂ξi
∂yj
(I)
and the ﬁnal gradCAM heat map for L is obtained using
ML
i = ReLU


k
ℓ=1
αℓ
i Fℓ
(I)


234
GradCAM on VGG16
235
GradCAM on VGG16
Consider the last convolutional layer of the VGG16 (Block5,
Conv3)
235
GradCAM on VGG16
From left to right:
▶ An image of a cat (has to be resized to 224 × 224 to ﬁt
VGG16)
▶ The gradCAM heat map for the last convolutional layer and
the class "cat"
▶ Rescaled and smoothed gradCAM heat map.
▶ The gradCAM overlay.
236
Occlusion
▶ Systematically cover parts of the input image.
▶ Observe the effect on the output value.
▶ Find regions with the largest effect.
237
Occlusion - example
238
Occlusion - example
239
LIME - for images
Let us ﬁx an image I to be explained.
240
LIME - for images
Let us ﬁx an image I to be explained.
Outline:
▶ Consider superpixels of I as interpretable components.
240
LIME - for images
Let us ﬁx an image I to be explained.
Outline:
▶ Consider superpixels of I as interpretable components.
▶ Construct a linear model approximating the network around
the image I with weights corresponding to the superpixels.
240
LIME - for images
Let us ﬁx an image I to be explained.
Outline:
▶ Consider superpixels of I as interpretable components.
▶ Construct a linear model approximating the network around
the image I with weights corresponding to the superpixels.
▶ Select the superpixels with weights of large magnitude as
the important ones.
240
Superpixels as interpretable components
Denote by P1, . . . , Pℓ all superpixels of I.
241
Superpixels as interpretable components
Denote by P1, . . . , Pℓ all superpixels of I.
Consider binary vectors ⃗x = (x1, . . . , xℓ) ∈ {0, 1}ℓ
.
241
Superpixels as interpretable components
Denote by P1, . . . , Pℓ all superpixels of I.
Consider binary vectors ⃗x = (x1, . . . , xℓ) ∈ {0, 1}ℓ
.
Each such vector ⃗x determines a "subimage" I[⃗x] of
I obtained by removing all Pk with xk = 0.
241
LIME
▶ Let us ﬁx an output neuron i, we denote by ξi(J) the inner
potential of the output neuron i for the input image J.
242
LIME
▶ Let us ﬁx an output neuron i, we denote by ξi(J) the inner
potential of the output neuron i for the input image J.
▶ Given the image I to be interpreted, consider the following
training set:
T = (⃗x1, ξi(I[⃗x1])), . . . , (⃗xp, ξi(I[⃗xp])
Here ⃗xh = (xh1, . . . , xhℓ) are (some) binary vectors of {0, 1}.
E.g., randomly selected.
242
LIME
▶ Let us ﬁx an output neuron i, we denote by ξi(J) the inner
potential of the output neuron i for the input image J.
▶ Given the image I to be interpreted, consider the following
training set:
T = (⃗x1, ξi(I[⃗x1])), . . . , (⃗xp, ξi(I[⃗xp])
Here ⃗xh = (xh1, . . . , xhℓ) are (some) binary vectors of {0, 1}.
E.g., randomly selected.
▶ Train a linear model (ADALINE) with weights w0, w1, . . . , wℓ
on T .
Intuitively, the linear model approximates the network on "subimages" of
I obtained by removing some superpixels.
▶ Inspect the weights (magnitude and sign).
242
LIME
More precisely, we train a linear model (ADALINE) F with weights
⃗w = w0, w1, . . . , wℓ on T minimizing the weighted mean-squared error
E(⃗w) =
1
p
p
k=1
πk · (F(⃗xk ) − ξi(I[⃗xk ]))2
+ Ω(⃗w)
where
▶ the weights are deﬁned by
πk = exp


−( 1 − 1 − (sk /ℓ) )2
2ν2


Here sk is the number of elements in ⃗xk equal to zero, ℓ is
the number of superpixels, ν determines how much perturbed
images are taken into account in the error.
Small ν means that πk is close to zero for ⃗xk with many zeros.
▶ Ω(⃗w) is a regularization term making the number of non-zero
weights as small as possible.
243
LIME - example
244
LIME - example
245
LIME - example
246
LIME - example
247
Recurrent Neural Networks - LSTM
248
RNN
▶ Input:
⃗x = (x1, . . . , xM)
▶ Hidden:
⃗h = (h1, . . . , hH)
▶ Output:
⃗y = (y1, . . . , yN)
249
RNN example
Activation function:
σ(ξ) =



1 ξ ≥ 0
0 ξ < 0
y 1 0 1
h (0, 0) (1, 1) (1, 0) (0, 1) · · ·
x (0, 0) (1, 0) (1, 1)
250
RNN example
Activation function:
σ(ξ) =



1 ξ ≥ 0
0 ξ < 0
y ⃗y1 = 1 ⃗y2 = 0 ⃗y3 = 1
h ⃗h0 = (0, 0) ⃗h1 = (1, 1) ⃗h2 = (1, 0) ⃗h3 = (0, 1) · · ·
x ⃗x1 = (0, 0) ⃗x2 = (1, 0) ⃗x3 = (1, 1)
250
RNN example
y ⃗y1 = 1 ⃗y2 = 0 ⃗y3 = 1
h ⃗h0 = (0, 0) ⃗h1 = (1, 1) ⃗h2 = (1, 0) ⃗h3 = (0, 1) · · ·
x ⃗x1 = (0, 0) ⃗x2 = (1, 0) ⃗x3 = (1, 1)
250
RNN – formally
▶ M inputs: ⃗x = (x1, . . . , xM)
▶ H hidden neurons: ⃗h = (h1, . . . , hH)
▶ N output neurons: ⃗y = (y1, . . . , yN)
▶ Weights:
▶ Ukk′ from input xk′ to hidden hk
▶ Wkk′ from hidden hk′ to hidden hk
▶ Vkk′ from hidden hk′ to output yk
251
RNN – formally
▶ Input sequence: x = ⃗x1, . . . ,⃗xT
⃗xt = (xt1, . . . , xtM)
252
RNN – formally
▶ Input sequence: x = ⃗x1, . . . ,⃗xT
⃗xt = (xt1, . . . , xtM)
▶ Hidden sequence: h = ⃗h0, ⃗h1, . . . , ⃗hT
⃗ht = (ht1, . . . , htH)
We have ⃗h0 = (0, . . . , 0) and
⃗htk = σ


M
k′=1
Ukk′ xtk′ +
H
k′=1
Wkk′ h(t−1)k′


252
RNN – formally
▶ Input sequence: x = ⃗x1, . . . ,⃗xT
⃗xt = (xt1, . . . , xtM)
▶ Hidden sequence: h = ⃗h0, ⃗h1, . . . , ⃗hT
⃗ht = (ht1, . . . , htH)
We have ⃗h0 = (0, . . . , 0) and
⃗htk = σ


M
k′=1
Ukk′ xtk′ +
H
k′=1
Wkk′ h(t−1)k′


▶ Output sequence: y = ⃗y1, . . . ,⃗yT
⃗yt = (yt1, . . . , ytN)
where ytk = σ H
k′=1 Vkk′ htk′ .
252
RNN – in matrix form
▶ Input sequence: x = ⃗x1, . . . ,⃗xT
253
RNN – in matrix form
▶ Input sequence: x = ⃗x1, . . . ,⃗xT
▶ Hidden sequence: h = ⃗h0, ⃗h1, . . . , ⃗hT where
⃗h0 = (0, . . . , 0)
and
⃗ht = σ(U⃗xt + W⃗ht−1)
253
RNN – in matrix form
▶ Input sequence: x = ⃗x1, . . . ,⃗xT
▶ Hidden sequence: h = ⃗h0, ⃗h1, . . . , ⃗hT where
⃗h0 = (0, . . . , 0)
and
⃗ht = σ(U⃗xt + W⃗ht−1)
▶ Output sequence: y = ⃗y1, . . . ,⃗yT where
yt = σ(Vht )
253
RNN – Comments
▶ ⃗ht is the memory of the network, captures what happened
in all previous steps (with decaying quality).
▶ RNN shares weights U, V, W along the sequence.
Note the similarity to convolutional networks where the weights were
shared spatially over images, here they are shared temporally over
sequences.
▶ RNN can deal with sequences of variable length.
Compare with MLP which accepts only ﬁxed-dimension vectors on
input.
254
RNN – training
Training set
T = (x1, d1), . . . , (xp, dp)
here
▶ each xℓ = ⃗xℓ1, . . . ,⃗xℓTℓ
is an input sequence,
▶ each dℓ = ⃗dℓ1, . . . , ⃗dℓTℓ
is an expected output sequence.
Here each ⃗xℓt = (xℓt1, . . . , xℓtM) is an input vector and each
⃗dℓt = (dℓt1, . . . , dℓtN) is an expected output vector.
255
Error function
In what follows I will consider a training set with a single
element (x, d). I.e. drop the index ℓ and have
▶ x = ⃗x1, . . . ,⃗xT where ⃗xt = (xt1, . . . , xtM)
▶ d = ⃗d1, . . . , ⃗dT where ⃗dt = (dt1, . . . , dtN)
The squared error of (x, d) is deﬁned by
E(x,d) =
T
t=1
N
k=1
1
2
(ytk − dtk )2
Recall that we have a sequence of network outputs
y = ⃗y1, . . . ,⃗yT and thus ytk is the k-th component of ⃗yt
256
Gradient descent (single training example)
Consider a single training example (x, d).
The algorithm computes a sequence of weight matrices as
follows:
257
Gradient descent (single training example)
Consider a single training example (x, d).
The algorithm computes a sequence of weight matrices as
follows:
▶ Initialize all weights randomly close to 0.
257
Gradient descent (single training example)
Consider a single training example (x, d).
The algorithm computes a sequence of weight matrices as
follows:
▶ Initialize all weights randomly close to 0.
▶ In the step ℓ + 1 (here ℓ = 0, 1, 2, . . .) compute "new"
weights U(ℓ+1), V(ℓ+1), W(ℓ+1) from the "old" weights
U(ℓ), V(ℓ), W(ℓ) as follows:
U
(ℓ+1)
kk′ = U
(ℓ)
kk′ − ε(ℓ) ·
δE(x,d)
δUkk′
V
(ℓ+1)
kk′ = V
(ℓ)
kk′ − ε(ℓ) ·
δE(x,d)
δVkk′
W
(ℓ+1)
kk′ = W
(ℓ)
kk′ − ε(ℓ) ·
δE(x,d)
δWkk′
257
Gradient descent (single training example)
Consider a single training example (x, d).
The algorithm computes a sequence of weight matrices as
follows:
▶ Initialize all weights randomly close to 0.
▶ In the step ℓ + 1 (here ℓ = 0, 1, 2, . . .) compute "new"
weights U(ℓ+1), V(ℓ+1), W(ℓ+1) from the "old" weights
U(ℓ), V(ℓ), W(ℓ) as follows:
U
(ℓ+1)
kk′ = U
(ℓ)
kk′ − ε(ℓ) ·
δE(x,d)
δUkk′
V
(ℓ+1)
kk′ = V
(ℓ)
kk′ − ε(ℓ) ·
δE(x,d)
δVkk′
W
(ℓ+1)
kk′ = W
(ℓ)
kk′ − ε(ℓ) ·
δE(x,d)
δWkk′
The above is THE learning algorithm that modiﬁes weights!
257
Backpropagation
Computes the derivatives of E, no weights are modiﬁed!
258
Backpropagation
Computes the derivatives of E, no weights are modiﬁed!
δE(x,d)
δUkk′
=
T
t=1
δE(x,d)
δhtk
· σ′
· xtk′ k′
= 1, . . . , M
δE(x,d)
δVkk′
=
T
t=1
δE(x,d)
δytk
· σ′
· htk′ k′
= 1, . . . , H
δE(x,d)
δWkk′
=
T
t=1
δE(x,d)
δhtk
· σ′
· h(t−1)k′ k′
= 1, . . . , H
258
Backpropagation
Computes the derivatives of E, no weights are modiﬁed!
δE(x,d)
δUkk′
=
T
t=1
δE(x,d)
δhtk
· σ′
· xtk′ k′
= 1, . . . , M
δE(x,d)
δVkk′
=
T
t=1
δE(x,d)
δytk
· σ′
· htk′ k′
= 1, . . . , H
δE(x,d)
δWkk′
=
T
t=1
δE(x,d)
δhtk
· σ′
· h(t−1)k′ k′
= 1, . . . , H
Backpropagation:
δE(x,d)
δytk
= ytk − dtk (assuming squared error)
δE(x,d)
δhtk
=
N
k′=1
δE(x,d)
δytk′
· σ′
· Vk′k +
H
k′=1
δE(x,d)
δh(t+1)k′
· σ′
· Wk′k
258
Long-term dependencies
δE(x,d)
δhtk
=
N
k′=1
δE(x,d)
δytk′
· σ′
· Vk′k +
H
k′=1
δE(x,d)
δh(t+1)k′
· σ′
· Wk′k
▶ Unless H
k′=1 σ′ · Wk′k ≈ 1, the gradient either vanishes, or
explodes.
▶ For a large T (long-term dependency), the gradient
"deeper" in the past tends to be too small (large).
▶ A solution: LSTM
LSTM is currently a bit obsolete. The main idea is to decompose W into
several matrices, each responsible for a different task. One is
concerned about memory, one is concerned about the output at each
step, etc.
https://arxiv.org/pdf/2205.13504.pdf
259
LSTM
⃗ht = ⃗ot ◦ σh(⃗Ct ) output
⃗Ct = ⃗ft ◦ ⃗Ct−1 +⃗it ◦ ˜Ct memory
˜Ct = σh(WC · ⃗ht−1 + UC · ⃗xt ) new memory contents
⃗ot = σg(Wo · ⃗ht−1 + Uo · ⃗xt ) output gate
⃗ft = σg(Wf · ⃗ht−1 + Uf · ⃗xt ) forget gate
⃗it = σg(Wi · ⃗ht−1 + Ui · ⃗xt ) input gate
▶ ◦ is the component-wise product of vectors
▶ · is the matrix-vector product
▶ σh hyperbolic tangents (applied component-wise)
▶ σg logistic sigmoid (aplied component-wise)
260
RNN vs LSTM
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
261
LSTM
⇒ ⃗ht = ⃗ot ◦ σh(⃗Ct )
⇒ ⃗Ct = ⃗ft ◦ ⃗Ct−1 +⃗it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ⃗ht−1 + UC · ⃗xt )
⇒ ⃗ot = σg(Wo · ⃗ht−1 + Uo · ⃗xt )
⇒ ⃗ft = σg(Wf · ⃗ht−1 + Uf · ⃗xt )
⇒⃗it = σg(Wi · ⃗ht−1 + Ui · ⃗xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
262
LSTM
⇒ ⃗ht = ⃗ot ◦ σh(⃗Ct )
⇒ ⃗Ct = ⃗ft ◦ ⃗Ct−1 +⃗it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ⃗ht−1 + UC · ⃗xt )
⇒ ⃗ot = σg(Wo · ⃗ht−1 + Uo · ⃗xt )
⇒ ⃗ft = σg(Wf · ⃗ht−1 + Uf · ⃗xt )
⇒⃗it = σg(Wi · ⃗ht−1 + Ui · ⃗xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
262
LSTM
⇒ ⃗ht = ⃗ot ◦ σh(⃗Ct )
⇒ ⃗Ct = ⃗ft ◦ ⃗Ct−1 +⃗it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ⃗ht−1 + UC · ⃗xt )
⇒ ⃗ot = σg(Wo · ⃗ht−1 + Uo · ⃗xt )
⇒ ⃗ft = σg(Wf · ⃗ht−1 + Uf · ⃗xt )
⇒⃗it = σg(Wi · ⃗ht−1 + Ui · ⃗xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
262
LSTM
⇒ ⃗ht = ⃗ot ◦ σh(⃗Ct )
⇒ ⃗Ct = ⃗ft ◦ ⃗Ct−1 +⃗it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ⃗ht−1 + UC · ⃗xt )
⇒ ⃗ot = σg(Wo · ⃗ht−1 + Uo · ⃗xt )
⇒ ⃗ft = σg(Wf · ⃗ht−1 + Uf · ⃗xt )
⇒⃗it = σg(Wi · ⃗ht−1 + Ui · ⃗xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
262
LSTM
⇒ ⃗ht = ⃗ot ◦ σh(⃗Ct )
⇒ ⃗Ct = ⃗ft ◦ ⃗Ct−1 +⃗it ◦ ˜Ct
⇒ ˜Ct = σh(WC · ⃗ht−1 + UC · ⃗xt )
⇒ ⃗ot = σg(Wo · ⃗ht−1 + Uo · ⃗xt )
⇒ ⃗ft = σg(Wf · ⃗ht−1 + Uf · ⃗xt )
⇒⃗it = σg(Wi · ⃗ht−1 + Ui · ⃗xt )
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
262
LSTM – summary
▶ LSTM (almost) solves the vanishing gradient problem w.r.t.
the "internal" state of the network.
▶ Learns to control its own memory (via forget gate).
▶ Revolution in machine translation and text processing.
... but the development goes on ...
263
RNN text generator
Generating texts letter by letter.
264
Shakespeare
▶ Generating Shakespeare letter by letter.
▶ Trained on Shakespeare’s plays (4.4MB).
VIOLA: Why, Salisbury must ﬁnd his ﬂesh and thought That which I
am not aps, not a man and in ﬁre, To show the reining of the raven
and the wars To grace my hand reproach within, and not a fair are
hand, That Caesar and my goodly father’s world; When I was heaven
of presence and our ﬂeets, We spare with hours, but cut thy council I
am great, Murdered and by thy master’s ready there My power to give
thee but so much as hell: Some service in the noble bondman here,
Would show him to her wine.
KING LEAR: O, if you were a feeble sight, the courtesy of your law,
Your sight and several breath, will wear the gods With his heads, and
my hands are wonder’d at the deeds, So drop upon your lordship’s
head, and your opinion Shall be against your honour.
265
Wikipedia
▶ Hutter Prize 100MB dataset from Wikipedia (96MB)
Naturalism and decision for the majority of Arab countries’ capitalide was
grounded by the Irish language by [[John Clair]], [[An Imperial Japanese
Revolt]], associated with Guangzham’s sovereignty. His generals were the
powerful ruler of the Portugal in the [[Protestant Immineners]], which could be
said to be directly in Cantonese Communication, which followed a ceremony
and set inspired prison, training. The emperor travelled back to [[Antioch,
Perth, October 25|21]] to note, the Kingdom of Costa Rica, unsuccessful
fashioned the [[Thrales]], [[Cynth’s Dajoard]], known in western [[Scotland]],
near Italy to the conquest of India with the conﬂict. Copyright was the
succession of independence in the slop of Syrian inﬂuence that was a famous
German movement based on a more popular servicious, non-doctrinal and
sexual power post. Many governments recognize the military housing of the
[[Civil Liberalization and Infantry Resolution 265 National Party in Hungary]],
that is sympathetic to be to the [[Punjab Resolution]] (PJS)[http:
//www.humah.yahoo.com/guardian.cfm/7754800786d17551963s89.htm
Ofﬁcial economics Adjoint for the Nazism, Montgomery was swear to
266
Xml halucination:
<page>
<title>Antichrist</title>
<id>865</id>
<revision>
<id>15900676</id>
<timestamp>2002-08-03T18:14:12Z</timestamp>
<contributor>
<username>Paris</username>
<id>23</id>
</contributor>
<minor />
<comment>Automated conversion</comment>
<text xml:space="preserve">
#REDIRECT [[Christianity]]</text>
</revision>
</page>
267
LaTeX
▶ Algebraic geometry textbook.
▶ LaTeX source (16MB).
▶ Almost compilable.
268
269
Linux source code
▶ Trained on all source ﬁles of Linux kernel concatenated
into a single ﬁle (474MB of C code).
270
271
272
Evolution of Shakespeare
100 iter.:
300 iter.:
500 iter.:
700 iter.:
1200 iter.:
2000 iter.:
273
Attention
Consider the following task: Given a sequence of vectors
x = ⃗x1, . . . ,⃗xT
generate a new sequence
y = ⃗y1, . . . ,⃗yT′
of possibly different length (i.e., possibly T T′).
E.g., a machine translation task, x is an embedding of an
English sentence, y is a sequence of probability distributions on
a German vocabulary.
274
Attention
Consider two recurrent networks:
▶ Enc the encoder
▶ Hidden state ⃗h0 initialized by standard methods for
recurrent networks
▶ Reads ⃗x1, . . . ,⃗xT , does not output anything but produces
a sequence of hidden states ⃗h1, . . . , ⃗hT
275
Attention
Consider two recurrent networks:
▶ Enc the encoder
▶ Hidden state ⃗h0 initialized by standard methods for
recurrent networks
▶ Reads ⃗x1, . . . ,⃗xT , does not output anything but produces
a sequence of hidden states ⃗h1, . . . , ⃗hT
▶ Dec the decoder
▶ The initial hidden state is ⃗hT
▶ Does not read anything but outputs the sequence ⃗y1, . . . ,⃗yT′
This is a simpliﬁcation. Typically, Dec reads ⃗y0,⃗y1, . . . ,⃗yT′−1 where
⃗y0 is a special vector embedding a separator.
275
Attention
Consider two recurrent networks:
▶ Enc the encoder
▶ Hidden state ⃗h0 initialized by standard methods for
recurrent networks
▶ Reads ⃗x1, . . . ,⃗xT , does not output anything but produces
a sequence of hidden states ⃗h1, . . . , ⃗hT
▶ Dec the decoder
▶ The initial hidden state is ⃗hT
▶ Does not read anything but outputs the sequence ⃗y1, . . . ,⃗yT′
This is a simpliﬁcation. Typically, Dec reads ⃗y0,⃗y1, . . . ,⃗yT′−1 where
⃗y0 is a special vector embedding a separator.
Trained on pairs of sentences, able to learn a ﬁne translation between major
languages (if the recurrent networks are LSTM).
Is not perfect because all info about x = ⃗x1, . . . ,⃗xT is squeezed
into the single state vector ⃗hT .
In particular, the network tends to forget the context of each word.
275
Attention in Recurrent Networks
What if we provide the decoder with an information about
the relevant context of the generated word?
276
Attention in Recurrent Networks
What if we provide the decoder with an information about
the relevant context of the generated word?
We use the same encoder Enc producing the sequence of
hidden states: ⃗h1, . . . , ⃗hT
276
Attention in Recurrent Networks
What if we provide the decoder with an information about
the relevant context of the generated word?
We use the same encoder Enc producing the sequence of
hidden states: ⃗h1, . . . , ⃗hT
The decoder Dec is still a recurrent network but
▶ the hidden state ⃗h′
0
initialized by ⃗hT and a sequence of
hidden states ⃗h′
0
, . . . , ⃗h′
T′ is computed,
276
Attention in Recurrent Networks
What if we provide the decoder with an information about
the relevant context of the generated word?
We use the same encoder Enc producing the sequence of
hidden states: ⃗h1, . . . , ⃗hT
The decoder Dec is still a recurrent network but
▶ the hidden state ⃗h′
0
initialized by ⃗hT and a sequence of
hidden states ⃗h′
0
, . . . , ⃗h′
T′ is computed,
▶ reads a sequence of context vectors ⃗c1, . . . ,⃗cT′ where
⃗ci =
T
j=1
αij
⃗hj where αij =
exp(eij)
T
k=1 exp(eik )
where eij = MLP(⃗h′
i−1
, ⃗hj)
▶ outputs the sequence ⃗y1, . . . ,⃗yT′
276
Do We Still Need the Recurrence?
▶ The attention mechanism extracts the information from
the sequence quite well.
277
Do We Still Need the Recurrence?
▶ The attention mechanism extracts the information from
the sequence quite well.
▶ Is there a reason for reading the input sequence
sequentially?
277
Do We Still Need the Recurrence?
▶ The attention mechanism extracts the information from
the sequence quite well.
▶ Is there a reason for reading the input sequence
sequentially?
▶ Could we remove the recurrent network itself and preserve
only the attention?
277
Self-Attention Layer (is all you need)
Fix an input sequence: ⃗x1, . . . ,⃗xT
Consider three learnable matrices: Wq, Wk , Wv
Generate sequences of queries, keys, and values:
▶ ⃗q1, . . . ,⃗qT where ⃗qk = Wq⃗xk for all k = 1, . . . , T
▶ ⃗k1, . . . , ⃗kT where ⃗kk = Wk⃗xk for all k = 1, . . . , T
▶ ⃗v1, . . . ,⃗vT where ⃗vk = Wv⃗xk for all k = 1, . . . , T
278
Self-Attention Layer (is all you need)
Fix an input sequence: ⃗x1, . . . ,⃗xT
Consider three learnable matrices: Wq, Wk , Wv
Generate sequences of queries, keys, and values:
▶ ⃗q1, . . . ,⃗qT where ⃗qk = Wq⃗xk for all k = 1, . . . , T
▶ ⃗k1, . . . , ⃗kT where ⃗kk = Wk⃗xk for all k = 1, . . . , T
▶ ⃗v1, . . . ,⃗vT where ⃗vk = Wv⃗xk for all k = 1, . . . , T
Deﬁne a vector score for all i, j ∈ {1, . . . , T} by
eij = ⃗qi · ⃗kj
Intuitively, eij measures how much the input at the position i is related to the
input at the position j, in other words, how much the query ﬁts the key.
Deﬁne
αij =
exp(eij /
√
dattn)
T
k=1 exp(eik /
√
dattn)
dattn is the dimension of ⃗vi
I.e., we apply the good old softmax to (ei1, . . . , eiT ) /
√
dattn 278
Self-Attention Layer (is all you need)
Deﬁne a vector score for all i, j ∈ {1, . . . , T} by
eij = ⃗qi · ⃗kj
Intuitively, eij measures how much the input at the position i is related to the
input at the position j, in other words, how much the query ﬁts the key.
Deﬁne
αij =
exp(eij /
√
dattn)
T
k=1 exp(eik /
√
dattn)
dattn is the dimension of ⃗vi
I.e., we apply the good old softmax to (ei1, . . . , eiT ) /
√
dattn
Deﬁne a sequence of outputs ⃗y1, . . . ,⃗yT by
⃗yi =
T
j=1
αij · ⃗vj
278
Language Model
A sequence of tokens a1, . . . , aT ∈ Σ∗
E.g. words from a vocabulary Σ.
The goal: Maximize
T
k=1
P(ak | a1, . . . , ak−1; W) (= P(a1, . . . , aT ; W))
where
▶ P is the conditional probability measure over Σ modeled
using a neural network with weights W.
279
Language Model
A sequence of tokens a1, . . . , aT ∈ Σ∗
E.g. words from a vocabulary Σ.
The goal: Maximize
T
k=1
P(ak | a1, . . . , ak−1; W) (= P(a1, . . . , aT ; W))
where
▶ P is the conditional probability measure over Σ modeled
using a neural network with weights W.
Can be used to generate text:
Given a1, . . . , ak , sample ak+1 from P(ak+1 | a1, . . . , ak ; W)
279
GPT
280
GPT
281
Masked Self-Attention Layer (is all you need)
Assume an attention mechanism which given an input
sequence ⃗x1, . . . ,⃗xT generates ⃗y1, . . . ,⃗yT .
The Problem: How to generate ⃗yk only based on ⃗x1, . . . ,⃗xk−1 ?
282
Masked Self-Attention Layer (is all you need)
Assume an attention mechanism which given an input
sequence ⃗x1, . . . ,⃗xT generates ⃗y1, . . . ,⃗yT .
The Problem: How to generate ⃗yk only based on ⃗x1, . . . ,⃗xk−1 ?
Deﬁne a vector score for all i, j ∈ {1, . . . , T} by
eij =



⃗qi · ⃗kj if j < i
−∞ otherwise.
This means that
αij =



exp(eij /
√
dattn)
T
k=1 exp(eik /
√
dattn)
if j < i
0 otherwise.
282
Masked Self-Attention Layer (is all you need)
Assume an attention mechanism which given an input
sequence ⃗x1, . . . ,⃗xT generates ⃗y1, . . . ,⃗yT .
The Problem: How to generate ⃗yk only based on ⃗x1, . . . ,⃗xk−1 ?
Deﬁne a vector score for all i, j ∈ {1, . . . , T} by
eij =



⃗qi · ⃗kj if j < i
−∞ otherwise.
This means that
αij =



exp(eij /
√
dattn)
T
k=1 exp(eik /
√
dattn)
if j < i
0 otherwise.
Deﬁne a sequence of outputs ⃗y1, . . . ,⃗yT by
⃗yi =
T
j=1
αij · ⃗vj
282
Multi-head Self-Attention Layer (is all you need)
Assume the number of heads is H.
For h = 1, . . . , H the h-th head is an attention mechanism which
given the input ⃗x1, . . . ,⃗xT produces
⃗yh
1 , . . . ,⃗yh
T
Note that the output may be different which means that, in particular, the
matrices Wq, Wk , Wv may be different for each head.
Assume that all vectors ⃗yh
k
are of the same dimension dmid and
consider a learnable matrix Wout of dimensions dout × (H · dmid).
283
Multi-head Self-Attention Layer (is all you need)
Assume the number of heads is H.
For h = 1, . . . , H the h-th head is an attention mechanism which
given the input ⃗x1, . . . ,⃗xT produces
⃗yh
1 , . . . ,⃗yh
T
Note that the output may be different which means that, in particular, the
matrices Wq, Wk , Wv may be different for each head.
Assume that all vectors ⃗yh
k
are of the same dimension dmid and
consider a learnable matrix Wout of dimensions dout × (H · dmid).
The multi-head attention produces the following output:
⃗y1, . . . ,⃗yT
where
⃗yk = Wout · ⃗y1
k ⊙ ⃗y2
k ⊙ · · ·⃗yH
k
Here ⊙ is a concatenation of vectors.
283
Multi-head Self-Attention Summary
Input: A sequence ⃗x1, . . . ,⃗xT
Output: A sequence ⃗y1, . . . ,⃗yT
I.e., a sequence of the same length. The dimensions of ⃗yk and ⃗xk do not have
to be equal.
284
Multi-head Self-Attention Summary
Input: A sequence ⃗x1, . . . ,⃗xT
Output: A sequence ⃗y1, . . . ,⃗yT
I.e., a sequence of the same length. The dimensions of ⃗yk and ⃗xk do not have
to be equal.
Attention:
Learnable parameters: Matrices Wq, Wk , Wv.
These matrices are used to compute queries, keys, and values from
⃗x1, . . . ,⃗xT . Output ⃗y1, . . . ,⃗yT is computed using values "scaled" by
the query-key attention.
284
Multi-head Self-Attention Summary
Input: A sequence ⃗x1, . . . ,⃗xT
Output: A sequence ⃗y1, . . . ,⃗yT
I.e., a sequence of the same length. The dimensions of ⃗yk and ⃗xk do not have
to be equal.
Attention:
Learnable parameters: Matrices Wq, Wk , Wv.
These matrices are used to compute queries, keys, and values from
⃗x1, . . . ,⃗xT . Output ⃗y1, . . . ,⃗yT is computed using values "scaled" by
the query-key attention.
Multi-head attention:
Learnable parameters:
▶ Matrices Wh
q , Wh
k
, Wh
v where h = 1, . . . , H and H is
the number of heads.
Each attention head operates independently on the input ⃗x1, . . . ,⃗xT .
▶ Matrix Wout .
Linearly transforms the concatenated results of the attention heads.
284
GPT - transformer
285
Positional encoding
The Goal: To encode a position (index) k ∈ {1, . . . , T} into
a vector ⃗Pk of real numbers.
286
Positional encoding
The Goal: To encode a position (index) k ∈ {1, . . . , T} into
a vector ⃗Pk of real numbers.
Assume that ⃗Pk should have a dimension d.
Given a position k ∈ {1, . . . , T} and i ∈ {0, . . . , d/2} deﬁne
Pk,2i = sin
k
n2i/d
Pk,(2i+1) = cos
k
n2i/d
Here n = 10000.
A user deﬁned constant, the original paper suggests n = 10000.
286
Positional encoding
The Goal: To encode a position (index) k ∈ {1, . . . , T} into
a vector ⃗Pk of real numbers.
Assume that ⃗Pk should have a dimension d.
Given a position k ∈ {1, . . . , T} and i ∈ {0, . . . , d/2} deﬁne
Pk,2i = sin
k
n2i/d
Pk,(2i+1) = cos
k
n2i/d
Here n = 10000.
A user deﬁned constant, the original paper suggests n = 10000.
Given an input sequence ⃗x1, . . . ,⃗xT we add the position
embedding to each ⃗xk obtaining a new input sequence
⃗x′
1
, . . . ,⃗x′
T
where
⃗x′
k = ⃗xk + ⃗Pk
286
Positional encoding/embedding
287
Positional encoding/embedding
▶ Vertically: Sinusoidal functions
▶ Horizontally: Decreasing frequency
For any offset o ∈ {1, . . . , T} there is a linear transformation M
such that for any k ∈ {1, . . . , T − o} we have M⃗Pk = ⃗Pk+o.
Intuitively, just rotate each component of the ⃗Pk appropriately.
288
GPT-2 - transformer
289
Layer normalization
Given a vector ⃗x ∈ Rd, the layer normalization computes:
⃗x′
= γ ·
(⃗x − µ)
σ
+ β
Here
▶ µ = 1
d
d
i=1 xi and σ2 = 1
d
d
i=1(xi − µ)2
▶ γ, β ∈ Rd are vectors of trainable parameters
290
Layer normalization
Given a vector ⃗x ∈ Rd, the layer normalization computes:
⃗x′
= γ ·
(⃗x − µ)
σ
+ β
Here
▶ µ = 1
d
d
i=1 xi and σ2 = 1
d
d
i=1(xi − µ)2
▶ γ, β ∈ Rd are vectors of trainable parameters
In Transformer:
The input to the layer normalization is a sequence of vectors:
⃗x1, . . . ,⃗xT . The layer normalization is applied to each ⃗xk ,
producing a sequence of "normalized" vectors.
290
GPT - learning
A sequence of tokens a1, . . . , aT ∈ Σ and their
one-hot encodings ⃗u1, . . . ,⃗uT ∈ {0, 1}|Σ|
We assume that a1 is a special token marking the start of
the sequence.
Embed to vectors and add the position
encoding (We is an embedding matrix):
⃗xk = We · ⃗uk + Pk ∈ Rsetd
291
GPT - learning
A sequence of tokens a1, . . . , aT ∈ Σ and their
one-hot encodings ⃗u1, . . . ,⃗uT ∈ {0, 1}|Σ|
We assume that a1 is a special token marking the start of
the sequence.
Embed to vectors and add the position
encoding (We is an embedding matrix):
⃗xk = We · ⃗uk + Pk ∈ Rsetd
Apply the network (with the transformer block repeated 12x) to
⃗x1, . . . ,⃗xT and obtain ⃗y1, . . . ,⃗yT
(Here assume that each ⃗yk ∈ [0, 1]Σ
is a probability distribution on Σ)
291
GPT - learning
A sequence of tokens a1, . . . , aT ∈ Σ and their
one-hot encodings ⃗u1, . . . ,⃗uT ∈ {0, 1}|Σ|
We assume that a1 is a special token marking the start of
the sequence.
Embed to vectors and add the position
encoding (We is an embedding matrix):
⃗xk = We · ⃗uk + Pk ∈ Rsetd
Apply the network (with the transformer block repeated 12x) to
⃗x1, . . . ,⃗xT and obtain ⃗y1, . . . ,⃗yT
(Here assume that each ⃗yk ∈ [0, 1]Σ
is a probability distribution on Σ)
Compute the error:
−
T−1
ℓ=1
log ⃗yℓ[aℓ+1]
Here ⃗yℓ[ak+1] is the probability of ak+1 in the distribution ⃗yk . 291
GPT - inference
A sequence of tokens
a1, . . . , aℓ ∈ Σ and their one-hot
encodings ⃗u1, . . . ,⃗uℓ ∈ {0, 1}|Σ|
Embed to vectors and add
the position encoding:
⃗xk = We · ⃗uk + Pk ∈ Rsetd
Apply the network to ⃗x1, . . . ,⃗xℓ and
obtain ⃗y1, . . . ,⃗yℓ
(Assume that each ⃗yk ∈ [0, 1]Σ
is
a probability distribution on Σ)
Sample the next token from
aℓ+1 ∼ ⃗yℓ
https://transformer.huggingface.co/doc/distil-gpt2
292
Feed-forward networks summary
Architectures:
▶ Multi-layer perceptron (MLP):
▶ dense connections between layers
▶ Convolutional networks (CNN):
▶ local receptors, feature maps
▶ pooling
▶ Recurrent networks (RNN):
▶ self-loops but still feed-forward through time
▶ Transformer
▶ Attention, query-key-value
Training:
▶ gradient descent algorithm + heuristics
293