MLP training – practical issues
114
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
115
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
116
MLP – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
Error function: E(w) =
p
k=1
Ek (w)
117
SGD
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1)
are
computed as follows:
Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
Compute
w(t+1)
= w(t)
+ ∆w(t)
where
∆w(t)
= −ε(t) ·
k∈T
Ek (w(t)
)
0 < ε(t) ≤ 1 is a learning rate in step t + 1
Ek (w(t)
) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented by
randomly shufﬂing all data and then choosing minibatches sequentially.
118
MLP – mse gradient
For every wji we have
∂E
∂wji
=
1
p
p
k=1
∂Ek
∂wji
119
MLP – mse gradient
For every wji we have
∂E
∂wji
=
1
p
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
and for every j ∈ Z X we get (for squared error)
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj for j ∈ Z (Y ∪ X)
(Here all yj are in fact yj(w, xk )).
119
(Some) error functions
squared error:
E(w) =
p
k=1
Ek (w)
where Ek (w) = 1
2 j∈Y yj(w, xk ) − dkj
2
mean squared error (mse):
E(w) =
1
p
p
k=1
Ek (w)
(categorical) cross entropy:
E(w) = −
1
p
p
k=1 j∈Y
dkj ln(yj)
120
Practical issues of gradient descent
Training efﬁciency:
What size of a minibatch?
How to choose the learning rate ε(t) and control SGD ?
How to pre-process the inputs?
How to initialize weights?
How to choose desired output values of the network?
121
Practical issues of gradient descent
Training efﬁciency:
What size of a minibatch?
How to choose the learning rate ε(t) and control SGD ?
How to pre-process the inputs?
How to initialize weights?
How to choose desired output values of the network?
Quality of the resulting model:
When to stop training?
Regularization techniques.
How large network?
For simplicity, I will illustrate the reasoning on MLP + mse.
Later we will see other topologies and error functions with
different but always somewhat related issues.
121
Issues in gradient descent
Small networks: Lots of local minima where the descent
gets stuck.
The model identiﬁability problem: Swapping incoming
weights of neurons i and j leaves the same network
topology – weight space symmetry.
Recent studies show that for sufﬁciently large networks all
local minima have low values of the error function.
122
Issues in gradient descent
Small networks: Lots of local minima where the descent
gets stuck.
The model identiﬁability problem: Swapping incoming
weights of neurons i and j leaves the same network
topology – weight space symmetry.
Recent studies show that for sufﬁciently large networks all
local minima have low values of the error function.
Saddle points
One can show (by a combinatorial
argument) that larger networks
have exponentially more saddle
points than local minima.
122
Issues in gradient descent – too slow descent
ﬂat regions
E.g. if the inner potentials are too large (in abs. value), then their
derivative is extremely small.
123
Issues in gradient descent – too fast descent
steep cliffs: the gradient is extremely large, descent skips
important weight vectors
124
Issues in gradient descent – local vs global
structure
What if we initialize on the left?
125
Gradient Descent in Large Networks
Theorem
Assume (roughly),
activation functions: "smooth" ReLU (softplus)
σ(z) = log(1 + exp(z))
In general: Smooth, non-polynomial, analytic, Lipschitzs.
inputs xk of Euclidean norm equal to 1, desired values dk
satisfying |dk | ∈ O(1),
the number of hidden neurons per layer sufﬁciently large
(polynomial in certain numerical characteristics of inputs roughly
measuring their similarity, and exponential in the depth of the network),
the learning rate constant and sufﬁciently small.
The gradient descent converges (with high probability) to a global
minimum with zero error at linear rate.
Later we get to a special type of networks called ResNet where the above
result demands only polynomially many neurons per layer (w.r.t. depth). 126
Issues in computing the gradient
vanishing and exploding gradients
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj for j ∈ Z (Y ∪ X)
127
Issues in computing the gradient
vanishing and exploding gradients
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj for j ∈ Z (Y ∪ X)
inexact gradient computation:
Minibatch gradient is only an estimate of the true gradient.
Note that the variance of the estimate is (roughly) σ/
√
m
where m is the size of the minibatch and σ is the variance
of the gradient estimate for a single training example.
(E.g. minibatch size 10 000 means 100 times more computation
than the size 100 but gives only 10 times less variance.)
127
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
128
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
Multicore architectures are usually underutilized by extremely
small batches.
128
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
Multicore architectures are usually underutilized by extremely
small batches.
If all examples in the batch are to be processed in parallel (as is
the typical case), then the amount of memory scales with the
batch size. For many hardware setups this is the limiting factor in
batch size.
128
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
Multicore architectures are usually underutilized by extremely
small batches.
If all examples in the batch are to be processed in parallel (as is
the typical case), then the amount of memory scales with the
batch size. For many hardware setups this is the limiting factor in
batch size.
It is common (especially when using GPUs) for power of 2 batch
sizes to offer better runtime. Typical power of 2 batch sizes
range from 32 to 256, with 16 sometimes being attempted for
large models.
128
Minibatch size
Larger batches provide a more accurate estimate of the
gradient, but with less than linear returns.
Multicore architectures are usually underutilized by extremely
small batches.
If all examples in the batch are to be processed in parallel (as is
the typical case), then the amount of memory scales with the
batch size. For many hardware setups this is the limiting factor in
batch size.
It is common (especially when using GPUs) for power of 2 batch
sizes to offer better runtime. Typical power of 2 batch sizes
range from 32 to 256, with 16 sometimes being attempted for
large models.
Small batches can offer a regularizing effect, perhaps due to the
noise they add to the learning process.
It has been observed in practice that when using a larger batch
there is a degradation in the quality of the model, as measured
by its ability to generalize.
("On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima". Keskar et al, ICLR’17)
128
Momentum
Issue in the gradient descent:
E(w(t)) constantly changes direction (but the error
steadily decreases).
129
Momentum
Issue in the gradient descent:
E(w(t)) constantly changes direction (but the error
steadily decreases).
Solution: In every step add the change made in the previous
step (weighted by a factor α):
∆w(t)
= −ε(t) ·
k∈T
Ek (w(t)
) + α · ∆w
(t−1)
ji
where 0 < α < 1.
129
Momentum – illustration
130
SGD with momentum
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1)
are
computed as follows:
Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
Compute
w(t+1)
= w(t)
+ ∆w(t)
where
∆w(t)
= −ε(t) ·
k∈T
Ek (w(t)
) + α∆w(t−1)
0 < ε(t) ≤ 1 is a learning rate in step t + 1
0 < α < 1 measures the "inﬂuence" of the momentum
Ek (w(t)
) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented by
randomly shufﬂing all data and then choosing minibatches sequentially.
131
Learning rate
132
Search for the learning rate
Use settings from a successful solution of a similar problem as a
baseline.
Search for the learning rate using the learning monitoring:
Search through values from small (e.g. 0.001) to (0.1),
possibly multiplying by 2.
Train for several epochs, observe the learning curves (see
cross-validation later).
133
Adaptive learning rate
Power scheduling: Set (t) = 0/(1 + t/s) where 0 is an initial
learning rate and s a number of steps
(after s steps the learning rate is 0/2, after 2s it is 0/3 etc.)
134
Adaptive learning rate
Power scheduling: Set (t) = 0/(1 + t/s) where 0 is an initial
learning rate and s a number of steps
(after s steps the learning rate is 0/2, after 2s it is 0/3 etc.)
Exponential scheduling: Set (t) = 0 · 0.1t/s
.
(the learning rate decays faster than in the power scheduling)
134
Adaptive learning rate
Power scheduling: Set (t) = 0/(1 + t/s) where 0 is an initial
learning rate and s a number of steps
(after s steps the learning rate is 0/2, after 2s it is 0/3 etc.)
Exponential scheduling: Set (t) = 0 · 0.1t/s
.
(the learning rate decays faster than in the power scheduling)
Piecewise constant scheduling: A constant learning rate for a
number of steps/epochs, then a smaller learning rate, and so on.
134
Adaptive learning rate
Power scheduling: Set (t) = 0/(1 + t/s) where 0 is an initial
learning rate and s a number of steps
(after s steps the learning rate is 0/2, after 2s it is 0/3 etc.)
Exponential scheduling: Set (t) = 0 · 0.1t/s
.
(the learning rate decays faster than in the power scheduling)
Piecewise constant scheduling: A constant learning rate for a
number of steps/epochs, then a smaller learning rate, and so on.
1cycle scheduling: Start by increasing the initial learning rate
from 0 linearly to 1 (approx. 1 = 10 0) halfway through
training. Then decrease from 1 linearly to 0. Finish by dropping
the learning rate by several orders of magnitude (still linearly).
According to a 2018 paper by Leslie Smith this may converge much
faster (100 epochs vs 800 epochs on CIFAR10 dataset).
For comparison of some methods see: AN EMPIRICAL STUDY OF LEARNING RATES IN DEEP NEURAL
NETWORKS FOR SPEECH RECOGNITION, Senior et al
134
AdaGrad
So far we have considered ﬁxed schedules for learning rates.
It is better to have
larger rates for weights with smaller updates,
smaller rates for weights with larger updates.
AdaGrad uses individually adapting learning rate for each
weight.
135
SGD with AdaGrad
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1)
:
Choose (randomly) a minibatch T ⊆ {1, . . . , p}
Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
136
SGD with AdaGrad
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1)
:
Choose (randomly) a minibatch T ⊆ {1, . . . , p}
Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −
η
r
(t)
ji
+ δ
·
k∈T
∂Ek
∂wji
(w(t)
)
and
r
(t)
ji
= r
(t−1)
ji
+


k∈T
∂Ek
∂wji
(w(t)
)


2
η is a constant expressing the inﬂuence of the learning rate,
typically 0.01.
δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0.
136
RMSProp
The main disadvantage of AdaGrad is the accumulation of the
gradient throughout the whole learning process.
In case the learning needs to get over several "hills" before
settling in a deep "valley", the weight updates get far too small
before getting to it.
RMSProp uses an exponentially decaying average to discard
history from the extreme past so that it can converge rapidly
after ﬁnding a convex bowl, as if it were an instance of the
AdaGrad algorithm initialized within that bowl.
137
SGD with RMSProp
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1)
:
Choose (randomly) a minibatch T ⊆ {1, . . . , p}
Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
138
SGD with RMSProp
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), compute w(t+1)
:
Choose (randomly) a minibatch T ⊆ {1, . . . , p}
Compute
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −
η
r
(t)
ji
+ δ
·
k∈T
∂Ek
∂wji
(w(t)
)
and
r
(t)
ji
= ρr
(t−1)
ji
+ (1 − ρ)


k∈T
∂Ek
∂wji
(w(t)
)


2
η is a constant expressing the inﬂuence of the learning rate
(Hinton suggests ρ = 0.9 and η = 0.001).
δ > 0 is a smoothing term (typically 1e-8) avoiding division by 0.
138
Other optimization methods
There are more methods such as AdaDelta, Adam (roughly
RMSProp combined with momentum), etc.
A natural question: Which algorithm should one choose?
139
Other optimization methods
There are more methods such as AdaDelta, Adam (roughly
RMSProp combined with momentum), etc.
A natural question: Which algorithm should one choose?
Unfortunately, there is currently no consensus on this point.
According to a recent study, the family of algorithms with
adaptive learning rates (represented by RMSProp and
AdaDelta) performed fairly robustly, no single best algorithm
has emerged.
139
Other optimization methods
There are more methods such as AdaDelta, Adam (roughly
RMSProp combined with momentum), etc.
A natural question: Which algorithm should one choose?
Unfortunately, there is currently no consensus on this point.
According to a recent study, the family of algorithms with
adaptive learning rates (represented by RMSProp and
AdaDelta) performed fairly robustly, no single best algorithm
has emerged.
Currently, the most popular optimization algorithms actively in
use include SGD, SGD with momentum, RMSProp, RMSProp
with momentum, AdaDelta and Adam.
The choice of which algorithm to use, at this point, seems to
depend largely on the user’s familiarity with the algorithm.
139
Choice of (hidden) activations
Generic requirements imposed on activation functions:
1. differentiability
(to do gradient descent)
2. non-linearity
(linear multi-layer networks are equivalent to single-layer)
3. monotonicity
(local extrema of activation functions induce local extrema of the error
function)
4. "linearity"
(i.e. preserve as much linearity as possible; linear models are easiest to
ﬁt; ﬁnd the "minimum" non-linearity needed to solve a given task)
The choice of activation functions is closely related to input
preprocessing and the initial choice of weights. I will illustrate the
reasoning on sigmoidal functions; say few words about other
activation functions later.
140
Activation functions – tanh
σ(ξ) = 1.7159 · tanh(2
3 · ξ), we have limξ→∞ σ(ξ) = 1.7159 and
limξ→−∞ σ(ξ) = −1.7159
141
Activation functions – tanh
σ(ξ) = 1.7159 · tanh(2
3 · ξ) is almost linear on [−1, 1]
142
Activation functions – tanh
ﬁrst derivative: σ(ξ) = 1.7159 · tanh(2
3 · ξ)
143
Input preprocessing
Some inputs may be much larger than others.
E.g..: Height vs weight of a person, maximum speed of
a car (in km/h) vs its price (in CZK), etc.
144
Input preprocessing
Some inputs may be much larger than others.
E.g..: Height vs weight of a person, maximum speed of
a car (in km/h) vs its price (in CZK), etc.
Large inputs have greater inﬂuence on the training than the
small ones. In addition, too large inputs may slow down
learning (saturation of activation functions).
144
Input preprocessing
Some inputs may be much larger than others.
E.g..: Height vs weight of a person, maximum speed of
a car (in km/h) vs its price (in CZK), etc.
Large inputs have greater inﬂuence on the training than the
small ones. In addition, too large inputs may slow down
learning (saturation of activation functions).
Typical standardization:
average = 0 (subtract the mean)
variance = 1 (divide by the standard deviation)
Here the mean and standard deviation may be estimated
from data (the training set).
(illustration of standard deviation)
144
Initial weights (for tanh)
Assume weights chosen uniformly in random from
an interval [−w, w] where w depends on the number of
inputs of a given neuron.
145
Initial weights (for tanh)
Assume weights chosen uniformly in random from
an interval [−w, w] where w depends on the number of
inputs of a given neuron.
Consider the activation function σ(ξ) = 1.7159 · tanh(2
3 · ξ)
for all neurons.
σ is almost linear on [−1, 1],
σ saturates out of the interval [−4, 4] (i.e. it is close to its
limit values and its derivative is close to 0.
145
Initial weights (for tanh)
Assume weights chosen uniformly in random from
an interval [−w, w] where w depends on the number of
inputs of a given neuron.
Consider the activation function σ(ξ) = 1.7159 · tanh(2
3 · ξ)
for all neurons.
σ is almost linear on [−1, 1],
σ saturates out of the interval [−4, 4] (i.e. it is close to its
limit values and its derivative is close to 0.
Thus
for too small w we may get (almost) linear model.
for too large w (i.e. much larger than 1) the activations may
get saturated and the learning will be very slow.
Hence, we want to choose w so that the inner potentials of
neurons will be roughly in the interval [−1, 1].
145
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data.
146
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data.
Consider a neuron j from the ﬁrst layer with n inputs. Assume
that its weights are chosen uniformly from [−w, w].
146
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data.
Consider a neuron j from the ﬁrst layer with n inputs. Assume
that its weights are chosen uniformly from [−w, w].
The rule: choose w so that the standard deviation of ξj (denote
by oj) is close to the border of the interval on which σj is linear.
In our case: oj ≈ 1.
146
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data.
Consider a neuron j from the ﬁrst layer with n inputs. Assume
that its weights are chosen uniformly from [−w, w].
The rule: choose w so that the standard deviation of ξj (denote
by oj) is close to the border of the interval on which σj is linear.
In our case: oj ≈ 1.
Our assumptions imply: oj = n
3 · w.
Thus we put w =
√
3√
n
.
146
Initial weights (for tanh)
Standardization gives mean = 0 and variance = 1 of the input
data.
Consider a neuron j from the ﬁrst layer with n inputs. Assume
that its weights are chosen uniformly from [−w, w].
The rule: choose w so that the standard deviation of ξj (denote
by oj) is close to the border of the interval on which σj is linear.
In our case: oj ≈ 1.
Our assumptions imply: oj = n
3 · w.
Thus we put w =
√
3√
n
.
The same works for higher layers, n corresponds to the number
of neurons in the layer one level lower.
146
Glorot & Bengio initialization
The previous heuristics for weight initialization ignores variance of the
gradient (i.e. it is concerned only with the "size" of activations in the
forward pass).
147
Glorot & Bengio initialization
The previous heuristics for weight initialization ignores variance of the
gradient (i.e. it is concerned only with the "size" of activations in the
forward pass).
Glorot & Bengio (2010) presented a normalized initialization by
choosing w uniformly from the interval:

−
6
m + n
,
6
m + n

 =


−
3
(m + n)/2
,
3
(m + n)/2


Here n is the number of inputs to the layer, m is the number of
outputs of the layer (i.e. the number of neurons in the layer).
147
Glorot & Bengio initialization
The previous heuristics for weight initialization ignores variance of the
gradient (i.e. it is concerned only with the "size" of activations in the
forward pass).
Glorot & Bengio (2010) presented a normalized initialization by
choosing w uniformly from the interval:

−
6
m + n
,
6
m + n

 =


−
3
(m + n)/2
,
3
(m + n)/2


Here n is the number of inputs to the layer, m is the number of
outputs of the layer (i.e. the number of neurons in the layer).
This is designed to compromise between the goal of initializing all
layers to have the same activation variance and the goal of initializing
all layers to have the same gradient variance.
The formula is derived using the assumption that the network consists only of
a chain of matrix multiplications, with no non-linearities. Real neural networks
obviously violate this assumption, but many strategies designed for the linear
model perform reasonably well on its non-linear counterparts.
147
Modern activation functions
For hidden neurons sigmoidal functions are often substituted with
piece-wise linear activations functions. Most prominent is ReLU:
σ(ξ) = max{0, ξ}
THE default activation function recommended for use with most
feedforward neural networks.
As close to linear function as possible; very simple; does not
saturate for large potentials.
Dead for negative potentials.
148
More modern activation functions
Leaky ReLU (greenboard):
Generalizes ReLU, not dead for negative potentials.
Experimentally not much better than ReLU.
149
More modern activation functions
Leaky ReLU (greenboard):
Generalizes ReLU, not dead for negative potentials.
Experimentally not much better than ReLU.
ELU: "Smoothed" ReLU:
σ(ξ) =



α(exp(ξ) − 1) for ξ < 0
ξ for ξ ≥ 0
Here α is a parameter, ELU converges to −α as ξ → −∞. As
opposed to ReLU: Smooth, always non-zero gradient (but
saturates), slower to compute.
149
More modern activation functions
Leaky ReLU (greenboard):
Generalizes ReLU, not dead for negative potentials.
Experimentally not much better than ReLU.
ELU: "Smoothed" ReLU:
σ(ξ) =



α(exp(ξ) − 1) for ξ < 0
ξ for ξ ≥ 0
Here α is a parameter, ELU converges to −α as ξ → −∞. As
opposed to ReLU: Smooth, always non-zero gradient (but
saturates), slower to compute.
SELU: Scaled variant of ELU: :
σ(ξ) = λ



α(exp(ξ) − 1) for ξ < 0
ξ for ξ ≥ 0
Self-normalizing, i.e. output of each layer will tend to preserve
a mean (close to) 0 and a standard deviation (close to) 1 for
λ ≈ 1.050 and α ≈ 1.673, properly initialized weights (see below)
and normalized inputs (zero mean, standard deviation 1).
149
Initializing with Normal Distribution
Denote by n the number of inputs to the initialized layer, and m
the number of neurons in the layer.
Glorot & Bengio (2010): Choose weights randomly from
the normal distribution with mean 0 and variance
2/(n + m)
Suitable for activation functions: None, tanh, logistic,
softmax
150
Initializing with Normal Distribution
Denote by n the number of inputs to the initialized layer, and m
the number of neurons in the layer.
Glorot & Bengio (2010): Choose weights randomly from
the normal distribution with mean 0 and variance
2/(n + m)
Suitable for activation functions: None, tanh, logistic,
softmax
He (2015): Choose weights randomly from the normal
distribution with mean 0 and variance 2/n
Designed for ReLU, leaky ReLU
150
Initializing with Normal Distribution
Denote by n the number of inputs to the initialized layer, and m
the number of neurons in the layer.
Glorot & Bengio (2010): Choose weights randomly from
the normal distribution with mean 0 and variance
2/(n + m)
Suitable for activation functions: None, tanh, logistic,
softmax
He (2015): Choose weights randomly from the normal
distribution with mean 0 and variance 2/n
Designed for ReLU, leaky ReLU
LeCun (1990): Choose weights randomly from the normal
distribution with mean 0 and variance 1/n
Suitable for SELU
150
How to choose activation of hidden neurons
Default is ReLU.
According to Aurélien Géron:
SELU > ELU > leakyReLU > ReLU > tanh > logistic
For discussion see: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and
Techniques to Build Intelligent Systems, Aurélien Géron
151
Output neurons
The choice of activation functions for output units depends on the
concrete applications.
For regression (function approximation) the output is typically linear.
152
Output neurons
The choice of activation functions for output units depends on the
concrete applications.
For regression (function approximation) the output is typically linear.
For classiﬁcation, the current activation functions of choice are
logistic sigmoid – binary classiﬁcation
softmax: Given an output neuron j ∈ Y
yj = σj(ξj) =
eξj
i∈Y eξi
for multi-class classiﬁcation.
152
Output neurons
The choice of activation functions for output units depends on the
concrete applications.
For regression (function approximation) the output is typically linear.
For classiﬁcation, the current activation functions of choice are
logistic sigmoid – binary classiﬁcation
softmax: Given an output neuron j ∈ Y
yj = σj(ξj) =
eξj
i∈Y eξi
for multi-class classiﬁcation.
The error function used with softmax (assuming that the target values
dkj are from {0, 1}) is typically cross-entropy:
−
1
p
p
k=1 j∈Y
dkj ln(yj)
... which somewhat corresponds to the maximum likelihood principle.
152
Sigmoidal outputs with cross-entropy – in detail
Consider
Binary classiﬁcation, two classes {0, 1}
One output neuron j, its activation logistic sigmoid
σj(ξj) =
1
1 + e−ξj
The output of the network is y = σj(ξj).
153
Sigmoidal outputs with cross-entropy – in detail
Consider
Binary classiﬁcation, two classes {0, 1}
One output neuron j, its activation logistic sigmoid
σj(ξj) =
1
1 + e−ξj
The output of the network is y = σj(ξj).
For a training set
T = xk , dk k = 1, . . . , p
(here xk ∈ R|X| and dk ∈ R), the cross-entropy looks like
this:
Ecross
= −
1
p
p
k=1
[dk ln(yk ) + (1 − dk ) ln(1 − yk )]
where yk is the output of the network for the k-th training
input xk , and dk is the k-th desired output.
153
Generalization
Intuition: Generalization = ability to cope with new unseen
instances.
Data are mostly noisy, so it is not good idea to ﬁt exactly.
In case of function approximation, the network should not
return exact results as in the training set.
154
Generalization
Intuition: Generalization = ability to cope with new unseen
instances.
Data are mostly noisy, so it is not good idea to ﬁt exactly.
In case of function approximation, the network should not
return exact results as in the training set.
More formally: It is typically assumed that the training set has
been generated as follows:
dkj = gj(xk ) + Θkj
where gj is the "underlying" function corresponding to
the output neuron j ∈ Y and Θkj is random noise.
The network should ﬁt gj not the noise.
Methods improving generalization are called regularization
methods.
154
Regularization
Regularization is a big issue in neural networks, as they
typically use a huge amount of parameters and thus are very
susceptible to overﬁtting.
155
Regularization
Regularization is a big issue in neural networks, as they
typically use a huge amount of parameters and thus are very
susceptible to overﬁtting.
von Neumann: "With four parameters I can ﬁt an elephant,
and with ﬁve I can make him wiggle his trunk."
... and I ask you prof. Neumann:
What can you ﬁt with 40GB of parameters??
155
Early stopping
Early stopping means that we stop learning before it reaches
a minimum of the error E.
When to stop?
156
Early stopping
Early stopping means that we stop learning before it reaches
a minimum of the error E.
When to stop?
In many applications the error function is not the main thing we
want to optimize.
E.g. in the case of a trading system, we typically want to maximize our proﬁt
not to minimize (strange) error functions designed to be easily differentiable.
Also, as noted before, minimizing E completely is not good for
generalization.
For start: We may employ standard approach of training on one
set and stopping on another one.
156
Early stopping
Divide your dataset into several subsets:
training set (e.g. 60%) – train the network here
validation set (e.g. 20%) – use to stop the training
(possibly) test set (e.g. 20%) – use to compare trained
models
What to use as a stopping rule?
157
Early stopping
Divide your dataset into several subsets:
training set (e.g. 60%) – train the network here
validation set (e.g. 20%) – use to stop the training
(possibly) test set (e.g. 20%) – use to compare trained
models
What to use as a stopping rule?
You may observe E (or any other function of interest) on the
validation set, if it does not improve for last k steps, stop.
Alternatively, you may observe the gradient, if it is small for
some time, stop.
(recent studies shown that this traditional rule is not too good: it may happen
that the gradient is larger close to minimum values; on the other hand, E
does not have to be evaluated which saves time.
To compare models you may use ML techniques such as
various types of cross-validation etc.
157
Size of the network
Similar problem as in the case of the training duration:
Too small network is not able to capture intrinsic properties
of the training set.
Large networks overﬁt faster.
Solution: Optimal number of neurons :-)
158
Size of the network
Similar problem as in the case of the training duration:
Too small network is not able to capture intrinsic properties
of the training set.
Large networks overﬁt faster.
Solution: Optimal number of neurons :-)
there are some (useless) theoretical bounds
there are algorithms dynamically adding/removing neurons
(not much use nowadays)
In practice:
start using a rule of thumb: the number of neurons ≈ ten
times less than the number of training instances.
experiment, experiment, experiment.
158
Feature extraction
Consider a two layer network. Hidden neurons are supposed to
represent "patterns" in the inputs.
Example: Network 64-2-3 for letter classiﬁcation:
159
Ensemble methods
Techniques for reducing generalization error by combining
several models.
The reason that ensemble methods work is that different models will usually
not make all the same errors on the test set.
Idea: Train several different models separately, then have all of
the models vote on the output for test examples.
160
Ensemble methods
Techniques for reducing generalization error by combining
several models.
The reason that ensemble methods work is that different models will usually
not make all the same errors on the test set.
Idea: Train several different models separately, then have all of
the models vote on the output for test examples.
Bagging:
Generate k training sets T1, ..., Tk by sampling from T
uniformly with replacement.
If the number of samples is |T |, then on average |Ti| = (1 − 1/e)|T |.
For each i, train a model Mi on Ti.
Combine outputs of the models: for regression by
averaging, for classiﬁcation by (majority) voting.
160
Dropout
The algorithm: In every step of the gradient descent
choose randomly a set N of neurons, each neuron is included in
N independently with probability 1/2,
(in practice, different probabilities are used as well).
do forward and backward propagations only using the selected
neurons
(i.e. leave weights of the other neurons unchanged)
161
Dropout
The algorithm: In every step of the gradient descent
choose randomly a set N of neurons, each neuron is included in
N independently with probability 1/2,
(in practice, different probabilities are used as well).
do forward and backward propagations only using the selected
neurons
(i.e. leave weights of the other neurons unchanged)
Dropout resembles bagging: Large ensemble of neural networks is
trained "at once" on parts of the data.
Dropout is not exactly the same as bagging: The models share
parameters, with each model inheriting a different subset of
parameters from the parent neural network. This parameter sharing
makes it possible to represent an exponential number of models with
a tractable amount of memory.
In the case of bagging, each model is trained to convergence on its respective
training set. This would be infeasible for large networks/training sets.
161
Weight decay and L2 regularization
Generalization can be improved by removing "unimportant"
weights.
Penalising large weights gives stronger indication about their
importance.
162
Weight decay and L2 regularization
Generalization can be improved by removing "unimportant"
weights.
Penalising large weights gives stronger indication about their
importance.
In every step we decrease weights (multiplicatively) as follows:
w
(t+1)
ji
= (1 − ζ)(w
(t)
ji
+ ∆w
(t)
ji
)
Intuition: Unimportant weights will be pushed to 0, important
weights will survive the decay.
162
Weight decay and L2 regularization
Generalization can be improved by removing "unimportant"
weights.
Penalising large weights gives stronger indication about their
importance.
In every step we decrease weights (multiplicatively) as follows:
w
(t+1)
ji
= (1 − ζ)(w
(t)
ji
+ ∆w
(t)
ji
)
Intuition: Unimportant weights will be pushed to 0, important
weights will survive the decay.
Weight decay is equivalent to the gradient descent with a
constant learning rate ε and the following error function:
E (w) = E(w) +
2ζ
ε
(w · w)
Here 2ζ
ε (w · w) is the L2 regularization that penalizes large
weights.
162
More optimization, regularization ...
There are many more practical tips, optimization methods,
regularization methods, etc.
For a very nice survey see
http://www.deeplearningbook.org/
... and also all other inﬁnitely many urls concerned with deep
learning.
163
Some applications
164
ALVINN (history)
165
ALVINN
Architecture:
MLP, 960 − 4 − 30 (also 960 − 5 − 30)
inputs correspond to pixels
166
ALVINN
Architecture:
MLP, 960 − 4 − 30 (also 960 − 5 − 30)
inputs correspond to pixels
Activity:
activation functions: logistic sigmoid
Steering wheel position determined by "center of mass" of
neuron values.
166
ALVINN
Learning: Trained during (live) drive.
Front window view captured by a camera, 25 images per
second.
Training samples of the form (xk , dk ) where
xk = image of the road
dk = corresponding position of the steering wheel
position of the steering wheel "blurred" by Gaussian
distribution:
dki = e−D2
i
/10
where Di is the distance of the i-th output from the one
which corresponds to the correct position of the wheel.
(The authors claim that this was better than the binary
output.)
167
ALVINN – Selection of training samples
Naive approach: take images directly from the camera and
adapt accordingly.
168
ALVINN – Selection of training samples
Naive approach: take images directly from the camera and
adapt accordingly.
Problems:
If the driver is gentle enough, the car never learns how to
get out of dangerous situations. A solution may be
turn off learning for a moment, then suddenly switch on,
and let the net catch on,
let the driver drive as if being insane (dangerous, possibly
expensive).
The real view out of the front window is repetitive and
boring, the net would overﬁt on few examples.
168
ALVINN – Selection of training examples
Problem with a "good" driver is solved as follows:
169
ALVINN – Selection of training examples
Problem with a "good" driver is solved as follows:
15 distorted copies of each image:
desired output generated for each copy
169
ALVINN – Selection of training examples
Problem with a "good" driver is solved as follows:
15 distorted copies of each image:
desired output generated for each copy
"Boring" images solved as follows:
a buffer of 200 images (including 15 copies of the original), in
every step the system trains on the buffer
after several updates a new image is captured, 15 copies are
made and they will substitute 15 images in the buffer (5 chosen
randomly, 10 with the smallest error).
169
ALVINN - learning
pure backpropagation
constant learning rate
momentum, slowly increasing.
Results:
Trained for 5 minutes, speed 4 miles per hour.
ALVINN was able to drive well on a new road it has never
seen (in different weather conditions).
170
ALVINN - learning
pure backpropagation
constant learning rate
momentum, slowly increasing.
Results:
Trained for 5 minutes, speed 4 miles per hour.
ALVINN was able to drive well on a new road it has never
seen (in different weather conditions).
The maximum speed was limited by the hydraulic controller
of the steering wheel, not the learning algorithm.
170
ALVINN - weight development
round 0
round 10
round 20
round 50
h1 h2 h3 h4 h5
Here h1, . . . , h5 are hidden neurons.
171
MNIST – handwritten digits recognition
Database of labelled images of
handwritten digits: 60 000
training examples, 10 000 testing.
Dimensions: 28 x 28, digits are
centered to the "center of gravity"
of pixel values and normalized to
ﬁxed size.
More at http:
//yann.lecun.com/exdb/mnist/
The database is used as a standard benchmark in lots of publications.
172
MNIST – handwritten digits recognition
Database of labelled images of
handwritten digits: 60 000
training examples, 10 000 testing.
Dimensions: 28 x 28, digits are
centered to the "center of gravity"
of pixel values and normalized to
ﬁxed size.
More at http:
//yann.lecun.com/exdb/mnist/
The database is used as a standard benchmark in lots of publications.
Allows comparison of various methods.
172
MNIST
One of the best "old" results is the following:
6-layer NN 784-2500-2000-1500-1000-500-10 (on GPU)
(Ciresan et al. 2010)
Abstract: Good old on-line back-propagation for plain multi-layer
perceptrons yields a very low 0.35 error rate on the famous MNIST
handwritten digits benchmark. All we need to achieve this best result so far
are many hidden layers, many neurons per layer, numerous deformed
training images, and graphics cards to greatly speed up learning.
A famous application of a learning convolutional network LeNet-1 in
1998.
173
MNIST – LeNet1
174
MNIST – LeNet1
Interpretation of output:
the output neuron with the highest value identiﬁes the digit.
the same, but if the two largest neuron values are too close
together, the input is rejected (i.e. no answer).
Learning:
Inputs:
training on 7291 samples, tested on 2007 samples
Results:
error on test set without rejection: 5%
error on test set with rejection: 1% (12% rejected)
compare with dense MLP with 40 hidden neurons: error
1% (19.4% rejected)
175
Modern convolutional networks
The rest of the lecture is based on the online book Neural
Networks and Deep Learning by Michael Nielsen.
http://neuralnetworksanddeeplearning.com/index.html
Convolutional networks are currently the best networks for
image classiﬁcation.
Their common ancestor is LeNet-5 (and other LeNets)
from nineties.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 1998
176
AlexNet
In 2012 this network made a breakthrough in ILVSCR
competition, taking the classiﬁcation error from around 28% to
16%:
A convolutional network, trained on two GPUs.
177
Convolutional networks - local receptive ﬁelds
Every neuron is connected with a ﬁeld of k × k (in this case
5 × 5) neurons in the lower layer (this ﬁled is receptive ﬁeld).
Neuron is "standard": Computes a weighted sum of its inputs,
applies an activation function.
178
Convolutional networks - stride length
Then we slide the local receptive ﬁeld over by one pixel to the right
(i.e., by one neuron), to connect to a second hidden neuron:
The "size" of the slide is
called stride length.
The group of all such
neurons is feature map.
all these neurons share
weights and biases!
179
Feature maps
Each feature map represents a property of the input that is
supposed to be spatially invariant.
Typically, we consider several feature maps in a single layer.
180
Pooling
Neurons in the pooling layer compute functions of their
receptive ﬁelds:
Max-pooling : maximum of inputs
L2-pooling : square root of the sum of squres
Average-pooling : mean
· · · 181
Trained feature maps
(20 feature maps, receptive ﬁelds 5 × 5)
182
Trained feature maps
183
Simple convolutional network
28 × 28 input image, 3 feature maps, each feature map has its
own max-pooling (ﬁeld 5 × 5, stride = 1), 10 output neurons.
Each neuron in the output layer gets input from each neuron in
the pooling layer.
Trained using backprop, which can be easily adapted to
convolutional networks.
184
Convolutional network
185
Simple convolutional network vs MNIST
two convolutional-pooling layers, one 20, second 40 feature
maps, two dense (MLP) layers (1000-1000), outputs (10)
Activation functions of the feature maps and dense layers:
ReLU
max-pooling
output layer: soft-max
Error function: negative log-likelihood (= cross-entropy)
Training: SGD, mini-batch size 10
learning rate 0.03
L2 regularization with "weight" λ = 0.1 + dropout with prob.
1/2
training for 40 epochs (i.e. every training example is
considered 40 times)
Expanded dataset: displacement by one pixel to an
arbitrary direction.
Committee voting of 5 networks. 186
MNIST
Out of 10 000 images in the test set, only these 33 have been
incorrectly classiﬁed:
187
More complex convolutional networks
Convolutional networks have been used for classiﬁcation of
images from the ImageNet database (16 million color images,
20 thousand classes)
188
ImageNet Large-Scale Visual Recognition
Challenge (ILSVRC)
Competition in classiﬁcation over a subset of images from
ImageNet.
Started in 2010, assisted in breakthrough in image recognition.
Training set 1.2 million images, 1000 classes. Validation set: 50
000, test set: 150 000.
Many images contain more than one object ⇒ model is allowed
to choose ﬁve classes, the correct label must be among the
ﬁve. (top-5 criterion).
189
AlexNet
ImageNet classiﬁcation with deep convolutional neural networks, by Alex
Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton (2012).
Trained on two GPUs (NVIDIA GeForce GTX 580)
Výsledky:
accuracy 84.7% in top-5 (second best algorithm at the time
73.8%)
63.3% "perfect" (top-1) classiﬁcation
190
ILSVRC 2014
The same set as in 2012, top-5 criterion.
GoogLeNet: deep convolutional network, 22 layers
Results:
Accuracy 93.33% top-5
191
ILSVRC 2015
Deep convolutional network
Various numbers of layers, the winner has
152 layers
Skip connections implementing residual
learning
Error 3.57% in top-5.
192
ILSVRC 2016
Trimps-Soushen (The Third Research Institute of Ministry of
Public Security)
There is no new innovative technology or novelty by
Trimps-Soushen.
Ensemble of the pretrained models from Inception-v3,
Inception-v4, Inception-ResNet-v2, Pre-Activation ResNet-200,
and Wide ResNet (WRN-68–2).
Each of the models are strong at classifying some categories,
but also weak at classifying some categories.
Test error: 2.99%
193
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
194
Top-20 typical errors
Out of 1458 misclassiﬁed images in Top-20:
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
195
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
196
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
197
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
198
Top-k accuracy analyzed
https://towardsdatascience.com/review-trimps-soushen-winner-in-ilsvrc-2016-image-classiﬁcation-dfbc423111dd
199
Superhuman convolutional nets?!
Andrej Karpathy: ...the task of labeling images with 5 out of 1000
categories quickly turned out to be extremely challenging, even for some
friends in the lab who have been working on ILSVRC and its classes for a
while. First we thought we would put it up on [Amazon Mechanical Turk].
Then we thought we could recruit paid undergrads. Then I organized a
labeling party of intense labeling effort only among the (expert labelers) in
our lab. Then I developed a modiﬁed interface that used GoogLeNet
predictions to prune the number of categories from 1000 to only about 100. It
was still too hard - people kept missing categories and getting up to ranges of
13-15% error rates. In the end I realized that to get anywhere competitively
close to GoogLeNet, it was most efﬁcient if I sat down and went through the
painfully long training process and the subsequent careful annotation process
myself... The labeling happened at a rate of about 1 per minute, but this
decreased over time... Some images are easily recognized, while some
images (such as those of ﬁne-grained breeds of dogs, birds, or monkeys) can
require multiple minutes of concentrated effort. I became very good at
identifying breeds of dogs... Based on the sample of images I worked on, the
GoogLeNet classiﬁcation error turned out to be 6.8%... My own error in the
end turned out to be 5.1%, approximately 1.7% better.
200
Does it really work?
201
Convolutional networks – theory
202
Convolutional network
203
Convolutional layers
Every neuron is connected with a (typically small) receptive
ﬁeld of neurons in the lower layer.
Neuron is "standard": Computes a weighted sum of its inputs,
applies an activation function.
204
Convolutional layers
Neurons grouped into
feature maps sharing
weights.
205
Convolutional layers
Each feature map represents a property of the input that is
supposed to be spatially invariant.
Typically, we consider several feature maps in a single layer.
206
Pooling layers
Neurons in the pooling layer compute simple functions of their
receptive ﬁelds (the ﬁelds are typically disjoint):
Max-pooling : maximum of inputs
L2-pooling : square root of the sum of squres
Average-pooling : mean
· · · 207
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
208
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
Several types of layers:
input layer L0
208
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
Several types of layers:
input layer L0
dense layer Lm: Each neuron of Lm connected with each
neuron of Lm−1.
208
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
Several types of layers:
input layer L0
dense layer Lm: Each neuron of Lm connected with each
neuron of Lm−1.
convolutional & pooling layer Lm: Contains two
sub-layers:
convolutional layer: Neurons organized into disjoint
feature maps, all neurons of a given feature map share
weights (but have different inputs)
pooling layer: Each (convolutional) feature map F has
a corresponding pooling map P. Neurons of P
have inputs only from F (typically few of them),
compute a simple aggregate function (such as max),
have disjoint inputs.
208
Convolutional networks – architecture
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
[ji] is a set of all connections (i.e. pairs of neurons) sharing
the weight wji. 209
Convolutional networks – activity
neurons of dense and convolutional layers:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable):
yj = σj(ξj)
210
Convolutional networks – activity
neurons of dense and convolutional layers:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable):
yj = σj(ξj)
Neurons of pooling layers: Apply the "pooling" function:
max-pooling:
yj = max
i∈j←
yi
avg-pooling:
yj =
i∈j←
yi
|j←|
A convolutional network is evaluated layer-wise (as MLP), for each j ∈ Y we
have that yj(w, x) is the value of the output neuron j after evaluating the
network with weights w and input x.
210
Convolutional networks – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
Error function – mean squared error (for example):
E(w) =
1
p
p
k=1
Ek (w)
where
Ek (w) =
1
2
j∈Y
yj(w, xk ) − dkj
2
211
Convolutional networks – SGD
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
Compute
w(t+1)
= w(t)
+ ∆w(t)
where
∆w(t)
= −ε(t) ·
1
|T|
k∈T
Ek (w(t)
)
Here T is a minibatch (of a ﬁxed size),
0 < ε(t) ≤ 1 is a learning rate in step t + 1
Ek (w(t)) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented
by randomly shufﬂing all data and then choosing minibatches
sequentially. Epoch consists of one round through all data. 212
Backprop
Recall that Ek (w(t)) is a vector of all partial derivatives of
the form ∂Ek
∂wji
.
How to compute ∂Ek
∂wji
?
213
Backprop
Recall that Ek (w(t)) is a vector of all partial derivatives of
the form ∂Ek
∂wji
.
How to compute ∂Ek
∂wji
?
First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj:
Recall that for every wji where j is in a dense layer, i.e.
does not share weights:
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
213
Backprop
Recall that Ek (w(t)) is a vector of all partial derivatives of
the form ∂Ek
∂wji
.
How to compute ∂Ek
∂wji
?
First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj:
Recall that for every wji where j is in a dense layer, i.e.
does not share weights:
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
Now for every wji where j is in a convolutional layer:
∂Ek
∂wji
=
r ∈[ji]
∂Ek
∂yr
· σr (ξr ) · y
Neurons of pooling layers do not have weights.
213
Backprop
Now compute derivatives w.r.t. yj:
for every j ∈ Y:
∂Ek
∂yj
= yj − dkj
This holds for the squared error, for other error functions the derivative
w.r.t. outputs will be different.
214
Backprop
Now compute derivatives w.r.t. yj:
for every j ∈ Y:
∂Ek
∂yj
= yj − dkj
This holds for the squared error, for other error functions the derivative
w.r.t. outputs will be different.
for every j ∈ Z Y such that j→
is either a dense layer, or a
convolutional layer:
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj
214
Backprop
Now compute derivatives w.r.t. yj:
for every j ∈ Y:
∂Ek
∂yj
= yj − dkj
This holds for the squared error, for other error functions the derivative
w.r.t. outputs will be different.
for every j ∈ Z Y such that j→
is either a dense layer, or a
convolutional layer:
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj
for every j ∈ Z Y such that j→
is max-pooling: Then j→
= {i} for
a single "max" neuron and we have
∂Ek
∂yj
=



∂Ek
∂yi
if j = arg maxr∈i←
yr
0 otherwise
I.e. gradient can be propagated from the output layer downwards as in MLP.
214
Convolutional networks – summary
Conv. nets. are nowadays the most used networks in
image processing (and also in other areas where input has
some local, "spatially" invariant properties)
Typically trained using backpropagation.
Due to the weight sharing allow (very) deep architectures.
Typically extended with more adjustments and tricks in
their topologies.
215