MLP training – theory
26
Architecture – Multilayer Perceptron (MLP)
Input
Hidden
Output
x1 x2
y1 y2
Neurons partitioned into layers;
one input layer, one output layer,
possibly several hidden layers
layers numbered from 0; the
input layer has number 0
E.g. three-layer network has
two hidden layers and one
output layer
Neurons in the i-th layer are
connected with all neurons in
the i + 1-st layer
Architecture of a MLP is typically
described by numbers of neurons
in individual layers (e.g. 2-4-3-2)
27
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
28
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
28
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
28
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
28
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
28
MLP – architecture
Notation:
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
28
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
29
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable)
[ e.g. logistic sigmoid σj(ξ) = 1
1+e
−λjξ ]
29
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable)
[ e.g. logistic sigmoid σj(ξ) = 1
1+e
−λjξ ]
State of non-input neuron j ∈ Z \ X after the computation
stops:
yj = σj(ξj)
(yj depends on the conﬁguration w and the input x, so we sometimes
write yj(w, x) )
29
MLP – activity
Activity:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable)
[ e.g. logistic sigmoid σj(ξ) = 1
1+e
−λjξ ]
State of non-input neuron j ∈ Z \ X after the computation
stops:
yj = σj(ξj)
(yj depends on the conﬁguration w and the input x, so we sometimes
write yj(w, x) )
The network computes a function R|X|
do R|Y|
. Layer-wise computation:
First, all input neurons are assigned values of the input. In the -th step,
all neurons of the -th layer are evaluated.
29
MLP – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
30
MLP – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
Error function:
E(w) =
p
k=1
Ek (w)
where
Ek (w) =
1
2
j∈Y
yj(w, xk ) − dkj
2
30
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
31
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂E
∂wji
(w(t)
)
is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is
a learning rate in step t + 1.
31
MLP – learning algorithm
Batch algorithm (gradient descent):
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂E
∂wji
(w(t)
)
is a weight update of wji in step t + 1 and 0 < ε(t) ≤ 1 is
a learning rate in step t + 1.
Note that ∂E
∂wji
(w(t)
) is a component of the gradient E, i.e. the weight update
can be written as w(t+1)
= w(t)
− ε(t) · E(w(t)
).
31
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
32
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
32
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
and for every j ∈ Z X we get
∂Ek
∂yj
= yj − dkj for j ∈ Y
32
MLP – error function gradient
For every wji we have
∂E
∂wji
=
p
k=1
∂Ek
∂wji
where for every k = 1, . . . , p holds
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
and for every j ∈ Z X we get
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj for j ∈ Z (Y ∪ X)
(Here all yj are in fact yj(w, xk )).
32
MLP – error function gradient
If σj(ξ) = 1
1+e
−λjξ for all j ∈ Z, then
σj (ξj) = λjyj(1 − yj)
33
MLP – error function gradient
If σj(ξ) = 1
1+e
−λjξ for all j ∈ Z, then
σj (ξj) = λjyj(1 − yj)
and thus for all j ∈ Z X:
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X)
33
MLP – error function gradient
If σj(ξ) = 1
1+e
−λjξ for all j ∈ Z, then
σj (ξj) = λjyj(1 − yj)
and thus for all j ∈ Z X:
∂Ek
∂yj
= yj − dkj for j ∈ Y
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· λr yr (1 − yr ) · wrj for j ∈ Z (Y ∪ X)
If σj(ξ) = a · tanh(b · ξj) for all j ∈ Z, then
σj (ξj) =
b
a
(a − yj)(a + yj)
33
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
34
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
34
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
34
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
34
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
34
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
3. compute ∂Ek
∂wji
for all wji using
∂Ek
∂wji
:=
∂Ek
∂yj
· σj (ξj) · yi
34
MLP – computing the gradient
Compute ∂E
∂wji
= p
k=1
∂Ek
∂wji
as follows:
Initialize Eji := 0
(By the end of the computation: Eji = ∂E
∂wji
)
For every k = 1, . . . , p do:
1. forward pass: compute yj = yj(w, xk ) for all j ∈ Z
2. backward pass: compute ∂Ek
∂yj
for all j ∈ Z using
backpropagation (see the next slide!)
3. compute ∂Ek
∂wji
for all wji using
∂Ek
∂wji
:=
∂Ek
∂yj
· σj (ξj) · yi
4. Eji := Eji + ∂Ek
∂wji
The resulting Eji equals ∂E
∂wji
.
34
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
35
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
if j ∈ Y, then ∂Ek
∂yj
= yj − dkj
35
MLP – backpropagation
Compute ∂Ek
∂yj
for all j ∈ Z as follows:
if j ∈ Y, then ∂Ek
∂yj
= yj − dkj
if j ∈ Z Y ∪ X, then assuming that j is in the -th layer and
assuming that ∂Ek
∂yr
has already been computed for all
neurons in the + 1-st layer, compute
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj
(This works because all neurons of r ∈ j→
belong to the + 1-st layer.)
35
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
36
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
36
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
36
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
36
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
36
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
The steps 1. - 3. take linear time.
36
Complexity of the batch algorithm
Computation of ∂E
∂wji
(w(t−1)) stops in time linear in the size of
the network plus the size of the training set.
(assuming unit cost of operations including computation of σr (ξr ) for given ξr )
Proof sketch: The algorithm does the following p times:
1. forward pass, i.e. computes yj(w, xk )
2. backpropagation, i.e. computes ∂Ek
∂yj
3. computes ∂Ek
∂wji
and adds it to Eji (a constant time operation
in the unit cost framework)
The steps 1. - 3. take linear time.
Note that the speed of convergence of the gradient descent cannot be
estimated ...
36
Illustration of the gradient descent – XOR
Source: Pattern Classiﬁcation (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork
37
MLP – learning algorithm
Online algorithm:
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
w
(t+1)
ji
= w
(t)
ji
+ ∆w
(t)
ji
where
∆w
(t)
ji
= −ε(t) ·
∂Ek
∂wji
(w
(t)
ji
)
is the weight update of wji in the step t + 1 and 0 < ε(t) ≤ 1
is the learning rate in the step t + 1.
There are other variants determined by selection of the training examples
used for the error computation (more on this later).
38
SGD
weights in w(0)
are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1)
are
computed as follows:
Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
Compute
w(t+1)
= w(t)
+ ∆w(t)
where
∆w(t)
= −ε(t) ·
k∈T
Ek (w(t)
)
0 < ε(t) ≤ 1 is a learning rate in step t + 1
Ek (w(t)
) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented by
randomly shufﬂing all data and then choosing minibatches sequentially.
39