Convolutional network
1
Convolutional layers
Every neuron is connected with a (typically small) receptive
ﬁeld of neurons in the lower layer.
Neuron is "standard": Computes a weighted sum of its inputs,
applies an activation function.
2
Convolutional layers
Neurons grouped into
feature maps sharing
weights.
3
Convolutional layers
Each feature map represents a property of the input that is
supposed to be spatially invariant.
Typically, we consider several feature maps in a single layer.
4
Pooling layers
Neurons in the pooling layer compute simple functions of their
receptive ﬁelds (the ﬁelds are typically disjoint):
Max-pooling : maximum of inputs
L2-pooling : square root of the sum of squres
Average-pooling : mean
· · · 5
Convolutional networks – architecture
Neurons organized in layers, L0, L1, . . . , Ln, connections
(typically) only from Lm to Lm+1.
Several types of layers:
input layer L0
dense layer Lm: Each neuron of Lm connected with each
neuron of Lm−1.
convolutional & pooling layer Lm: Contains two
sub-layers:
convolutional layer: Neurons organized into disjoint
feature maps, all neurons of a given feature map share
weights (but have different inputs)
pooling layer: Each (convolutional) feature map F has
a corresponding pooling map P. Neurons of P
have inputs only from F (typically few of them),
compute a simple aggregate function (such as max),
have disjoint inputs.
6
Convolutional networks – architecture
Denote
X a set of input neurons
Y a set of output neurons
Z a set of all neurons (X, Y ⊆ Z)
individual neurons denoted by indices i, j etc.
ξj is the inner potential of the neuron j after the computation
stops
yj is the output of the neuron j after the computation stops
(deﬁne y0 = 1 is the value of the formal unit input)
wji is the weight of the connection from i to j
(in particular, wj0 is the weight of the connection from the formal unit
input, i.e. wj0 = −bj where bj is the bias of the neuron j)
j← is a set of all i such that j is adjacent from i
(i.e. there is an arc to j from i)
j→ is a set of all i such that j is adjacent to i
(i.e. there is an arc from j to i)
jshare is a set of neurons sharing weights with j
i.e. neurons that belong to the same feature map as j 7
Convolutional networks – activity
neurons of dense and convolutional layers:
inner potential of neuron j:
ξj =
i∈j←
wjiyi
activation function σj for neuron j (arbitrary differentiable):
yj = σj(ξj)
Neurons of pooling layers: Apply the "pooling" function:
max-pooling:
yj = max
i∈j←
yi
avg-pooling:
yj =
i∈j←
yi
|j←|
A convolutional network is evaluated layer-wise (as MLP), for each j ∈ Y we
have that yj(w, x) is the value of the output neuron j after evaluating the
network with weights w and input x.
8
Convolutional networks – learning
Learning:
Given a training set T of the form
xk , dk k = 1, . . . , p
Here, every xk ∈ R|X| is an input vector end every dk ∈ R|Y|
is the desired network output. For every j ∈ Y, denote by
dkj the desired output of the neuron j for a given network
input xk (the vector dk can be written as dkj j∈Y
).
Error function – mean square error (for example):
E(w) =
1
p
p
k=1
Ek (w)
where
Ek (w) =
1
2
j∈Y
yj(w, xk ) − dkj
2
9
Convolutional networks – SGD
The algorithm computes a sequence of weight vectors
w(0), w(1), w(2), . . ..
weights in w(0) are randomly initialized to values close to 0
in the step t + 1 (here t = 0, 1, 2 . . .), weights w(t+1) are
computed as follows:
Choose (randomly) a set of training examples T ⊆ {1, . . . , p}
Compute
w(t+1)
= w(t)
+ ∆w(t)
where
∆w(t)
= −ε(t) ·
1
|T|
k∈T
Ek (w(t)
)
Here T is a minibatch (of a ﬁxed size),
0 < ε(t) ≤ 1 is a learning rate in step t + 1
Ek (w(t)) is the gradient of the error of the example k
Note that the random choice of the minibatch is typically implemented
by randomly shufﬂing all data and then choosing minibatches
sequentially. Epoch consists of one round through all data. 10
Backprop
Recall that Ek (w(t)) is a vector of all partial derivatives of
the form ∂Ek
∂wji
.
How to compute ∂Ek
∂wji
?
First, switch from derivatives w.r.t. wji to derivatives w.r.t. yj:
Recall that for every wji where j is in a dense layer, i.e.
does not share weights:
∂Ek
∂wji
=
∂Ek
∂yj
· σj (ξj) · yi
Now for every wji where j is in a convolutional layer, that is
shares wji with neurons of jshare:
∂Ek
∂wji
=
r∈jshare
∂Ek
∂yr
· σr (ξr ) · yr
Neurons of pooling layers do not have weights.
11
Backprop
Now compute derivatives w.r.t. yj:
for every j ∈ Y:
∂Ek
∂yj
= yj − dkj
This holds for the mean-square error, for other error functions
the derivative w.r.t. outputs will be different.
for every j ∈ Z Y such that j→
is either a dense layer, or a
convolutional layer:
∂Ek
∂yj
=
r∈j→
∂Ek
∂yr
· σr (ξr ) · wrj
for every j ∈ Z Y such that j→
is max-pooling: Then j→
= {i} for
a single "max" neuron and we have
∂Ek
∂yj
=



∂Ek
∂yi
if j = arg maxr∈i←
yr
0 otherwise
I.e. gradient can be propagated from the output layer downwards as in MLP.
12
Convolutional networks – conclusions
Conv. nets. are nowadays the most used networks in
image processing (and also in other areas where input has
some local, "spatially" invariant properties)
Typically trained using backpropagation.
Due to the weight sharing allow (very) deep architectures.
Typically extended with more adjustments and tricks in
their topologies.
13
Recurrent networks – Hopﬁeld network
Auto-associative network: Given an input, the network outputs
a training example (encoded in its weights) "similar" to
the given input.
14
Hopﬁeld network
Architecture:
complete topology, i.e. output of each neuron is input to all
neurons
all neurons are both input and output
denote by ξ1, . . . , ξn inner potentials and by y1, . . . , yn
outputs (states) of individual neurons
denote by wji the weight of connection from a neuron
i ∈ {1, . . . , n} to a neuron j ∈ {1, . . . , n}
assume wjj = 0 for every j = 1, . . . , n
For now: no neuron has a bias
15
Hopﬁeld network
Learning: Training set
T = {xk | xk = (xk1, . . . , xkn) ∈ {−1, 1}n
, k = 1, . . . , p}
The goal is to "store" the training examples of T so that the
network is able to associate similar examples.
Hebb’s learning rule: If the inputs to a system cause the same pattern
of activity to occur repeatedly, the set of active elements constituting that
pattern will become increasingly strongly interassociated. That is, each
element will tend to turn on every other element and (with negative weights)
to turn off the elements that do not form part of the pattern. To put it another
way, the pattern as a whole will become "auto-associated".
Mathematically speaking:
wji =
p
k=1
xkjxki 1 ≤ j i ≤ n
Intuition: "Neurons that ﬁre together, wire together".
16
Hopﬁeld network
Learning: Training set
T = {xk | xk = (xk1, . . . , xkn) ∈ {−1, 1}n
, k = 1, . . . , p}
Hebb’s rule:
wji =
p
k=1
xkjxki 1 ≤ j i ≤ n
Note that wji = wij, i.e. the weight matrix is symmetric.
Learning can be seen as poll about equality of inputs:
If xkj = xki, then the training example votes for "i equals j"
by adding one to wji.
If xkj xki, then the training example votes for "i does not
equal j" by subtracting one from wji.
17
Hopﬁeld network
Activity: Initially, neurons set to the network input
x = (x1, . . . , xn), thus y
(0)
j
= xj for every j = 1, . . . , n.
Cyclically update states of neurons, i.e. in step t + 1 compute
the value of a neuron j such that j = (t mod p) + 1, as follows:
Compute the inner potential:
ξ
(t)
j
=
n
i=1
wjiy
(t)
i
then
y
(t+1)
j
=



1 ξ
(t)
j
> 0
y
(t)
j
ξ
(t)
j
= 0
−1 ξ
(t)
j
< 0
18
Hopﬁeld network – activity
The computation stops in a step t∗ if the network is for the ﬁrst
time in a stable state, i.e.
y
(t∗+n)
j
= y
(t∗)
j
(j = 1, . . . , n)
Theorem
Assuming symmetric weights, computation of a Hopﬁled
network always stops for every input.
This implies that a given Hopﬁled network computes a function
from {−1, 1}n to {−1, 1}n (determined by its weights).
Denote by y(W, x) = y
(t∗)
1
, . . . , y
(t∗)
n the value of the network
for a given input x and a weight matrix W.
Denote by yj(W, x) = y
(t∗)
j
the component of the value of
the network corresponding to the neuron j.
If W is clear from the context, we write only y(x) a yj(x).
19
Ising model – an analogy
Simple models of magnetic materials resemble Hopﬁeld
network.
atomic magnets organized into
square-lattice
each magnet may have only one of
two possible orientations (in the
Hopﬁeld network +1 a −1)
orientation of each magnet is
inﬂuenced by an external magnetic
ﬁeld (input of the network) as well as
orientation of the other magnets
weights in the Hopﬁled net model
determine interaction among
magnets
20
Energy function
Energy function E assigns to every state y ∈ {−1, 1}n
a (potential) energy:
E(y) = −
1
2
n
j=1
n
i=1
wjiyjyi
states with low energy are stable (few neurons "want to"
change their states), states with high energy are not stable
i.e. large (positive) wjiyjyi is stable and small (negative)
wjiyjyi is not stable
The energy does not increase during computation:
E(y(t)) ≥ E(y(t+1)), stable states y(t∗) correspond to local
minima of E.
21
Energy landscape
22
Hopﬁeld – example
1 2
3
−1
−11
y1 y2 y3 E
1 1 1 1
1 1 −1 1
1 −1 1 −3
1 −1 −1 1
−1 1 1 1
−1 1 −1 −3
−1 −1 1 1
−1 −1 −1 1
Hopﬁeld network with three neurons
trained on a single training example (1, −1, 1) using Hebb’s
rule
(note that (−1, 1, −1) has also been "stored" into the network)
23
Hopﬁeld network – convergence
Observe that
the energy does not increase during computation:
E(y(t)) ≥ E(y(t+1))
if the state is updated in a step t + 1, then
E(y(t)) > E(y(t+1))
there are only ﬁnitely many states, and thus, eventually,
a local minimum of E is reached.
This proves that computation of a Hopﬁeld network always
stops.
24
Hopﬁeld network – phantoms
The energy function E may have local minima that do not
correspond to training examples (so called phantoms).
Phantoms can be "unlearned" e.g. using the following rule:
Given a phantom (x1, . . . , xn) ∈ {−1, 1}n and weights wji, then
new weights wji
are computed by
wji = wji − xixj
(i.e. similar to Hebb’s rule but with the opposite sign)
25
Reproduction – statistical analysis
Capacity of Hopﬁeld network is deﬁned as the ratio p/n of
number of training examples the net is able to learn over the
number of neurons.
Assume that training examples are chosen randomly: each
component of xk is set to 1 with probability 1/2 and to −1 with
probability 1/2.
Consider a conﬁguration W obtained by learning using the
Hebb’s rule.
Denote
β = P xk = y(W, xk ) pro k = 1, . . . , p
Then for n → ∞ and p ≤ n/(4 log n) we have β → 1.
I.e. the maximum number of examples that can be effectively
stored in Hopﬁeld net is proportional to n/(4 log n).
26
Hopﬁeld network – example
ﬁgures 12 × 10
(120 neurons, −1 is white and 1 is black)
learned 8 ﬁgures
input generated with 25% noise
image shows the activity of the
Hopﬁeld network
27
Hopﬁeld network – example
28
Hopﬁeld network – example
29