Hopﬁeld network – local minima
We look for "deep" minima of E ....
We may get suck in a shallow minimum.
Solution: In every state we allow transition to states with higher
energy. This transition has a small probability (which will be higher at
the beginning and decrease throughout computation).
1
Boltzmann activity
Activity: States of neurons initially set to values of {−1, 1}, i.e.,
y
(0)
j
∈ {−1, 1} for j ∈ {1, . . . , n}.
In the step t + 1 update value of a randomly chosen neuron
j ∈ {1, . . . , n} as follows: Compute the inner potential
ξ
(t)
j
=
n
i=1
wjiy
(t)
i
choose y
(t+1)
j
∈ {−1, 1} randomly so that
P y
(t+1)
j
= 1 = σ ξ
(t)
j
where
σ(ξ) =
1
1 + e−2ξ/T(t)
The parameter T(t) is called temperature in time t.
2
Temperature and energy
High temperature T(t) implies that P y
(t+1)
j
= 1 ≈ 1
2 and thus
the network behaves almost randomly.
Very low temperature T(t) implies that either P y
(t+1)
j
= 1 ≈ 1
or P y
(t+1)
j
= 1 ≈ 0 depending on whether ξ
(t)
j
> 0 or ξ
(t)
j
< 0.
Thus the network behaves almost deterministically (as in the
original activity of Hopﬁeld network).
Notes:
Boltzmann activity = Hopﬁeld activity + random noise,
energy E(y) = −1
2
n
j=1
n
i=1 wjiyjyi may jump to higher levels
(with probability depending on the temperature),
the probability of transition to higher energy decreases
exponentially with the size of the "energy jump".
3
Simulated annealing
The following approach may help to reach deep minima of E:
Start with higher temperature T(t)
Gradually reduce the temperature, e.g. as follows:
T(t) = ηt
· T(0) where η < 1 is close to 1
or T(t) = T(0)/ log(1 + t)
This process resembles annealing used in metallurgy that alters
the physical and sometimes chemical properties of a material to
increase its ductility and reduce its hardness.
It also extends physical motivation of Hopﬁeld networks: magnet
orientation is now, in addition, inﬂuenced by thermal ﬂuctuations.
... and it gets us close to Boltzmann machines.
4
Boltzmann machine
Architecture:
Neural network with cycles and symmetric connections
(i.e. arbitrary graph)
N is a set of all neurons.
Denote by ξj the inner potential and by yj the output (i.e.
state) of neuron j.
State of the machine: y ∈ {−1, 1}|N|.
Denote by wji ∈ R the weight of the connection from i to j
(and thus also from j to i).
No bias and assume wjj = 0 for all j ∈ N.
5
Boltzmann machine
Activity: States of neurons initially set to values of {−1, 1}, i.e.
y
(0)
j
∈ {−1, 1} for j ∈ N.
In the step t + 1 do the following:
Choose a neuron j ∈ N randomly with the uniform
probability.
Compute the inner potential of j:
ξ
(t)
j
=
n
i∈j←
wjiy
(t)
i
Choose y
(t+1)
j
∈ {−1, 1} randomly so that
P y
(t+1)
j
= 1 = σ(ξ
(t)
j
) where
σ(ξ) =
1
1 + e−2ξ/T(t)
(T(t) is a temperature at time t.)
6
Boltzmann machine
High temperature T(t) implies that P y
(t+1)
j
= 1 ≈ 1
2 and
thus the machine behaves almost randomly.
Low temperature T(t) means that either P y
(t+1)
j
= 1 ≈ 1
or P y
(t+1)
j
= 1 ≈ 0 depending on whether ξ
(t)
j
> 0 or
ξ
(t)
j
< 0. Then the machine behaves almost
deterministically (as the Hopﬁeld network).
7
Boltzmann machine represents probability
Goal: Construct a network representing a distribution on a set
of vectors {−1, 1}|N|.
Rough idea: Boltzmann machine has states in {−1, 1}|N|,
moves randomly from state to state during computation.
If we let the machine run for sufﬁciently long time (with a ﬁxed
temperature), the relative frequencies of visits to states will be
independent of the initial state.
We consider these frequencies as probabilities of the states.
This gives a probability distribution on {−1, 1}|N| represented by
the machine.
During learning, a probability distribution on states of {−1, 1}|N|
will be given, and we adapt weights so that the frequencies
match the given probabilities.
8
Equilibrium
Fix a temperature T (i.e. T(t) = T for t = 1, 2, . . .).
Theorem
For every γ∗ ∈ {−1, 1}|N| we have that
lim
t→∞
P y(t)
= γ∗
=
1
Z
e−E(γ∗)/T
where
Z =
γ∈{−1,1}|N|
e−E(γ)/T
E(γ) = −
1
2
i,j
wijy
γ
i
y
γ
j
the Boltzmann distribution.
Deﬁne pN(γ∗) := limt→∞ P y(t) = γ∗ for every γ∗ ∈ {−1, 1}|N|.
9
Equilibrium probabilities
Note that
pN is a probability distribution on {−1, 1}|N| represented by
the machine,
for a state γ∗, we have that pN(γ∗) is the probability of γ∗ in
the thermal equilibrium,
pN(γ∗) can be estimated by P y(t∗) = γ∗ for sufﬁciently
large t∗
That is, in order to compute pN(γ∗
) it is sufﬁcient to simulate a
computation several times for t∗
steps and then compute the relative
frequency of stopping in γ∗
.
By Markov chains theory, pN(γ∗) is the long-run frequency
of visits to γ∗.
This gives an alternative procedure for estimating pN(γ∗
): Execute the
machine for very long time, compute the relative frequency of visits to γ∗
along the computation.
10
Boltzmann machine – learning
To be able to capture more probability distributions, we
introduce hidden neurons.
Divide N into two disjoint sets:
visible neurons V
hidden neurons H
For α ∈ {−1, 1}|V| denote
pV (α) =
β∈{−1,1}|H|
pN(α, β)
the probability that the state of visible neurons in the thermal
equilibrium is α.
Our goal is to adapt weights so that pV corresponds to a given
probability distribution on {−1, 1}|V|.
11
Boltzmann machine – learning
Learning:
Let pd be a probability distribution on the states of visible
neurons, i.e. on {−1, 1}|V|.
The distribution pd can be determined by a sequence of training
examples:
T = x1, x2, . . . , xm
then
pd(α) = #(α, T )/m
here #(α, T ) is the number of occurrences of α in T .
Our goal is to ﬁnd a conﬁguration of the network W such that
pV ≈ pd.
12
Boltzmann machine – learning
A suitable measure of difference between probability
distributions pV and pd is relative entropy weighted by
probabilities of states (Kullback-Leibler divergence):
E(W) =
α∈{−1,1}|V|
pd(α) ln
pd(α)
pV (α)
For pd given by a training set T = x1, x2, . . . , xm we have that
minimizing E(W) is equivalent to maximizing likelihood of T .
13
Boltzmann machine – learning
Minimize E(w) using gradient descent, i.e. compute a
sequence of weight matrices: W(0), W(1), . . .
initialise W(0) randomly, close to 0
in step t + 1 compute W(t+1) as follows:
W
(t+1)
ji
= W
(t)
ji
+ ∆W
(t)
ji
where
∆W
(t)
ji
= −ε(t) ·
∂E
∂wji
(W(t)
)
is the update of the weight wji in the step t + 1 and
0 < ε(t) ≤ 1 is the learning rate in the step t + 1.
It remains to compute ∂E
∂wji
(W).
14
Boltzmann machine – learning
For sufﬁciently large t∗ (i.e. in thermal equilibrium) we have
∂E
∂wji
≈ −
1
T
y
(t∗)
j
y
(t∗)
i ﬁxed
− y
(t∗)
j
y
(t∗)
i free
y
(t∗)
j
y
(t∗)
i ﬁxed
is the expected value of y
(t∗)
j
y
(t∗)
i
in the
thermal equilibrium assuming that values of visible neurons
are ﬁxed at the beginning of computation according to pd.
y
(t∗)
j
y
(t∗)
i free
is the expected value of y
(t∗)
j
y
(t∗)
i
in the
thermal equilibrium (no values ﬁxed).
Thus
∆w
( )
ji
= −ε( ) ·
∂E
∂wji
(W( −1)
)
=
ε( )
T
y
(t∗)
j
y
(t∗)
i ﬁxed
− y
(t∗)
j
y
(t∗)
i free
15
Boltzmann machine – learning
Compute y
(t∗)
j
y
(t∗)
i ﬁxed
as follows:
Let Y := 0 and do the following q times:
1. choose α ∈ {−1, 1}|V|
randomly according to pd,
2. ﬁx values of visible neurons to α and do not update them
throughout the remaining steps 3. and 4.,
3. simulate t∗
steps, now the current values of neurons j and i
are y
(t∗
)
j
and y
(t∗
)
i
, respectively,
4. add y
(t∗
)
j
y
(t∗
)
i
to Y.
For sufﬁciently large q, the value Y/q will be a good
estimate of y
(t∗)
j
y
(t∗)
i ﬁxed
.
y
(t∗)
j
y
(t∗)
i free
can be estimated similarly, the only difference is
that the steps 1. and 2. are omitted.
16
Boltzmann machine – learning
For completeness, the analytic version:
y
(t∗)
i
y
(t∗)
j ﬁxed
=
=
α∈{−1,1}|V|
pd(α)
β∈{−1,1}|S|
pN(α, β)
pV (α)
y
αβ
j
y
αβ
i
here y
αβ
j
is the output of the neuron j in the state (α, β).
y
(t∗)
i
y
(t∗)
j free
=
γ∈{−1,1}|N|
pN(γ)y
γ
j
y
γ
i
17