Hopfield network – local minima We look for "deep" minima of E .... We may get suck in a shallow minimum. Solution: In every state we allow transition to states with higher energy. This transition has a small probability (which will be higher at the beginning and decrease throughout computation). 1 Boltzmann activity Activity: States of neurons initially set to values of {−1, 1}, i.e., y (0) j ∈ {−1, 1} for j ∈ {1, . . . , n}. In the step t + 1 update value of a randomly chosen neuron j ∈ {1, . . . , n} as follows: Compute the inner potential ξ (t) j = n i=1 wjiy (t) i choose y (t+1) j ∈ {−1, 1} randomly so that P y (t+1) j = 1 = σ ξ (t) j where σ(ξ) = 1 1 + e−2ξ/T(t) The parameter T(t) is called temperature in time t. 2 Temperature and energy High temperature T(t) implies that P y (t+1) j = 1 ≈ 1 2 and thus the network behaves almost randomly. Very low temperature T(t) implies that either P y (t+1) j = 1 ≈ 1 or P y (t+1) j = 1 ≈ 0 depending on whether ξ (t) j > 0 or ξ (t) j < 0. Thus the network behaves almost deterministically (as in the original activity of Hopfield network). Notes: Boltzmann activity = Hopfield activity + random noise, energy E(y) = −1 2 n j=1 n i=1 wjiyjyi may jump to higher levels (with probability depending on the temperature), the probability of transition to higher energy decreases exponentially with the size of the "energy jump". 3 Simulated annealing The following approach may help to reach deep minima of E: Start with higher temperature T(t) Gradually reduce the temperature, e.g. as follows: T(t) = ηt · T(0) where η < 1 is close to 1 or T(t) = T(0)/ log(1 + t) This process resembles annealing used in metallurgy that alters the physical and sometimes chemical properties of a material to increase its ductility and reduce its hardness. It also extends physical motivation of Hopfield networks: magnet orientation is now, in addition, influenced by thermal fluctuations. ... and it gets us close to Boltzmann machines. 4 Boltzmann machine Architecture: Neural network with cycles and symmetric connections (i.e. arbitrary graph) N is a set of all neurons. Denote by ξj the inner potential and by yj the output (i.e. state) of neuron j. State of the machine: y ∈ {−1, 1}|N|. Denote by wji ∈ R the weight of the connection from i to j (and thus also from j to i). No bias and assume wjj = 0 for all j ∈ N. 5 Boltzmann machine Activity: States of neurons initially set to values of {−1, 1}, i.e. y (0) j ∈ {−1, 1} for j ∈ N. In the step t + 1 do the following: Choose a neuron j ∈ N randomly with the uniform probability. Compute the inner potential of j: ξ (t) j = n i∈j← wjiy (t) i Choose y (t+1) j ∈ {−1, 1} randomly so that P y (t+1) j = 1 = σ(ξ (t) j ) where σ(ξ) = 1 1 + e−2ξ/T(t) (T(t) is a temperature at time t.) 6 Boltzmann machine High temperature T(t) implies that P y (t+1) j = 1 ≈ 1 2 and thus the machine behaves almost randomly. Low temperature T(t) means that either P y (t+1) j = 1 ≈ 1 or P y (t+1) j = 1 ≈ 0 depending on whether ξ (t) j > 0 or ξ (t) j < 0. Then the machine behaves almost deterministically (as the Hopfield network). 7 Boltzmann machine represents probability Goal: Construct a network representing a distribution on a set of vectors {−1, 1}|N|. Rough idea: Boltzmann machine has states in {−1, 1}|N|, moves randomly from state to state during computation. If we let the machine run for sufficiently long time (with a fixed temperature), the relative frequencies of visits to states will be independent of the initial state. We consider these frequencies as probabilities of the states. This gives a probability distribution on {−1, 1}|N| represented by the machine. During learning, a probability distribution on states of {−1, 1}|N| will be given, and we adapt weights so that the frequencies match the given probabilities. 8 Equilibrium Fix a temperature T (i.e. T(t) = T for t = 1, 2, . . .). Theorem For every γ∗ ∈ {−1, 1}|N| we have that lim t→∞ P y(t) = γ∗ = 1 Z e−E(γ∗)/T where Z = γ∈{−1,1}|N| e−E(γ)/T E(γ) = − 1 2 i,j wijy γ i y γ j the Boltzmann distribution. Define pN(γ∗) := limt→∞ P y(t) = γ∗ for every γ∗ ∈ {−1, 1}|N|. 9 Equilibrium probabilities Note that pN is a probability distribution on {−1, 1}|N| represented by the machine, for a state γ∗, we have that pN(γ∗) is the probability of γ∗ in the thermal equilibrium, pN(γ∗) can be estimated by P y(t∗) = γ∗ for sufficiently large t∗ That is, in order to compute pN(γ∗ ) it is sufficient to simulate a computation several times for t∗ steps and then compute the relative frequency of stopping in γ∗ . By Markov chains theory, pN(γ∗) is the long-run frequency of visits to γ∗. This gives an alternative procedure for estimating pN(γ∗ ): Execute the machine for very long time, compute the relative frequency of visits to γ∗ along the computation. 10 Boltzmann machine – learning To be able to capture more probability distributions, we introduce hidden neurons. Divide N into two disjoint sets: visible neurons V hidden neurons H For α ∈ {−1, 1}|V| denote pV (α) = β∈{−1,1}|H| pN(α, β) the probability that the state of visible neurons in the thermal equilibrium is α. Our goal is to adapt weights so that pV corresponds to a given probability distribution on {−1, 1}|V|. 11 Boltzmann machine – learning Learning: Let pd be a probability distribution on the states of visible neurons, i.e. on {−1, 1}|V|. The distribution pd can be determined by a sequence of training examples: T = x1, x2, . . . , xm then pd(α) = #(α, T )/m here #(α, T ) is the number of occurrences of α in T . Our goal is to find a configuration of the network W such that pV ≈ pd. 12 Boltzmann machine – learning A suitable measure of difference between probability distributions pV and pd is relative entropy weighted by probabilities of states (Kullback-Leibler divergence): E(W) = α∈{−1,1}|V| pd(α) ln pd(α) pV (α) For pd given by a training set T = x1, x2, . . . , xm we have that minimizing E(W) is equivalent to maximizing likelihood of T . 13 Boltzmann machine – learning Minimize E(w) using gradient descent, i.e. compute a sequence of weight matrices: W(0), W(1), . . . initialise W(0) randomly, close to 0 in step t + 1 compute W(t+1) as follows: W (t+1) ji = W (t) ji + ∆W (t) ji where ∆W (t) ji = −ε(t) · ∂E ∂wji (W(t) ) is the update of the weight wji in the step t + 1 and 0 < ε(t) ≤ 1 is the learning rate in the step t + 1. It remains to compute ∂E ∂wji (W). 14 Boltzmann machine – learning For sufficiently large t∗ (i.e. in thermal equilibrium) we have ∂E ∂wji ≈ − 1 T y (t∗) j y (t∗) i fixed − y (t∗) j y (t∗) i free y (t∗) j y (t∗) i fixed is the expected value of y (t∗) j y (t∗) i in the thermal equilibrium assuming that values of visible neurons are fixed at the beginning of computation according to pd. y (t∗) j y (t∗) i free is the expected value of y (t∗) j y (t∗) i in the thermal equilibrium (no values fixed). Thus ∆w ( ) ji = −ε( ) · ∂E ∂wji (W( −1) ) = ε( ) T y (t∗) j y (t∗) i fixed − y (t∗) j y (t∗) i free 15 Boltzmann machine – learning Compute y (t∗) j y (t∗) i fixed as follows: Let Y := 0 and do the following q times: 1. choose α ∈ {−1, 1}|V| randomly according to pd, 2. fix values of visible neurons to α and do not update them throughout the remaining steps 3. and 4., 3. simulate t∗ steps, now the current values of neurons j and i are y (t∗ ) j and y (t∗ ) i , respectively, 4. add y (t∗ ) j y (t∗ ) i to Y. For sufficiently large q, the value Y/q will be a good estimate of y (t∗) j y (t∗) i fixed . y (t∗) j y (t∗) i free can be estimated similarly, the only difference is that the steps 1. and 2. are omitted. 16 Boltzmann machine – learning For completeness, the analytic version: y (t∗) i y (t∗) j fixed = = α∈{−1,1}|V| pd(α) β∈{−1,1}|S| pN(α, β) pV (α) y αβ j y αβ i here y αβ j is the output of the neuron j in the state (α, β). y (t∗) i y (t∗) j free = γ∈{−1,1}|N| pN(γ)y γ j y γ i 17