Vector quantization
Assume we are given a probability density function p(x) on
input vectors x ∈ Rn.
I.e. assume that the inputs are randomly generated according to p(x).
Our goal is to approximate p(x) using ﬁnitely many
centres wi ∈ Rn where i = 1, . . . , h.
Roughly speaking: We want more centres in areas of higher density
and less in areas of low density.
Formally: To every input x we assign its closest centre
wc(x) :
c(x) = arg min
i=1,...,h
x − wi
and then minimize the error
E = x − wc(x)
2
p(x)dx
Caution! c(x) depends on x.
1
Vector quantization
In practice, p(x) is obtained by sampling uniformly from a given
training (multi)set:
T = {xj ∈ Rn
| j = 1, . . . , }
The error then corresponds to
E =
1
j=1
xj − wc(xj)
2
(keep in mind that c(xj) = arg mini=1,...,h xj − wi .)
If T has been randomly selected according to p(x) and is large
eough, then
1
j=1
xj − wc(xj )
2
≈ x − wc(x)
2
p(x)dx
2
Example – image compression
Every pixel has 256 shades of grey,
each pair of neighbouring pixels is a
two-dimensional vector from
{0, . . . , 255} × {0, . . . , 255},
our compression ﬁnds a small set of
centres that will encode shades of grey
of pairs of pixels,
image is then encoded by simple
substitution of pairs of pixels with their
centres.
3
Example – image compression
pair distribution
naive quantization
smart quantization
4
Lloyd’s algorithm
Assume a ﬁnite training set: T = {xj ∈ Rn | j = 1, . . . , }
The algorithm moves centres closer to the centres of mass of
closest points.
In the step t computes w
(t)
1
, . . . , w
(t)
h
as follows:
for every k = 1, . . . , h compute a set Tk of all vectors of T
to which w
(t−1)
k
is the closest centre:
Tk = xj ∈ T | k = arg min
i=1,...,h
xj − w
(t−1)
i
compute w
(t)
k
as the centre of mass of Tk :
w
(t)
k
=
1
|Tk |
x∈Tk
x
We may stop the computation when, e.g. the error E is
sufﬁciently small.
5
Kohonen’s learning
Disadvantage of Lloyd’s algorithm: It is not online!
The following Kohonen’s algorithm is online (i.e. the inputs may
be generated one by one and the centres are adapted online):
In step t, consider the input xt and compute w
(t)
k
as follows:
If w
(t−1)
k
is the closest centre to xt , i.e.
k = arg mini xt − w
(t−1)
i
then
w
(t)
k
= w
(t−1)
k
+ θ · (xt − w
(t−1)
k
)
else w
(t)
k
= w
(t−1)
k
0 < θ ≤ 1 determines how much to move the centre towards
the input.
Let us formulate this algorithm in the language of neural
networks.
6
Kohonen’s learning – neural network
Architecture: Single layer
x1 xi xn
· · · · · ·
y1 yk yh
· · · · · ·
wk1 wki wkn
Activity: For an input x ∈ Rn and k = 1, . . . , h:
yk =



1 k = arg mini=1,...,h x − wi
0 otherwise
7
Kohonen’s learning
In step t, consider the input xt and compute w
(t)
k
as follows:
If w
(t−1)
k
is the closest center to xt , i.e.
k = arg mini xt − w
(t−1)
i
then
w
(t)
k
= w
(t−1)
k
+ θ · (xt − w
(t−1)
k
)
else w
(t)
k
= w
(t−1)
k
0 < θ ≤ 1 determines how much to move the center towards
the input.
8
Kohonen’s learning – efﬁciency
Works well if most input vectors evenly distributed in a
convex area.
In case of two (or more) separated clusters, the density
may not correspond to p(x) at all:
Ex. Two separated areas with the same density.
Assume that the centres are initially in one of the areas.
The second then "drags" only one of the centres (which
always wins the competition).
Result: One of the areas will be covered by a single centre
even though it contains half of the mass of the input
examples.
Solution: We tie centres together so that they have to move
together.
9
Kohonen’s map
Architecture: Single layer
x1 xi xn
· · · · · ·
y1 yk yh
· · · · · ·
wk1 wki wkn
Topological structure: neurons connected by edges so
that they are nodes in an undirected graph.
In most cases, this structure is either a one dimensional
sequence or a two dimensional grid.
10
Kohonen’s map – illustration
11
Kohonen’s map – bio motivation
Source: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996
12
Kohonen’s map
Activity: Given an input vector x ∈ Rn and k = 1, . . . , h:
yk =



1 k = arg mini=1,...,h x − wi
0 jinak
Learning: We use the topological structure.
Denote by d(c, k) the length of the shortest path from
neuron c to neuron k in the topological structure.
For every neuron c and a given s ∈ N0 deﬁne topological
neighbourhood of the neuron c of size s :
Ns(c) = {k | d(c, k) ≤ s}
In step t, given training example xt adapt wk as follows:
w
(t)
k
=



w
(t−1)
k
+ θ · xt − w
(t−1)
k
k ∈ Ns(c(xt ))
w
(t−1)
k
otherwise
where c(xt ) = arg mini=1,...,h xt − w
(t−1)
i
and θ ∈ R and s ∈ N0
are parameters that may change during training. 13
Kohonen’s map – learning
More general version:
w
(t)
k
= w
(t−1)
k
+ Θ(c(xt ), k) · xt − w
(t−1)
k
where c(xt ) = arg mini=1,...,h xt − w
(t−1)
i
. The previous case
then corresponds to
Θ(c(xt ), k) =



θ k ∈ Ns(c(xt ))
0 jinak
A smoother version:
Θ(c(xt ), k) = θ0 · exp
−d(c(xt ), k)2
σ2
where θ0 ∈ R is a learning rate and σ ∈ R is the width (both
parameters may change during training).
14
Example 1
Inputs uniformly distributed in a rectangle.
Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996
15
Example 2
Inputs uniformly distributed in a triangle. Zdroj obrázku: Neural Networks - A
Systematic Introduction, Raul Rojas, Springer, 1996
16
Example 3
Inputs uniformly distributed in a cuboid.
Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996
17
Example 4
Inputs uniformly distributed in a cactus.
Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996 18
Example – defect
Topological defect – twisted network.
Zdroj obrázku: Neural Networks - A Systematic Introduction, Raul Rojas, Springer, 1996
19
Kohonen’s map – practical approach
By Kohonen’s paper: Inital weights are not so important, should
be different from each other.
Two phase learning:
coarse phase:
Approx. 1000 steps
learning rate θ: start with 0.1 and steadily decrement to
0.01
topological neighbourhood of every neuron (determined by
s or by the width σ) should be large at the beginning (i.e.
contain most neurons) and should shrink to few neurons at
the end
ﬁne tuning:
number of steps: approx. 500 times the number of neurons
θ close to 0.01 (otherwise topological defects are likely to
occur)
neighbourhood of each neuron should contain just few
other neurons 20
Kohonen’s map – theory
Convergence to "ordered" state has been proved only for
one dimensional maps and special cases of the distribution
p(x) (uniform), ﬁxed neighbourhoods of size 1, and a ﬁxed
learning rate.
There are simple counterexamples disproving convergence in case
these assumptions are not satisﬁed.
In more than one dimension there are no guarantees at all,
convergence depends on several factors:
initial distribution of neurons (centres)
size of the neighbourhood
learning rate
What dimension to choose? Typically one or two
dimensional map is used (as a coarse version of
dimensionality reduction).
21
LVQ – classiﬁcation using Kohonen’s map
Assume randomly generated training examples of the form
(xt , dt ) where xt ∈ Rn is feature vector and dt ∈ {C1, . . . , Cq}
corresponds to one of the q classes.
Our goal is to classify objects based on our knowledge of their
features, i.e. to every xt assign a class so that the probability of
error is minimized.
Ex.: Conveyor belt with fruits, apples and oranges: Formally,
(xt , dt ) where
xt ∈ R2, here the ﬁrst component is the weight and the
second the diameter.
dt is either A or O depending on whether the given object
is an apple or an orange.
We allow apples and oranges with the same features.
The goal is to sort out the fruits based on their weight and
diameter.
22
Classiﬁcation using Kohonen’s map
We use Kohonen’s map as follows:
1. Train the map on feature vectors xt where t = 1, . . . ,
(ignore the classes for now).
2. Label neurons with classes. The class vc of a given neuron
c is determined as follows:
For every neuron c and every class Ci count the number
#(c, Ci) of training examples xt with class Ci for which the
neuron c returns 1 (i.e. is the closest to them).
To c, assign the class vc satisfying
vc = argmaxCi
#(c, Ci)
3. Fine tune the network using LVQ (see later)
The trained network is used as follows: Given a feature vector
x, evaluate the network with x as the input. A single neuron c
has the value 1, return vc as the class of x.
23
LVQ
Iterate over training examples. For (xt , dt ) ﬁnd the closes
neuron c
c = arg min
i=1,...,h
xt − wi
Adjust weights of c as follows:
w
(t)
c =



w
(t−1)
c + α(xt − w
(t−1)
c ) dt = vc
w
(t−1)
c − α(xt − w
(t−1)
c ) dt vc
The parameter α should be small right from the beginning
(approx. 0.01 − 0.02) and go to 0 steadily.
By Kohonen: The border between classes should be a good
approximation of the Bayes decision boundary.
What is it??
24
Bayes classiﬁer
For simplicity, consider two classes C0 and C1 (e.g. A and O).
Let P(Ci | x) be the probability that the object belongs to Ci
assuming that it has features x.
(e.g. P(A | (a, b)) is the probability that a fruit with weight a and diameter b is
an apple.)
Bayes classiﬁer assigns to x the class Ci which satisﬁes
P(Ci | x) ≥ P(C1−i | x).
Denote by R0 the set of all x satisfying P(C0 | x) ≥ P(C1 | x)
and R1 = Rn R0.
Bayes classiﬁer minimizes the error probability:
P(x ∈ R0 ∧ C1) + P(x ∈ R1 ∧ C0)
Bayes decision boundary is the boundary between the sets R0
and R1.
25
Bayes decision boundary vs LVQ
Zdroj obrázku: The Self-Organizing Map, Teuvo Kohonen, IEEE, 1990
26
Oceanographic data
Source: Patterns of ocean current variability on the West Florida Shelf using
the self-organizing map. Y. Liu a R. H. Weisberg, JOURNAL OF
GEOPHYSICAL RESEARCH, 2005
Investigates currents in the ocean around Florida.
27
Oceanographic data
11 measuring stations, 3 depths (surface, bottom, in
between).
data: 2D velocity vectors of the current
measured by every hour, for 25585 hours
Thus we have 25585 data samples, 66 dimensions.
Kohonen’s map:
grid 3 × 4
neighbourhoods given by Gaussian functions
Θ(c, k) = θ0 · exp
−d(c, k)2
σ2
shrinking width
(linearly decreasing learning rate)
28
Oceanographic data
29
Oceanographic data
crosses are winning neurons)
inﬂuenced by local ﬂuctuations
observable trend:
winter: neurons 1-6 (south-east)
summer: neurons 10-12 (north-west)
30
Grimm’s fairy tales
Zdroj: Contextual Relations of Words in Grimm Tales, Analyzed by
Self-Organizing Map. T. Kohonen, T. Honkela a V. Pulkki, ICANN, 1995
Our goal is to visualize syntactic and semantic categories of
words in fairy tales (depending on context).
Input: Grimm’s fairy tales (understandably encoded using a
stream of 270-dimensional vectors)
triples of words (predecessor, key, successor)
every component in the triple encoded using a randomly
generated 90 dimensional real vector
Network: Kohonen’s map, 42 × 36 neurons, weights of the form
w = (wp, wk , wn) where wp, wk , wn ∈ R90.
31
Grimm’s fairy tales
Learning:
Trained on triples of successive words in fairy tales
The training set consisted of 150 most common words, with "average"
context.
Coarse training: 600 000 iterations; Fine tuning: 400 000
In the end, 150 most common words labelled neurons:
A word u labels a neuron with weights w = (wp, wk , wn) when
wk is closest to the code of u.
32
Grimm’s fairy tales
33
Great summary – models
We have considered several models of neural networks:
ADALINE (aka linear regression)
Multilayer Perceptron
Hopﬁeld Networks
Restricted Boltzmann Machines and Deep Belief Networks
Convolutional Networks
Recurrent Networks (LSTM)
Kohonen’s Maps
34
Great summary – algorithms
Gradient descent!
The only exception were Kohonen’s maps (Kohonen learning)
and Hopﬁeld (Hebb’s learning).
The gradient computed using
Backpropagation: MLP, Convolutional, Recurrent (LSTM)
Simulations: RBM
35
Deeper thoughts
Most neural network models are universal approximators
(i.e. capable of approximating any reasonable function),
but it is difﬁcult to ﬁnd the appropriate conﬁguration →
such conﬁguration can be learned efﬁciently (without
guarantees of course)
Depth is stronger than size: deep networks are more
succinct in their representation but are harder to train: Do
not forget the vanishin/exploding gradient problem!
The way how backprop is derived: Uniﬁcation of all
neurons using indices, backprop for models then differs
very little, only in speciﬁcation of neurons with tied weights!
Weight tying = single most effective trick in the history of
neural networks!
36