Autoencoders
An autoencoder consists of two parts:
φ : Rn → Rm the encoder
ψ : Rm → Rn the decoder
The goal is to ﬁnd φ, ψ so that ψ ◦ φ is (almost) identity.
The value h = φ(x) is called the latent representation of x.
254
Autoencoders – training
Assume
T = {x1, . . . , xp}
where xi ∈ Rn for all i ∈ {1, . . . , n}.
Minimize the reconstruction error
E =
p
i=1
(xi − ψ(φ(xi)))2
255
Autoencoders – neural networks
Both φ and ψ can be represented using MLP Mφ and Mψ,
respectively.
Mφ and Mψ can be connected into a single network.
256
Autoencoders – Usage
Compression – from x to h.
Dimensionality reduction – the latent representation h has
a smaller dimension.
Pretraining (next slides)
Generative versions – (roughly) generate h from a known
distribution, let Mψ generate realistic inputs x
257
Autoencoder – compression – historical
implementation
Architecture: MLP 64 − 16 − 64
Activity: activation function: hyperbolic tangens with limits −1
and 1
258
Autoencoder – compression – historical
implementation
Architecture: MLP 64 − 16 − 64
Activity: activation function: hyperbolic tangens with limits −1
and 1
Data:
Images 256 × 256, 8 bits per pixel.
Samples: input and output is a frame 8 × 8, randomly
selected in the image.
Inputs normalized to [−1, 1].
258
Autoencoder – compression – historical
implementation
Architecture: MLP 64 − 16 − 64
Activity: activation function: hyperbolic tangens with limits −1
and 1
Data:
Images 256 × 256, 8 bits per pixel.
Samples: input and output is a frame 8 × 8, randomly
selected in the image.
Inputs normalized to [−1, 1].
The goal was to compress images to smaller data size.
258
Autoencoder – compression – historical
implementation
A frame 8 × 8 passes through the
image 256 × 256 (no overlap)
(A) original
(B) compression
(C) compression + rounding to 6
bits (1.5 bit per pixel)
(D) compression + rounding to 4
bits (1 bit per pixel)
259
Dimensionality reduction – compression
New image (trained on the previous
one):
(A) original
(B) compression
(C) compression + rounding to 6
bits (1.5 bit per pixel)
(D) compression + rounding to 4
bits (1 bit per pixel)
260
Application – dimensionality reduction
Dimensionality reduction: A mapping R from Rn to Rm
where
m < n,
for every example x we have that x can be "reconstructed"
from R(x).
261
Application – dimensionality reduction
Dimensionality reduction: A mapping R from Rn to Rm
where
m < n,
for every example x we have that x can be "reconstructed"
from R(x).
Standard method: PCA (there are many linear as well as
non-linear variants)
261
Reconstruction – PCA
1024 pixels compressed to 100 dimensions (i.e. 100 numbers).
262
PCA vs Autoencoders
263
Autoencoders – Pretraining
An autoencoder is (pre)trained on input data xi without
desired outputs (unsupervised)
typically much larger datasets of unlabelled data
the encoder Mφ computes a latent representation for
every input vector, it is supposed to extract important
features (controversial)
A new part of the model Mtop is added on top of Mφ (e.g.
a MLP taking the output of Mφ as an input).
Subsequently, labels are added and the whole model
(composed of Mφ and Mtop) is trained on labelled data.
264
Autoencoders – Pretraining
265
Deep MLP – dimensionality reduction
Hinton, G. E., Osindero, S. and Teh, Y. (2006)
A fast learning algorithm for deep belief nets.
Neural Computation, 18, pp 1527-1554.
Hinton, G. E. and Salakhutdinov, R. R. (2006)
Reducing the dimensionality of data with neural networks.
Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006.
This basically started all the deep learning craze ...
266
Deep MLP – dimensionality reduction
267
Images – pretraining
Data: 165 600 black-white images, 25 × 25, mean intensity
0, variance 1.
Images obtained from Olivetti Faces database of images 64 × 64 using
standard transformations.
103 500 training set, 20 700 validation, 41 400 test
Network: 2000-100-500-30, training using layered RBM.
Notes:
Training of the lowest layer (2000 neurons): Values of pixels distorted
using Gaussian noise, low learning rate: 0.001, 200 iterations
Training all hidden layers: Values of neurons are binary.
Training of output layer: Values computed directly using the sigmoid
activation functions + noise. That is, values of output neurons are
from the interval [0, 1].
268
Images – ﬁne-tuning
Stochastic activation substituted with deterministic.
That is the value of hidden neurons is not chosen randomly but directly
computed by application of sigmoid on the inner potential (this gives the
mean activation).
Backpropagation.
Error function: cross-entropy
−
i
pi ln ˆpi −
i
(1 − pi) ln(1 − ˆpi)
here pi is the intensity of i-th pixel of the input and ˆpi of
the reconstruction.
269
Results
1. Original
2. Reconstruction using deep networks (reduction to 30-dim)
3. Reconstruction using PCA (reduction to 30-dim)
270
Generative adversarial networks
Generative adversarial Nets, Goodfellow et al, NIPS 2014
An unsupervised generative model.
Two networks:
Generator: A network computing a function G : Rk → Rn
which takes a random input z with a distribution pz
(e.g. multivariate normal distribution) and returns G(z)
which should follow the target probability distribution.
E.g. G(z) could be realistically looking faces.
Discriminator: A network computing a function
D : Rn → [0, 1] that given x ∈ Rn gives a probability D(x)
that x is not "generated" by G.
E.g. x can be an image, D(x) is a probability that it is a true face of an
existing person.
What error function will "motivate" G to generate realistically
and D to discriminate appropriately?
271
Generative adversarial networks – error function
Let T = {x1, . . . , xp} be a training multiset (or a minibatch).
Intuition: G should produce outputs similar to elements of T .
D should recognize that its input is not from T .
272
Generative adversarial networks – error function
Let T = {x1, . . . , xp} be a training multiset (or a minibatch).
Intuition: G should produce outputs similar to elements of T .
D should recognize that its input is not from T .
Generate a multiset of noise samples: F = {z1, . . . , zp} from
the distribution pz.
ET ,F (G, D) = −
1
p
p
i=1
ln D(x1) + ln(1 − D(G(z1)))
This is just the binary cross entropy error of D which classiﬁes the input as
either real, or fake.
The problem can be seen as a game: The discriminator wants
to minimize E, the generator wants to maximize E!
272
The learning algorithm
Denote by WG and WD the weights of G and D, respectively.
In every iteration of the training, modify weights of the discriminator
and the generator as follows:
273
The learning algorithm
Denote by WG and WD the weights of G and D, respectively.
In every iteration of the training, modify weights of the discriminator
and the generator as follows:
For k steps (here k is a hyperparameter) update the discriminator as
follows:
Sample a minibatch T = {x1, . . . , xm} from the training set T .
Sample a minibatch F = {z1, . . . , zm} from the distribution pz.
Update WD using the gradient descent w.r.t. E:
WD := WD − α · WD
ET,F (G, D)
273
The learning algorithm
Denote by WG and WD the weights of G and D, respectively.
In every iteration of the training, modify weights of the discriminator
and the generator as follows:
For k steps (here k is a hyperparameter) update the discriminator as
follows:
Sample a minibatch T = {x1, . . . , xm} from the training set T .
Sample a minibatch F = {z1, . . . , zm} from the distribution pz.
Update WD using the gradient descent w.r.t. E:
WD := WD − α · WD
ET,F (G, D)
Now update the generator:
Sample a minibatch F = {z1, . . . , zm} from the distribution pz.
Update the generator by gradient descent:
WG := WG − α · WG


1
p
p
i=1
ln(1 − D(G(z1)))


(The updates may also use momentum, adaptive learning rate etc.) 273
GAN MNIST
274
GAN faces
... from the original paper.
275
GAN reﬁned
... after some reﬁnements.
... none of these people ever lived.
276