Autoencoders An autoencoder consists of two parts: φ : Rn → Rm the encoder ψ : Rm → Rn the decoder The goal is to find φ, ψ so that ψ ◦ φ is (almost) identity. The value h = φ(x) is called the latent representation of x. 254 Autoencoders – training Assume T = {x1, . . . , xp} where xi ∈ Rn for all i ∈ {1, . . . , n}. Minimize the reconstruction error E = p i=1 (xi − ψ(φ(xi)))2 255 Autoencoders – neural networks Both φ and ψ can be represented using MLP Mφ and Mψ, respectively. Mφ and Mψ can be connected into a single network. 256 Autoencoders – Usage Compression – from x to h. Dimensionality reduction – the latent representation h has a smaller dimension. Pretraining (next slides) Generative versions – (roughly) generate h from a known distribution, let Mψ generate realistic inputs x 257 Autoencoder – compression – historical implementation Architecture: MLP 64 − 16 − 64 Activity: activation function: hyperbolic tangens with limits −1 and 1 258 Autoencoder – compression – historical implementation Architecture: MLP 64 − 16 − 64 Activity: activation function: hyperbolic tangens with limits −1 and 1 Data: Images 256 × 256, 8 bits per pixel. Samples: input and output is a frame 8 × 8, randomly selected in the image. Inputs normalized to [−1, 1]. 258 Autoencoder – compression – historical implementation Architecture: MLP 64 − 16 − 64 Activity: activation function: hyperbolic tangens with limits −1 and 1 Data: Images 256 × 256, 8 bits per pixel. Samples: input and output is a frame 8 × 8, randomly selected in the image. Inputs normalized to [−1, 1]. The goal was to compress images to smaller data size. 258 Autoencoder – compression – historical implementation A frame 8 × 8 passes through the image 256 × 256 (no overlap) (A) original (B) compression (C) compression + rounding to 6 bits (1.5 bit per pixel) (D) compression + rounding to 4 bits (1 bit per pixel) 259 Dimensionality reduction – compression New image (trained on the previous one): (A) original (B) compression (C) compression + rounding to 6 bits (1.5 bit per pixel) (D) compression + rounding to 4 bits (1 bit per pixel) 260 Application – dimensionality reduction Dimensionality reduction: A mapping R from Rn to Rm where m < n, for every example x we have that x can be "reconstructed" from R(x). 261 Application – dimensionality reduction Dimensionality reduction: A mapping R from Rn to Rm where m < n, for every example x we have that x can be "reconstructed" from R(x). Standard method: PCA (there are many linear as well as non-linear variants) 261 Reconstruction – PCA 1024 pixels compressed to 100 dimensions (i.e. 100 numbers). 262 PCA vs Autoencoders 263 Autoencoders – Pretraining An autoencoder is (pre)trained on input data xi without desired outputs (unsupervised) typically much larger datasets of unlabelled data the encoder Mφ computes a latent representation for every input vector, it is supposed to extract important features (controversial) A new part of the model Mtop is added on top of Mφ (e.g. a MLP taking the output of Mφ as an input). Subsequently, labels are added and the whole model (composed of Mφ and Mtop) is trained on labelled data. 264 Autoencoders – Pretraining 265 Deep MLP – dimensionality reduction Hinton, G. E., Osindero, S. and Teh, Y. (2006) A fast learning algorithm for deep belief nets. Neural Computation, 18, pp 1527-1554. Hinton, G. E. and Salakhutdinov, R. R. (2006) Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006. This basically started all the deep learning craze ... 266 Deep MLP – dimensionality reduction 267 Images – pretraining Data: 165 600 black-white images, 25 × 25, mean intensity 0, variance 1. Images obtained from Olivetti Faces database of images 64 × 64 using standard transformations. 103 500 training set, 20 700 validation, 41 400 test Network: 2000-100-500-30, training using layered RBM. Notes: Training of the lowest layer (2000 neurons): Values of pixels distorted using Gaussian noise, low learning rate: 0.001, 200 iterations Training all hidden layers: Values of neurons are binary. Training of output layer: Values computed directly using the sigmoid activation functions + noise. That is, values of output neurons are from the interval [0, 1]. 268 Images – fine-tuning Stochastic activation substituted with deterministic. That is the value of hidden neurons is not chosen randomly but directly computed by application of sigmoid on the inner potential (this gives the mean activation). Backpropagation. Error function: cross-entropy − i pi ln ˆpi − i (1 − pi) ln(1 − ˆpi) here pi is the intensity of i-th pixel of the input and ˆpi of the reconstruction. 269 Results 1. Original 2. Reconstruction using deep networks (reduction to 30-dim) 3. Reconstruction using PCA (reduction to 30-dim) 270 Generative adversarial networks Generative adversarial Nets, Goodfellow et al, NIPS 2014 An unsupervised generative model. Two networks: Generator: A network computing a function G : Rk → Rn which takes a random input z with a distribution pz (e.g. multivariate normal distribution) and returns G(z) which should follow the target probability distribution. E.g. G(z) could be realistically looking faces. Discriminator: A network computing a function D : Rn → [0, 1] that given x ∈ Rn gives a probability D(x) that x is not "generated" by G. E.g. x can be an image, D(x) is a probability that it is a true face of an existing person. What error function will "motivate" G to generate realistically and D to discriminate appropriately? 271 Generative adversarial networks – error function Let T = {x1, . . . , xp} be a training multiset (or a minibatch). Intuition: G should produce outputs similar to elements of T . D should recognize that its input is not from T . 272 Generative adversarial networks – error function Let T = {x1, . . . , xp} be a training multiset (or a minibatch). Intuition: G should produce outputs similar to elements of T . D should recognize that its input is not from T . Generate a multiset of noise samples: F = {z1, . . . , zp} from the distribution pz. ET ,F (G, D) = − 1 p p i=1 ln D(x1) + ln(1 − D(G(z1))) This is just the binary cross entropy error of D which classifies the input as either real, or fake. The problem can be seen as a game: The discriminator wants to minimize E, the generator wants to maximize E! 272 The learning algorithm Denote by WG and WD the weights of G and D, respectively. In every iteration of the training, modify weights of the discriminator and the generator as follows: 273 The learning algorithm Denote by WG and WD the weights of G and D, respectively. In every iteration of the training, modify weights of the discriminator and the generator as follows: For k steps (here k is a hyperparameter) update the discriminator as follows: Sample a minibatch T = {x1, . . . , xm} from the training set T . Sample a minibatch F = {z1, . . . , zm} from the distribution pz. Update WD using the gradient descent w.r.t. E: WD := WD − α · WD ET,F (G, D) 273 The learning algorithm Denote by WG and WD the weights of G and D, respectively. In every iteration of the training, modify weights of the discriminator and the generator as follows: For k steps (here k is a hyperparameter) update the discriminator as follows: Sample a minibatch T = {x1, . . . , xm} from the training set T . Sample a minibatch F = {z1, . . . , zm} from the distribution pz. Update WD using the gradient descent w.r.t. E: WD := WD − α · WD ET,F (G, D) Now update the generator: Sample a minibatch F = {z1, . . . , zm} from the distribution pz. Update the generator by gradient descent: WG := WG − α · WG   1 p p i=1 ln(1 − D(G(z1)))   (The updates may also use momentum, adaptive learning rate etc.) 273 GAN MNIST 274 GAN faces ... from the original paper. 275 GAN refined ... after some refinements. ... none of these people ever lived. 276