Agenda ► Biological and Artificial Neurons ► Neural Network ► Multi-Layer Perceptron (Fully-connected layers) ► Backpropagation Biological and Artificial Neurons Neuron impulses carried toward cell body dendrites nucleus branches of axon axon cell body impulses carried away from cell body axon terminals An Artificial Neural Network (Multi-Layer Perceptron) Idea: ► Use a simplified (mathematical) model of a neuron as building blocks ► Connect the neurons together in the following way: input layer hidden layer 1 hidden layer 2 ► An input layer: feed in input features (e.g. like retinal cells in your eyes) ► A number of hidden layers: don't have specific meaning ► An output layer: interpret output like a "grandmother cell" Modeling Individual Neurons ► xi5x25 ••• = inputs to the neuron ► wi, W2,... = the neuron's weights ► b = the neuron's bias ► f = an activation function ► f(Y,ixiwi + b) = the neuron's activation (outpu Activation Functions: common choices Common Choices: ► Sigmoid activation ► Tanh activation ► ReLU activation Rule of thumb: Start with ReLU activation. If necessary, try tanh. Activation Function: Sigmoid ► somewhat problematic due to gradient signal ► all activations are positive Activation Function: Tanh ► scaled version of the sigmoid activation ► also somewhat problematic due to gradient signal ► activations can be positive or negative Activation Function: ReLU _i_i_i_i_i_i_i_i_i_i_._i_i_i_i_i_i_i_i_i_i_ -10 -5 5 10 ► most often used nowadays ► all activations are positive ► easy to compute gradients ► can be problematic if the bias is too large and negative, so the activations are always 0 Linear Regression as a Single Neuron ► xi5x25 ••• : inputs ► wi, W2,... : components of the weight vector w ► b : the bias ► f : identity function ► y = x\W\ + b = vjtx + b Binary Classification (Logistic Regression) as a Single Neuron 20 wo ► *i5x2> ••• : inputs ► wi, W2,... : components of the weight vector w ► b : the bias ► f = a ► y = a(J2jXiWj + b) = a(w7x + />) MNIST Digit Recognition C 20 -\ c 20 J 3 25 0 25 0 25 0 25 0 25 C 25 0 25 0 25 0 20 A 3 £ 25 0 0 -i 25 0 25 7 3-/ C 25 0 25 0 25 0 25 0 25 0 25 ► Input: An 28x28 pixel image ► x is a vector of length 784 ► Target: The digit represented in the image ► t is a one-hot vector of length 10 ► Model (from tutorial 4) ► y = softmax(l/l/x + b) Adding a Hidden Layer Two layer neural network input layer hidden layer ► Input size: 784 (number of features) ► Hidden size: 50 (we choose this number) ► Output size: 10 (number of classes) Side note about machine learning models When discussing machine learning models, we usually ► first talk about how to make predictions assume the weights are trained ► then talk about how to traing the weights Often the second step requires gradient descent or some other optimization method Making Predictions: computing the hidden layer input layer hidden layer 784 #=l 784 7=1 Making Predictions: computing the output (pre-activation) Making Predictions: applying the output activation LzioJ y = softmax(z) Making Predictions: Vectorized input layer hidden layer h = f(Wwx + bw) y = softmax(z) Expressive Power: Linear Layers (No Activation Function) ► We've seen that there are some functions that linear classifiers can't represent. Are deep networks any better? ► Any sequence of linear layers (with no activation function) can be equivalently represented with a single linear layer. Y = l/l/P) x J V-v-' = l/l/'x Deep linear networks are no more expressive than linear regression! Expressive Power: MLP (nonlinear activation) ► Multilayer feed-forward neural nets with nonlinear activation functions are universal approximators: they can approximate any function arbitrarily well. ► This has been shown for various activation functions (thresholds, logistic, ReLU, etc.) ► Even though ReLU is "almost" linear, it's nonlinear enough! Universality for binary inputs and targets ► Hard threshold hidden units, linear output ► Strategy: 2° hidden units, each of which responds to one particular input configuration ► Only requires one hidden layer, though it needs to be extremely wide! Limits of universality ► You may need to represent an exponentially large network. ► If you can learn any function, you'll just overfit. ► Really, we desire a compact representation! Backpropagation Training Neural Networks ► How do we find good weights for the neural network? ► We can continue to use the loss functions: ► cross-entropy loss for classification ► square loss for regression ► The neural network operations we used (weights, etc) are continuous We can use gradient descent! Gradient Descent Recap ► Start with a set of parameters (initialize to some value) ► Compute the gradient for each parameter (also ||) ► This computation can often vectorized ► Update the parameters towards the negative direction of the gradient Gradient Descent for Neural Networks ► Conceptually, the exact same idea! ► However, we have more parameters than before ► Higher dimensional ► Harder to visualize ► More "steps" Since is the average of |£ across training examples, we'll focus on computing f£ Univariate Chain Rule Recall: if f(x) and x(t) are univariate functions, then d r, /xx df dx Jt'^ = dx~dt Univariate Chain Rule for Logistic Least Square Recall: Univariate logistic least squares model z — wx + b y = = -J2tk\ogyk W22 There are multiple paths for which a weight like wu affects the loss L Multivariate Chain Rule Suppose we have a function f{x,y) and functions x(t) and y(t) (All the variables here are scalar-valued.) Then d rf / x /xx df dx df dy X. K x y. Multivariate Chain Rule Example If f(x,y) = y + exy, x(t) = cost and y(t) = t2... d r, , . , xx c/x <9f c/y ft?* +ay* = tye'5') • (- sin t) + (1 + xe*'') • 2t Multivariate Chain Rule Notation Mathematical expressions to be evaluated / x df df dx df dy dt dx dt dy dt / Values already computed by our program In our notation The Backpropagation Algorithm ► Backpropagation is an algorithm to compute gradients efficiency ► Forward Pass: Compute predictions (and save intermediate values) ► Backwards Pass: Compute gradients ► The idea behind backpropagation is very similar to dynamic programming ► Use chain rule, and be careful about the order in which we compute the derivatives Backpropagation for a MLP w (1) (i) 12 (2) (2) 6" (2) X2 ?1-^1 ^2-*/l2 2/L i) (1) ™21 22 W +V2 4 (2) (2) W21 X 7 *2 22 Forward pass: /?/ = cr(z/) Backward pass: £ = 1 y/c = £(y/c - tk) k zj = hia'(zi) (1) — Wjj = ZiXj Backpropagation for a MLP (Vectorized) (2) (2) W21 22 Forward pass: z = l/l/Wx + h = a(z) y = |/|/(2)h + b<2> £ = -||y -1 2 J Backward pass: C = 1 y = £(y-t W(2) =yhr b(2) = y h = l/|A2) Ty z = h o