Agenda
► Biological and Artificial Neurons
► Neural Network
► Multi-Layer Perceptron (Fully-connected layers)
► Backpropagation
Biological and Artificial Neurons
Neuron
impulses carried toward cell body
dendrites
nucleus
branches of axon
axon
cell body
impulses carried away from cell body
axon terminals
An Artificial Neural Network (Multi-Layer Perceptron)
Idea:
► Use a simplified (mathematical) model of a neuron as building blocks
► Connect the neurons together in the following way:
input layer
hidden layer 1   hidden layer 2
► An input layer: feed in input features (e.g. like retinal cells in your eyes)
► A number of hidden layers: don't have specific meaning
► An output layer: interpret output like a "grandmother cell"
Modeling Individual Neurons
► xi5x25 ••• = inputs to the neuron
► wi, W2,... = the neuron's weights
► b = the neuron's bias
► f = an activation function
► f(Y,ixiwi + b) = the neuron's activation (outpu
Activation Functions: common choices
Common Choices:
► Sigmoid activation
► Tanh activation
► ReLU activation
Rule of thumb: Start with ReLU activation. If necessary, try tanh.
Activation Function: Sigmoid
► somewhat problematic due to gradient signal
► all activations are positive
Activation Function: Tanh
► scaled version of the sigmoid activation
► also somewhat problematic due to gradient signal
► activations can be positive or negative
Activation Function: ReLU
_i_i_i_i_i_i_i_i_i_i_._i_i_i_i_i_i_i_i_i_i_
-10 -5 5 10
► most often used nowadays
► all activations are positive
► easy to compute gradients
► can be problematic if the bias is too large and negative, so the activations are always 0
Linear Regression as a Single Neuron
► xi5x25 ••• : inputs
► wi, W2,... : components of the weight vector w
► b : the bias
► f : identity function
► y =     x\W\ + b = vjtx + b
Binary Classification (Logistic Regression) as a Single Neuron
20 wo
► *i5x2> ••• : inputs
► wi, W2,... : components of the weight vector w
► b : the bias
► f = a
► y = a(J2jXiWj + b) = a(w7x + />)
MNIST Digit Recognition
C
20 -\
c
20 J
3
25 0
25 0
25 0
25 0 25
C 25 0        25 0        25 0
20 A
3 £
25 0 0 -i
25 0 25
7 3-/
C 25 0        25 0        25 0        25 0        25 0 25
► Input: An 28x28 pixel image
► x is a vector of length 784
► Target: The digit represented in the image
► t is a one-hot vector of length 10
► Model (from tutorial 4)
► y = softmax(l/l/x + b)
Adding a Hidden Layer
Two layer neural network
input layer
hidden layer
► Input size: 784 (number of features)
► Hidden size: 50 (we choose this number)
► Output size: 10 (number of classes)
Side note about machine learning models
When discussing machine learning models, we usually
► first talk about how to make predictions assume the weights are trained
► then talk about how to traing the weights
Often the second step requires gradient descent or some other optimization method
Making Predictions: computing the hidden layer
input layer
hidden layer
784
#=l
784
7=1
Making Predictions: computing the output (pre-activation)
Making Predictions: applying the output activation
LzioJ
y = softmax(z)
Making Predictions: Vectorized
input layer
hidden layer
h = f(Wwx + bw) y = softmax(z)
Expressive Power: Linear Layers (No Activation Function)
► We've seen that there are some functions that linear classifiers can't represent. Are deep networks any better?
► Any sequence of linear layers (with no activation function) can be equivalently represented with a single linear layer.
Y = l/l/P) x
J        V-v-'
= l/l/'x
Deep linear networks are no more expressive than linear regression!
Expressive Power: MLP (nonlinear activation)
► Multilayer feed-forward neural nets with nonlinear activation functions are universal approximators: they can approximate any function arbitrarily well.
► This has been shown for various activation functions
(thresholds, logistic, ReLU, etc.)
► Even though ReLU is "almost" linear, it's nonlinear enough!
Universality for binary inputs and targets
► Hard threshold hidden units, linear output
► Strategy: 2° hidden units, each of which responds to one
particular input configuration
► Only requires one hidden layer, though it needs to be extremely wide!
Limits of universality
► You may need to represent an exponentially large network.
► If you can learn any function, you'll just overfit.
► Really, we desire a compact representation!
Backpropagation
Training Neural Networks
► How do we find good weights for the neural network?
► We can continue to use the loss functions:
► cross-entropy loss for classification
► square loss for regression
► The neural network operations we used (weights, etc) are continuous
We can use gradient descent!
Gradient Descent Recap
► Start with a set of parameters (initialize to some value)
► Compute the gradient      for each parameter (also ||)
► This computation can often vectorized
► Update the parameters towards the negative direction of the gradient
Gradient Descent for Neural Networks
► Conceptually, the exact same idea!
► However, we have more parameters than before
► Higher dimensional
► Harder to visualize
► More "steps"
Since is the average of |£ across training examples, we'll focus on computing f£
Univariate Chain Rule
Recall: if f(x) and x(t) are univariate functions, then
d r, /xx df dx Jt'^ = dx~dt
Univariate Chain Rule for Logistic Least Square
Recall: Univariate logistic least squares model
z — wx + b y = <r(z)
Let's compute the loss derivative
Univariate Chain Rule Computation (1)
How you would have done it in calculus class
C
dw
-(a{wx + b) - tf
8
dw
1 d
2dw
-(a(wx + b)- tf
(a(wx + b) - t)2
8
(a(wx + b)- t)^-{a{wx + b) - t)
aw
d
— (a(wx + b) — t)<7'(wx + b)——(wx + b)
ow
= (a(wx + b) — t)(jf(wx + b)x
Univariate Chain Rule Computation (2)
Similarly for
C
dC
-(<t(wx + b) - t)
d
db db
~(a{wx + b) - tf
= l^W^x + b) - t)2
d
- (a(wx + b)~ t) — (a(wx + b) - t)
ab
Q
= (a(wx + b) — t)o(wx + b) — (wx + b)
db
= (a(wx + b) - t)a'(wx + b)
Univariate Chain Rule Computation (2)
Similarly for ^
C
8C
-(a(wx + b) - t)2
d
db db
-(a{wx + b) - tf
= \^{°{w* + b)-t)2
d
= (a(wx + b)- t)-^(a(wx + b)-t)
Q
= (a(wx + b) — t)a (wx + b) — (wx + b)
db
= (cr(wx + b) - t)a'(wx + b)
Q: What are the disadvantages of this approach?
More Structured Way to Compute the Derivatives
z = wx + b y = <r(z)
Less repeated work; easier to write a derivatives
dC	
dy ~	y-
dC	dC
dz ~	dy
dC _	dC
dw	dz
dC _	dC
db ~	dz
t
program to efficiently compute
Computation Graph
We can diagram out the computations using a computation graph.
Compute Loss
Compute Derivatives
The nodes represent all the inputs and computed quantities
The edges represent which nodes are computed directly as a function of which other nodes.
Chain Rule (Error Signal) Notation
Use y to denote the derivative ^
► sometimes called the error signal This notation emphasizes that the error signals are just values our program is computing (rather than a mathematical operation).
z = wx + b
y =
£ = l(y-t):
y = y-t z = ya\z)
w = zx
b = z
Multiclass Logistic Regression Computation Graph
In general, the computation graph fans out:
Wl2
zi = Yl wuxj +bi j
Yk =
£> = -J2tk\ogyk
W22
There are multiple paths for which a weight like wu affects the loss L
Multivariate Chain Rule
Suppose we have a function f{x,y) and functions x(t) and y(t) (All the variables here are scalar-valued.) Then
d rf / x   /xx    df dx    df dy
X.
K x
y.
Multivariate Chain Rule Example
If f(x,y) = y + exy, x(t) = cost and y(t) = t2...
d r, , .    , xx c/x    <9f c/y
ft?* +ay*
= tye'5') • (- sin t) + (1 + xe*'') • 2t
Multivariate Chain Rule Notation
Mathematical expressions to be evaluated
/ x
df df dx df dy dt     dx dt     dy dt
/
Values already computed by our program
In our notation
The Backpropagation Algorithm
► Backpropagation is an algorithm to compute gradients efficiency
► Forward Pass: Compute predictions (and save intermediate values)
► Backwards Pass: Compute gradients
► The idea behind backpropagation is very similar to dynamic programming
► Use chain rule, and be careful about the order in which we compute the derivatives
Backpropagation for a MLP
w
(1)
(i)
12
(2)
(2)
6"
(2)
X2
?1-^1
^2-*/l2
2/L
i)
(1) ™21 22
W
+V2
4
(2)
(2) W21
X
7
*2
22
Forward pass:
/?/ = cr(z/)
Backward pass:
£ = 1
y/c = £(y/c - tk)
k
zj = hia'(zi)
(1) —
Wjj    = ZiXj
Backpropagation for a MLP (Vectorized)
(2)
(2) W21 22
Forward pass:
z = l/l/Wx +
h = a(z)
y = |/|/(2)h + b<2>
£ = -||y -1
2 J
Backward pass:
C = 1
y = £(y-t W(2) =yhr
b(2) = y
h = l/|A2) Ty z = h o <r'(z
l/l/« = zx
bW =