Relationship
Supervised Learning Idea
► We have some data (x^1). ft1)), (x<2>, t<2)), . . . (x^,
► We want to be able to make prediction y (of an unseen t) for a new value of x
► For example, predict the exam grade of a person who missed their exam
► How can we build a mo del to solve the prediction problem?
Supervised Learning Task: Exam Grade Prediction
(Definitely not real data from last term)
Task: Predict Exam Grade given Assignment Grade
0.2       0.3       0.4       0.5        0.6       0.7       0.8       0.9 10
Assignment Grade
- Data: (x*1), t*1'), (x(2>. t<2>), .. . (x(w>. (W)
► The x(') are called inputs
► The t(0 are called targets
near Regression Model
A model is a set of assumptions about the underlying nature of the data we wish to learn about. The model, or architecture defines the set of allowable hypotheses.
In linear regression, our model will look like this
Where y is a prediction for t, and the wj and b are parameters of the model, to be determined based on the data.
J
Linear Regression for Exam Grade Prediction
For the exam prediction problem, we only have a single feature, so we can simplify our model to:
y = wx + b
Our hypothesis space includes all functions of the form y = wx + b. Here are some examples:
OAx + 0.2 0.9x + 0.2 O.lx + 0.7
-x - 1
The variables w and b are called weights or parameters of our model. (Sometimes w and b are referred to as coefficients and intercept, respectively.)
Which hypothesis is better suited to the data?
Hypothesis y = 0.4 • x + 0.2 Hypothesis y = 0 9 * x + 0 2
00 02 04 06 08 1 0 00 02 04 06 08 1 0
Assignment Grade Assignment Grade
Hypothesis y - 0 1 » x + 0 7_ _Hypothesis y - 10'x + 10
0 0 0 2 0 4 0 6 0 8 10 0 0 0 2 04 0 6 0 8 11
Assignment Grade Assignment Grade
Hypothesis Space
We can visualize the hypothesis space or weight space: Data space Weight space
Each point in the weight space represents a hypothesis.
Quantifying the "badness" of a hypothesis
Idea:
A good hypothesis should make good predictions about our labeled data (xt1), ft1)), (x<2), t(2)), . . .  (xW, t<w))
* That is, y(0 = wxt') + b should be "close to" t(')
► But how do we define the notion of "close to"?
We'll choose square vertical distance:
£{y, t) = \{y - tf This choice has some nice mathematical and statistical properties.
Cost Function (Loss Function)
The "badness" of an entire hypothesis is the average badness across our labeled data.
£(^b) = ^££(y('V(/))
i
= ^ E(y(?) - *»?
i
= 5n + fa) - *(0)2
This is called the loss of a particular hypothesis.
Since the loss depends on the choice of w and b, we call £{w. b) the loss function.
Summary so far
Hypothesis y = wx -+- b
Parameters w, b
Loss Function £{w, b) = ± E/((««(/) + *>) - f(/))2
Goal Find w, b that minimize b)
Minimizing the Loss Function
Task: Find w and 6 that minimize the loss function
Potential Strategy: Direct Solution
Find a critical point by setting
36
dw
98 =o
Ob
Possible for our hypothesis space, and are covered in the notes . . . and the pre-requisite quiz! See what we did there?
However, let's use a technique that can also be applied to more general models.
rategy: Gradient Descent
Minimizing a scalar function f(x) J
Gradient Descent is an iterative method used to find the minima of a function.
We'll start by thinking about a scalar function (ID)
To minimize a function f(x)r we start with a random point xo and iterate an update rule that we will derive.
Deriving Gradient Descent Update
Consdier this function f(x)
~\-1-1-1-1-1-1—
-3-2-10 1 2 3
Q: If we want to move the red point closer to the minima, do we move left or right?
Q: At the red point xp is the derivative ff(x) positive or negative?
We want to move x towards the negative direction of the gradient!
How much do we move?
Q: Should we make a larger jump at the red point or green?
The larger |f'(x)|r the more we should move. We slow down close to a minima.
x
X
The term a is the learning rate
Gradient Descent for Linear Regression (2D)
The same idea holds in higher dimensions:
06
w 4— w — a-
ow
Gradient Descent for Linear Regression (high dimensional)
Or, in genera
w <— w — a
06 dw
08
LdwD J
It turns out that the gradient is the direction of the steepest descent
Gradient Descent: when to stop?
In theory:
► Stop when w and b stop changing (convergence) In practice:
► Stop when £ almost stops changing (another notion of convergence)
► Stop until we're tired of waiting
What are neural networks?
• While neural nets originally drew inspiration from the brain, nowadays we mostly think about math, statistics, etc.
bias
i'th weight
output uuas
y = 9 (ft + y]
X\      X2 Xs
nonlinearity
i'th input
• Neural networks are collections of thousands (or millions) of these simple processing units that together perform useful computations.