Relationship Supervised Learning Idea ► We have some data (x^1). ft1)), (x<2>, t<2)), . . . (x^, ► We want to be able to make prediction y (of an unseen t) for a new value of x ► For example, predict the exam grade of a person who missed their exam ► How can we build a mo del to solve the prediction problem? Supervised Learning Task: Exam Grade Prediction (Definitely not real data from last term) Task: Predict Exam Grade given Assignment Grade 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 Assignment Grade - Data: (x*1), t*1'), (x(2>. t<2>), .. . (x(w>. (W) ► The x(') are called inputs ► The t(0 are called targets near Regression Model A model is a set of assumptions about the underlying nature of the data we wish to learn about. The model, or architecture defines the set of allowable hypotheses. In linear regression, our model will look like this Where y is a prediction for t, and the wj and b are parameters of the model, to be determined based on the data. J Linear Regression for Exam Grade Prediction For the exam prediction problem, we only have a single feature, so we can simplify our model to: y = wx + b Our hypothesis space includes all functions of the form y = wx + b. Here are some examples: OAx + 0.2 0.9x + 0.2 O.lx + 0.7 -x - 1 The variables w and b are called weights or parameters of our model. (Sometimes w and b are referred to as coefficients and intercept, respectively.) Which hypothesis is better suited to the data? Hypothesis y = 0.4 • x + 0.2 Hypothesis y = 0 9 * x + 0 2 00 02 04 06 08 1 0 00 02 04 06 08 1 0 Assignment Grade Assignment Grade Hypothesis y - 0 1 » x + 0 7_ _Hypothesis y - 10'x + 10 0 0 0 2 0 4 0 6 0 8 10 0 0 0 2 04 0 6 0 8 11 Assignment Grade Assignment Grade Hypothesis Space We can visualize the hypothesis space or weight space: Data space Weight space Each point in the weight space represents a hypothesis. Quantifying the "badness" of a hypothesis Idea: A good hypothesis should make good predictions about our labeled data (xt1), ft1)), (x<2), t(2)), . . . (xW, t) - f(/))2 Goal Find w, b that minimize b) Minimizing the Loss Function Task: Find w and 6 that minimize the loss function Potential Strategy: Direct Solution Find a critical point by setting 36 dw 98 =o Ob Possible for our hypothesis space, and are covered in the notes . . . and the pre-requisite quiz! See what we did there? However, let's use a technique that can also be applied to more general models. rategy: Gradient Descent Minimizing a scalar function f(x) J Gradient Descent is an iterative method used to find the minima of a function. We'll start by thinking about a scalar function (ID) To minimize a function f(x)r we start with a random point xo and iterate an update rule that we will derive. Deriving Gradient Descent Update Consdier this function f(x) ~\-1-1-1-1-1-1— -3-2-10 1 2 3 Q: If we want to move the red point closer to the minima, do we move left or right? Q: At the red point xp is the derivative ff(x) positive or negative? We want to move x towards the negative direction of the gradient! How much do we move? Q: Should we make a larger jump at the red point or green? The larger |f'(x)|r the more we should move. We slow down close to a minima. x X The term a is the learning rate Gradient Descent for Linear Regression (2D) The same idea holds in higher dimensions: 06 w 4— w — a- ow Gradient Descent for Linear Regression (high dimensional) Or, in genera w <— w — a 06 dw 08 LdwD J It turns out that the gradient is the direction of the steepest descent Gradient Descent: when to stop? In theory: ► Stop when w and b stop changing (convergence) In practice: ► Stop when £ almost stops changing (another notion of convergence) ► Stop until we're tired of waiting What are neural networks? • While neural nets originally drew inspiration from the brain, nowadays we mostly think about math, statistics, etc. bias i'th weight output uuas y = 9 (ft + y] X\ X2 Xs nonlinearity i'th input • Neural networks are collections of thousands (or millions) of these simple processing units that together perform useful computations.