Probabilistic Classification
Based on the ML lecture by Raymond J. Mooney University of Texas at Austin
l
Probabilistic Classification — Idea
Imagine that
► I look out of a window and see a bird,
► it is black, approx. 25 cm long, and has a rather yellow beak. My daughter asks: What kind of bird is this?
My usual answer: This is probably a kind of blackbird (kos černý in Czech).
Here probably means that out of my extensive catalogue of four kinds of birds that I am able to recognize, "blackbird" gets the highest degree of belief based on features of this particular bird.
Frequentists might say that the largest proportion of birds with similar features I have ever seen were blackbirds.
The degree of belief (Bayesians), or the relative frequency (frequentists) is the probability.
2
Basic Discrete Probability Theory
► A finite or countably infinite set ft of possible outcomes, ft is called sample space.
Experiment: Roll one dice once. Sample space: Q = {1,..., 6}
► Each element uo of ft is assigned a "probability" value f(uo), here f must satisfy
► f(u) e [0,1] for all uj e ft,
If the dice is fair, then f(uu) = | for all uo G {1,..., 6}.
► An event is any subset E of ft.
► The probability of a given event E C ft is defined as
p(e) = n«>)
Let E be the event that an odd number is rolled, i.e., E = {1,3,5}. Then P(E) = |.
► Basic laws: P(ft) = 1, P(0) = 0, given disjoint sets a, b we have p(a u b) = p(a) + p(b), P(Q \ a) = 1 - P(/\).
Conditional Probability and Independence
► P(A | B) is the probability of A given B (assume P(B) > 0) defined by
P{A | B) = P{AnB)/P{B)
(We assume that B is all and only information known.)
A fair dice: what is the probability that 3 is rolled assuming that an odd number is rolled? ... and assuming that an even number is rolled?
► The law of total probability: Let A be an event and 61,..., Bn pairwise disjoint events such that Q = IJ/Li Then
n n
P(A) = £P(/*n B;) =      P(A I B;) ■ P(Bi)
i=l ;=1
► A and B are independent if P(A n B) = P(A) ■ P{B).
It is easy to show that if P(B) > 0, then
A, B are independent iff P(A \ B) = P(A).
Random Variables
► A random variable X is a function X : Q —> IR.
A dice: X : {1,..., 6}     {0,1} such that X(n) = n mod 2.
► A probability mass function (pmf) of X is a function p defined by
p(x) := P(X = x)
Often P(X) is used to denote the pmf of X.
Random Vectors
► A random vector is a function X : Q —> IR .
We use X = (Xi,... , X^) where X/ is a random variable returning the /-th component of X.
► A joint probability mass function of X is
Px(*i, ...,Xf/) := P(Xi = xi A • • • A Xd = x</). I.e., px gives the probability of every combination of values.
Often, P(Xi, • • • , Xd) denotes the joint pmf of Xi,..., Xd. That is, P(Xi, • • • , Xd) stands for probabilities P(Xi = xi A • • • A Xd = Xd) for all possible combinations of xi,...,Xd.
► The probability mass function px, of each X/ is called marginal probability mass function. We have
Xd)
6
Random Vectors — Example
Let Q be a space of colored geometric shapes that are divided into two categories (positive and negative).
Assume a random vector X = (Xco/on Xshape, Xcat) where
► Xcohr : ft     {red, blue},
► Xshape : ft     {circle, square},
► Xcat : ft     {pos, neg"}.
The joint pmf is given by the following tables: positive:
	circle	square
red	0.2	0.02
blue	0.02	0.01
negative:
	circle	square
red	0.05	0.3
blue	0.2	0.2
7
Random Vectors — Example
The probability of all possible events can be calculated by summing the appropriate probabilities.
P(red A circle) = P(Xco/or = red A Xs^ape — circle)
— P(red A circle A positive)^ + P(red A circle A negative) = 0.2 + 0.05 = 0.25
P(red) = 0.2 + 0.02 + 0.05 + 0.3 = 0.57
Thus also all conditional probabilities can be computed:
. .    i     ,    . , x     P(positive A red A circle) 0.2
Pi positive  redAcicle) =-——--:——-= —— = 0.8
VH 1 ; P(red A circle) 0.25
8
Conditional Probability Mass Functions
We often have to deal with a pmf of a random vector X conditioned on values of a random vector Y.
I.e., we are interested in P(X — x | Y = y) for all x and y.
We write P(X \ Y) to denote the pmf of X conditioned on Y.
Technically, P(X \ Y) is a function which takes a possible value x of X and a possible value y of Y and returns P{X — x \ Y — y).
This allows us to say, e.g., that two variables X\ and X± are independent conditioned on Y by
P(XUX2 | V) = P(Xi | V)-P(X2 | Y)
Technically this means that for all possible values x\ of X\, all possible values x2 of X2, and all possible values y of Y we have
P(X1=x1AX2 = x2| V = y) =
P(Xi =x1\Y = y)-P(X2=x2\Y = y)
9
Bayesian Classification
Let Q be a sample space (a universum) of all objects that can be classified.
We assume a probability P on fi.
A training set will be a subset of Q randomly sampled according to P.
► Let Y be the random variable for the category which takes values in {yll.. .,ym}.
► Let X be the random vector describing n features of a given instance, i.e., X = (Xi,..., Xn)
► Denote by     possible values of X,
► and by x;j possible values of X,.
Bayes classifier: Given a vector of feature values x^,
CBaye5{xk) := y£ where £ = arg max P(Y = y\ \ X = xk)
#£{l,...,m}
Intuitively, CBayes assigns xk to the most probable category it might be in.
10
Bayesian Classification — Example
Imagine a conveyor belt with apples and apricots.
A machine is supposed to correctly distinguish apples from apricots based on their weight and diameter.
That is, ► Y = {app/e, apricot},
^ X — {^weight-} ^diatm)■
Assume that we are given a fruit that weighs 40g" with 5cm diameter.
The Bayes classifier compares P{Y — apple \ X = (40g",5cm)) with P(Y — apricot \ X = (40g",5cm)) and selects the more probable category given the features.
li
Optimality of the Bayes Classifier
Let C be an arbitrary classifier, that is a function that to every assigns a class out of {yi,... ,ym}.
Slightly abusing notation, we use C to also denote the random variable which assigns a category to every instance.
(Technically this is a composition C o X of C and X which first determines the features using X and then classifies according to C).
Define the error of the classifier C by
EC = P(Y^C)
Veta
The Bayes classifier CBayes minimizes Eq, that is
Bayes '.= ITlin Eq
C is a classifier
12
Optimality of the Bayes Classifier
m
EC   =   ^P(V = y;AC/y()
i=l
m
i-J2p(Y = y<AC = yti
i=l m
=   l-EEP(y = ^AC = ^ | X = xk)P(X = xk)
1 = 1 Xk
m
=   l-^P(X = xk)^P(y = y/AC = y; \X = xk)
Xk 1=1
=   1 - J>(X = xk)P{Y = C{xk) \X = xk)
Xk
(Here the last equality follows from the fact that C is determined by Xk.) Choosing
C(xfr) = CBayes(xk) = yi where I = argmax P(Y = y, \ X =
ie{i,...,m}
maximizes P{Y = C(x/C) | X = xk) and thus minimizes Eq.
Practical Use of Bayes Classifier
The crucial problem: How to compute P( Y — y; | X = x^) ?
Given no other assumptions, this requires a table giving
the probability of each category for each possible vector of feature
values, which is impossible to accurately estimate from
a reasonably-sized training set.
Concretely, if all Y,Xi,... ,Xn are binary, we need 2n numbers to specify P(Y = 0 | X = x^) for each possible x^.
(Note that we do not need to specify
P{Y = 1 | X = xk) = 1 - P{Y = 0 | X = xk)).
It is a bit better than 2n+1 — 1 entries for specification of the complete joint pmf P( Y, Xi,..., Xn).
However, it is still too large for most classification problems.
14
Let's Look at It the Other Way Round
Věta (Bayes,1764)
P(A I B) =
P(B I A) ■ P{A) PjB)
Důkaz.
P(ADB)     ^SP-P(^) P(B\A)-P(A)
P(A I B) =
P(B)
P(B)
P(B)
□
15
Bayesian Classification
Determine the category for xk by finding y,- maximizing
P(Y = yi\X = xk) =
P(Y = yj) ■ P(X = xk I Y = P(X = xk)
So in order to make the classifier we need to compute:
► The prior P(Y — y/) for every y;
► The conditionals P(X —      Y — y/) for every and
Estimating the Prior and Conditionals
► p[Y — y;) can be easily estimated from data:
► Given a set of p training examples where
► n-, of the examples are in the category y,
► we set
P(Y = Yi) = -
P
► If the dimension of features is small, P(X —      Y — y/) can be estimated from data similarly as for P{Y — y,-).
Unfortunately, for higher dimensional data too many examples are needed to estimate all P(X —      Y — y/) (there are too many x^'s).
So where is the advantage of using the Bayes thm.?
We introduce independence assumptions about the features!
17
Naive Bayes
► We assume that features of an instance are (conditionally) independent given the category.
n
P(X I Y) = P(Xi,     ,X„ | Y) = [] P(X/ | V)
/=i
► Therefore, we only need to specify P(X; | V), that is
P(X/ = Xjj | y = y/c) for each possible pair of a feature-value x;j and a class y^.
Note that if Y and all X; are binary (values in {0,1}), this requires specifying only 2n parameters:
P(X/ = 1 | Y = 1) and P(X; = 1 | Y = 0) for each X;
since P(X; = 0 | Y) = 1 - P(X; = 1 | Y).
Compared to specifying 2n parameters without any independence assumptions.
18
Naive Bayes — Example
	positive	negative
P(Y)	0.5	0.5
P{small Y)	0.4	0.4
P(medium Y)	0.1	0.2
F'(large Y)	0.5	0.4
P(red Y)	0.9	0.3
P(blue Y)	0.05	0.3
P (green Y)	0.05	0.4
P(square Y)	0.05	0.4
P (triangle Y)	0.05	0.3
P(circle Y)	0.9	0.3
Is (medium, red, circle) positive?
19
	positive	negative
P(Y)	0.5	0.5
P(medium Y)	0.1	0.2
P(red Y)	0.9	0.3
P(circle Y)	0.9	0.3
Denote xk = (medium, red, circle).
P(pos | X = Xk) =
= P(pos) - P(medium | pos) • P(red | pos) • P(circle | pos) / P(X = Xk) = 0.5 • 0.1 • 0.9 • 0.9 / P{X = xk) = 0.0405/P(X = xk)
P{neg | X = xk) =
= P(neg) - P(medium | neg) • P(red | neg) • P(circle | /leg-) / P(X = Xk) = 0.5 • 0.2 • 0.3 • 0.3 / P{X = xk) = 0.009/P(X = xk)
Apparently,
P{pos | X = xk) = 0.0405/P(X = xk) > 0.009/P(X = xk) = P(A7eg- | X = x/,) So we classify Xk as positive.
20
Estimating Probabilities (In General)
Normally, probabilities are estimated on observed frequencies in the training data (see the previous example).
Let us have
► rik training examples in class yk,
► riijk of these     examples have the value for X, equal to x-,j. Then we put P(X; = xff | Y = yk) = |f.
A problem: If, by chance, a rare value x,y of a feature X; never occurs in the training data, we get
P(X; =XiJ | V = y/c) = 0    for all k e {1,..., m}
But then P(X = x^) = 0 for xk containing the value x,y for X/, and thus P(y = y/c | X = x^) is not well defined. Moreover, P(Y = yk) • P(X = x* | Y = y/c) = 0 (for all yk) so even this cannot be used for classification.
21
Probability Estimation Example
Learned probabilities:
Training data:					positive	negative
				P(Y)	0.5	0.5
				P(small Y)	0.5	0.5
Size	Color	Shape	Class	P(medium Y)	0	0
small	red	circle	p OS	P (large Y)	0.5	0.5
large	red	circle	p OS	P(red Y)	1	0.5
small	red	triangle	neg	P(blue Y)	0	0.5
large	blue	circle	neg	P (green Y)	0	0
				P(square Y)	0	0
				P (triangle Y)	0	0.5
				P (circle Y)	1	0.5
Note that P(medium A red A circle) = 0. So what is P(pos \ medium A red A circle) ?
22
Smoothing
To account for estimation from small samples, probability estimates are adjusted or smoothed.
Laplace smoothing using an m-estimate works as if
► each feature is given a prior probability p,
► such feature have been observed with this probability p in
a sample of size m (recall that m is the number of classes).
We get
P{Xi =xij   Y = yk) = ——■-
ni< + m
(Recall that n^ is the number of training examples of class y^, and rijjk is the number of training examples of class for which the /-th feature X; has the value x,y.)
23
Laplace Smothing Example
► Assume training set contains 10 positive examples:
► 4 small
► 0 medium
► 6 large
► Estimate parameters as follows (m — 2 and p = 1/3)
► P(small | positive) = (4 + 2/3)/(10 + 2) = 0.389
► P(medium | positive) = (0 + 2/3)/(10 + 2) = 0.056
► P(large | positive) = (6 + 2/3)/(10 + 2) = 0.556
24
Continuous Features
Q may be (potentially) continuous, X; may assign a continuum of values in IR.
► The probabilities are computed using probability density p : IR —>> IR+ instead of pmf.
A random variable X : Q —>• IR+ has a density p : IR —>• M+ if for every interval [a, 6] we have
P{a < X < b) =
Usually, P(Xj \ Y = y^) is used to denote the density of X;
conditioned on Y = y^. ► The densities P(X; | Y = y/c) are usually estimated using Gaussian densities as follows:
► Estimate the mean ji^ and the standard deviation cr,* based on training data.
► Then put
Comments on Naive Bayes
► Tends to work well despite rather strong assumption of conditional independence of features.
► Experiments show it to be quite competitive with other classification methods.
Even if the probabilities are not accurately estimeted, it often picks the correct maximum probability category.
► Directly constructs a hypothesis from parameter estimates that are calculated from the training data.
► Typically handles noise well.
► Missing values are easy to deal with (simply average over all missing values in feature vectors).
26
Bayes Classifier vs MAP vs MLE
Recall that the Bayes classifier chooses the category as follows:
CBayes(xk) = arg max P(Y = y, \ X = xk)
/g{l,...,m}
P(Y = y) ■ P(X = xk | Y = yi) = arg max -—--r-
ie{i,...,m} P{X = Xk)
As the denominator P(X = xk) is not influenced by /, the Bayes is equivalent to the Maximum Aposteriori Probability rule:
CMAP(xk) = arg max P(Y = y() • P(X = xk \ Y = y()
/g{l,...,m}
If we do not care about the prior (or assume uniform) we may use the Maximum Likelihood Estimate rule:
CMLE(xk) = arg max P(X = xk \ Y = y()
/g{l,...,m}
(Intuitively, we maximize the probability that the data Xk have been generated into the category y,.)
27
Bayesian Networks (Basic Information)
In the Naive Bayes we have assumed that all features Xi,..., are independent.
This is usually not realistic.
E.g. Variables "rain" and "grass wet" are (usually) strongly dependent.
What if we return some dependencies back?
(But now in a well-defined sense.)
Bayesian networks are a graphical model that uses a directed acyclic graph to specify dependencies among variables.
Bayesian Networks — Example
P(C = T)	P(C = F)
0.8	0.2
P(S = T)	P(S = F)
0.02	0.98
C	P(W - T|C)	P(W = F|C)
T	0.9	0.1
F	0.01	0.99
B	P(A = T|B)	P(A = F|B)
T	0.7	0.3
F	0.1	0.9
Ache
c	S	P(B = T|C,S)	P(B-F|C,S)
T	T	0.9	0.1
T	F	0.2	0.8
F	T	0.9	0.1
F	F	0.01	0.99
Now, e.g.,
P(C, S, W, B, A) = P(C) • P(S) • P(W I C) • P{B I C, S) • P{A \ B)
Now we may e.g. infer what is the probability P(C = T \ A = T) that we sit in a bad chair assuming that our back aches.
We have to store only 10 numbers as opposed to 25 — 1 if the whole joint pmf is stored.
29
Bayesian Networks — Learning & Naive Bayes
Many algorithms have been developed for learning:
► the structure of the graph of the network,
► the conditional probability tables.
The methods are based on maximum-likelihood estimation, gradient descent, etc.
Automatic procedures are usually combined with expert knowledge.
Can you express the naive Bayes for V, Xi,... ,Xn using a Bayesian network?
30
umerical features
► Throughout this lecture we assume that all features are numerical, i.e. feature vectors belong to IRn.
► Most non-numerical features can be conveniently transformed to numerical ones.
For example:
► Colors {blue, red, yellow} can be represented by
{(1,0,0), (0,1,0), (0,0,1)}
(one-hot encoding)
► A black-and-white picture of x x y pixels can be encoded as a vector of xy numbers that capture the shades of gray of the pixels.
Basic Problems
We consider two basic problems:
► (Binary) classification
Our goal: Classify inputs into two categories.
► Function approximation (regression)
Our goal: Find a (hypothesized)
-4 - •
functional dependency in data.
~6-6 -4 -2 0 2 4 6
32
inary classification in W
► Assume
► a set of instances XCK",
► an unknown categorization function c : X —>► {0,1}.
► Our goal:
► Given a set D of training examples of the form (x, c(x)) where x gX,
► construct a hypothesized categorization function /? g T-L that is consistent with c on the training examples, i.e.,
h{x) = c(x) for all training examples (x, c(x)) g D
Comments:
► In practice, we often do not strictly demand h(x) = c(x) for all training examples (x, c(x)) g D (often it is impossible)
► We are more interested in good generalization, that is how well h classifies new instances that do not belong to D.
► Recall that we usually evaluate accuracy of the resulting hypothesized function h on a test set.
Hypothesis Spaces
We consider two kinds of hypothesis spaces: ► Linear (affine) classifiers (this lecture)
► Classifiers based on combinations of linear and sigmoidal functions (classical neural networks) (next lecture)
Length and Scalar Product of Vectors
► We consider vectors x — (xi,..., xm) G ffi.
► Typically, we use Euclidean metric on vectors: |x| = y Xl/li x/ The distance between two vectors (points) x,y is |x — y|.
► We use the scalar product x • y of vectors x = (xi,..., xm) and y = (yi,... ,ym) defined by
m
x-y = J^x/y/ /=i
► Recall that x y =
x
y| cos0 where 0 is the angle between x and y. That is x y is the length of the projection of y on x multiplied by |x .
► Note that x
x
x
35
Linear classifier - example
o
o
► classification in plane using a linear classifier
► if a point is incorrectly classified, the learning algorithm turns the line (hyperplane) to improve the classification.
36
Linear Classifier
A linear classifier h[w] is determined by a vector of weights w = (i/i/n, 1/1/1,..., wn) G as follows:
Given x = (xi,..., xn) G X C IRn,
/?[iv](x) :=
More succinctly:
/7(x) = Sgn    Wq + ^ W\ • X;
where    s(g"/i(y) =
i=i
i y>0
0 y<0
37
inear Classifier — Geometry
Linear Classifier — Notation
Given x — (xi,... , xn) 6 w1 we define an augmented feature vector x = (xn,xi,...,xn)    where xn = 1
This makes the notation for the linear classifier more succinct: /?[iv](x) = sgn(w • x)
39
Perceptron Learning
► Given a training set
D = {(xi, c(xi)), (x2, c(x2)), ■ ■ ■, (xpj c(xp))}
Here x^ = (x/ri ... , x/^) GXCR" and c{x^) G {0,1}.
We write c/< instead of c(x/c).
Note that     = (x/^x^i ..., x^) where x^n = 1.
► A weight vector w G IRn+1 is consistent with D if
/?[iv](x/() = sgn(w • x/f) = q-    for all /c = 1,..., p
D is linearly separable if there is a vector w G IRn+1 which is consistent with D.
► Our goal is to find a consistent w assuming that D is linearly separable.
40
Perceptron — Learning Algorithm
Online learning algorithm:
Idea: Cyclically go through the training examples in D and adapt weights. Whenever an example is incorrectly classified, turn the hyperplane so that the example becomes closer to it's correct half-space.
Compute a sequence of weight vectors w^°\        w^2\ ....
► is randomly initialized close to 0 = (0,..., 0)
► In (t + l)-th step, w/(t+1) is computed as follows:
=   w(t) - s • (sgn (w(t) • Xfr) - c^j • xk
Here k = (t mod p) + 1, i.e. the examples are considered cyclically, and 0 < e < 1 is a learning speed.
Veta (Rosenblatt)
//"D /s linearly separable, then there is t* such that w^*) is consistent with D.
41
Example
Training set:
D = {((2,-1),1),((2,1),1),((1,3),0)} That is
xi   =   (2,-1) xi   = (1,2,-1)
x2   =   (2,1) x2   = (1,2,1)
x3   =   (1,3) x3   = (1,1,3)
ci   = 1
c2 = 1 c3   = 0
Assume that the initial vector w^°> is v\/(°> = (0, —1,1). Consider e = 1.
42
Example: Separating by
4 -
3 -
2 -
1 -
1 -
-2 -
O x3
O X?
2
O x\
Denoting =
{m- wll w2) = (0, -1,1)
the blue separating line is given
by i/i/o + w\x\ + 1/1/2x2 = 0.
The red vector normal to the blue line is (wi, 1/1/2).
The points on the side of (1/1/1,1/1/2) are assigned 1 by the classifier, the others zero. (In this case X3 is assigned one and xi,X2 are assigned zero, a of this is inconsistent with ci = l,c2 = l,c3 = 0.)
-3 -
43
Example:
We have
(°) • Sex = (0, -1,1) • (1,2, -1) = 0 - 2 - 1 = -3
thus
sgn (w{0) - xi) = 0 and thus
sgnfw^-x^ -ci=0-l = -l (This means that xi is not well classified, and w(0) is not consistent with D.)
Hence,
tfU)    = - (SgA7 (tf<°> • Xi) - Ci) • Xi
=   (0,-1,1)+ (1,2,-1) = (1,1,0)
Example
4 -
3 -
2 -
1 -
O x3
O X2
-1 -
■2 -
3 -
2
O xi
45
Example: Separating by
We have
■ x2 = (1,1,0) ■ (1,2,1) = 1 + 2 = 3
thus
sgn •      = 1
and thus
sgn • X2j — C2 = l — 1 = 0
(This means that x\ is currently well classified by w^. However, as we will see, x$ is not well classified.)
Hence,
tfP) =       = (1,1,0)
46
Example: w
(3)
We have
ivP)-x3 = (lJl,0)-(lJlJ3) = l + l = 2
thus
sgn • X3^ = 1
and thus
sgn (w^ • x3^ — C3 = l — 0 = 1
(This means that x3 is not well classified, and w(2) is not consistent with D.)
Hence,
w(3)   = - (sgn (w^ • x3^ - c^j • x3
= — X3
= (1,1,0)-(1,1,3) = (0,0,-3)
47
Example: Separating by
Example:
We have
M?(3) -xi = (0,0,-3) • (1,2,-1) = 3
thus
sgn (w^ • xi^ = 1
and thus
sgn (w(3) -xi) -ci = 1-1 = 0
(This means that x\ is currently well classified by w^3\ However, as we will see, x2 is not.)
Hence,
tfW = w& = (0,0,-3)
49
Example:
We have
tfW .x2 = (0,0,-3) • (1,2,1) = -3
thus
sgn (^w^ •      = 0 and thus
sgn • X2^j — C2 = 0 — 1 = — 1
(This means that x2 is not well classified, and       is not consistent with D.)
Hence,
= ^4)+X2
=   (0,0,-3)+ (1,2,1) = (1,2,-2)
50
Example: Separating by w^5*
-3
51
Example: The result
The vector w(5> is consistent with D:
sgn (tf <5> . 5l) = sgn ((1,2, -2) • (1,2, -1)) = sgv,(7) = 1 = d
sgn (V5) • x2) = sgn ((1,2, -2) • (1, 2,1)) = sgn(3) = 1 = c2 sgn (V5) • x3) = sgn ((1,2, -2) • (1,1,3)) = sgn(-3) = 0 = c3
52
Perceptron — Learning Algorithm
Batch learning algorithm:
Compute a sequence of weight vectors w^°\ w^2\
► is randomly initialized close to 0 = (0,..., 0)
► In (t + l)-th step, i/i/(t+1) is computed as follows:
p
^(t+i)   =   -(t) _
/c=l P
/c=l
Here k — (t mod p) + 1, i.e. the examples are considered cyclically, and 0 < e < 1 is a learning speed.
53
Function Approximation — Oaks in Wisconsin
This example is from How to Lie with Statistics by Darrell Huff (1954)
Age	DBH
(years)	(inch)
97	12.5
93	12.5
88	8.0
15 -
o 10
5 h
0
Oak Diameter vs. Age i-1-1-1-r
"I-V
* *
_i_I_i_I_i_I_i_I_i_I
o 20 40 60 80 100
Age (years)
Oak Diameter vs. Age
54
Function Approximation
► Assume
► a set X C R" of instances,
► an unknown function f : X —>► M.
► Our goal:
► Given a set D of training examples of the form (x, f(x)) where x eX,
► construct a hypothesized function /? g H such that
/?(x) ^ f(x) for all training examples (x, f (*)) <g D Here ^ means that the values are somewhat close to each other w.r.t. an appropriate error function E.
► In what follows we use the least squares defined by
(x,f(x))eD Our goal is to minimize E.
The main reason is that this function has nice mathematical properties (as opposed e.g. to £(j?jf(j?))GD \f(x) - h(x)\ ).
55
Least Squares — Oaks in Wisconsin
Age (years)	DBH (inch)
97	12.5
93	12.5
88	8.0
81	9.5
75	16.5
57	11.0
52	10.5
45	9.0
28	6.0
15	1.5
12	1.0
11	1.0
Oak Diameter vs. Age
40 60 Age (years)
100
56
Linear Function Approximation
► Given a set D of training examples:
D = {(Xl, f (Xl)) , (X2, f (x2)),..., (xpj f (xP))}
Here Xk = (x/ri ..., x^n) G Rn and /fc (x) G IR. Recall that     = (x/^x^i ..., x^).
Our goal: Find w so that /?[iv](x) = w • x approximates the function f some of whose values are given by the training set.
► Least Squares Error Function:
2
Gradient of the Error Function
Consider the gradient of the error function:
f QE dE      \ P
VEW = (,      '''' Ä WJ= ?("'Xfc"fk)' *k
k—1
What is the gradient VE(w) ? It is a vector in IRn+ which points in the direction of the steepest ascent of E (it's length corresponds to the steepness). Note that here the vectors Xk are fixed parameters of E\
Fakt
IfVE(w) = 0 = (0..... 0), then w is a global minimum of E.
This follows from the fact that E is a convex paraboloid that has a unique extreme which is a minimum.
Gradient — illustration
Function Approximation — Learning
Gradient Descent:
► Weights w(°' are initialized randomly close to 0.
► In (t + l)-th step, \/\/(t+1> is computed as follows:
= w
= w
k=l
p
k=l
Here k — {t mod p) + 1 and 0 < e < 1 is the learning speed
Note that the algorithm is almost similar to the batch perceptron algorithm!
Tvrzeni
For sufficiently small s > 0 the sequence w^°\        w^2\ ... converges (component-wisely) to the global minimum of E.
60
Finding the Minimum in Dimension One
Assume n = 1. Then the error function E is
1 9
E(w0, wi) = - 22(w0 + wixk - fk)2
k=l
Minimize E w.r.t. wq a w\:
$E     n 7 -
= 0     -vv-     Wq = T — w\X     <4>     f = Wq + w\X
Swq
where x = ± ££=1 xfc   a   f = \ E£=i ft
M=°   ^   * = |EiU(*-*)>
i.e. i/i/i = cot/(f, x)/i/ar(x)
61
Finding the Minimum in Arbitrary Dimension
Let A be a matrix px(n + l)(p rows, n + 1 columns) whose /c-th row is the vector x^.
Let f = (fi,..., /p)T be the column vector formed by values of f in the training set.
Then
VE(w) = 0    ^    w = (/T/l)-MTf if (/T^)"1 exists
(Then (v4Tv4)_MT is the so called Moore-Pen rose pseudoinverse of A)
62
Normal Distribution — Reminder
Distribution of continuous random variables. Density (one dimensional, that is over IR):
H is the expected value (the mean), a2 is the variance.
aximum Likelihood vs Least Squares (Dim 1)
Fix a training set D = {(xi, f{), (x2, f2),..., (xp, fp)} Assume that each fk has been generated randomly by
Here
► wo, tvi are unknown numbers
► e/c are normally distributed with mean 0 and an unknown variance a2
Assume that ei,..., ep were generated independently.
Denote by p(/"i,..., fp | i/i/n, 1/1/1, a2) the probability density according to which the values /"i,..., fn were generated assuming fixed
w0, w1, a2,xi, . . . ,xp.
(For interested: The independence and normality imply
p
p(fl, . ..,fp\m, ^l,cr2) = Y\ N[W° + ^lX/c, cr2](f/c)
where A/[i/i/o + w/ix/c, cr2](f/c) is a normal distribution with the mean wq + i/i/ix/c and the variance a2.)
Maximum Likelihood vs Least Squares
Our goal is to find (i/i/n, 1/1/1) that maximizes the likelihood that the training set D with fixed values fi,..., fn has been generated:
L(w0, wlla2) := p(/i,..., fp | w0, wlla2)
Veta
(i/i/n, 1/1/1) maximizes L(wq, i/i/i, <t2) for arbitrary a2 iff (1/1/n, 1/1/1) minimizes E(wq, w\), i.e. the least squares error function.
Note that the maximizing/minimizing (i/i/n, 1/1/1) does not depend on a2.
Maximizing a2 satisfies a2 = ^ X)£=i(^c — wo — 1/1/1 • *7c)2-
65
Comments on Linear Models
► Linear models are parametric, i.e. they have a fixed form with a small number of parameters that need to be learned from data (as opposed e.g. to decision trees where the structure is not fixed in advance).
► Linear models are stable, i.e. small variations in the training data have only limited impact on the learned model, (tree models typically vary more with the training data).
► Linear models are less likely to overfit (low variance) the training data but sometimes tend to underfit (high bias).
66
(Primitive) Mathematical Model of Neuron
Formal neuron
x0 = 1
y
► xi,..., xn real inputs
► xn special input, always 1
► i/i/rj, 1/1/1,..., wn real weights
► £ = wo + S/Li w/x; //i/ier potential;
In general, other potentials are considered (e.g. Gaussian), more on this in PV021.
► y output defined by y = cr(£) where a is an activation function.
We consider several activation functions.
e.g. //near threshold function
*(0 = «"(0 =
1 £>0; 0 £<0.
68
Formal Neuron vs Linear Models
Both linear classifier and linear (affine) function are special cases the formal neuron.
► If a is a linear threshold function
*(0 =
1 £>0; 0 £<0.
we obtain a linear classifier.
► If a is identity, i.e.        =    we obtain a linear (affine) function.
Many more activation functions are used in neural networks!
Sigmoid Functions
Multilayer Perceptron (MLP)
Output
Hidden
Input
► Neurons are organized in layers
(input layer, output layer, possibly several hidden layers)
► Layers are numbered from 0; the input is 0-th
► Neurons in the £-th layer are connected with all neurons in the £ + 1-th layer
Intuition: The network computes a function as follows: Assign input values to the input neurons and 0 to the rest. Proceed upwards through the layers, one layer per step. In the £-th step consider output values of neurons in £ — 1-th layer as inputs to neurons of the £-th layer. Compute output values of neurons in the £-th layer.
71
Example
► Activation function: linear threshold
*(0 =
1 £>0; 0 £<0.
72
Classical Example - ALVINN
Sharp Left
Straight Ahead
Sharp Right
4 Hidden Units
30 Output Units
30x32 Sensor Input Retina
► One of the first autonomous car driving systems (in 90s)
► ALVINN drives a car
► The net has 30 x 32 = 960 input neurons (the input space is IR960).
► The value of each input captures the shade of gray of
the corresponding pixel.
► Output neurons indicate where to turn (to the center of gravity).
Source: http://jmvidal.cse.sc.edu/talks/ann/alvin.html
73
A Bit of History
► Perceptron (Rosenblatt et al, 1957)
Hi-' * i:
.....mm
IK
HM
► Single layer (i.e. no hidden layers), the activation function linear threshold
(i.e., a bit more general linear classifier)
► Perceptron learning algorithm
► Used to recognize numbers ► Adaline (Widrow & Hof, 1960)
► Single layer, the activation function identity (i.e., a bit more linear function)
► Online version of the gradient descent
► Used a new circuitry element called memristor which was at to "remembe^'history of current in form of resistance
In both cases, the expressive power is rather limited - can express only linear decision boundaries and linear (affine) functions.
A Bit of History
One of the famous (counter)-examples: XOR
Xl
(0,1)
(1,1)
0—©
(0,0)
©
(0,1)
*2
No perceptron can distinguish between ones and zeros.
75
XOR vs Multilayer Perceptron
Pi Pi
(Here a is a linear threshold function.)
Pi : -1 + 2xi + 2x2 = 0 P2 : 3 - 2xi -
The output neuron performs an intersection of half-spaces.
Expressive Power of MLP
Cybenko's theorem:
► Two layer networks with a single output neuron and a single layer of hidden neurons (with the logistic sigmoid as the activation function) are able to
► approximate with arbitrarily small error any "reasonable" boundary
a given input is classified as 1 iff the output value of the network is > 1/2.
► approximate with arbitrarily small error any "reasonable" function with range (0,1).
Here "reasonable" means that it is pretty tough to find a function that is not reasonable.
So multi-layer perceptrons are suffuciently powerful for any application.
But for a long time, at least throughout 60s and 70s, nobody well-known knew any efficient method for training multilayer networks!
... then the backpropagation appeared in 1986!
77
MLP - Notation
► X set of input neurons
► Y set of output neurons
► Z set of all neurons (tedy X, Y C Z)
► individual neurons are denoted by indices, e.g. ij.
► £j is the inner potential of the neuron j when the computation is finished.
► yj is the output value of the neuron j when the computation is finished.
(we formally assume yo = 1)
► wjj is the weight of the arc from the neuron / to the neuron j.
► y<_ is the set of all neurons from which there are edges to j
(i.e. jV- is the layer directly below j)
► is the set of all neurons to which there are edges from j.
(i.e.      is the layer directly above j)
78
MLP - Notation
► Inner potential of a neuron j\
► for simplicity, the activation function of every neuron will be the logistic sigmoid cr(£) = .
(We may of course consider logistic sigmoids with different steepness paramaters, or other sigmoidal functions, more in PV021.)
► A value of a non-input neuron j g Z \ X when the computation is finished is yj = cr(£/)
(y7 is determined by weights w and a given input x, so it's sometimes written as y/[vv](x))
► Fixing weights of all neurons, the network computes a function F[w] : RIxI —)► Rlyl as follows: Assign values of a given vector
x g M'x' to the input neurons, evaluate the network, then /~[vv](x) is the vector of values of the output neurons.
Here we implicitly assume a fixed orderings on input and output vectors.
79
MLP — Learning
► Given a set D of training examples:
D = | (xk, dkj    |   k = 1,... ,p
Here x^ G and cf^ G IRlyL We write cf^- to denote the value in dk corresponding to the output neuron j.
► Least Squares Error Function: Let w be a vector of a weights in the network.
e{w) = j^ek{w)
k=l
where
E*(^) = oE (X/HWO - dkj)'
2
80
MLP — Learning Algorithm
Batch Learning - Gradient Descent:
The algorithm computes a sequence of weights w^°\        ....
► weights       are initialized randomly close to 0
► in the step t + 1 (here £ = 0,1, 2 ...) is i/i/(t+1) computed as follows:
(t+l) (t)        . A (t)
Wjj      = wjj   + Awji } where
Awjp = -e(t) ■
J O Wji
is the weight change wjj and 0 < s(t) < 1 is the learning speed in the step t + l.
Note that -§^:{w^) is a component of VE, i.e. the weight change in the step t + 1 can be written as follows: w(t+1) = w(t) - s(t) • VE(w(t)).
81
MLP — Gradient Computation
For every weight wy, we have (obviously)
dE     ^ dEk
dwji 9wy
So now it suffices to compute that is the error for a fixed training example (x^, dk).
It holds that
dEk
dwji
Sj-yj(i-yj)-yi
where
5j = yj - dkj pro j t Y
Sj = ^2 ()r ■ yr(l - yr) • wrj pro j e Z \ (Y U X)
(Here yr = y[w](x/c) where w are the current weights and Xk is the input of the /c-th training example.)
82
ultilayer Perceptron — Backpropagation
So to compute all
dE
Compute all
Sj • y/(l — yj) • y\ for every training example (x^, dk
► Evaluate all values y\ of neurons using the standard bottom-up procedure with the input xk.
► Compute Sj using backpropagation through layers top-down :
► Assign Sj = yj — dkj for all j g Y
► In the layer £, assuming that Sr has been computed for all neurons r in the layer £ + 1, compute
for all j from the ^-th layer.
Example
w,
(0) 40
(0) W30 =	(0)	(0)
(0) W31 =	(0) W32 =	(0) W53 =
= 1/1/4^ = w!^) = 1 and
54
— 1. Consider a training set {((1,0), 1)}
Then
yi = i. Y2 = o,
y3 = a(i/i/30 + w^yi + W32V2) = 0-5, y4 = 0.5, y5 = 0.731058.
,(0).,N _
S3
y5 - 1 = -0.268942, 0.052877.
y5)*i^5(0)
^50
'54
= -0.052877,
dEi <91/1/53 dE1
dEi <91/1/30
55 ■ y5 ■ (1 - y5) • y3 = -0.026438, S5 ■ ys ■ (1 - ys) • y4 = -0.026438, 53 • y3 • (1 - y3) • 1 = 0.01321925,
lustration of Gradient Descent — XOR
Source: Pattern Classification (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork
85
Comments on Training Algorithm
► Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely.
► In practice, does converge to low error for many large networks on real data.
► Many epochs (thousands) may be required, hours or days of training for large networks.
► To avoid local-minima problems, run several trials starting with different random weights (random restarts).
► Take results of trial with lowest training set error.
► Build a committee of results from multiple trials (possibly weighting votes by training set accuracy).
There are many more issues concerning learning efficiency (data normalization, selection of activation functions, weight initialization, training speed, efficiency of the gradient descent itself etc.) - see PV021.
86
idden Neurons Representations
Trained hidden neurons can be seen as newly constructed features. E.g., in a two layer network used for classification, the hidden layer transforms the input so that important features become explicit (and hence the result may become linearly separable).
Consider a two-layer MLP, 64-2-3 for classification of letters (three output neurons, each corresponds to one of the letters).
sample training patterns
							
-							
							
							
							
			■				
						-	
							
learned input-to-hidden weights
Overfitting
► Due to their expressive power, neural networks are quite sensitive to overfitting.
on test data on training data
# training epochs
► Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase the validation error.
88
Overfitting — The Number of Hidden Neurons
► Too few hidden neurons prevent the network from adequately fitting the data.
► Too many hidden units can result in overfitting.
(There are advanced methods that prevent overfitting even for rich models, such as regularization, where the error function penalizes overfitting - see PV021.)
on test data on training data
# hidden units
► Use cross-validation to empirically determine an optimal number of hidden units.
There are methods that automatically construct the structure of the network based on data, they are not much used though.
89
Applications
► Text to Speech and vice versa
► Fraud detection
► finance & business predictions
► Game playing (backgammon is a classical example, AlphaGo is the modern one)
► Image recognition
This is the main area in which the current state-of-the-art deep networks excel.
► (artificial brain and intelligence)
► ...
90
ALVINN
91
ALVINN
► Two layer MLP, 960 - 4 - 30 (sometimes 960 - 5 - 30)
► Inputs correspond to pixels.
► Sigmoidal activation function (logistic sigmoid).
► Direction corresponds to the center of gravity.
I.e., output neurons are considered as points of mass evenly distributed along a line. Weight of each neuron corresponds to its value.
92
ALVINN - Training
Trained while driving.
► A camera captured the road from the front window, approx. 25 pictures per second
► Training examples (x^, d^) where
► Xk = image of the road
► dk ~ corresponding direction of the steering wheel set by the driver
► the values d^ computed using Gaussian distribution:
where D\ is the distance between the /-th output from the one that corresponds to the real direction of the steering wheel.
(This is better than the binary output because similar road directions induce similar reaction of the driver.)
93
Selection of Training Examples
Naive approach: just take images from the camera. Problems:
► A too good driver never teaches the network how to solve deviations from the right track. Couple of harsh solutions:
► turn the learning off for a moment, deviate from the right track, then turn on the learning and let the network learn how to solve the situation.
► let the driver go crazy! (a bit dangerous, expensive, unreliable)
► Images are very similar (the network basically sees the road from the right lane), may be overtrained.
94
Selection of Training Examples
Problem with too good driver were solved as follows:
► every image of the road has been has been transformed to 15 slightly different copies
Original Image
• • •
Repetitiveness of images was solved as follows:
► the system has a buffer of 200 images (including the 15 copies of the current one), in every round trains on these images
► afterwards, a new image is captured, 15 copies made, and these new 15 substitute 15 selected from the buffer (10 with the smallest training error, 5 randomly)
95
ALVINN - Training
► standard backpropagation
► constant speed of learning (possibly different for each neuron -see PV021)
► some other optimizations (see PV021) Výsledek:
► Training took 5 minutes, the speed was 4 miles per hour
(The speed was limited by the hydraulic controller of the steering wheel not the learning algorithm.)
► ALVINN was able to go through roads it never "seen" and in different weather
96
ALVINN - Weight Learning
hl
hl
h3
h4
h5
round 0
round 10
round 20
round 50
Here /?1,..., /?5 are values of hidden neurons.
Extensions and Directions (PV021)
► Other types of learning inspired by neuroscience - Hebbian learning
► More biologically plausible models of neural networks - spiking neurons
This goes into the direction of HUGE area of (computational) neuroscience, only very lightly touched in PV021.
► Unsupervised learning - Self-Organizing Maps
► Reinforcement learning
► learning to make decisions, or play games, sequentially
► neural networks have been used - temporal difference learning
98
Deep Learning
► Cybenko's theorem shows that two-layer networks are omnipotent -such results nearly killed NN when support vector machines were found to be easier to train in 00's.
► Later, it has been shown (experimentally) that deep networks (with many layers) have better represenational properties.
► ... but how to train them? The backpropagation suffers from so-called vanishing gradient, intuitively, updates of weights in lower layers are very slow.
► In 2006 a solution was found by Hinton et al:
► Use unsupervised methods to initialize the weights so that they capture important features in data.
More precisely: The lowest hidden layer learns patterns in data, second lowest learns patterns in data transformed through the first layer, and so on.
► Then use a supervised learning algorithm to only fine tune the weights to the desired input-output behavior.
A rather heavy machinery is needed to develop this, but you will be rewarded by insight into a very modern and expensive technology.
I mage Net Large-Scale Visual Recognition Challenge (ILSVRC)
ImageNet database (16,000,000 color images, 20,000 categories)
100
I mage Net Large-Scale Visual Recognition Challenge (ILSVRC)
Competition in classification over a subset of images from ImageNet.
In 2012: Training se 1,200,000 images, 1000 categories. Validation set 50,000, Test set 150,000.
Many images contain several objects —> typical rule is top-5 highest probability assigned by the net.
101
KSH sit
ImageNet classification with deep convolutional neural networks, by Alex Krizhevsky, llya Sutskever, and Geoffrey E. Hinton (2012).
dense
Max pooling
Trained on two GPUs (NVIDIA GeForce GTX 580) Results:
► Accuracy 84.7% in top-5 (second best alg. at the time: 73.8%)
► 63.3% in "perfect" classification (top-1)
102
ILSVRC 2014
The same set of images as in 2012, top-5 criterium. GoogLeNet: deep convolutional net, 22 layers
Results: ► 93.33% in top-5
Superhuman power?
103
Superhuman GoogLeNet?!
Andrej Karpathy: ...the task of labeling images with 5 out of 1000 categories quickly turned out to be extremely challenging, even for some friends in the lab who have been working on ILSVRC and its classes for a while. First we thought we would put it up on [Amazon Mechanical Turk]. Then we thought we could recruit paid undergrads. Then I organized a labeling party of intense labeling effort only among the (expert labelers) in our lab. Then I developed a modified interface that used GoogLeNet predictions to prune the number of categories from 1000 to only about 100. It was still too hard - people kept missing categories and getting up to ranges of 13-15% error rates. In the end I realized that to get anywhere competitively close to GoogLeNet, it was most efficient if I sat down and went through the painfully long training process and the subsequent careful annotation process myself... The labeling happened at a rate of about 1 per minute, but this decreased over time... Some images are easily recognized, while some images (such as those of fine-grained breeds of dogs, birds, or monkeys) can require multiple minutes of concentrated effort. I became very good at identifying breeds of dogs... Based on the sample of images I worked on, the GoogLeNet classification error turned out to be 6.8%... My own error in the end turned out to be 5.1%, approximately 1.7% better.
104
ILSVRC 2015
34-layer plain        34-layer residual
► Microsoft network ResNet: 152 layers, complex architecture
► Trained on 8 GPUs
► 96.43% accuracy in top-5
Id on; 11] lilnf^,l].-
♦
•
105
ILSVRC
ils^rc.png
106
Deeper Insight into the Logistic Sigmoid
Consider a perceptron (that is a linear classifier)
n
i=l
and y = sgn(^)
i e>o o e<o
Recall, that the signed distance from the decision boundary determined by £ = 0 is (here x = (xi,...,xn) and w = (wi,..., wn))
This value is positive for x on the side of w and negative on the opposite.
For simplicity, assume that y^]-=1 wf = 1, and thus that the potential £ is equal to the signed distance of x from the boundary.
107
Deeper Insight into the Logistic Sigmoid
Assume that training examples (x, c(x)) are randomly generated. Denote:
► ^ mean signed distance from the boundary of points classified 1.
► £° mean signed distance from the boundary of points classified 0. It is not unreasonable to assume that
► conditioned on c = 1, the signed distance £ is normally distributed with the mean ^ and variance (for simplicity) 1,
► conditioned on c = 0, the signed distance £ is normally distributed with the mean £° and variance (for simplicity) 1.
(Notice that £ may be negative, which means that such point is on the wrong side of the boundary (the same for £ > 0).)
Now, can we decide what is the probability of c = 1 given a distance?
108
Deeper Insight into the Logistic Sigmoid
For simplicity, assume that £x = — £° = 1/2
		i)^(i)	
p(ei	|i)P(i) + p(CI		0)P(0)
/./?
LR + l/clr
where
p(£l	1)	_ exp(-	-(£ - l/2)2/2)
p(€ 1	0)	exp(-	+1/2)V2)
and
P(l)
clr = p     which we assume (for simplicity) = 1
So
exp(f)
P(l I 0
exp(f) + 1     1 + e-f
Thus the logistic sigmoid applied to £ = wq + ^ 'x/ gives t/?e probability of c = 1 given the input!
109
Deeper Insight into the Logistic Sigmoid
So if we use the logistic sigmoid as an activation function, and turn the neuron into a classifier as follows:
Then the neuron basically works as the Bayes classifier! This is the basis of logistic regression.
Given training data, we may compute the weights w that maximize the likelihood of the training data (w.r.t. the probabilities returned by the neuron).
An extremely interesting observation is that such w maximizing the likelihood coincides with the minimum of least squares for the corresponding linear function (that is the same neuron but with identity as the activation function).
no
Kernel Methods & SVM
Partially based on the ML lecture by Raymond J. Mooney
University of Texas at Austin
in
Back to Linear Classifier (Slightly Modified)
A linear classifier h[w] is determined by a vector of weights
w = (i/i/n, 1/1/1,..., wn) G as follows:
Given x — (xi,..., xn) G X C IRn,
For convenience, we use values {—1,1} instead of {0,1}. Note that this is not a principal modification, the linear classifier works exactly as the original one.
Recall that given x — (xi,... ,xn) G W, the augmented feature vector is
x = (xn,xi,...,xn)    where xn = 1 This makes the notation for the linear classifier more succinct:
l/l/O + Z7=l     • X/ > 0
+ S/Li wj • x/ < 0
y >0
1 y<0
112
Perceptron Learning Revisited
► Given a training set
D = {(xi,y(xi)),(x2,y(x2)),...,(xp,y(xp))}
Here x/< = {x^i ... , x/^) GXCR" and y(x/c) G { — 1,1}.
We write yk instead of y(x/c).
Note that x^ = (x/^x^i ... , x/^) where x^n = 1.
► A weight vector w G IRn+1 is consistent with D if
/?[iv](x/() = sig(w • x^) = y/c    for all /c = 1,..., p
D is linearly separable if there is a vector w G IRn+1 which consistent with D.
Perceptron Learning Revisited
Perceptron learning algorithm (slightly modified):
Consider training examples cyclically. Compute a sequence of weight vectors w^°\        w^2\ ....
► is initialized to 0 = (0,..., 0).
(This is a slight but harmless modification of the traditional algorithm.)
► In (t + l)-th step, w/(t+1) is computed as follows:
► If sig(w • x/c) 7^ yk, then w^t+1^ =       + yk • *k-
► Otherwise, w^t+1^ =
Here k — (t mod p) + 1, i.e. the examples are considered cyclically.
(Note that this algorithm corresponds to the perceptron learning with the learning speed e = 1.)
We know: if D is linearly separable, then there is t* such that w^*) is consistent with D.
But what can we do if D is not linearly separable?
114
Quadratic Decision Boundary
Left: The original set, Right: Transformed using the square of features.
Right: the green line is the decision boundary learned using the perceptron algorithm.
(The red boundary corresponds to another learning algorithm.) Left: the green ellipse maps exactly to the green line.
How to classify (in the original space): First, transform a given feature vector by squaring the features, then use the linear classifier.
Do We Need to Map Explicitly?
In general, mapping to (much) higher feature space helps (there are more "degrees of freedom" so linear separability might get a chance).
However, complexity of learning grows (quickly) with dimension. Sometimes its even beneficial to map to infinite-dimensional spaces.
To avoid explicit construction of the higher dimensional feature space, we use so called kernel trick.
But first we need to dualize our learning algorithm.
116
Perceptron Learning Revisited
Perceptron learning algorithm once more:
Compute a sequence of weight vectors w^°\        w^2\ ....
► is initialized to 0 = (0,..., 0).
► In (t + l)-th step, i/i/(t+1) is computed as follows:
► If sig(w • x/c) 7^ yk, then w^t+1^ =       +    • xk.
► Otherwise, w(t+1) =
Here k — {t mod p) + 1, i.e. the examples are considered cyclically.
Crucial observation:
Note that       = X)&=i      ' Yi ' *£ f°r suitable       ..., r?p^ G N.
Intuitively, rip counts how many times xi was added to (if yi — 1)> or subtracted from (if yi — — 1) weights.
117
Dual Perceptron Learning
Dual Perceptron learning algorithm :
Compute a sequence of vectors of numbers . where
each ft*) = (n[t\...,n{J))eW.
► M0) is initialized to 0 = (0,..., 0).
► In (t + l)-th step, (n[t+1\ ..., r?pt+1^) is computed as follows:
► If sig(J2i=i ^ • yi-*i- *k)    Yk, then n[t+1) :=      + 1, else, n^+V) := n£\
► A?5t+1) := nP for all ^ ^ k.
Here k — {t mod p) + 1, the examples are considered cyclically.
If D is linearly separable, there exists t* such that J2i=i n\ Yl'*l is consistent with D. The algorithm stops at such t* and returns (n^ \      r?pf ^) so that X)£=1     ^ • ye -X£ is consistent with D.
118
Example
Training set:
D = {((2,-1),1),((2,1),1),((1,3),-1)}
That is
xi   =   (2,-1) xi   = (1,2,-1)
x2   =   (2,1) x2   = (1,2,1)
x3   =   (1,3) x3   = (1,1,3)
n = 1 y2   = 1
Yz  = -1
The initial values       =      =      = 0
119
► ELi nT' yi' *t' *i = °-thus 5/£"(ELi nT • yi • *e • *i) =1 = n-
Hence, M1) = (0,0,0).
^ ELi nfJ •    •    • x2 = 0, thus s#g(X)Li ^  • y^ • x^ • x2) = 1 = y2. Hence, at<2) = (0,0,0).
* ELi nf} ' yt' *t' *3 = 0, thus sig(Y?e=i nt • y^ •    • x3) = 1 y3. Hence, at<3) = (0,0,1).
► ELi "T-yi-xt*! = -Ixsxi = -1(1,1, 3)-(l, 2, -1) = -10 = 0, thus sig(Y?e=i nf] ' yi ' *i • xi) = 1 = yi. Hence, n<4) = (0,0,1).
► ELi44) -y/-XrX2 = -1x3x2 = -1-(1,1,3)-(1,2,1) = -1-6 = -6, thus s/g(ELi ^ " W ' x^ • x2) = -1    y2. Hence, at<5) = (0,1,1).
* ELi nt   ' yi ' *i • x3 = 1 • x2 • x3 - 1 • x3 • x3 = -5, thus aťw = (0,1,1). This is OK.
* ELi nt ' yi ' *i • xi = 1 • x2 • xi - 1 • x3 • xi = 4, thus rP> = (0,1,1). This is OK.
* ELi      ' yi ' *i • x2 = 1 • x2 • x2 - 1 • x3 • x2 = 0, thus rP> = (0,1,1). This is OK.
The result: x2 — x3.
120
Dual Perceptron Learning — Output
Let Yle=i n£ ' Y£ ' *£ result from the dual perceptron learning algorithm.
I.e., each m = n\ J E N for suitable t* in which the algorithm found a consistent vector.
This vector of weights determines a linear classifier that for a given xGR" gives
h[w](x) = sig "e-ye-xe-
(Here x is the augmented feature vector obtained from x.)
Crucial observation: The (augmented) feature vectors xi and x occur only in scalar products!
121
Kernel Trick
For simplicity, assume bivariate data: xk = (l,*/ci3*/c2)-
The corresponding instance in the quadratic feature space is (l,x^l3x^
Consider two instances Xk = (1,xki,xk2) and    = (l,x^i,x^2). Then the scalar product of their corresponding instances (l,*^,*^) anc' (lj^fi,^)' resP> 'n      quadratic feature space is
1 + x^x|x + x22x22 which resembles (but is not equal to)
(xk-x£)2 = (1 + Xfcixa + xk2X£2)2 =
= 1 + x^x^ + x^2x|2 + 2xklX£1 Xk2X£2 + 2x^1X^1 + 2x^2X^2
But now consider a mapping </> to M6 defined by
</>(*/<) = (1 5 xki 1 xk21 V/2x/cix/c2, V2xki, V/2x/c2)
Then
<f>(xk) • </>(x^) = (x/, • Xif
THE Idea: Define a kernel K,(xklxi) = (x^ • x^)2 and replace x^ • X£ in the dual perceptron algorithm with K{xklxi).
Kernel Perceptron Learning
Kernel Perceptron learning algorithm :
Compute a sequence of vectors of numbers rt^°\ rP-\ ... where each
a7« = (n[t\...,nipt))£W.
► M°) is initialized to 0 = (0,..., 0).
► In (t + l)-th step, (n^+1\ ..., n"p+V)) is computed as follows:
► lf sig (ELi n£t] ' Ye ' ^(x^x^)) 7^ y/c, then n[t+1) := nj^ + 1,
i      (t+i) (0 else, a?^     .= a?^ \
► a7^+1) := a?*0 for all ^ ^ k.
Here k = (t mod p) + 1, the examples are considered cyclically.
Intuition: The algorithm computes a linear classifier in R6 for training examples transformed using (j>.
The trick is that the transformation (j> itself does not have to be explicitly computed!
123
Dual Perceptron Learning
Let n = (ni,..., np) result from the kernel perceptron learning algorithm.
(t* }
I.e., each m = n\ J E N for suitable t* such that
s'g (SLi $ } ' y*> ' ^(x/c,x^)) = yk for all k = 1,..., p.
We obtain a non-linear classifier that for a given xgR" gives
(Here x is the augmented feature vector obtained from x.)
Are there other kernels that correspond to the scalar product in higher dimensional spaces?
124
Kernels
Given a (potential) kernel K,(x£,Xk) we need to check whether K,(x£,Xk) = <j>{xt) • 0O?/c) for a function 0. This might be very difficult.
Veta (Mercer's)
k is a kernel if the corresponding Gram matrix K of the training set D, whose each £k-th element is K,(x£,Xk), is positive semi-definite for all possible choices of the training set D.
Kernels can be constructed from existing kernels by several operations
► linear combination (i.e. multiply by a constant, or sum),
► multiplication,
► exponentiation,
► multiply by a polynomial with non-negative coefficients,
► ...
(see e.g. "Pattern Recognition and Machine Learning" by Bishop)
125
Examples of Kernels
► Linear: ft(x£, Xk) — *£ • *k
The corresponding mapping </>(x) = x is identity (no transformation).
► Polynomial of power m: K,(x£,Xk) = (1 + xj> ■ x/c)m
The corresponding mapping assigns to x g 1" the vector </>(x) in
fn+m\
_ || xg — xk ||2
► Gaussian (radial-basis function): ^(x^x/J = e 2^
The corresponding mapping cj) maps x to an infinite-dimensional vector cj)(x) which is, in fact, a Gaussian function; combination of such functions for support vectors is then the separating hypersurface.
► • • •
Choosing kernels remains to be black magic of kernel methods. They are usually chosen based on trial and error (of course, experience and additional insight into data helps).
Now let's go on to the main area where kernel methods are used: to enhance support vector machines.
SVM Idea — Which Linear Classifier is the Best?
Benefits of maximum margin:
► Intuitively, maximum margin is good w.r.t. generalization.
► Only the support vectors (those on the magin) matter, others can, in principle, be ignored.
127
Support Vector Machines (SVM)
Notation:
► w = (i/i/q, 1/1/1,..., wn) a vector of weights,
► w = (1/1/1,..., i/i/,,) a vector of all weights except i/i/q,
► x = (xi,..., xn) a (generic) feature vector. Consider a linear classifier:
The signed distance of x from the decision boundary determined by w
\d[w](x)\ is the distance of x from the decision boundary.
c/[tv](x) is positive for x on the side to which tv points and negative on the
opposite side.
d[w](x)
w0 + W ' xk
W
Here w
S/Li wf 's the Euclidean norm of w_
Support Vectors & Margin
► Given a training set
D = {(xl,y(xl)),(x2,y(x2)),...,(xp,y(xp))} Here xk = (xki..., xkn) EXCR" and y(xk) g {-1,1}.
We write yk instead of y(xk).
► Assume that D is linearly separable, let w be consistent with D so that the distance of the decision boundary from the nearest examples on both sides is the same (if not, it suffices to adjust wq).
O
► Margin p of w is twice the distance
► Support vectors are those xk that minimize |c/[i/i?](^5c)|-
between support vectors and
□ □
Maximum,
\ /margin \
o
the decision boundary
X
Our goal is to find a classifier that maximizes the margin.
Maximizing the Margin
For w consistent with D (such that no lies on the decision boundary) we have
Wq + W_ • Xk
Q = 2
w
= 2 y/c • (w0 + w-xk)
w
w
where     is a support vector.
We may safely consider only w such that • (i/i/n + w • x^) — 1 for the support vectors.
Just adjust the length of w so that yk • (wo -\-w-Xk) = 1, the denominator will compensate.
Then maximizing g is equivalent to maximizing 2/||m?| .
(In what follows we use a bit looser constraint:
Yk - {wo + w_ - Xk) > 1 for all Xk
However, the result is the same since even with this looser condition, the support vectors always satisfy yk • (wo + w • Xk) = 1 whenever 2/ maximal.)
w
is
130
SVM - Optimization
Margin maximization can be formulated as a quadratic optimization problem:
Find w = (i/i/n,..., wn) such that
P =
w
is maximized
and for all (x/o y/c) G D we have yk • (i/i/n + w • xk) > 1
which can be reformulated as:
Find w such that
= ||j/i/||2 = w • vv is minimized and for all (x^, y^) 6 D we have y/< • (1/1/n + w_ • x^) > 1
131
SVM - Optimization
► Need to optimize a quadratic function subject to linear constraints.
► Quadratic optimization problems are a well-known class of mathematical programming problems for which efficient methods (and tools) exist.
► The solution usually involves construction of a dual problem where Lagrange multipliers a; are associated with every inequality (constraint) in the original problem:
Find a = (ol\, ..., ap) such that
p p
^ 1 ^
= ^2 olh - - ^ ^ an • ak • y£ • yk • x£ • xk is maximized
£=1 £=1 k=l
so that the following constraints are satisfied:
► y^!Li apvp = 0
Yle=i a£Y£ ► a£ > 0 for all 1 < £ < p
132
The Optimization Problem Solution
► Given a solution ai,..., an to the dual problem, solution w = (i/i/q, 1/1/1,..., !/!/„) to the original one is:
w = [wi
w,
n) = ^2<xe- ye - xe
£=1
wo — yk — ^2 a£ '    ' *£ ' *k for an arbitrary ak > 0
£=1
Note that ak > 0 iff xk is a support vector. Hence it does not matter which ak > 0 is chosen in the above definition of wq.
► The classifier is then
h(x) = sig(wo + w • x)
= sig {yk -J2ia£-y£-^-^kJr J2i ai • yt •   • *)
Note that both the dual optimization problem as well as the classifier contain training feature vectors only in the scalar product! We may apply the kernel trick!
Kernel SVM
► Choose your favourite kernel k.
► Solve the dual problem with kernel k\
Find a — (ol\, ..., ap) such that
p p
^ 1 ^ _L_
1,7 (a) = ^at-~^^a£-ak-y£-yk-k(*£i><k) is maximized
2
£=1 £=1 k=l
so that the following constraints are satisfied:
► Y,£ a£y£ = 0
► a£ > 0 for all 1 < £ < p
► Then use the classifier:
h(x) = Sig (yk -Y,£OL£'y£' k,(x£, xk) + Y,£OL£'y£' k(x£, x))
► Note that the optimization techniques remain the same as for the linear SVM without kernels!
134
Comments on Algorithms
► The main bottleneck of SVM's is in complexity of quadratic programming (QP). A naive QP solver has cubic complexity.
► For small problems any general purpose optimization algorithm can be used.
► For large problems this is usually not possible, many methods avoiding direct solution have been devised.
► These methods usually decompose the optimization problem into a sequence of smaller ones. Intuitively,
► start with a (smaller) subset of training examples.
► Find an optimal solution using any solver.
► Afterwards, only support vectors matter in the solution! Leave only them in the training set, and add new training examples.
► This iterative procedure decreases the (general) cost function.
135
SVM in Applications (Mooney's lecture)
► SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s.
► SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data.
► SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data.
► SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. '97], principal component analysis [Scholkopf et al. '99], etc.
► Most popular optimization algorithms for SVMs use decomposition to hillclimb over a subset of a;'s at a time, e.g. SMO [Piatt '99] and [Joachims '99]
► Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner.
136