Probabilistic Classification Based on the ML lecture by Raymond J. Mooney University of Texas at Austin l Probabilistic Classification — Idea Imagine that ► I look out of a window and see a bird, ► it is black, approx. 25 cm long, and has a rather yellow beak. My daughter asks: What kind of bird is this? My usual answer: This is probably a kind of blackbird (kos černý in Czech). Here probably means that out of my extensive catalogue of four kinds of birds that I am able to recognize, "blackbird" gets the highest degree of belief based on features of this particular bird. Frequentists might say that the largest proportion of birds with similar features I have ever seen were blackbirds. The degree of belief (Bayesians), or the relative frequency (frequentists) is the probability. 2 Basic Discrete Probability Theory ► A finite or countably infinite set ft of possible outcomes, ft is called sample space. Experiment: Roll one dice once. Sample space: Q = {1,..., 6} ► Each element uo of ft is assigned a "probability" value f(uo), here f must satisfy ► f(u) e [0,1] for all uj e ft, If the dice is fair, then f(uu) = | for all uo G {1,..., 6}. ► An event is any subset E of ft. ► The probability of a given event E C ft is defined as p(e) = n«>) Let E be the event that an odd number is rolled, i.e., E = {1,3,5}. Then P(E) = |. ► Basic laws: P(ft) = 1, P(0) = 0, given disjoint sets a, b we have p(a u b) = p(a) + p(b), P(Q \ a) = 1 - P(/\). Conditional Probability and Independence ► P(A | B) is the probability of A given B (assume P(B) > 0) defined by P{A | B) = P{AnB)/P{B) (We assume that B is all and only information known.) A fair dice: what is the probability that 3 is rolled assuming that an odd number is rolled? ... and assuming that an even number is rolled? ► The law of total probability: Let A be an event and 61,..., Bn pairwise disjoint events such that Q = IJ/Li Then n n P(A) = £P(/*n B;) = P(A I B;) ■ P(Bi) i=l ;=1 ► A and B are independent if P(A n B) = P(A) ■ P{B). It is easy to show that if P(B) > 0, then A, B are independent iff P(A \ B) = P(A). Random Variables ► A random variable X is a function X : Q —> IR. A dice: X : {1,..., 6} {0,1} such that X(n) = n mod 2. ► A probability mass function (pmf) of X is a function p defined by p(x) := P(X = x) Often P(X) is used to denote the pmf of X. Random Vectors ► A random vector is a function X : Q —> IR . We use X = (Xi,... , X^) where X/ is a random variable returning the /-th component of X. ► A joint probability mass function of X is Px(*i, ...,Xf/) := P(Xi = xi A • • • A Xd = x(X = xk)P{Y = C{xk) \X = xk) Xk (Here the last equality follows from the fact that C is determined by Xk.) Choosing C(xfr) = CBayes(xk) = yi where I = argmax P(Y = y, \ X = ie{i,...,m} maximizes P{Y = C(x/C) | X = xk) and thus minimizes Eq. Practical Use of Bayes Classifier The crucial problem: How to compute P( Y — y; | X = x^) ? Given no other assumptions, this requires a table giving the probability of each category for each possible vector of feature values, which is impossible to accurately estimate from a reasonably-sized training set. Concretely, if all Y,Xi,... ,Xn are binary, we need 2n numbers to specify P(Y = 0 | X = x^) for each possible x^. (Note that we do not need to specify P{Y = 1 | X = xk) = 1 - P{Y = 0 | X = xk)). It is a bit better than 2n+1 — 1 entries for specification of the complete joint pmf P( Y, Xi,..., Xn). However, it is still too large for most classification problems. 14 Let's Look at It the Other Way Round Věta (Bayes,1764) P(A I B) = P(B I A) ■ P{A) PjB) Důkaz. P(ADB) ^SP-P(^) P(B\A)-P(A) P(A I B) = P(B) P(B) P(B) □ 15 Bayesian Classification Determine the category for xk by finding y,- maximizing P(Y = yi\X = xk) = P(Y = yj) ■ P(X = xk I Y = P(X = xk) So in order to make the classifier we need to compute: ► The prior P(Y — y/) for every y; ► The conditionals P(X — Y — y/) for every and Estimating the Prior and Conditionals ► p[Y — y;) can be easily estimated from data: ► Given a set of p training examples where ► n-, of the examples are in the category y, ► we set P(Y = Yi) = - P ► If the dimension of features is small, P(X — Y — y/) can be estimated from data similarly as for P{Y — y,-). Unfortunately, for higher dimensional data too many examples are needed to estimate all P(X — Y — y/) (there are too many x^'s). So where is the advantage of using the Bayes thm.? We introduce independence assumptions about the features! 17 Naive Bayes ► We assume that features of an instance are (conditionally) independent given the category. n P(X I Y) = P(Xi, ,X„ | Y) = [] P(X/ | V) /=i ► Therefore, we only need to specify P(X; | V), that is P(X/ = Xjj | y = y/c) for each possible pair of a feature-value x;j and a class y^. Note that if Y and all X; are binary (values in {0,1}), this requires specifying only 2n parameters: P(X/ = 1 | Y = 1) and P(X; = 1 | Y = 0) for each X; since P(X; = 0 | Y) = 1 - P(X; = 1 | Y). Compared to specifying 2n parameters without any independence assumptions. 18 Naive Bayes — Example positive negative P(Y) 0.5 0.5 P{small Y) 0.4 0.4 P(medium Y) 0.1 0.2 F'(large Y) 0.5 0.4 P(red Y) 0.9 0.3 P(blue Y) 0.05 0.3 P (green Y) 0.05 0.4 P(square Y) 0.05 0.4 P (triangle Y) 0.05 0.3 P(circle Y) 0.9 0.3 Is (medium, red, circle) positive? 19 positive negative P(Y) 0.5 0.5 P(medium Y) 0.1 0.2 P(red Y) 0.9 0.3 P(circle Y) 0.9 0.3 Denote xk = (medium, red, circle). P(pos | X = Xk) = = P(pos) - P(medium | pos) • P(red | pos) • P(circle | pos) / P(X = Xk) = 0.5 • 0.1 • 0.9 • 0.9 / P{X = xk) = 0.0405/P(X = xk) P{neg | X = xk) = = P(neg) - P(medium | neg) • P(red | neg) • P(circle | /leg-) / P(X = Xk) = 0.5 • 0.2 • 0.3 • 0.3 / P{X = xk) = 0.009/P(X = xk) Apparently, P{pos | X = xk) = 0.0405/P(X = xk) > 0.009/P(X = xk) = P(A7eg- | X = x/,) So we classify Xk as positive. 20 Estimating Probabilities (In General) Normally, probabilities are estimated on observed frequencies in the training data (see the previous example). Let us have ► rik training examples in class yk, ► riijk of these examples have the value for X, equal to x-,j. Then we put P(X; = xff | Y = yk) = |f. A problem: If, by chance, a rare value x,y of a feature X; never occurs in the training data, we get P(X; =XiJ | V = y/c) = 0 for all k e {1,..., m} But then P(X = x^) = 0 for xk containing the value x,y for X/, and thus P(y = y/c | X = x^) is not well defined. Moreover, P(Y = yk) • P(X = x* | Y = y/c) = 0 (for all yk) so even this cannot be used for classification. 21 Probability Estimation Example Learned probabilities: Training data: positive negative P(Y) 0.5 0.5 P(small Y) 0.5 0.5 Size Color Shape Class P(medium Y) 0 0 small red circle p OS P (large Y) 0.5 0.5 large red circle p OS P(red Y) 1 0.5 small red triangle neg P(blue Y) 0 0.5 large blue circle neg P (green Y) 0 0 P(square Y) 0 0 P (triangle Y) 0 0.5 P (circle Y) 1 0.5 Note that P(medium A red A circle) = 0. So what is P(pos \ medium A red A circle) ? 22 Smoothing To account for estimation from small samples, probability estimates are adjusted or smoothed. Laplace smoothing using an m-estimate works as if ► each feature is given a prior probability p, ► such feature have been observed with this probability p in a sample of size m (recall that m is the number of classes). We get P{Xi =xij Y = yk) = ——■- ni< + m (Recall that n^ is the number of training examples of class y^, and rijjk is the number of training examples of class for which the /-th feature X; has the value x,y.) 23 Laplace Smothing Example ► Assume training set contains 10 positive examples: ► 4 small ► 0 medium ► 6 large ► Estimate parameters as follows (m — 2 and p = 1/3) ► P(small | positive) = (4 + 2/3)/(10 + 2) = 0.389 ► P(medium | positive) = (0 + 2/3)/(10 + 2) = 0.056 ► P(large | positive) = (6 + 2/3)/(10 + 2) = 0.556 24 Continuous Features Q may be (potentially) continuous, X; may assign a continuum of values in IR. ► The probabilities are computed using probability density p : IR —>> IR+ instead of pmf. A random variable X : Q —>• IR+ has a density p : IR —>• M+ if for every interval [a, 6] we have P{a < X < b) = Usually, P(Xj \ Y = y^) is used to denote the density of X; conditioned on Y = y^. ► The densities P(X; | Y = y/c) are usually estimated using Gaussian densities as follows: ► Estimate the mean ji^ and the standard deviation cr,* based on training data. ► Then put Comments on Naive Bayes ► Tends to work well despite rather strong assumption of conditional independence of features. ► Experiments show it to be quite competitive with other classification methods. Even if the probabilities are not accurately estimeted, it often picks the correct maximum probability category. ► Directly constructs a hypothesis from parameter estimates that are calculated from the training data. ► Typically handles noise well. ► Missing values are easy to deal with (simply average over all missing values in feature vectors). 26 Bayes Classifier vs MAP vs MLE Recall that the Bayes classifier chooses the category as follows: CBayes(xk) = arg max P(Y = y, \ X = xk) /g{l,...,m} P(Y = y) ■ P(X = xk | Y = yi) = arg max -—--r- ie{i,...,m} P{X = Xk) As the denominator P(X = xk) is not influenced by /, the Bayes is equivalent to the Maximum Aposteriori Probability rule: CMAP(xk) = arg max P(Y = y() • P(X = xk \ Y = y() /g{l,...,m} If we do not care about the prior (or assume uniform) we may use the Maximum Likelihood Estimate rule: CMLE(xk) = arg max P(X = xk \ Y = y() /g{l,...,m} (Intuitively, we maximize the probability that the data Xk have been generated into the category y,.) 27 Bayesian Networks (Basic Information) In the Naive Bayes we have assumed that all features Xi,..., are independent. This is usually not realistic. E.g. Variables "rain" and "grass wet" are (usually) strongly dependent. What if we return some dependencies back? (But now in a well-defined sense.) Bayesian networks are a graphical model that uses a directed acyclic graph to specify dependencies among variables. Bayesian Networks — Example P(C = T) P(C = F) 0.8 0.2 P(S = T) P(S = F) 0.02 0.98 C P(W - T|C) P(W = F|C) T 0.9 0.1 F 0.01 0.99 B P(A = T|B) P(A = F|B) T 0.7 0.3 F 0.1 0.9 Ache c S P(B = T|C,S) P(B-F|C,S) T T 0.9 0.1 T F 0.2 0.8 F T 0.9 0.1 F F 0.01 0.99 Now, e.g., P(C, S, W, B, A) = P(C) • P(S) • P(W I C) • P{B I C, S) • P{A \ B) Now we may e.g. infer what is the probability P(C = T \ A = T) that we sit in a bad chair assuming that our back aches. We have to store only 10 numbers as opposed to 25 — 1 if the whole joint pmf is stored. 29 Bayesian Networks — Learning & Naive Bayes Many algorithms have been developed for learning: ► the structure of the graph of the network, ► the conditional probability tables. The methods are based on maximum-likelihood estimation, gradient descent, etc. Automatic procedures are usually combined with expert knowledge. Can you express the naive Bayes for V, Xi,... ,Xn using a Bayesian network? 30 umerical features ► Throughout this lecture we assume that all features are numerical, i.e. feature vectors belong to IRn. ► Most non-numerical features can be conveniently transformed to numerical ones. For example: ► Colors {blue, red, yellow} can be represented by {(1,0,0), (0,1,0), (0,0,1)} (one-hot encoding) ► A black-and-white picture of x x y pixels can be encoded as a vector of xy numbers that capture the shades of gray of the pixels. Basic Problems We consider two basic problems: ► (Binary) classification Our goal: Classify inputs into two categories. ► Function approximation (regression) Our goal: Find a (hypothesized) -4 - • functional dependency in data. ~6-6 -4 -2 0 2 4 6 32 inary classification in W ► Assume ► a set of instances XCK", ► an unknown categorization function c : X —>► {0,1}. ► Our goal: ► Given a set D of training examples of the form (x, c(x)) where x gX, ► construct a hypothesized categorization function /? g T-L that is consistent with c on the training examples, i.e., h{x) = c(x) for all training examples (x, c(x)) g D Comments: ► In practice, we often do not strictly demand h(x) = c(x) for all training examples (x, c(x)) g D (often it is impossible) ► We are more interested in good generalization, that is how well h classifies new instances that do not belong to D. ► Recall that we usually evaluate accuracy of the resulting hypothesized function h on a test set. Hypothesis Spaces We consider two kinds of hypothesis spaces: ► Linear (affine) classifiers (this lecture) ► Classifiers based on combinations of linear and sigmoidal functions (classical neural networks) (next lecture) Length and Scalar Product of Vectors ► We consider vectors x — (xi,..., xm) G ffi. ► Typically, we use Euclidean metric on vectors: |x| = y Xl/li x/ The distance between two vectors (points) x,y is |x — y|. ► We use the scalar product x • y of vectors x = (xi,..., xm) and y = (yi,... ,ym) defined by m x-y = J^x/y/ /=i ► Recall that x y = x y| cos0 where 0 is the angle between x and y. That is x y is the length of the projection of y on x multiplied by |x . ► Note that x x x 35 Linear classifier - example o o ► classification in plane using a linear classifier ► if a point is incorrectly classified, the learning algorithm turns the line (hyperplane) to improve the classification. 36 Linear Classifier A linear classifier h[w] is determined by a vector of weights w = (i/i/n, 1/1/1,..., wn) G as follows: Given x = (xi,..., xn) G X C IRn, /?[iv](x) := More succinctly: /7(x) = Sgn Wq + ^ W\ • X; where s(g"/i(y) = i=i i y>0 0 y<0 37 inear Classifier — Geometry Linear Classifier — Notation Given x — (xi,... , xn) 6 w1 we define an augmented feature vector x = (xn,xi,...,xn) where xn = 1 This makes the notation for the linear classifier more succinct: /?[iv](x) = sgn(w • x) 39 Perceptron Learning ► Given a training set D = {(xi, c(xi)), (x2, c(x2)), ■ ■ ■, (xpj c(xp))} Here x^ = (x/ri ... , x/^) GXCR" and c{x^) G {0,1}. We write c/< instead of c(x/c). Note that = (x/^x^i ..., x^) where x^n = 1. ► A weight vector w G IRn+1 is consistent with D if /?[iv](x/() = sgn(w • x/f) = q- for all /c = 1,..., p D is linearly separable if there is a vector w G IRn+1 which is consistent with D. ► Our goal is to find a consistent w assuming that D is linearly separable. 40 Perceptron — Learning Algorithm Online learning algorithm: Idea: Cyclically go through the training examples in D and adapt weights. Whenever an example is incorrectly classified, turn the hyperplane so that the example becomes closer to it's correct half-space. Compute a sequence of weight vectors w^°\ w^2\ .... ► is randomly initialized close to 0 = (0,..., 0) ► In (t + l)-th step, w/(t+1) is computed as follows: = w(t) - s • (sgn (w(t) • Xfr) - c^j • xk Here k = (t mod p) + 1, i.e. the examples are considered cyclically, and 0 < e < 1 is a learning speed. Veta (Rosenblatt) //"D /s linearly separable, then there is t* such that w^*) is consistent with D. 41 Example Training set: D = {((2,-1),1),((2,1),1),((1,3),0)} That is xi = (2,-1) xi = (1,2,-1) x2 = (2,1) x2 = (1,2,1) x3 = (1,3) x3 = (1,1,3) ci = 1 c2 = 1 c3 = 0 Assume that the initial vector w^°> is v\/(°> = (0, —1,1). Consider e = 1. 42 Example: Separating by 4 - 3 - 2 - 1 - 1 - -2 - O x3 O X? 2 O x\ Denoting = {m- wll w2) = (0, -1,1) the blue separating line is given by i/i/o + w\x\ + 1/1/2x2 = 0. The red vector normal to the blue line is (wi, 1/1/2). The points on the side of (1/1/1,1/1/2) are assigned 1 by the classifier, the others zero. (In this case X3 is assigned one and xi,X2 are assigned zero, a of this is inconsistent with ci = l,c2 = l,c3 = 0.) -3 - 43 Example: We have (°) • Sex = (0, -1,1) • (1,2, -1) = 0 - 2 - 1 = -3 thus sgn (w{0) - xi) = 0 and thus sgnfw^-x^ -ci=0-l = -l (This means that xi is not well classified, and w(0) is not consistent with D.) Hence, tfU) = - (SgA7 (tf<°> • Xi) - Ci) • Xi = (0,-1,1)+ (1,2,-1) = (1,1,0) Example 4 - 3 - 2 - 1 - O x3 O X2 -1 - ■2 - 3 - 2 O xi 45 Example: Separating by We have ■ x2 = (1,1,0) ■ (1,2,1) = 1 + 2 = 3 thus sgn • = 1 and thus sgn • X2j — C2 = l — 1 = 0 (This means that x\ is currently well classified by w^. However, as we will see, x$ is not well classified.) Hence, tfP) = = (1,1,0) 46 Example: w (3) We have ivP)-x3 = (lJl,0)-(lJlJ3) = l + l = 2 thus sgn • X3^ = 1 and thus sgn (w^ • x3^ — C3 = l — 0 = 1 (This means that x3 is not well classified, and w(2) is not consistent with D.) Hence, w(3) = - (sgn (w^ • x3^ - c^j • x3 = — X3 = (1,1,0)-(1,1,3) = (0,0,-3) 47 Example: Separating by Example: We have M?(3) -xi = (0,0,-3) • (1,2,-1) = 3 thus sgn (w^ • xi^ = 1 and thus sgn (w(3) -xi) -ci = 1-1 = 0 (This means that x\ is currently well classified by w^3\ However, as we will see, x2 is not.) Hence, tfW = w& = (0,0,-3) 49 Example: We have tfW .x2 = (0,0,-3) • (1,2,1) = -3 thus sgn (^w^ • = 0 and thus sgn • X2^j — C2 = 0 — 1 = — 1 (This means that x2 is not well classified, and is not consistent with D.) Hence, = ^4)+X2 = (0,0,-3)+ (1,2,1) = (1,2,-2) 50 Example: Separating by w^5* -3 51 Example: The result The vector w(5> is consistent with D: sgn (tf <5> . 5l) = sgn ((1,2, -2) • (1,2, -1)) = sgv,(7) = 1 = d sgn (V5) • x2) = sgn ((1,2, -2) • (1, 2,1)) = sgn(3) = 1 = c2 sgn (V5) • x3) = sgn ((1,2, -2) • (1,1,3)) = sgn(-3) = 0 = c3 52 Perceptron — Learning Algorithm Batch learning algorithm: Compute a sequence of weight vectors w^°\ w^2\ ► is randomly initialized close to 0 = (0,..., 0) ► In (t + l)-th step, i/i/(t+1) is computed as follows: p ^(t+i) = -(t) _ /c=l P /c=l Here k — (t mod p) + 1, i.e. the examples are considered cyclically, and 0 < e < 1 is a learning speed. 53 Function Approximation — Oaks in Wisconsin This example is from How to Lie with Statistics by Darrell Huff (1954) Age DBH (years) (inch) 97 12.5 93 12.5 88 8.0 15 - o 10 5 h 0 Oak Diameter vs. Age i-1-1-1-r "I-V * * _i_I_i_I_i_I_i_I_i_I o 20 40 60 80 100 Age (years) Oak Diameter vs. Age 54 Function Approximation ► Assume ► a set X C R" of instances, ► an unknown function f : X —>► M. ► Our goal: ► Given a set D of training examples of the form (x, f(x)) where x eX, ► construct a hypothesized function /? g H such that /?(x) ^ f(x) for all training examples (x, f (*)) is computed as follows: = w = w k=l p k=l Here k — {t mod p) + 1 and 0 < e < 1 is the learning speed Note that the algorithm is almost similar to the batch perceptron algorithm! Tvrzeni For sufficiently small s > 0 the sequence w^°\ w^2\ ... converges (component-wisely) to the global minimum of E. 60 Finding the Minimum in Dimension One Assume n = 1. Then the error function E is 1 9 E(w0, wi) = - 22(w0 + wixk - fk)2 k=l Minimize E w.r.t. wq a w\: $E n 7 - = 0 -vv- Wq = T — w\X <4> f = Wq + w\X Swq where x = ± ££=1 xfc a f = \ E£=i ft M=° ^ * = |EiU(*-*)> i.e. i/i/i = cot/(f, x)/i/ar(x) 61 Finding the Minimum in Arbitrary Dimension Let A be a matrix px(n + l)(p rows, n + 1 columns) whose /c-th row is the vector x^. Let f = (fi,..., /p)T be the column vector formed by values of f in the training set. Then VE(w) = 0 ^ w = (/T/l)-MTf if (/T^)"1 exists (Then (v4Tv4)_MT is the so called Moore-Pen rose pseudoinverse of A) 62 Normal Distribution — Reminder Distribution of continuous random variables. Density (one dimensional, that is over IR): H is the expected value (the mean), a2 is the variance. aximum Likelihood vs Least Squares (Dim 1) Fix a training set D = {(xi, f{), (x2, f2),..., (xp, fp)} Assume that each fk has been generated randomly by Here ► wo, tvi are unknown numbers ► e/c are normally distributed with mean 0 and an unknown variance a2 Assume that ei,..., ep were generated independently. Denote by p(/"i,..., fp | i/i/n, 1/1/1, a2) the probability density according to which the values /"i,..., fn were generated assuming fixed w0, w1, a2,xi, . . . ,xp. (For interested: The independence and normality imply p p(fl, . ..,fp\m, ^l,cr2) = Y\ N[W° + ^lX/c, cr2](f/c) where A/[i/i/o + w/ix/c, cr2](f/c) is a normal distribution with the mean wq + i/i/ix/c and the variance a2.) Maximum Likelihood vs Least Squares Our goal is to find (i/i/n, 1/1/1) that maximizes the likelihood that the training set D with fixed values fi,..., fn has been generated: L(w0, wlla2) := p(/i,..., fp | w0, wlla2) Veta (i/i/n, 1/1/1) maximizes L(wq, i/i/i, 0; 0 £<0. 68 Formal Neuron vs Linear Models Both linear classifier and linear (affine) function are special cases the formal neuron. ► If a is a linear threshold function *(0 = 1 £>0; 0 £<0. we obtain a linear classifier. ► If a is identity, i.e. = we obtain a linear (affine) function. Many more activation functions are used in neural networks! Sigmoid Functions Multilayer Perceptron (MLP) Output Hidden Input ► Neurons are organized in layers (input layer, output layer, possibly several hidden layers) ► Layers are numbered from 0; the input is 0-th ► Neurons in the £-th layer are connected with all neurons in the £ + 1-th layer Intuition: The network computes a function as follows: Assign input values to the input neurons and 0 to the rest. Proceed upwards through the layers, one layer per step. In the £-th step consider output values of neurons in £ — 1-th layer as inputs to neurons of the £-th layer. Compute output values of neurons in the £-th layer. 71 Example ► Activation function: linear threshold *(0 = 1 £>0; 0 £<0. 72 Classical Example - ALVINN Sharp Left Straight Ahead Sharp Right 4 Hidden Units 30 Output Units 30x32 Sensor Input Retina ► One of the first autonomous car driving systems (in 90s) ► ALVINN drives a car ► The net has 30 x 32 = 960 input neurons (the input space is IR960). ► The value of each input captures the shade of gray of the corresponding pixel. ► Output neurons indicate where to turn (to the center of gravity). Source: http://jmvidal.cse.sc.edu/talks/ann/alvin.html 73 A Bit of History ► Perceptron (Rosenblatt et al, 1957) Hi-' * i: .....mm IK HM ► Single layer (i.e. no hidden layers), the activation function linear threshold (i.e., a bit more general linear classifier) ► Perceptron learning algorithm ► Used to recognize numbers ► Adaline (Widrow & Hof, 1960) ► Single layer, the activation function identity (i.e., a bit more linear function) ► Online version of the gradient descent ► Used a new circuitry element called memristor which was at to "remembe^'history of current in form of resistance In both cases, the expressive power is rather limited - can express only linear decision boundaries and linear (affine) functions. A Bit of History One of the famous (counter)-examples: XOR Xl (0,1) (1,1) 0—© (0,0) © (0,1) *2 No perceptron can distinguish between ones and zeros. 75 XOR vs Multilayer Perceptron Pi Pi (Here a is a linear threshold function.) Pi : -1 + 2xi + 2x2 = 0 P2 : 3 - 2xi - The output neuron performs an intersection of half-spaces. Expressive Power of MLP Cybenko's theorem: ► Two layer networks with a single output neuron and a single layer of hidden neurons (with the logistic sigmoid as the activation function) are able to ► approximate with arbitrarily small error any "reasonable" boundary a given input is classified as 1 iff the output value of the network is > 1/2. ► approximate with arbitrarily small error any "reasonable" function with range (0,1). Here "reasonable" means that it is pretty tough to find a function that is not reasonable. So multi-layer perceptrons are suffuciently powerful for any application. But for a long time, at least throughout 60s and 70s, nobody well-known knew any efficient method for training multilayer networks! ... then the backpropagation appeared in 1986! 77 MLP - Notation ► X set of input neurons ► Y set of output neurons ► Z set of all neurons (tedy X, Y C Z) ► individual neurons are denoted by indices, e.g. ij. ► £j is the inner potential of the neuron j when the computation is finished. ► yj is the output value of the neuron j when the computation is finished. (we formally assume yo = 1) ► wjj is the weight of the arc from the neuron / to the neuron j. ► y<_ is the set of all neurons from which there are edges to j (i.e. jV- is the layer directly below j) ► is the set of all neurons to which there are edges from j. (i.e. is the layer directly above j) 78 MLP - Notation ► Inner potential of a neuron j\ ► for simplicity, the activation function of every neuron will be the logistic sigmoid cr(£) = . (We may of course consider logistic sigmoids with different steepness paramaters, or other sigmoidal functions, more in PV021.) ► A value of a non-input neuron j g Z \ X when the computation is finished is yj = cr(£/) (y7 is determined by weights w and a given input x, so it's sometimes written as y/[vv](x)) ► Fixing weights of all neurons, the network computes a function F[w] : RIxI —)► Rlyl as follows: Assign values of a given vector x g M'x' to the input neurons, evaluate the network, then /~[vv](x) is the vector of values of the output neurons. Here we implicitly assume a fixed orderings on input and output vectors. 79 MLP — Learning ► Given a set D of training examples: D = | (xk, dkj | k = 1,... ,p Here x^ G and cf^ G IRlyL We write cf^- to denote the value in dk corresponding to the output neuron j. ► Least Squares Error Function: Let w be a vector of a weights in the network. e{w) = j^ek{w) k=l where E*(^) = oE (X/HWO - dkj)' 2 80 MLP — Learning Algorithm Batch Learning - Gradient Descent: The algorithm computes a sequence of weights w^°\ .... ► weights are initialized randomly close to 0 ► in the step t + 1 (here £ = 0,1, 2 ...) is i/i/(t+1) computed as follows: (t+l) (t) . A (t) Wjj = wjj + Awji } where Awjp = -e(t) ■ J O Wji is the weight change wjj and 0 < s(t) < 1 is the learning speed in the step t + l. Note that -§^:{w^) is a component of VE, i.e. the weight change in the step t + 1 can be written as follows: w(t+1) = w(t) - s(t) • VE(w(t)). 81 MLP — Gradient Computation For every weight wy, we have (obviously) dE ^ dEk dwji 9wy So now it suffices to compute that is the error for a fixed training example (x^, dk). It holds that dEk dwji Sj-yj(i-yj)-yi where 5j = yj - dkj pro j t Y Sj = ^2 ()r ■ yr(l - yr) • wrj pro j e Z \ (Y U X) (Here yr = y[w](x/c) where w are the current weights and Xk is the input of the /c-th training example.) 82 ultilayer Perceptron — Backpropagation So to compute all dE Compute all Sj • y/(l — yj) • y\ for every training example (x^, dk ► Evaluate all values y\ of neurons using the standard bottom-up procedure with the input xk. ► Compute Sj using backpropagation through layers top-down : ► Assign Sj = yj — dkj for all j g Y ► In the layer £, assuming that Sr has been computed for all neurons r in the layer £ + 1, compute for all j from the ^-th layer. Example w, (0) 40 (0) W30 = (0) (0) (0) W31 = (0) W32 = (0) W53 = = 1/1/4^ = w!^) = 1 and 54 — 1. Consider a training set {((1,0), 1)} Then yi = i. Y2 = o, y3 = a(i/i/30 + w^yi + W32V2) = 0-5, y4 = 0.5, y5 = 0.731058. ,(0).,N _ S3 y5 - 1 = -0.268942, 0.052877. y5)*i^5(0) ^50 '54 = -0.052877, dEi <91/1/53 dE1 dEi <91/1/30 55 ■ y5 ■ (1 - y5) • y3 = -0.026438, S5 ■ ys ■ (1 - ys) • y4 = -0.026438, 53 • y3 • (1 - y3) • 1 = 0.01321925, lustration of Gradient Descent — XOR Source: Pattern Classification (2nd Edition); Richard O. Duda, Peter E. Hart, David G. Stork 85 Comments on Training Algorithm ► Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely. ► In practice, does converge to low error for many large networks on real data. ► Many epochs (thousands) may be required, hours or days of training for large networks. ► To avoid local-minima problems, run several trials starting with different random weights (random restarts). ► Take results of trial with lowest training set error. ► Build a committee of results from multiple trials (possibly weighting votes by training set accuracy). There are many more issues concerning learning efficiency (data normalization, selection of activation functions, weight initialization, training speed, efficiency of the gradient descent itself etc.) - see PV021. 86 idden Neurons Representations Trained hidden neurons can be seen as newly constructed features. E.g., in a two layer network used for classification, the hidden layer transforms the input so that important features become explicit (and hence the result may become linearly separable). Consider a two-layer MLP, 64-2-3 for classification of letters (three output neurons, each corresponds to one of the letters). sample training patterns - ■ - learned input-to-hidden weights Overfitting ► Due to their expressive power, neural networks are quite sensitive to overfitting. on test data on training data # training epochs ► Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase the validation error. 88 Overfitting — The Number of Hidden Neurons ► Too few hidden neurons prevent the network from adequately fitting the data. ► Too many hidden units can result in overfitting. (There are advanced methods that prevent overfitting even for rich models, such as regularization, where the error function penalizes overfitting - see PV021.) on test data on training data # hidden units ► Use cross-validation to empirically determine an optimal number of hidden units. There are methods that automatically construct the structure of the network based on data, they are not much used though. 89 Applications ► Text to Speech and vice versa ► Fraud detection ► finance & business predictions ► Game playing (backgammon is a classical example, AlphaGo is the modern one) ► Image recognition This is the main area in which the current state-of-the-art deep networks excel. ► (artificial brain and intelligence) ► ... 90 ALVINN 91 ALVINN ► Two layer MLP, 960 - 4 - 30 (sometimes 960 - 5 - 30) ► Inputs correspond to pixels. ► Sigmoidal activation function (logistic sigmoid). ► Direction corresponds to the center of gravity. I.e., output neurons are considered as points of mass evenly distributed along a line. Weight of each neuron corresponds to its value. 92 ALVINN - Training Trained while driving. ► A camera captured the road from the front window, approx. 25 pictures per second ► Training examples (x^, d^) where ► Xk = image of the road ► dk ~ corresponding direction of the steering wheel set by the driver ► the values d^ computed using Gaussian distribution: where D\ is the distance between the /-th output from the one that corresponds to the real direction of the steering wheel. (This is better than the binary output because similar road directions induce similar reaction of the driver.) 93 Selection of Training Examples Naive approach: just take images from the camera. Problems: ► A too good driver never teaches the network how to solve deviations from the right track. Couple of harsh solutions: ► turn the learning off for a moment, deviate from the right track, then turn on the learning and let the network learn how to solve the situation. ► let the driver go crazy! (a bit dangerous, expensive, unreliable) ► Images are very similar (the network basically sees the road from the right lane), may be overtrained. 94 Selection of Training Examples Problem with too good driver were solved as follows: ► every image of the road has been has been transformed to 15 slightly different copies Original Image • • • Repetitiveness of images was solved as follows: ► the system has a buffer of 200 images (including the 15 copies of the current one), in every round trains on these images ► afterwards, a new image is captured, 15 copies made, and these new 15 substitute 15 selected from the buffer (10 with the smallest training error, 5 randomly) 95 ALVINN - Training ► standard backpropagation ► constant speed of learning (possibly different for each neuron -see PV021) ► some other optimizations (see PV021) Výsledek: ► Training took 5 minutes, the speed was 4 miles per hour (The speed was limited by the hydraulic controller of the steering wheel not the learning algorithm.) ► ALVINN was able to go through roads it never "seen" and in different weather 96 ALVINN - Weight Learning hl hl h3 h4 h5 round 0 round 10 round 20 round 50 Here /?1,..., /?5 are values of hidden neurons. Extensions and Directions (PV021) ► Other types of learning inspired by neuroscience - Hebbian learning ► More biologically plausible models of neural networks - spiking neurons This goes into the direction of HUGE area of (computational) neuroscience, only very lightly touched in PV021. ► Unsupervised learning - Self-Organizing Maps ► Reinforcement learning ► learning to make decisions, or play games, sequentially ► neural networks have been used - temporal difference learning 98 Deep Learning ► Cybenko's theorem shows that two-layer networks are omnipotent -such results nearly killed NN when support vector machines were found to be easier to train in 00's. ► Later, it has been shown (experimentally) that deep networks (with many layers) have better represenational properties. ► ... but how to train them? The backpropagation suffers from so-called vanishing gradient, intuitively, updates of weights in lower layers are very slow. ► In 2006 a solution was found by Hinton et al: ► Use unsupervised methods to initialize the weights so that they capture important features in data. More precisely: The lowest hidden layer learns patterns in data, second lowest learns patterns in data transformed through the first layer, and so on. ► Then use a supervised learning algorithm to only fine tune the weights to the desired input-output behavior. A rather heavy machinery is needed to develop this, but you will be rewarded by insight into a very modern and expensive technology. I mage Net Large-Scale Visual Recognition Challenge (ILSVRC) ImageNet database (16,000,000 color images, 20,000 categories) 100 I mage Net Large-Scale Visual Recognition Challenge (ILSVRC) Competition in classification over a subset of images from ImageNet. In 2012: Training se 1,200,000 images, 1000 categories. Validation set 50,000, Test set 150,000. Many images contain several objects —> typical rule is top-5 highest probability assigned by the net. 101 KSH sit ImageNet classification with deep convolutional neural networks, by Alex Krizhevsky, llya Sutskever, and Geoffrey E. Hinton (2012). dense Max pooling Trained on two GPUs (NVIDIA GeForce GTX 580) Results: ► Accuracy 84.7% in top-5 (second best alg. at the time: 73.8%) ► 63.3% in "perfect" classification (top-1) 102 ILSVRC 2014 The same set of images as in 2012, top-5 criterium. GoogLeNet: deep convolutional net, 22 layers Results: ► 93.33% in top-5 Superhuman power? 103 Superhuman GoogLeNet?! Andrej Karpathy: ...the task of labeling images with 5 out of 1000 categories quickly turned out to be extremely challenging, even for some friends in the lab who have been working on ILSVRC and its classes for a while. First we thought we would put it up on [Amazon Mechanical Turk]. Then we thought we could recruit paid undergrads. Then I organized a labeling party of intense labeling effort only among the (expert labelers) in our lab. Then I developed a modified interface that used GoogLeNet predictions to prune the number of categories from 1000 to only about 100. It was still too hard - people kept missing categories and getting up to ranges of 13-15% error rates. In the end I realized that to get anywhere competitively close to GoogLeNet, it was most efficient if I sat down and went through the painfully long training process and the subsequent careful annotation process myself... The labeling happened at a rate of about 1 per minute, but this decreased over time... Some images are easily recognized, while some images (such as those of fine-grained breeds of dogs, birds, or monkeys) can require multiple minutes of concentrated effort. I became very good at identifying breeds of dogs... Based on the sample of images I worked on, the GoogLeNet classification error turned out to be 6.8%... My own error in the end turned out to be 5.1%, approximately 1.7% better. 104 ILSVRC 2015 34-layer plain 34-layer residual ► Microsoft network ResNet: 152 layers, complex architecture ► Trained on 8 GPUs ► 96.43% accuracy in top-5 Id on; 11] lilnf^,l].- ♦ • 105 ILSVRC ils^rc.png 106 Deeper Insight into the Logistic Sigmoid Consider a perceptron (that is a linear classifier) n i=l and y = sgn(^) i e>o o e 0).) Now, can we decide what is the probability of c = 1 given a distance? 108 Deeper Insight into the Logistic Sigmoid For simplicity, assume that £x = — £° = 1/2 i)^(i) p(ei |i)P(i) + p(CI 0)P(0) /./? LR + l/clr where p(£l 1) _ exp(- -(£ - l/2)2/2) p(€ 1 0) exp(- +1/2)V2) and P(l) clr = p which we assume (for simplicity) = 1 So exp(f) P(l I 0 exp(f) + 1 1 + e-f Thus the logistic sigmoid applied to £ = wq + ^ 'x/ gives t/?e probability of c = 1 given the input! 109 Deeper Insight into the Logistic Sigmoid So if we use the logistic sigmoid as an activation function, and turn the neuron into a classifier as follows: Then the neuron basically works as the Bayes classifier! This is the basis of logistic regression. Given training data, we may compute the weights w that maximize the likelihood of the training data (w.r.t. the probabilities returned by the neuron). An extremely interesting observation is that such w maximizing the likelihood coincides with the minimum of least squares for the corresponding linear function (that is the same neuron but with identity as the activation function). no Kernel Methods & SVM Partially based on the ML lecture by Raymond J. Mooney University of Texas at Austin in Back to Linear Classifier (Slightly Modified) A linear classifier h[w] is determined by a vector of weights w = (i/i/n, 1/1/1,..., wn) G as follows: Given x — (xi,..., xn) G X C IRn, For convenience, we use values {—1,1} instead of {0,1}. Note that this is not a principal modification, the linear classifier works exactly as the original one. Recall that given x — (xi,... ,xn) G W, the augmented feature vector is x = (xn,xi,...,xn) where xn = 1 This makes the notation for the linear classifier more succinct: l/l/O + Z7=l • X/ > 0 + S/Li wj • x/ < 0 y >0 1 y<0 112 Perceptron Learning Revisited ► Given a training set D = {(xi,y(xi)),(x2,y(x2)),...,(xp,y(xp))} Here x/< = {x^i ... , x/^) GXCR" and y(x/c) G { — 1,1}. We write yk instead of y(x/c). Note that x^ = (x/^x^i ... , x/^) where x^n = 1. ► A weight vector w G IRn+1 is consistent with D if /?[iv](x/() = sig(w • x^) = y/c for all /c = 1,..., p D is linearly separable if there is a vector w G IRn+1 which consistent with D. Perceptron Learning Revisited Perceptron learning algorithm (slightly modified): Consider training examples cyclically. Compute a sequence of weight vectors w^°\ w^2\ .... ► is initialized to 0 = (0,..., 0). (This is a slight but harmless modification of the traditional algorithm.) ► In (t + l)-th step, w/(t+1) is computed as follows: ► If sig(w • x/c) 7^ yk, then w^t+1^ = + yk • *k- ► Otherwise, w^t+1^ = Here k — (t mod p) + 1, i.e. the examples are considered cyclically. (Note that this algorithm corresponds to the perceptron learning with the learning speed e = 1.) We know: if D is linearly separable, then there is t* such that w^*) is consistent with D. But what can we do if D is not linearly separable? 114 Quadratic Decision Boundary Left: The original set, Right: Transformed using the square of features. Right: the green line is the decision boundary learned using the perceptron algorithm. (The red boundary corresponds to another learning algorithm.) Left: the green ellipse maps exactly to the green line. How to classify (in the original space): First, transform a given feature vector by squaring the features, then use the linear classifier. Do We Need to Map Explicitly? In general, mapping to (much) higher feature space helps (there are more "degrees of freedom" so linear separability might get a chance). However, complexity of learning grows (quickly) with dimension. Sometimes its even beneficial to map to infinite-dimensional spaces. To avoid explicit construction of the higher dimensional feature space, we use so called kernel trick. But first we need to dualize our learning algorithm. 116 Perceptron Learning Revisited Perceptron learning algorithm once more: Compute a sequence of weight vectors w^°\ w^2\ .... ► is initialized to 0 = (0,..., 0). ► In (t + l)-th step, i/i/(t+1) is computed as follows: ► If sig(w • x/c) 7^ yk, then w^t+1^ = + • xk. ► Otherwise, w(t+1) = Here k — {t mod p) + 1, i.e. the examples are considered cyclically. Crucial observation: Note that = X)&=i ' Yi ' *£ f°r suitable ..., r?p^ G N. Intuitively, rip counts how many times xi was added to (if yi — 1)> or subtracted from (if yi — — 1) weights. 117 Dual Perceptron Learning Dual Perceptron learning algorithm : Compute a sequence of vectors of numbers . where each ft*) = (n[t\...,n{J))eW. ► M0) is initialized to 0 = (0,..., 0). ► In (t + l)-th step, (n[t+1\ ..., r?pt+1^) is computed as follows: ► If sig(J2i=i ^ • yi-*i- *k) Yk, then n[t+1) := + 1, else, n^+V) := n£\ ► A?5t+1) := nP for all ^ ^ k. Here k — {t mod p) + 1, the examples are considered cyclically. If D is linearly separable, there exists t* such that J2i=i n\ Yl'*l is consistent with D. The algorithm stops at such t* and returns (n^ \ r?pf ^) so that X)£=1 ^ • ye -X£ is consistent with D. 118 Example Training set: D = {((2,-1),1),((2,1),1),((1,3),-1)} That is xi = (2,-1) xi = (1,2,-1) x2 = (2,1) x2 = (1,2,1) x3 = (1,3) x3 = (1,1,3) n = 1 y2 = 1 Yz = -1 The initial values = = = 0 119 ► ELi nT' yi' *t' *i = °-thus 5/£"(ELi nT • yi • *e • *i) =1 = n- Hence, M1) = (0,0,0). ^ ELi nfJ • • • x2 = 0, thus s#g(X)Li ^ • y^ • x^ • x2) = 1 = y2. Hence, at<2) = (0,0,0). * ELi nf} ' yt' *t' *3 = 0, thus sig(Y?e=i nt • y^ • • x3) = 1 y3. Hence, at<3) = (0,0,1). ► ELi "T-yi-xt*! = -Ixsxi = -1(1,1, 3)-(l, 2, -1) = -10 = 0, thus sig(Y?e=i nf] ' yi ' *i • xi) = 1 = yi. Hence, n<4) = (0,0,1). ► ELi44) -y/-XrX2 = -1x3x2 = -1-(1,1,3)-(1,2,1) = -1-6 = -6, thus s/g(ELi ^ " W ' x^ • x2) = -1 y2. Hence, at<5) = (0,1,1). * ELi nt ' yi ' *i • x3 = 1 • x2 • x3 - 1 • x3 • x3 = -5, thus aťw = (0,1,1). This is OK. * ELi nt ' yi ' *i • xi = 1 • x2 • xi - 1 • x3 • xi = 4, thus rP> = (0,1,1). This is OK. * ELi ' yi ' *i • x2 = 1 • x2 • x2 - 1 • x3 • x2 = 0, thus rP> = (0,1,1). This is OK. The result: x2 — x3. 120 Dual Perceptron Learning — Output Let Yle=i n£ ' Y£ ' *£ result from the dual perceptron learning algorithm. I.e., each m = n\ J E N for suitable t* in which the algorithm found a consistent vector. This vector of weights determines a linear classifier that for a given xGR" gives h[w](x) = sig "e-ye-xe- (Here x is the augmented feature vector obtained from x.) Crucial observation: The (augmented) feature vectors xi and x occur only in scalar products! 121 Kernel Trick For simplicity, assume bivariate data: xk = (l,*/ci3*/c2)- The corresponding instance in the quadratic feature space is (l,x^l3x^ Consider two instances Xk = (1,xki,xk2) and = (l,x^i,x^2). Then the scalar product of their corresponding instances (l,*^,*^) anc' (lj^fi,^)' resP> 'n quadratic feature space is 1 + x^x|x + x22x22 which resembles (but is not equal to) (xk-x£)2 = (1 + Xfcixa + xk2X£2)2 = = 1 + x^x^ + x^2x|2 + 2xklX£1 Xk2X£2 + 2x^1X^1 + 2x^2X^2 But now consider a mapping to M6 defined by (*/<) = (1 5 xki 1 xk21 V/2x/cix/c2, V2xki, V/2x/c2) Then (xk) • (x^) = (x/, • Xif THE Idea: Define a kernel K,(xklxi) = (x^ • x^)2 and replace x^ • X£ in the dual perceptron algorithm with K{xklxi). Kernel Perceptron Learning Kernel Perceptron learning algorithm : Compute a sequence of vectors of numbers rt^°\ rP-\ ... where each a7« = (n[t\...,nipt))£W. ► M°) is initialized to 0 = (0,..., 0). ► In (t + l)-th step, (n^+1\ ..., n"p+V)) is computed as follows: ► lf sig (ELi n£t] ' Ye ' ^(x^x^)) 7^ y/c, then n[t+1) := nj^ + 1, i (t+i) (0 else, a?^ .= a?^ \ ► a7^+1) := a?*0 for all ^ ^ k. Here k = (t mod p) + 1, the examples are considered cyclically. Intuition: The algorithm computes a linear classifier in R6 for training examples transformed using (j>. The trick is that the transformation (j> itself does not have to be explicitly computed! 123 Dual Perceptron Learning Let n = (ni,..., np) result from the kernel perceptron learning algorithm. (t* } I.e., each m = n\ J E N for suitable t* such that s'g (SLi $ } ' y*> ' ^(x/c,x^)) = yk for all k = 1,..., p. We obtain a non-linear classifier that for a given xgR" gives (Here x is the augmented feature vector obtained from x.) Are there other kernels that correspond to the scalar product in higher dimensional spaces? 124 Kernels Given a (potential) kernel K,(x£,Xk) we need to check whether K,(x£,Xk) = {xt) • 0O?/c) for a function 0. This might be very difficult. Veta (Mercer's) k is a kernel if the corresponding Gram matrix K of the training set D, whose each £k-th element is K,(x£,Xk), is positive semi-definite for all possible choices of the training set D. Kernels can be constructed from existing kernels by several operations ► linear combination (i.e. multiply by a constant, or sum), ► multiplication, ► exponentiation, ► multiply by a polynomial with non-negative coefficients, ► ... (see e.g. "Pattern Recognition and Machine Learning" by Bishop) 125 Examples of Kernels ► Linear: ft(x£, Xk) — *£ • *k The corresponding mapping (x) = x is identity (no transformation). ► Polynomial of power m: K,(x£,Xk) = (1 + xj> ■ x/c)m The corresponding mapping assigns to x g 1" the vector (x) in fn+m\ _ || xg — xk ||2 ► Gaussian (radial-basis function): ^(x^x/J = e 2^ The corresponding mapping cj) maps x to an infinite-dimensional vector cj)(x) which is, in fact, a Gaussian function; combination of such functions for support vectors is then the separating hypersurface. ► • • • Choosing kernels remains to be black magic of kernel methods. They are usually chosen based on trial and error (of course, experience and additional insight into data helps). Now let's go on to the main area where kernel methods are used: to enhance support vector machines. SVM Idea — Which Linear Classifier is the Best? Benefits of maximum margin: ► Intuitively, maximum margin is good w.r.t. generalization. ► Only the support vectors (those on the magin) matter, others can, in principle, be ignored. 127 Support Vector Machines (SVM) Notation: ► w = (i/i/q, 1/1/1,..., wn) a vector of weights, ► w = (1/1/1,..., i/i/,,) a vector of all weights except i/i/q, ► x = (xi,..., xn) a (generic) feature vector. Consider a linear classifier: The signed distance of x from the decision boundary determined by w \d[w](x)\ is the distance of x from the decision boundary. c/[tv](x) is positive for x on the side to which tv points and negative on the opposite side. d[w](x) w0 + W ' xk W Here w S/Li wf 's the Euclidean norm of w_ Support Vectors & Margin ► Given a training set D = {(xl,y(xl)),(x2,y(x2)),...,(xp,y(xp))} Here xk = (xki..., xkn) EXCR" and y(xk) g {-1,1}. We write yk instead of y(xk). ► Assume that D is linearly separable, let w be consistent with D so that the distance of the decision boundary from the nearest examples on both sides is the same (if not, it suffices to adjust wq). O ► Margin p of w is twice the distance ► Support vectors are those xk that minimize |c/[i/i?](^5c)|- between support vectors and □ □ Maximum, \ /margin \ o the decision boundary X Our goal is to find a classifier that maximizes the margin. Maximizing the Margin For w consistent with D (such that no lies on the decision boundary) we have Wq + W_ • Xk Q = 2 w = 2 y/c • (w0 + w-xk) w w where is a support vector. We may safely consider only w such that • (i/i/n + w • x^) — 1 for the support vectors. Just adjust the length of w so that yk • (wo -\-w-Xk) = 1, the denominator will compensate. Then maximizing g is equivalent to maximizing 2/||m?| . (In what follows we use a bit looser constraint: Yk - {wo + w_ - Xk) > 1 for all Xk However, the result is the same since even with this looser condition, the support vectors always satisfy yk • (wo + w • Xk) = 1 whenever 2/ maximal.) w is 130 SVM - Optimization Margin maximization can be formulated as a quadratic optimization problem: Find w = (i/i/n,..., wn) such that P = w is maximized and for all (x/o y/c) G D we have yk • (i/i/n + w • xk) > 1 which can be reformulated as: Find w such that = ||j/i/||2 = w • vv is minimized and for all (x^, y^) 6 D we have y/< • (1/1/n + w_ • x^) > 1 131 SVM - Optimization ► Need to optimize a quadratic function subject to linear constraints. ► Quadratic optimization problems are a well-known class of mathematical programming problems for which efficient methods (and tools) exist. ► The solution usually involves construction of a dual problem where Lagrange multipliers a; are associated with every inequality (constraint) in the original problem: Find a = (ol\, ..., ap) such that p p ^ 1 ^ = ^2 olh - - ^ ^ an • ak • y£ • yk • x£ • xk is maximized £=1 £=1 k=l so that the following constraints are satisfied: ► y^!Li apvp = 0 Yle=i a£Y£ ► a£ > 0 for all 1 < £ < p 132 The Optimization Problem Solution ► Given a solution ai,..., an to the dual problem, solution w = (i/i/q, 1/1/1,..., !/!/„) to the original one is: w = [wi w, n) = ^2 0 £=1 Note that ak > 0 iff xk is a support vector. Hence it does not matter which ak > 0 is chosen in the above definition of wq. ► The classifier is then h(x) = sig(wo + w • x) = sig {yk -J2ia£-y£-^-^kJr J2i ai • yt • • *) Note that both the dual optimization problem as well as the classifier contain training feature vectors only in the scalar product! We may apply the kernel trick! Kernel SVM ► Choose your favourite kernel k. ► Solve the dual problem with kernel k\ Find a — (ol\, ..., ap) such that p p ^ 1 ^ _L_ 1,7 (a) = ^at-~^^a£-ak-y£-yk-k(*£i> 0 for all 1 < £ < p ► Then use the classifier: h(x) = Sig (yk -Y,£OL£'y£' k,(x£, xk) + Y,£OL£'y£' k(x£, x)) ► Note that the optimization techniques remain the same as for the linear SVM without kernels! 134 Comments on Algorithms ► The main bottleneck of SVM's is in complexity of quadratic programming (QP). A naive QP solver has cubic complexity. ► For small problems any general purpose optimization algorithm can be used. ► For large problems this is usually not possible, many methods avoiding direct solution have been devised. ► These methods usually decompose the optimization problem into a sequence of smaller ones. Intuitively, ► start with a (smaller) subset of training examples. ► Find an optimal solution using any solver. ► Afterwards, only support vectors matter in the solution! Leave only them in the training set, and add new training examples. ► This iterative procedure decreases the (general) cost function. 135 SVM in Applications (Mooney's lecture) ► SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. ► SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. ► SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. ► SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. '97], principal component analysis [Scholkopf et al. '99], etc. ► Most popular optimization algorithms for SVMs use decomposition to hillclimb over a subset of a;'s at a time, e.g. SMO [Piatt '99] and [Joachims '99] ► Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner. 136