Kernel Methods & SVM Partially based on the ML lecture by Raymond J. Mooney University of Texas at Austin l Back to Linear Classifier (Slightly Modified) A linear classifier h[w] is determined by a vector of weights w = (i/i/q, 1/1/1,..., wn) G as follows: 2 Back to Linear Classifier (Slightly Modified) A linear classifier h[w] is determined by a vector of weights w — (i/i/n, w\...., wn) G as follows: Given x — (xi,..., xn) G X C IRn, 1 + X)/Li w' ' x' - 0 -1 wo + S/Li w,- ■ x/ < 0 For convenience, we use values {—1,1} instead of {0,1}. Note that this is not a principal modification, the linear classifier works exactly as the original one. h[w](x) := 2 Back to Linear Classifier (Slightly Modified) A linear classifier h[w] is determined by a vector of weights w = (i/i/q, wi,..., i/i/n) e as follows: /?[iv](x) := Given x — (xi,..., xn) £ X C ]Rn, -1 wo + S/Li Wj • x/ < 0 For convenience, we use values {—1,1} instead of {0,1}. Note that this is not a principal modification, the linear classifier works exactly as the original one. Recall that given x — (xi,... ,xn) 6 W1, the augmented feature vector is x = (xn,xi,...,xn) where xn = 1 This makes the notation for the linear classifier more succinct: l y >o /?[iv](x) = sig(w • x) where sig(y) = -1 y<0 Perceptron Learning Revisited ► Given a training set D = {(xi,y(xi)), (x2,y(x2)),..., (xp,y(xp))} Here x^ = (x^i..., x^) eXCl" and y(xfc) e {—1 Perceptron Learning Revisited ► Given a training set D = {(xi,y(xi)),(x2,y(x2)),...,(xp,y(xp))} Here x/< = {x^i ... , x/^) GXCR" and y(x/c) G { —1, We write y/< instead of y(x/c). Note that = (x/^x^i ... , x/^) where x^o = 1. Perceptron Learning Revisited ► Given a training set D = {(xi,y(xi)),(x2,y(x2)),...,(xp,y(xp))} Here x/< = {x^i ... , x/^) GXCR" and y(x/c) G { — 1,1}. We write y/< instead of y(x/c). Note that = (x/^x^i ... , x/^) where x^o = 1. ► A weight vector w G IRn+1 is consistent with D if /?[iv](x/() = sig(w • x^) = y/c for all /c = 1,..., p D is linearly separable if there is a vector w G IRn+1 which is consistent with D. Perceptron Learning Revisited Perceptron learning algorithm (slightly modified): Consider training examples cyclically. Compute a sequence of weight vectors w^°\ w^2\ .... 4 Perceptron Learning Revisited Perceptron learning algorithm (slightly modified): Consider training examples cyclically. Compute a sequence of weight vectors w^°\ w^2\ .... ► is initialized to 0 = (0,..., 0). (This is a slight but harmless modification of the traditional algorithm.) 4 Perceptron Learning Revisited Perceptron learning algorithm (slightly modified): Consider training examples cyclically. Compute a sequence of weight vectors w^°\ w^2\ .... ► is initialized to 0 = (0,..., 0). (This is a slight but harmless modification of the traditional algorithm.) ► In (t + l)-th step, w/(t+1) is computed as follows: ► If sig(w • x/c) 7^ yk, then i/i/(t+1) = + y^ • x^. ► Otherwise, w^t+1^ = Here k — (t mod p) + 1, i.e. the examples are considered cyclically. (Note that this algorithm corresponds to the perceptron learning with the learning speed e = 1.) 4 Perceptron Learning Revisited Perceptron learning algorithm (slightly modified): Consider training examples cyclically. Compute a sequence of weight vectors w^°\ w^2\ .... ► is initialized to 0 = (0,..., 0). (This is a slight but harmless modification of the traditional algorithm.) ► In (t + l)-th step, w/(t+1) is computed as follows: ► If sig(w • yik) 7^ yk, then w^t+1^ = + Yk • */c- ► Otherwise, w^t+1^ = Here k = (t mod p) + 1, i.e. the examples are considered cyclically. (Note that this algorithm corresponds to the perceptron learning with the learning speed e = 1.) We know: if D is linearly separable, then there is t* such that w^*) is consistent with D. 4 Perceptron Learning Revisited Perceptron learning algorithm (slightly modified): Consider training examples cyclically. Compute a sequence of weight vectors w^°\ w^2\ .... ► is initialized to 0 = (0,..., 0). (This is a slight but harmless modification of the traditional algorithm.) ► In (t + l)-th step, w/(t+1) is computed as follows: ► If sig(w • yik) 7^ yk, then w^t+1^ = + Yk • */c- ► Otherwise, w^t+1^ = Here k = (t mod p) + 1, i.e. the examples are considered cyclically. (Note that this algorithm corresponds to the perceptron learning with the learning speed e = 1.) We know: if D is linearly separable, then there is t* such that w^*) is consistent with D. But what can we do if D is not linearly separable? 4 Quadratic Decision Boundary 1 ' +'* ' * * ft* * . o - ■/ / ++ \ + vf+>+++ -+v +++ ++ - + + 4.+ // ...... + + ^^^f o o 1 Left: The original set, Quadratic Decision Boundary Left: The original set, Right: Transformed using the square of features. 5 Quadratic Decision Boundary Left: The original set, Right: Transformed using the square of features. Right: the green line is the decision boundary learned using the perceptron algorithm. (The red boundary corresponds to another learning algorithm.) 5 Quadratic Decision Boundary Left: The original set, Right: Transformed using the square of features. Right: the green line is the decision boundary learned using the perceptron algorithm. (The red boundary corresponds to another learning algorithm.) Left: the green ellipse maps exactly to the green line. 5 Quadratic Decision Boundary Left: The original set, Right: Transformed using the square of features. Right: the green line is the decision boundary learned using the perceptron algorithm. (The red boundary corresponds to another learning algorithm.) Left: the green ellipse maps exactly to the green line. How to classify (in the original space): First, transform a given feature vector by squaring the features, then use the linear classifier. Do We Need to Map Explicitly? In general, mapping to (much) higher feature space helps (there are more "degrees of freedom" so linear separability might get a chance). 6 Do We Need to Map Explicitly? In general, mapping to (much) higher feature space helps (there are more "degrees of freedom" so linear separability might get a chance). However, complexity of learning grows (quickly) with dimension. 6 Do We Need to Map Explicitly? In general, mapping to (much) higher feature space helps (there are more "degrees of freedom" so linear separability might get a chance). However, complexity of learning grows (quickly) with dimension. Sometimes its even beneficial to map to infinite-dimensional spaces. 6 Do We Need to Map Explicitly? In general, mapping to (much) higher feature space helps (there are more "degrees of freedom" so linear separability might get a chance). However, complexity of learning grows (quickly) with dimension. Sometimes its even beneficial to map to infinite-dimensional spaces. To avoid explicit construction of the higher dimensional feature space, we use so called kernel trick. 6 Do We Need to Map Explicitly? In general, mapping to (much) higher feature space helps (there are more "degrees of freedom" so linear separability might get a chance). However, complexity of learning grows (quickly) with dimension. Sometimes its even beneficial to map to infinite-dimensional spaces. To avoid explicit construction of the higher dimensional feature space, we use so called kernel trick. But first we need to dualize our learning algorithm. 6 Perceptron Learning Revisited Perceptron learning algorithm once more: Compute a sequence of weight vectors w^°\ Perceptron Learning Revisited Perceptron learning algorithm once more: Compute a sequence of weight vectors w^°\ ► is initialized to 0 = (0.....0). Perceptron Learning Revisited Perceptron learning algorithm once more: Compute a sequence of weight vectors w^°\ w^2\ .... ► is initialized to 0 = (0,..., 0). ► In (t + l)-th step, i/i/(t+1) is computed as follows: ► If sig(w • Xk) 7^ y/c, then w^t+1^ = + yk • Xk- ► Otherwise, w(t+1) = Here k — {t mod p) + 1, i.e. the examples are considered cyclically. 7 Perceptron Learning Revisited Perceptron learning algorithm once more: Compute a sequence of weight vectors w^°\ w^2\ .... ► is initialized to 0 = (0,..., 0). ► In (t + l)-th step, i/i/(t+1) is computed as follows: ► If sig(w • yik) 7^ yk, then w^t+1^ = + Yk • */c- ► Otherwise, w(t+1) = Here k = (t mod p) + 1, i.e. the examples are considered cyclically. Crucial observation: Note that = Y2e=i ' Yi ' *£ f°r suitable ..., r?p^ 6 N. Intuitively, counts how many times was added to (if yi — 1). or subtracted from (if Yi — — 1) weights. 7 Dual Perceptron Learning Dual Perceptron learning algorithm : Compute a sequence of vectors of numbers fv-0', rr-1',... where each n • w • * • «2 = o, thus s/g(ELi 4° • w •• x2) = i - * Hence, M2) = (0,0,0). * ► E3- x 42) • n ■ % ■ xs = o. thus s/g(ELi 42) • w • 3U • 3b) = M y3- Hence, ^ = (0,0,1). 10 ► ELi 40) ■ w ■ x< ■ xi = o, thus s/g(ELi 40) • w • *t ■ xi) = i = n- Hence, n*1) = (0,0,0). ► ELi • Ye ■ *t ■ X2 = 0, thus s/g(ELi ■ W • x< • x2) = 1 = J*-Hence, ň<2) = (0,0,0). ► ELi ) • W ■ x/ • x3 = 0, thus s/g(ELi n?) • y£ • X£ • x3) = 1 Y3-Hence, ň<3) = (0,0,1). ► ELi "f ^Wxrxi = -lXs-xi = -1-(1,1,3)(1,2, -1) = -1-0 = 0, thus s/£(ELi "f) • W • x> • xi) = 1 = yi- Hence, ň<4) = (0,0,1). 10 ► ELi 40) ■ y/ • x< ■ Xi = o, thus s/g(ELi 40) • w • x/ • Xi) = i = yi- Hence, ň*1' = (0,0,0). ► ELi "?} ■ yt • *e ■ x2 = 0, thus s/g(ELi "?} • W • Xť • x2) = 1 = y2. Hence, ň<2) = (0,0,0). ► ELi n?) • W ■ x/ ■ X3 = 0, thus s/g(E?=i "f' • w • ^ • x3) = 1 # y3. Hence, M3) = (0,0,1). ► ELi nf-yv^i = -1-3É3-X1 = -1(1,1,3)-(l, 2, -1) = -10 = 0, thus s/g(ELi 43) • yt • *t- *i) =1 = Hence>n_<4) = (°> °' ► ELi 44)-W-Xrx2 = -IX3X2 = -1-(1,1,3)-(1,2,1) = -1-6 = -6, thus sig(E?=i 44) • Yt ■ *t ■ X2) = -1 ^ y2- Hence, ň<5) = (0,1,1). 10 * ELi nt • y« • X£ • xi = 0, thus sig{J2e=i ne - Yt • *(■ ■ *i) = 1 = Hence, M1' = (0,0,0). ^ ELi "?} • W • • x2 = 0, thus sig{J23e=1 nf] ■ ye- %e ■ x2) = 1 = y2. Hence, n<2) = (0,0,0). * ELi nf) ' W • x€ • x3 = 0, thus s/g(£j=1 nf) • ye ■ xe ■ x3) = 1 + y3-Hence, iP^ = (0,0,1). ► ELi nfVrxrXi = -lxs-X! = -1(1,1,3)-(l,2, -1) = -10 = 0, thus sig(Y%=i nT ■ yi ■ *t ■ X0 = 1 = Yi- Hence, n<4) = (0,0,1). ► ELi 44)-WXrx2 = -1-X3-X2 = -1-(1,1,3)-(l, 2,1) = -1-6 = -6, thus sig(E?=i 44) • W • x^ • x2) = -\+ y2. Hence, n<5) = (0,1,1). ELi nf' • y« • X^' X3 = 1 • x2 • x3 - 1 • x3 • x3 = -5, thus n<6) = (0,1,1). This is OK. 10 * £Li 4 •ye-*e-*i = 0, thus sig(Y?e=1 4 • y* • *e • xi) = 1 = y*. Hence, nW = (0,0,0). £Li 4*' • y^ • x^ • x2 = 0, thus s/g(ELi 4 ' W • x> • x2) = 1 = y>-Hence, M2) = (0,0,0). ELi "f' • y? • x> • x3 = 0, thus sig(Y?e=i 4 • y^ • x^ • x3) = 1 ^ y3. Hence, n<3) = (0,0,1). ► £?=i "f'W-xrXi = -l-xs-xi = -1(1,1,3)-(l, 2, -1) = -10 = 0, thus sfc(£?=1 • y/ ■ 3q • xi) = 1 = yi. Hence, n<4) = (0,0,1). ► £Li 44)-WXrx2 = -1-X3-X2 = -1-(1,1,3)-(l, 2,1) = -1-6 = -6, thus sig(£i=i "? • Yt ■ *i ■ X2) = -1 + y2- Hence, n<5) = (0,1,1). £Li y«' ^' *3 = 1 • x2 • x3 - 1 • x3 • x3 = -5, thus n<6) = (0,1,1). This is OK. £Li 4 • y« • • xi = 1 • x2 • xi - 1 • x3 • xi = 4, thus n<7) = (0,1,1). This is OK. 10 ► ELi • yt - *t - *i = °-thus s/£(ELi 4°} • yt - *t - *i) =1 = yi- Hence, M1) = (0,0,0). ^ ELi nP ' yz ' *t' *2 = 0, thus sig(J23e=i • • • x2) = 1 = y2-Hence, at<2) = (0,0,0). * ELi nf} • • • x3 = 0, thus s/g(X)Li } • y^ • x^ • x3) = 1 y3. Hence, at<3) = (0,0,1). ► ELi nf^yvxvx! = -l-x3-xi = -1-(1,1, 3)-(l, 2, -1) = -1-0 = 0, thus sig(J23e=i nP • Yi • *e ' *i) = 1 = yi- Hence, at<4) = (0,0,1). ► ELi44)-yrXrx2 = -1-X3-X2 = -1-(1,1,3)-(l, 2,1) = -1-6 = -6, thus sig(J2e=i nP • yi-*i- X2) = -1 ^ y2- Hence, at*5) = (0,1,1). * ELi ) • y^ • x^ • x3 = 1 • x2 • x3 - 1 • x3 • x3 = -5, thus atw = (0,1,1). This is OK. * ELi • y^ • x^ • xi = 1 • x2 • xi - 1 • x3 • xi = 4, thus = (0,1,1). This is OK. * ELi 46) • y^ • x^ • x2 = 1 • x2 • x2 - 1 • x3 • x2 = 0, thus = (0,1,1). This is OK. 10 ► ELi nT ■ y/ • x/ • xi = 0, thus sig(Y?e=1 ■ y« • • xi) = 1 = yi-Hence, ň*1' = (0,0,0). ^ ELi n?} • W • x*r x2 = 0, thus s/g(^=1 nj1} • ye • *e ■ x2) = 1 = y2. Hence, ^ = (0,0,0). ► ELi nf) • y? • xV x3 = 0, thus s/g(ELi • y? • x^ • x3) = 1 / y3. Hence, ň<3) = (0,0,1). ► ELi nf-y/-X/-xi = -1-X3-X1 = -1-(1,1,3)(1,2, -1) = -10 = 0, thus s/g(E?=i "f) • W • Xť • xi) = 1 = yi- Hence, ň<4) = (0,0,1). ► ELi 44)-yrxrx2 = -1-X3ÓÍ2 = -1(1,1,3)-(1,2,1) = -1-6 = -6, thus sig(Y%=i 44) • Ye ■ *e ■ **) = -1 + Yi- Hence, ^ = (0,1,1). ^ ELi nT ■ ye ■ *e ■ x3 = 1 • x2 • x3 - 1 • x3 • x3 = -5, thus fp) = (0,1,1). This is OK. ► YlLi nf] ■ ye ■ *e • xi = 1 • x2 • xi - 1 • x3 • xi = 4, thus ňv) = (0,1,1). This is OK. ► J2Pf=i 46) • ye ■ *e ■ x2 = 1 • x2 • x2 - 1 • x3 • x2 = 0, thus ň<7> = (0,1,1). This is OK. The result: x2 — x3. 10 Dual Perceptron Learning — Output Let Yle=i n£ ' Y£ ' *£ result from the dual perceptron learning algorithm. I.e., each m = n\ J E N for suitable t* in which the algorithm found a consistent vector. This vector of weights determines a linear classifier that for a given xGR" gives h[w](x) = sig "e-ye-xe- (Here x is the augmented feature vector obtained from x.) Crucial observation: The (augmented) feature vectors xi and x occur only in scalar products! 11 Kernel Trick For simplicity, assume bivariate data: x k = (1, x/a, ^2). 12 Kernel Trick For simplicity, assume bivariate data: Xk = (1,x/a,x/^). The corresponding instance in the quadratic feature space is (1, x^i, xj^) 12 Kernel Trick For simplicity, assume bivariate data: Xk = (1,x/a,x/^). The corresponding instance in the quadratic feature space is (l,x^l3x^2)- Consider two instances Xk = (1,x^i,x^) and X£ = (l,x^i,x^2)- 12 Kernel Trick For simplicity, assume bivariate data: Xk = (l,x/ci3*/c2)- The corresponding instance in the quadratic feature space is (l,x^l3x^2)- Consider two instances Xk = (1,x/ci,x^) and = (l,x^i,x^2). Then the scalar product of their corresponding instances (l^xj^xj^) and (lj^fi,^)' resP> 'n the quadratic feature space is 1 + x^x^ + x|2x|2 12 Kernel Trick For simplicity, assume bivariate data: Xk = (l,x/ci,x/c2)- The corresponding instance in the quadratic feature space is (l,x^l3x^2)- Consider two instances Xk = (1,x/ci,x^) and = (l,x^i,x^2). Then the scalar product of their corresponding instances 0-,x2vx22) anc' (lj^fi,^)' resP> 'n the quadratic feature space is 1 + x^x|x + x|2x|2 which resembles (but is not equal to) (xk-Xi)2 = (1 + Xfcixa + xk2X£2)2 = = 1 + X^X^ + X^2x|2 + 2xklXn Xk2X£2 + 2x^1X^1 + 2x^2X^2 12 Kernel Trick For simplicity, assume bivariate data: Xk = (1,Xki,x/^). The corresponding instance in the quadratic feature space is (l,x^l3x^2)- Consider two instances Xk = (1,x/ci,x^) and = (l,x^i,x^2). Then the scalar product of their corresponding instances 0-,x2vx22) and (l,x|l3x|2), resp., in the quadratic feature space is 1 + x^x|x + x|2x|2 which resembles (but is not equal to) (xk-Xi)2 = (1 + Xfcixa + xk2X£2)2 = = 1 + x^x^ + xl2x}2 + 2x^1X^1 X/c2X^2 + 2x^1x^1 + 2x^2X^2 But now consider a mapping to M6 defined by 4>(*k) = (1 1 xki 1 xk21 V/2x/cix/c2, V2xki, V/2x/c2) 12 Kernel Trick For simplicity, assume bivariate data: Xk = (l,x/ci3*/c2)- The corresponding instance in the quadratic feature space is (l,x^l3x^2)- Consider two instances Xk = (1,x/ci,x^) and = (l,x^i,x^2). Then the scalar product of their corresponding instances 0-,x2vx22) and (lj^fi,^)' resP> 'n the quadratic feature space is 1 + x^x|x + x22x22 which resembles (but is not equal to) (xk-Xi)2 = (1 + Xkixei + ^2^2)2 = = 1 + X^X^ + X^2x|2 + 2xklXn Xk2X£2 + 2x^1X^1 + 2x^2X^2 But now consider a mapping to M6 defined by (*/<) = (i ^ xki i xk21 V/2x/cix/c2, V2xki, V/2x/c2) Then (xk) • 0(x^) = (x/, • xi)2 12 Kernel Trick For simplicity, assume bivariate data: Xk = (1,Xki,x/^). The corresponding instance in the quadratic feature space is (l,x^l3x^2)- Consider two instances Xk = (1,x/ci,x^) and = (l,x^i,x^2). Then the scalar product of their corresponding instances (ljxj^xj^) and (l,x|l3x|2), resp., in the quadratic feature space is 1 + x^x|x + x22xf2 which resembles (but is not equal to) (xk-Xi)2 = (1 + Xfcixa + xk2X£2)2 = = 1 + x^x^ + x22xf2 + 2x^1X^1 x/c2x^2 + 2x^1x^1 + 2x^2X^2 But now consider a mapping to M6 defined by 4>(*k) = (1 1 xki 1 xk21 V/2x/cix/c2, V2xki, V/2x/c2) Then (xk) • (x^) = (xk • x^)2 THE Idea: Define a kernel ^(x/^x^) = (x/c • x^)2 and replace Xk • X£ in the dual perceptron algorithm with K(xk,xi). 12 Kernel Perceptron Learning Kernel Perceptron learning algorithm : Compute a sequence of vectors of numbers rt^°\ nf^\ ... where each a7. - , f Here k = (t mod p) + 1, the examples are considered cyclically. 13 Kernel Perceptron Learning Kernel Perceptron learning algorithm : Compute a sequence of vectors of numbers rt^°\ rP-\ ... where each a7« = (n[t\...,nipt))£W. ► M°) is initialized to 0 = (0,..., 0). ► In (t + l)-th step, (n^+1\ ..., n"p+V)) is computed as follows: ► lf sig (ELi ' W ' tt(xfc,x^)) 7^ y/c, then a7^t+1) := nj^ + 1, i (t+i) (0 else, a?^ .= a?^ \ ► a7^+1) := a?*0 for all ^ ^ k. Here k = (t mod p) + 1, the examples are considered cyclically. Intuition: The algorithm computes a linear classifier in R6 for training examples transformed using (j>. The trick is that the transformation itself does not have to be explicitly computed! 13 Dual Perceptron Learning Let n = (ni,..., np) result from the kernel perceptron learning algorithm. (t* } I.e., each m = n\ J G N for suitable t* such that SIS (Z)?=i n£ } ' ' ^(x/c,x^)^ = yk for all k = 1,..., p. We obtain a non-linear classifier that for a given xeR" gives (Here x is the augmented feature vector obtained from x.) Are there other kernels that correspond to the scalar product in higher dimensional spaces? 14 Given a (potential) kernel K,(x£,Xk) we need to check whether K,(x£,Xk) = {x£) • 0(^4) for a function 0. This might be very difficult. Kernels Given a (potential) kernel ^(x^x^) we need to check whether ^(x^x/c) = 0(x^) • ^(x/c) for a function 0. This might be very difficult. Veta (Mercer's) k is a kernel if the corresponding Gram matrix K of the training set D, whose each £k-th element is ^(x^x^), is positive semi-definite for all possible choices of the training set D. 15 Kernels Given a (potential) kernel ^(x^x^) we need to check whether ^(x^x/c) = (f)(x£) • 0(x/c) for a function 0. This might be very difficult. Veta (Mercer's) k is a kernel if the corresponding Gram matrix K of the training set D, whose each £k-th element is ^(x^x/c), is positive semi-definite for all possible choices of the training set D. Kernels can be constructed from existing kernels by several operations 15 Kernels Given a (potential) kernel ^(x^x^) we need to check whether ^(x^x/c) = (f)(x£) • 0(x/c) for a function 0. This might be very difficult. Veta (Mercer's) k is a kernel if the corresponding Gram matrix K of the training set D, whose each £k-th element is ^(x^x/c), is positive semi-definite for all possible choices of the training set D. Kernels can be constructed from existing kernels by several operations ► linear combination (i.e. multiply by a constant, or sum), 15 Kernels Given a (potential) kernel ^(x^x^) we need to check whether ^(x^x/c) = (f)(x£) • 0(x/c) for a function 0. This might be very difficult. Veta (Mercer's) k is a kernel if the corresponding Gram matrix K of the training set D, whose each £k-th element is ^(x^x/c), is positive semi-definite for all possible choices of the training set D. Kernels can be constructed from existing kernels by several operations ► linear combination (i.e. multiply by a constant, or sum), ► multiplication, 15 Kernels Given a (potential) kernel ^(x^x^) we need to check whether ^(x^x/c) = (f)(x£) • ^(x/c) for a function 0. This might be very difficult. Veta (Mercer's) k is a kernel if the corresponding Gram matrix K of the training set D, whose each £k-th element is ^(x^x^), is positive semi-definite for all possible choices of the training set D. Kernels can be constructed from existing kernels by several operations ► linear combination (i.e. multiply by a constant, or sum), ► multiplication, ► exponentiation, 15 Kernels Given a (potential) kernel K,(x£,Xk) we need to check whether K,(x£,Xk) = {xt) • 0O?/c) for a function 0. This might be very difficult. Veta (Mercer's) k is a kernel if the corresponding Gram matrix K of the training set D, whose each £k-th element is K,(x£,Xk), is positive semi-definite for all possible choices of the training set D. Kernels can be constructed from existing kernels by several operations ► linear combination (i.e. multiply by a constant, or sum), ► multiplication, ► exponentiation, ► multiply by a polynomial with non-negative coefficients, ► ... (see e.g. "Pattern Recognition and Machine Learning" by Bishop) 15 mples of Kernels ► Linear: ft(x£, xk) = X£ • xk The corresponding mapping (x) = x is identity ( transformation). mples of Kernels ► Linear: ft(x^, x/J = X£ • The corresponding mapping cj)(x) = x is identity (no transformation). ► Polynomial of power m: n[x^Xk) = (1 + X£ • x/c)m The corresponding mapping assigns to x G Mn the vector (x) fn+m\ RV m J. mples of Kernels ► Linear: ft(x£, x/J = x# • x*k The corresponding mapping (x) = x is identity (no transformation). ► Polynomial of power m: ^(x^x/c) = (1 + xj> ■ x/c)m The corresponding mapping assigns to x E R" the vector (x) in ► Gaussian (radial-basis function): ^(x^x/J = e The corresponding mapping maps x to an infinite-dimensional vector cj)(x) which is, in fact, a Gaussian function; combination of such functions for support vectors is then the separating 2 hypersurface. ► Examples of Kernels ► Linear: ft(x£, x/J = X£ • x^ The corresponding mapping (x) = x is identity (no transformation). ► Polynomial of power m: ^(x^x/c) = (1 + xj> ■ x/c)m The corresponding mapping assigns to x E R" the vector (x) in CD i*i-*y2 ► Gaussian (radial-basis function): ^(x^x/J = e 2CT: The corresponding mapping cj) maps x to an infinite-dimensional vector 0(x) which is, in fact, a Gaussian function; combination of such functions for support vectors is then the separating hypersurface. ► • • • Choosing kernels remains to be black magic of kernel methods. They are usually chosen based on trial and error (of course, experience and additional insight into data helps). Now let's go on to the main area where kernel methods are used: to enhance support vector machines. 16 SVM Idea — Which Linear Classifier is the Best? SVM Idea — Which Linear Classifier is the Best? Benefits of maximum margin: ► Intuitively, maximum margin is good w.r.t. generalization. ► Only the support vectors (those on the magin) matter, others can, in principle, be ignored. 17 Support Vector Machines (SVM) Notation: ► w = (i/i/q, 1/1/1,..., wn) a vector of weights, 18 Support Vector Machines (SVM) Notation: ► w = (i/i/q, 1/1/1,..., wn) a vector of weights, ► w = (1/1/1,..., i/i/,,) a vector of all weights except i/i/q, 18 Support Vector Machines (SVM) Notation: ► w = (i/i/q, 1/1/1, • • •, wn) a vector of weights, ► w = (1/1/1,..., i/i/,,) a vector of all weights except i/i/q, ► x = (xi,..., xn) a (generic) feature vector. 18 Support Vector Machines (SVM) Notation: ► w = (i/i/q, wi,..., wn) a vector of weights, ► w = (1/1/1,..., w^) a vector of all weights except wq, ► x = (xi,..., xn) a (generic) feature vector. Consider a linear classifier: 18 -A \ \ Support Vector Machines (SVM) Notation: ► w = (wo, wi,..., wn) a vector of weights, ► w = (vvi,..., wn) a vector of all weights except wq, fl^O~^" (^TX- ~ ^ ^C^^7 ► x = (xi,... ,x„) a (generic) feature vector. ia, ,, ,, 1/ 1/ ^ ^ ^ Consider a linear classifier: - -|wo I — 1 wo + 2^/=i w; ' x# = wo + w ■ x < 0 The signed distance of x from the decision boundary determined by vv is Wq + W ■ X/f (l y (I - - II n>r // d[w](x) \w\ /I = ^x Here ||vv| I S/Li ^'s tne Euclidean norm of vv. 18 & pAfQ + pJ* t Support Vector Machines (SVM) Notation: 0 X x H---—__|____i__ ► w = (n/o, wi, ■ ■ ■ ■> wn) a vector of weights, v - X, ► w = (n/i,..., wn) a vector of all weights except wq, ► x = (xi,..., x„) a (generic) feature vector. Consider a linear classifier: h[w](x) 1 Wq + E/Ll W; • X; = Wq + W_ ■ X > 0 The signed distance of x from the decision boundary determined by w Wq + W_ ■ Xk d[w](x) w\ Here \\w\\ = \ E/Li ^'s the Euclidean norm of |c/[w](x)| is the distance of x from the decision boundary. c/[w](x) is positive for x on the side to which w points and negative on the opposite side. Support Vectors & Margin ► Given a training set D = {(xi,y(xi)),(x2,y(x2)),...,(xp,y(xp))} Here xk = (xki..., xkn) EXCR" and y(xk) e {—1, Support Vectors & Margin ► Given a training set D = {(xi,y(xi)),(x2,y(x2)),...,(xp,y(xp))} Here xk = (xki..., xkn) EXCR" and y{xk) 1 for all x/c However, the result is the same since even with this looser condition, the support vectors always satisfy yk • (wq + w • xk) = 1 whenever 2/ maximal.) w is SVM - Optimization Margin maximization can be formulated as a quadratic optimization problem: Find w = (i/i/q, ..., wn) such that P = w is maximized and for all (x/o y/c) G D we have yk • (i/i/q + w • x^) > 1. 21 SVM - Optimization Margin maximization can be formulated as a quadratic optimization problem: Find w = (i/i/n, ..., wn) such that 7 P = 1/1/ is maximized 2_ llMTfl' and for all (x/oy/c) G D we have y/< • (i/i/n + w_ • x/J > 1. r^ which can be reformulated as: Find iv such that ^2 = ||j/i/||2 — w_- w_\s minimized and for all (x^? y/c) e D we have y/< • (i/i/n + w • Xk) > 1- 21 SVM - Optimization ► Need to optimize a quadratic function subject to linear constraints. 22 SVM - Optimization ► Need to optimize a quadratic function subject to linear constraints. ► Quadratic optimization problems are a well-known class of mathematical programming problems for which efficient methods (and tools) exist. 22 SVM - Optimization ► Need to optimize a quadratic function subject to linear constraints. ► Quadratic optimization problems are a well-known class of mathematical programming problems for which efficient methods (and tools) exist. ► The solution usually involves construction of a dual problem where Lagrange multipliers a; are associated with every inequality (constraint) in the original problem: 22 SVM - Optimization ► Need to optimize a quadratic function subject to linear constraints. ► Quadratic optimization problems are a well-known class of mathematical programming problems for which efficient methods (and tools) exist. ► The solution usually involves construction of a dual problem where Lagrange multipliers a; are associated with every inequality (constraint) in the original problem: Find a = ..., ap) such that p p ^ 1 ^ = ^2 olh - - ^ ^ an • ak • ye • yk • x£ • xk is maximized £=1 £=1 k=l so that the following constraints are satisfied: ► y^!Li apvp = 0 Yle=i a£Y£ ► a£ > 0 for all 1 < £ < p 22 e Optimization Problem Solution ► Given a solution a1?..., an to the dual problem, solution w = (i/i/o, 1/1/1,..., i/i/n) to the original one is: p w = (Wl, • • • , wn) = y~]&£ • Yi • *l P wo = — Oif yt- Xf Xk for an arbitrary a/< > 0 e Optimization Problem Solution ► Given a solution ai,..., an to the dual problem, solution w = (i/i/q, 1/1/1,..., !/!/„) to the original one is: p w = (i/i/i. • • •, w„) = ^2&£ • yt • X£ £=1 wo — yk — ^2 a£ ' yt ' *£ ' *k for an arbitrary ak > 0 £=1 Note that ak > 0 iff x/c is a support vector. Hence it does not matter which ak > 0 is chosen in the above definition of wq. The Optimization Problem Solution ► Given a solution ai,..., an to the dual problem, solution w = (l/i/n, 1/1/1,..., !/!/„) to the original one is: w = [wi w, n) = ^2 0 £=1 Note that ak > 0 iff xk is a support vector. Hence it does not matter which ak > 0 is chosen in the above definition of wq. ► The classifier is then h(x) = sig(wo + w • x) = sig {yk -J2ia£-y£-^-^kJr J2i ai-yi-xi- x) Note that both the dual optimization problem as well as the classifier contain training feature vectors only in the scalar product! We may apply the kernel trick! 23 Kernel SVM ► Choose your favourite kernel 24 Kernel SVM ► Choose your favourite kernel k. ► Solve the dual problem with kernel k\ Find a — (a\,..., ap) such that p p 1,7 (a) = X^_o ^^a^a^rx*) is maximized 2 £=1 £=1 k=l so that the following constraints are satisfied: ► J2i a£Y£ = 0 ► a£ > 0 for all 1 < £ < p 24 Kernel SVM ► Choose your favourite kernel k. ► Solve the dual problem with kernel k\ Find a — (a\,..., ap) such that p p ^ 1 ^ _L_ 1,7 (a) = ^a£-^^^a£'ak'Y£'Yk'^£^k) is maximized 2 £=1 £=1 k=l so that the following constraints are satisfied: ► a£ > 0 for all 1 < £ < p ► Then use the classifier: h(x) = sig (yk -Y,£a£'Y£' *k) + Y,£a£'Y£' *)) 24 Kernel SVM ► Choose your favourite kernel k. ► Solve the dual problem with kernel k\ Find a — (a\,..., ap) such that p p ^ 1 ^ _L_ = EaroEEar^,yrw'^'^is maximizec| 2 £=1 £=1 /c=l so that the following constraints are satisfied: ► a£ > 0 for all 1 < £ < p ► Then use the classifier: h(x) = sig (yk -Y.^'Yr k(x£, xk) + ^ a-i ■ ye ■ x)) ► Note that the optimization techniques remain the same as for the linear SVM without kernels! 24 Comments on Algorithms ► The main bottleneck of SVM's is in complexity of quadratic programming (QP). A naive QP solver has cubic complexity. 25 Comments on Algorithms ► The main bottleneck of SVM's is in complexity of quadratic programming (QP). A naive QP solver has cubic complexity. ► For small problems any general purpose optimization algorithm can be used. 25 Comments on Algorithms ► The main bottleneck of SVM's is in complexity of quadratic programming (QP). A naive QP solver has cubic complexity. ► For small problems any general purpose optimization algorithm can be used. ► For large problems this is usually not possible, many methods avoiding direct solution have been devised. 25 Comments on Algorithms ► The main bottleneck of SVM's is in complexity of quadratic programming (QP). A naive QP solver has cubic complexity. ► For small problems any general purpose optimization algorithm can be used. ► For large problems this is usually not possible, many methods avoiding direct solution have been devised. ► These methods usually decompose the optimization problem into a sequence of smaller ones. Intuitively, 25 Comments on Algorithms ► The main bottleneck of SVM's is in complexity of quadratic programming (QP). A naive QP solver has cubic complexity. ► For small problems any general purpose optimization algorithm can be used. ► For large problems this is usually not possible, many methods avoiding direct solution have been devised. ► These methods usually decompose the optimization problem into a sequence of smaller ones. Intuitively, ► start with a (smaller) subset of training examples. 25 Comments on Algorithms ► The main bottleneck of SVM's is in complexity of quadratic programming (QP). A naive QP solver has cubic complexity. ► For small problems any general purpose optimization algorithm can be used. ► For large problems this is usually not possible, many methods avoiding direct solution have been devised. ► These methods usually decompose the optimization problem into a sequence of smaller ones. Intuitively, ► start with a (smaller) subset of training examples. ► Find an optimal solution using any solver. 25 Comments on Algorithms ► The main bottleneck of SVM's is in complexity of quadratic programming (QP). A naive QP solver has cubic complexity. ► For small problems any general purpose optimization algorithm can be used. ► For large problems this is usually not possible, many methods avoiding direct solution have been devised. ► These methods usually decompose the optimization problem into a sequence of smaller ones. Intuitively, ► start with a (smaller) subset of training examples. ► Find an optimal solution using any solver. ► Afterwards, only support vectors matter in the solution! Leave only them in the training set, and add new training examples. 25 Comments on Algorithms ► The main bottleneck of SVM's is in complexity of quadratic programming (QP). A naive QP solver has cubic complexity. ► For small problems any general purpose optimization algorithm can be used. ► For large problems this is usually not possible, many methods avoiding direct solution have been devised. ► These methods usually decompose the optimization problem into a sequence of smaller ones. Intuitively, ► start with a (smaller) subset of training examples. ► Find an optimal solution using any solver. ► Afterwards, only support vectors matter in the solution! Leave only them in the training set, and add new training examples. ► This iterative procedure decreases the (general) cost function. 25 SVM in Applications (Mooney's lecture) ► SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. 26 SVM in Applications (Mooney's lecture) ► SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. ► SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. 26 SVM in Applications (Mooney's lecture) ► SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. ► SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. ► SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. 26 SVM in Applications (Mooney's lecture) ► SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. ► SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. ► SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. ► SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. '97], principal component analysis [Scholkopf et al. '99], etc. 26 SVM in Applications (Mooney's lecture) ► SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. ► SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. ► SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. ► SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. '97], principal component analysis [Scholkopf et al. '99], etc. ► Most popular optimization algorithms for SVMs use decomposition to hillclimb over a subset of a;'s at a time, e.g. SMO [Piatt '99] and [Joachims '99] 26 SVM in Applications (Mooney's lecture) ► SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. ► SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. ► SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. ► SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. '97], principal component analysis [Scholkopf et al. '99], etc. ► Most popular optimization algorithms for SVMs use decomposition to hillclimb over a subset of a;'s at a time, e.g. SMO [Piatt '99] and [Joachims '99] ► Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner. 26