Basic statistical learning theory Support vector machines Other kernel methods PA196: Pattern Recognition 06. Support vector machines Dr. Vlad Popovici popovici@iba.muni.cz Institute of Biostatistics and Analyses Masaryk University, Brno Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension SLT views the problem of "learning" from a statistical perspective aim (as for any theory): model some phenomena so that we can make predictions about them other equally valid theories exist: Bayesian inference, inductive inference, statistical physics, "traditional" statistical analysis, etc. some assumptions need to be made which may define which approach is more suitable in different contexts Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension In SLT: we assume data is generated by some underlying (unknown) distribution P(x, y) a sample of n observations i.i.d. is drawn from P and is available for the learner: S = {(xi, yi) ∈ Rd × {±1}|i = 1, . . . , n} there is a learning algorithm A that chooses a function f = AF (S) from a function space F as a results of training on S generalization error (expected error): ǫ(S, A, F ) = E(x,y)[l(AF (S), x, y)] where l is a loss function Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension we are interested not only in ES[ǫ(S, A, F )] but also in the distribution of ǫ(S, A, F ) classifier consistency: lim n→∞ ES[ǫ(S, A, F )] = ǫBayes where ǫBayes is the Bayes risk the distribution of ǫ(S, A, F ) depends on the algorithm, F and n Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension classical statistics: investigates mostly the mean value of the distribution of ǫ SLT: looks also at the tails; derives probabilistic bounds on the generalization error hence PAC: probably approximately correct - bound the probability of being "deceived" and set it equal to some δ Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension What is the probability of being deceived by a "bad" function f? i.e. what is the probability of having a perfect training, but a true error above some ǫ? PS{ErrS(f) = 0, Err(f) > ǫ} = (1 − Err(f))n ≤ (1 − ǫ)n ≤ exp(−ǫn) By taking ǫ = 1 n ln 1 δ leads to PS ErrS(f) = 0, Err(f) > 1 n ln 1 δ ≤ δ Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension Now consider a (countable) set of functions F = {f1, . . . , fk , . . . } and let the probability of being misled by fk less than qk δ ( k qk ≤ 1). Then PS ∃fk : ErrS(fk ) = 0, Err(fk ) > 1 n ln 1 qk δ ≤ δ Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension Theorem Given a countable set of functions F and qk ≤ 1, with probability at least 1 − δ over random samples of size n, the generalization error of a function fk ∈ F with zero training error is bounded by Err(fk ) ≤ 1 n ln 1 qk + ln 1 δ Notes: ln(1/qk ) can be thought of as a "complexity" (description length) of the function fk Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension use 0-1 loss: 1 2 |yi − f(xi, α)| ∈ {0, 1} the expected error (expected risk or actual risk) is R(α) = 1 2 |y − f(x, α)| dP(x, y) the empirical risk is measured over an observed set (here of size n): Remp(α) = 1 2n n i=1 |yi − f(xi, α)| Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension for such losses, the following bound holds (Vapnik, 1995): for η ∈ [0, 1], with probability 1 − η, R(α) ≤ Remp(α) + h n log 2n h + h n − 1 n log η 4 h is a non-negative integer called Vapnik-Chervonenkis (VC) dimension and is a measure of the capacity of the set of functions f the 2nd term of the rhs in above bound ( √ . . .) is called the VC confidence notes: the bound is independent of P(x, y); if we knew h we could compute rhs Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension VC dimension is a characteristic of the set of functions F = {f(x, α)} we restrict the analysis to functions f ∈ {±1} n points can be labeled in 2n distinct ways if for any labeling of the set of points, a function f(x, α) can be found in F , then we say the F is shattering the set of points the VC dimension (h) of F is the maximum number of points that can be shattered by F if the VC dim of F is h it means that there exists at least one set of h points that can be shattered, and not that all such sets can be shattered Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension Shattering points with oriented hyperplanes in Rd The VC dimension of the set of oriented hyperplanes in Rd is d + 1. Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension Notes: h does not depend on the number of parameters a family of functions has for 2 machines having null empirical risk, the one with smaller h has better generalization guarantees a k−NN classifier with k = 1 has h = ∞ and null empirical risk → the bound becomes useless h depends on the class of functions F , while R and Remp depend on the particular function selected by the learning machine Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Introduction VC bounds on generalization error The VC dimension Structural risk minimization h1 h2 h3 hk we introduce a structure over the set of functions, such that h1 < h2 < · · · < hk < . . . idea: find that subset of functions which minimizes the empirical risk, while controlling the complexity Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM (Reminder) 2 |w| − w0 |w| w − ξ |w| minimizew,w0,ξ 1 2 w, w + Ω(ξ) subject to yi( w, xi + w0 ≥ 1 − ξ, i = 1, . . . , n ξi ≥ 0, i = 1, . . . , n Ω(ξ) = C i ξp i ; p=1 → 1-norm (L1) soft margin SVM and p = 2 → 2-Norm (L2) soft margin SVM w0 can be computed from w0 = yi − w, xi and a more stable solution is obtained by averaging over all support vectos (SVs): w0 = 1 |SV| i∈SV (yi − w, xi ) Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM L1 SVM Dual optimization problem (from KKT conditions): maximizeα n i=1 αi − 1 2 n i,j=1 αiαjyiyj xi, xj subject to n i=1 yiαi = 0 (box conditions) C ≥ αi ≥ 0, i = 1, . . . , n Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM Notes: if αi = 0 then ξi = 0 and it follows that xi is correctly classified if 0 < αi < C then yi( w, xi + w0) − 1 + ξi = 0 and ξi = 0 meaning that xi is an unbounded support vector if αi = C then yi( w, xi + w0) − 1 + ξi = 0 and ξi > 0 meaning that xi is a bounded support vector. Moreover, if 0 ≥ ξi < 1 then xi is correctly classified, otherwise it is misclassified w0 is obtained as before, but averaging over unbounded SVs the discriminant function is h(x) = i∈SV αiyi xi, x + w0    > 0, predict y = +1 < 0, predict y = −1 Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM L2 SVM For convenience, we take Ω(ξ) = C/2 i ξ2 i , which leads to the dual optimization maximizeα n i=1 αi − 1 2 n i,j=1 yiyjαiαj xi, xj + δij C subject to n i=1 yiαi = 0 αi ≥ 0, i = 1, . . . , n where δij is Kronecker’s delta function. Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM Notes: w0 is computed from averaging over terms of the form yi − n j=1 αiyi xi, xj + δij C the decision function remains the same Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM The kernel trick the SVM problem was formulated in terms of inner products let there a mapping Φ : Rd → H (from input space into feature space) and suppose that there exists a "kernel function" such that K(xi, xj) = Φ(xi), Φ(xj) H may be infinite-dimensional, ex. K(xi, xj) = exp  − xi − xj 2 2σ2   if we replace xi, xj with K(xi, xj) in the linear SVM, we obtain a nonlinear SVM! Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM Φ(x) Discriminant function: h(x) = i∈SV αiyiK(xi, x) + w0 Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM Which functions can be used as kernels? For some kernels, it is easy to find the corresponding mapping Φ: for ex., K(xi, xj) = xi, xj 2 corresponds to Φ : R2 → R3 , Φ(x) =   x2 1√ 2x1x2 x2 2   In general, for a kernel there may exist several possible mappings Φ. from Burges: A tutorial on support vector machines for pattern recognition Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM (Theoretical conditions for kernels) Mercer’s conditions There exists a mapphing Φ and an expansion K(x, x) = i Φ(x)Φ(z) if and only if, for any g(x) such that g(x)2 dx < ∞ then K(x, y)g(x)g(z) dx dz ≥ 0 Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM if the Mercer’s conditions are not satisfied, there might exist cases from which the optimization problem has no solution the space which is generated by the kernel space is called Reproducing Kernel Hilbert Space kernel matrix (Gram matrix): Kij = K(xi, xj); Hessian matrix: Hij = yiyjK(xi, xj) K is positive semi-definite in L2 SVM, the diagonal of K is augmented by 1/C thus potentially transforming K into a positive definite matrix all information about the data is concentrated into K K can be seen as defining a similarity between samples Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM Commonly used kernels: linear kernel: K(x, z) = x, z polynomial kernel: K(x, z) = ( x, z + 1)p radial basis function (RBF) kernel: K(x, z) = exp − xi−xj 2 2σ2 sigmoid kernel: K(x, y) = tanh(κ x, z − δ): this kernel does not always satisfy the Mercer conditions! Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM Kernels - closure properties If K1, and K2 are some kernels, and a ∈ R+, f a real valued function, φ : Rd → Rm and B a symmetric positive semi-definite d × d matrix, then the following are kernels: K1(x, z) + K2(x, z) aK1(x, z) K1(x, z)K2(x, z) K(x, z) = f(x)f(z) K1(φ(x), φ(bz)) K(x, z) = xt Bz Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM The solution of the optimization problem is global: any local solution of a convex optimization problem is also a global solution unique: if the Hessian matrix is positive definite the solution is guaranteed to be unique In the case the solution is not unique: it is still global! if w1 and w2 are solutions, then there exists a path w(τ) = τw1 + (1 − τ)w2 with 0 ≤ τ ≤ 1, such that w(τ) is also a solution Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM for a Mercer kernel K, the VC dimension of the SVM is dim(H) + 1 the VC dimension of the RKHS generated by the polynomial kernel is d+p−1 p where p is the degree of the polynomial the VC dimension in the case of an RBF is infinite How comes that SVM can have very good generalization performance, even in the case of an infinite VC dimension?? Hint: it has to do with the large margin... Another bound on the generalization error: E[P(error)] ≤ E[no. of SVs] n where E[no. of SVs] is the expected number of support vectors of all training sets of size n Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM Platt scaling Idea: apply a logistic transformation to the classifier score (margin): P(y = +1|x) = 1 1 + exp(αh(x) + β) The parameters α and β are found by optimization. Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM Some remarks SVM have a good overall performance of a large number of problems - but they are not the "Swiss knife" of pattern recognition one key ingredient: choosing the right kernel another key ingredient: choosing the right formulation of the problem in general, there are a number of parameters (e.g. C and p or σ) that need to be tuned C can be used to re-balance the classes: C = C+ + C− and assign different weights to each class support vector regression and support vector density estimation Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Outline 1 Basic statistical learning theory Introduction VC bounds on generalization error The VC dimension 2 Support vector machines Linear SVMs Nonlinear SVM The VC dimension of SVM Posterior probabilities for SVM 3 Other kernel methods Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods why not replace the inner product with kernels in other methods as well? apply the same reasoning in the case of regression... this leads to Kernel LDA, Kernel PCA, Kernel Perceptron, etc etc Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods Kernel LDA (Mika et al. Fisher Discriminant Analysis with Kernels, 1999) Fisher criterion: w∗ = arg max w wt Sbw wt Sww Suppose now that this is carried out in the feature space: means and scatter matrices are computed on the transformed data. Vlad PA196: Pattern Recognition Basic statistical learning theory Support vector machines Other kernel methods (Sketch) This can still be expressed in terms of operations in the input space. Let µΦ = 1/n i Φ(xi) be the mean in the feature space (for each of the classes you have a similar mean). The weight vector has the form w = i αiΦ(xi). So the product w, µ will be of the form w, µ = 1 n i,j αjK(xi, xj) Vlad PA196: Pattern Recognition