Support Vector Machines Perceptron Revisited: Linear Separators • Binary classification can be viewed as the task of separating classes in feature space: Linear Separators • Which of the linear separators is optimal? Classification Margin • Distance from example x[i] to the separator is • Examples closest to the hyperplane are support vectors. • Margin ρ of the separator is the distance between support vectors. Maximum Margin Classification • Maximizing the margin is good according to intuition and PAC theory. • Implies that only support vectors matter; other training examples are ignorable. Soft Margin Classification • What if the training set is not linearly separable? • Slack variables ξ[i] can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. Linear SVMs: Overview • The classifier is a separating hyperplane. • Most “important” training points are support vectors; they define the hyperplane. • Quadratic optimization algorithms can identify which training points x[i][ ]are support vectors with non-zero Lagrangian multipliers α[i]. • Both in the dual formulation of the problem and in the solution training points appear only inside inner products: [ ] Non-linear SVMs • Datasets that are linearly separable with some noise work out great: • But what are we going to do if the dataset is just too hard? • How about… mapping data to a higher-dimensional space: Non-linear SVMs: Feature spaces • General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: The “Kernel Trick” [• ]The linear classifier relies on inner product between vectors K(x[i],x[j])=x[i]^Tx[j] • If every datapoint is mapped into high-dimensional space via some transformation Φ: x[ ]→ φ(x), the inner product becomes: K(x[i],x[j])= φ(x[i])[ ]^Tφ(x[j]) • A kernel function is a function that is eqiuvalent to an inner product in some feature space. • Example: 2-dimensional vectors x=[x[1 ]x[2]]; let K(x[i],x[j])=(1 + x[i]^Tx[j])^2[,] Need to show that K(x[i],x[j])= φ(x[i])[ ]^Tφ(x[j]): K(x[i],x[j])=(1 + x[i]^Tx[j])^2[,]= 1+ x[i1]^2x[j1]^2 + 2 x[i1]x[j1]^ x[i2]x[j2]+ x[i2]^2x[j2]^2 + 2x[i1]x[j1 ]+ 2x[i2]x[j2]= = [1 x[i1]^2 √2 x[i1]x[i2 ] x[i2]^2 √2x[i1 ]√2x[i2]]^T [1 x[j1]^2 √2 x[j1]x[j2 ] x[j2]^2 √2x[j1 ]√2x[j2]] = = φ(x[i])[ ]^Tφ(x[j]), where φ(x) = [ ][1 x[1]^2 √2 x[1]x[2 ] x[2]^2 √2x[1 ]√2x[2]] • Thus, a kernel function implicitly maps data to a high-dimensional space (without the need to compute each φ(x) explicitly). Examples of Kernel Functions [• ]Linear: K(x[i],x[j])= x[i]^Tx[j][ ] [– ]Mapping Φ: x[ ]→ φ(x), where φ(x) is x itself[] ^• Polynomial of power p: K(x[i],x[j])= (1+ x[i]^Tx[j])^p – Mapping Φ: x[ ]→ φ(x), where φ(x) has dimensions • Gaussian (radial-basis function): K(x[i],x[j]) = – Mapping Φ: x[ ]→ φ(x), where φ(x) is infinite-dimensional: every point is mapped to a function (a Gaussian); combination of functions for support vectors is the separator. • Higher-dimensional space still has intrinsic dimensionality d (the mapping is not onto), but linear separators in it correspond to non-linear separators in original space. SVM applications • SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. • SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. • SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. • SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc. • Most popular optimization algorithms for SVMs use decomposition to hill-climb over a subset of α[i]’s at a time, e.g. SMO [Platt ’99] and [ ][Joachims ’99] • Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner. [ ]