SVM intro SVM details Classification in the real world PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/PV211 IIR 15-1: Support Vector Machines Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2023-04-19 (compiled on 2023-04-12 10:40:17) Sojka, IIR Group: PV211: Support Vector Machines 1 / 38 SVM intro SVM details Classification in the real world Overview 1 SVM intro 2 SVM details 3 Classification in the real world Sojka, IIR Group: PV211: Support Vector Machines 2 / 38 SVM intro SVM details Classification in the real world Take-away today Support vector machines: State-of-the-art text classification methods (linear and nonlinear) Introduction to SVMs Formalization Soft margin case for nonseparable problems Discussion: Which classifier should I use for my problem? Sojka, IIR Group: PV211: Support Vector Machines 3 / 38 SVM intro SVM details Classification in the real world Support vector machines Machine-learning research in the last two decades has improved classifier effectiveness. New generation of state-of-the-art classifiers: support vector machines (SVMs), boosted decision trees, regularized logistic regression, maximum entropy, neural networks, and random forests As we saw in IIR: Applications to IR problems, particularly text classification Sojka, IIR Group: PV211: Support Vector Machines 5 / 38 SVM intro SVM details Classification in the real world What is a support vector machine – first take Vector space classification (similar to Rocchio, kNN, linear classifiers) Difference from previous methods: large margin classifier We aim to find a separating hyperplane (decision boundary) that is maximally far from any point in the training data In case of non-linear-separability: We may have to discount some points as outliers or noise. Sojka, IIR Group: PV211: Support Vector Machines 6 / 38 SVM intro SVM details Classification in the real world (Linear) Support Vector Machines binary classification problem Decision boundary is linear separator. criterion: being maximally far away from any data point → determines classifier margin Vectors on margin lines are called support vectors Set of support vectors are a complete specification of classifier Support vectors Margin is maximized Maximum margin decision hyperplane Sojka, IIR Group: PV211: Support Vector Machines 7 / 38 SVM intro SVM details Classification in the real world Why maximize the margin? Points near the decision surface are uncertain classification decisions. A classifier with a large margin makes no low certainty classification decisions (on the training set). Gives classification safety margin with respect to errors and random variation Support vectors Margin is maximized Maximum margin decision hyperplane Sojka, IIR Group: PV211: Support Vector Machines 8 / 38 SVM intro SVM details Classification in the real world Why maximize the margin? SVM classification = large margin around decision boundary We can think of the margin as a “fat separator” – a fatter version of our regular decision hyperplane. unique solution decreased memory capacity increased ability to correctly generalize to test data Sojka, IIR Group: PV211: Support Vector Machines 9 / 38 SVM intro SVM details Classification in the real world Separating hyperplane: Recap Hyperplane An n-dimensional generalization of a plane (point in 1-D space, line in 2-D space, ordinary plane in 3-D space). Decision hyperplane Can be defined by: intercept term b (we were calling this θ before) normal vector w (weight vector) which is perpendicular to the hyperplane All points x on the hyperplane satisfy: wTx + b = 0 Sojka, IIR Group: PV211: Support Vector Machines 10 / 38 SVM intro SVM details Classification in the real world Exercise 0 1 2 3 0 1 2 3 Draw the maximum margin separator. Which vectors are the support vectors? Coordinates of dots: (3,3), (-1,1). Coordinates of triangle: (3,0) Sojka, IIR Group: PV211: Support Vector Machines 11 / 38 SVM intro SVM details Classification in the real world Formalization of SVMs Training set Consider a binary classification problem: xi are the input vectors yi are the labels For SVMs, the two classes are yi = +1 and yi = −1. The linear classifier is then: f (x) = sign(wTx + b) A value of −1 indicates one class, and a value of +1 the other class. Sojka, IIR Group: PV211: Support Vector Machines 13 / 38 SVM intro SVM details Classification in the real world Functional margin of a point SVM makes its decision based on the score wTx + b. Clearly, the larger |wTx + b| is, the more confidence we can have that the decision is correct. Functional margin The functional margin of the vector xi w.r.t the hyperplane w, b is: yi (wTxi + b) The functional margin of a data set w.r.t a decision surface is twice the functional margin of any of the points in the data set with minimal functional margin Factor 2 comes from measuring across the whole width of the margin. Problem: We can increase functional margin by scaling w and b. → We need to place some constraint on the size of w. Sojka, IIR Group: PV211: Support Vector Machines 14 / 38 SVM intro SVM details Classification in the real world Geometric margin Geometric margin of the classifier: maximum width of the band that can be drawn separating the support vectors of the two classes. To compute the geometric margin, we need to compute the distance of a vector x from the hyperplane: r = y wTx + b |w| (why? we will see that this is so graphically in a few moments) Distance is of course invariant to scaling: if we replace w by 5w and b by 5b, then the distance is the same because it is normalized by the length of w. Sojka, IIR Group: PV211: Support Vector Machines 15 / 38 SVM intro SVM details Classification in the real world Optimization problem solved by SVMs Assume canonical “functional margin” distance Assume that every data point has at least distance 1 from the hyperplane, then: yi (wTxi + b) ≥ 1 Since each example’s distance from the hyperplane is ri = yi (wTxi + b)/|w|, the margin is ρ = 2/|w|. We want to maximize this margin. That is, we want to find w and b such that: For all (xi , yi ) ∈ D, yi (wTxi + b) ≥ 1 ρ = 2/|w| is maximized Sojka, IIR Group: PV211: Support Vector Machines 16 / 38 SVM intro SVM details Classification in the real world Support Vector Machines support vectors in red support vector x margin is maximized maximum margin decision hyperplane wT x + b = 1 wT x + b = 0 wT x + b = −1 0.5x + 0.5y − 2 = 1 0.5x + 0.5y − 2 = 0 0.5x + 0.5y − 2 = −1 Sojka, IIR Group: PV211: Support Vector Machines 17 / 38 SVM intro SVM details Classification in the real world wTw′ + b = 0 b = −wTw′ b |w| = − wTw′ |w| Distance of support vector from separator = (length of projection of x onto w) minus (length of w′) wTx |w| − wTw′ |w| = wTx |w| + b |w| = wTx + b |w| Sojka, IIR Group: PV211: Support Vector Machines 18 / 38 SVM intro SVM details Classification in the real world Distance of support vector from separator = (length of projection of x = (1 5)T onto w) minus (length of w′) wTx |w| − wTw′ |w| (1 · 2 + 5 · 2)/(1/ √ 2) − (0.5 · 2 + 0.5 · 2)/(1/ √ 2) 3/(1/ √ 2) − 2/(1/ √ 2) wTx |w| + b |w| 3/(1/ √ 2) + (−2)/(1/ √ 2) 3 − 2 1/ √ 2 √ 2 Sojka, IIR Group: PV211: Support Vector Machines 19 / 38 SVM intro SVM details Classification in the real world Optimization problem solved by SVMs (2) Maximizing 2/|w| is the same as minimizing |w|/2. This gives the final standard formulation of an SVM as a minimization problem: Example Find w and b such that: 1 2wTw is minimized (because |w| = √ wTw), and for all {(xi , yi )}, yi (wTxi + b) ≥ 1 We are now optimizing a quadratic function subject to linear constraints. Quadratic optimization problems are standard mathematical optimization problems, and many algorithms exist for solving them (e.g. Quadratic Programming libraries). Sojka, IIR Group: PV211: Support Vector Machines 20 / 38 SVM intro SVM details Classification in the real world Recap We start with a training set. The data set defines the maximum-margin separating hyperplane (if it is separable). We use quadratic optimization to find this plane. Given a new point x to classify, the classification function f (x) computes the functional margin of the point (= normalized distance). The sign of this function determines the class to assign to the point. If the point is within the margin of the classifier, the classifier can return “don’t know” rather than one of the two classes. The value of f (x) may also be transformed into a probability of classification Sojka, IIR Group: PV211: Support Vector Machines 21 / 38 SVM intro SVM details Classification in the real world Exercise 0 1 2 3 0 1 2 3 Which vectors are the support vectors? Draw the maximum margin separator. What values of w1, w2 and b (for w1x + w2y + b = 0) describe this separator? Recall that we must have w1x + w2y + b ∈ {1, −1} for the support vectors. Sojka, IIR Group: PV211: Support Vector Machines 22 / 38 SVM intro SVM details Classification in the real world Walkthrough example Working geometrically: The maximum margin weight vector will be parallel to the shortest line connecting points of the two classes, that is, the line between (1, 1) and (2, 3), giving a weight vector of (1, 2). The optimal decision surface is orthogonal to that line and intersects it at the halfway point. Therefore, it passes through (1.5, 2). The SVM decision boundary is: 0 = x+2y−(1·1.5+2·2) ⇔ 0 = 2 5 x+ 4 5 y− 11 5 0 1 2 3 0 1 2 3 Sojka, IIR Group: PV211: Support Vector Machines 23 / 38 SVM intro SVM details Classification in the real world Walkthrough example Working algebraically: With the constraint sign(yi (wTxi + b)) ≥ 1, we seek to minimize |w|. We know that the solution is w = (a, 2a) for some a. So: a + 2a + b = −1, 2a + 6a + b = 1 Hence, a = 2/5 and b = −11/5. So the optimal hyperplane is given by w = (2/5, 4/5) and b = −11/5. The margin ρ is 2/|w| = 2/ 4/25 + 16/25 = 2/(2 √ 5/5) = √ 5 = (1 − 2)2 + (1 − 3)2. 0 1 2 3 0 1 2 3 Sojka, IIR Group: PV211: Support Vector Machines 24 / 38 SVM intro SVM details Classification in the real world Soft margin classification What happens if data is not linearly separable? Standard approach: allow the fat decision margin to make a few mistakes some points, outliers, noisy examples are inside or on the wrong side of the margin Pay cost for each misclassified example, depending on how far it is from meeting the margin requirement Slack variable ξi : A non-zero value for ξi allows xi to not meet the margin requirement at a cost proportional to the value of ξi . Optimization problem: trading off how fat it can make the margin vs. how many points have to be moved around to allow this margin. The sum of the ξi gives an upper bound on the number of training errors. Soft-margin SVMs minimize training error traded off against margin. Sojka, IIR Group: PV211: Support Vector Machines 25 / 38 SVM intro SVM details Classification in the real world Using SVM for one-of classification Recall how to use binary linear classifiers (k classes) for one-of: train and run k classifiers and then select the class with the highest confidence Another strategy used with SVMs: build k(k − 1)/2 one-versus-one classifiers, and choose the class that is selected by the most classifiers. While this involves building a very large number of classifiers, the time for training classifiers may actually decrease, since the training data set for each classifier is much smaller. Yet another possibility: structured prediction. Generalization of classification where the classes are not just a set of independent, categorical labels, but may be arbitrary structured objects with relationships defined between them Sojka, IIR Group: PV211: Support Vector Machines 26 / 38 SVM intro SVM details Classification in the real world Text classification Many commercial applications There are many applications of text classification for corporate Intranets, government departments, and Internet publishers. Often greater performance gains from exploiting domain-specific text features than from changing from one machine learning method to another. Understanding the data is one of the keys to successful categorization, yet this is an area in which many categorization tool vendors are weak. Sojka, IIR Group: PV211: Support Vector Machines 28 / 38 SVM intro SVM details Classification in the real world Choosing what kind of classifier to use When building a text classifier, first question: how much training data is there currently available? Practical challenge: creating or obtaining enough training data Hundreds or thousands of examples from each class are required to produce a high performance classifier and many real world contexts involve large sets of categories. None? Very little? Quite a lot? A huge amount, growing every day? Sojka, IIR Group: PV211: Support Vector Machines 29 / 38 SVM intro SVM details Classification in the real world If you have no labeled training data Use hand-written rules! Example IF (wheat OR grain) AND NOT (whole OR bread) THEN c = grain In practice, rules get a lot bigger than this, and can be phrased using more sophisticated query languages than just Boolean expressions, including the use of numeric scores. With careful crafting, the accuracy of such rules can become very high (high 90% precision, high 80% recall). Nevertheless the amount of work to create such well-tuned rules is very large. A reasonable estimate is 2 days per class, and extra time has to go into maintenance of rules, as the content of documents in classes drifts over time. Sojka, IIR Group: PV211: Support Vector Machines 30 / 38 SVM intro SVM details Classification in the real world A Verity topic (a complex classification rule) Sojka, IIR Group: PV211: Support Vector Machines 31 / 38 SVM intro SVM details Classification in the real world Westlaw: Example queries Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company Query: “trade secret” /s disclos! /s prevent /s employe! Information need: Requirements for disabled people to be able to access a workplace Query: disab! /p access! /s work-site work-place (employment /3 place) Information need: Cases about a host’s responsibility for drunk guests Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest Sojka, IIR Group: PV211: Support Vector Machines 32 / 38 SVM intro SVM details Classification in the real world If you have fairly little data and you are going to train a supervised classifier Work out how to get more labeled data as quickly as you can. Best way: insert yourself into a process where humans will be willing to label data for you as part of their natural tasks. Example Often humans will sort or route email for their own purposes, and these actions give information about classes. Active Learning A system is built which decides which documents a human should label. Usually these are the ones on which a classifier is uncertain of the correct classification. Sojka, IIR Group: PV211: Support Vector Machines 33 / 38 SVM intro SVM details Classification in the real world If you have labeled data Good amount of labeled data, but not huge Use everything that we have presented about text classification. Consider hybrid approach (overlay Boolean classifier) Huge amount of labeled data Choice of classifier probably has little effect on your results. Choose classifier based on the scalability of training or runtime efficiency. Rule of thumb: each doubling of the training data size produces a linear increase in classifier performance, but with very large amounts of data, the improvement becomes sub-linear. Sojka, IIR Group: PV211: Support Vector Machines 34 / 38 SVM intro SVM details Classification in the real world Large and difficult category taxonomies If you have a small number of well-separated categories, then many classification algorithms are likely to work well. But often: very large number of very similar categories. Example Web directories (e.g. the Yahoo! Directory consists of over 200,000 categories or the Open Directory Project), library classification schemes (Dewey Decimal or Library of Congress), the classification schemes used in legal or medical applications. Accurate classification over large sets of closely related classes is inherently difficult. – No general high-accuracy solution. Sojka, IIR Group: PV211: Support Vector Machines 35 / 38 SVM intro SVM details Classification in the real world Recap Is there a learning method that is optimal for all text classification problems? No, because there is a trade-off between bias and variance. Factors to take into account: How much training data is available? How simple/complex is the problem? (linear vs. nonlinear decision boundary) How noisy is the problem? How stable is the problem over time? For an unstable problem, it’s better to use a simple and robust classifier. Sojka, IIR Group: PV211: Support Vector Machines 36 / 38 SVM intro SVM details Classification in the real world Take-away today Support vector machines: State-of-the-art text classification methods (linear and nonlinear) Introduction to SVMs Formalization Soft margin case for nonseparable problems Discussion: Which classifier should I use for my problem? Sojka, IIR Group: PV211: Support Vector Machines 37 / 38 SVM intro SVM details Classification in the real world Resources Chapter 14 of IIR (basic vector space classification) Chapter 15-1 of IIR Resources at https://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and FI MU library Discussion of “how to select the right classifier for my problem” in Russell and Norvig Sojka, IIR Group: PV211: Support Vector Machines 38 / 38