SVM intro SVM details Classiﬁcation in the real world
PV211: Introduction to Information Retrieval
https://www.fi.muni.cz/~sojka/PV211
IIR 15-1: Support Vector Machines
Handout version
Petr Sojka, Hinrich Schütze et al.
Faculty of Informatics, Masaryk University, Brno
Center for Information and Language Processing, University of Munich
2024-04-24
(compiled on 2024-06-10 08:54:14)
Sojka, IIR Group: PV211: Support Vector Machines 1 / 38
SVM intro SVM details Classiﬁcation in the real world
Overview
1 SVM intro
2 SVM details
3 Classiﬁcation in the real world
Sojka, IIR Group: PV211: Support Vector Machines 2 / 38
SVM intro SVM details Classiﬁcation in the real world
Take-away today
Support vector machines: State-of-the-art text classiﬁcation
methods (linear and nonlinear)
Introduction to SVMs
Formalization
Soft margin case for nonseparable problems
Discussion: Which classiﬁer should I use for my problem?
Sojka, IIR Group: PV211: Support Vector Machines 3 / 38
SVM intro SVM details Classiﬁcation in the real world
Support vector machines
Machine-learning research in the last two decades has
improved classiﬁer eﬀectiveness.
New generation of state-of-the-art classiﬁers: support vector
machines (SVMs), boosted decision trees, regularized logistic
regression, maximum entropy, neural networks, and random
forests
As we saw in IIR: Applications to IR problems, particularly
text classiﬁcation
Sojka, IIR Group: PV211: Support Vector Machines 5 / 38
SVM intro SVM details Classiﬁcation in the real world
What is a support vector machine – ﬁrst take
Vector space classiﬁcation (similar to Rocchio, kNN, linear
classiﬁers)
Diﬀerence from previous methods: large margin classiﬁer
We aim to ﬁnd a separating hyperplane (decision boundary)
that is maximally far from any point in the training data
In case of nonlinear separability: We may have to discount
some points as outliers or noise.
Sojka, IIR Group: PV211: Support Vector Machines 6 / 38
SVM intro SVM details Classiﬁcation in the real world
(Linear) Support Vector Machines
binary classiﬁcation
problem
Decision boundary is
linear separator.
criterion: being maximally
far away from any data
point → determines
classiﬁer margin
Vectors on margin lines
are called support vectors
Set of support vectors are
a complete speciﬁcation
of classiﬁer
Support vectors
Margin is
maximized
Maximum
margin
decision
hyperplane
Sojka, IIR Group: PV211: Support Vector Machines 7 / 38
SVM intro SVM details Classiﬁcation in the real world
Why maximize the margin?
Points near the decision
surface are uncertain
classiﬁcation decisions.
A classiﬁer with a large
margin makes no low
certainty classiﬁcation
decisions (on the
training set).
Gives classiﬁcation
safety margin with
respect to errors and
random variation
Support vectors
Margin is
maximized
Maximum
margin
decision
hyperplane
Sojka, IIR Group: PV211: Support Vector Machines 8 / 38
SVM intro SVM details Classiﬁcation in the real world
Why maximize the margin?
SVM classiﬁcation = large
margin around decision
boundary
We can think of the margin
as a “fat separator” – a
fatter version of our regular
decision hyperplane.
unique solution
decreased memory capacity
increased ability to correctly
generalize to test data
Sojka, IIR Group: PV211: Support Vector Machines 9 / 38
SVM intro SVM details Classiﬁcation in the real world
Separating hyperplane: Recap
Hyperplane
An n-dimensional generalization of a plane (point in 1-D space,
line in 2-D space, ordinary plane in 3-D space).
Decision hyperplane
Can be deﬁned by:
intercept term b (we were calling this θ before)
normal vector w (weight vector) which is perpendicular to the
hyperplane
All points x on the hyperplane satisfy:
wTx + b = 0
Sojka, IIR Group: PV211: Support Vector Machines 10 / 38
SVM intro SVM details Classiﬁcation in the real world
Exercise
0 1 2 3
0
1
2
3
Draw the maximum margin separator. Which vectors are the
support vectors? Coordinates of dots: (3,3), (-1,1). Coordinates of
triangle: (3,0)
Sojka, IIR Group: PV211: Support Vector Machines 11 / 38
SVM intro SVM details Classiﬁcation in the real world
Formalization of SVMs
Training set
Consider a binary classiﬁcation problem:
xi are the input vectors
yi are the labels
For SVMs, the two classes are yi = +1 and yi = −1.
The linear classiﬁer is then:
f (x) = sign(wTx + b)
A value of −1 indicates one class, and a value of +1 the other
class.
Sojka, IIR Group: PV211: Support Vector Machines 13 / 38
SVM intro SVM details Classiﬁcation in the real world
Functional margin of a point
SVM makes its decision based on the score wTx + b.
Clearly, the larger |wTx + b| is, the more conﬁdence we can have
that the decision is correct.
Functional margin
The functional margin of the vector xi w.r.t the hyperplane
w, b is: yi (wTxi + b)
The functional margin of a data set w.r.t a decision surface is
twice the functional margin of any of the points in the data
set with minimal functional margin
Factor 2 comes from measuring across the whole width of the
margin.
Problem: We can increase functional margin by scaling w and b.
→ We need to place some constraint on the size of w.
Sojka, IIR Group: PV211: Support Vector Machines 14 / 38
SVM intro SVM details Classiﬁcation in the real world
Geometric margin
Geometric margin of the classiﬁer: maximum width of the band
that can be drawn separating the support vectors of the two
classes.
To compute the geometric margin, we need to compute the
distance of a vector x from the hyperplane:
r = y
wTx + b
|w|
(why? we will see that this is so graphically in a few moments)
Distance is of course invariant to scaling: if we replace w by 5w
and b by 5b, then the distance is the same because it is normalized
by the length of w.
Sojka, IIR Group: PV211: Support Vector Machines 15 / 38
SVM intro SVM details Classiﬁcation in the real world
Optimization problem solved by SVMs
Assume canonical “functional margin” distance
Assume that every data point has at least distance 1 from the
hyperplane, then:
yi (wTxi + b) ≥ 1
Since each example’s distance from the hyperplane is
ri = yi (wTxi + b)/|w|, the margin is ρ = 2/|w|. We want to
maximize this margin. That is, we want to ﬁnd w and b such that:
For all (xi , yi ) ∈ D, yi (wTxi + b) ≥ 1
ρ = 2/|w| is maximized
Sojka, IIR Group: PV211: Support Vector Machines 16 / 38
SVM intro SVM details Classiﬁcation in the real world
Support Vector Machines
support vectors in red
support vector x
margin is
maximized
maximum
margin
decision
hyperplane
wT x + b = 1
wT x + b = 0
wT x + b = −1
0.5x + 0.5y − 2 = 1
0.5x + 0.5y − 2 = 0
0.5x + 0.5y − 2 = −1
Sojka, IIR Group: PV211: Support Vector Machines 17 / 38
SVM intro SVM details Classiﬁcation in the real world
wTw′
+ b = 0
b = −wTw′
b
|w|
= −
wTw′
|w|
Distance of support vector from separator =
(length of projection of x onto w) minus (length of w′)
wTx
|w|
−
wTw′
|w|
=
wTx
|w|
+
b
|w|
=
wTx + b
|w|
Sojka, IIR Group: PV211: Support Vector Machines 18 / 38
SVM intro SVM details Classiﬁcation in the real world
Distance of support vector from separator =
(length of projection of x = (1 5)T onto w) minus (length of w′)
wTx
|w|
−
wTw′
|w|
(1 · 2 + 5 · 2)/(1/
√
2) − (0.5 · 2 + 0.5 · 2)/(1/
√
2)
3/(1/
√
2) − 2/(1/
√
2)
wTx
|w|
+
b
|w|
3/(1/
√
2) + (−2)/(1/
√
2)
3 − 2
1/
√
2
√
2
Sojka, IIR Group: PV211: Support Vector Machines 19 / 38
SVM intro SVM details Classiﬁcation in the real world
Optimization problem solved by SVMs (2)
Maximizing 2/|w| is the same as minimizing |w|/2.
This gives the ﬁnal standard formulation of an SVM as a
minimization problem:
Example
Find w and b such that:
1
2wTw is minimized (because |w| =
√
wTw), and
for all {(xi , yi )}, yi (wTxi + b) ≥ 1
We are now optimizing a quadratic function subject to linear
constraints. Quadratic optimization problems are standard
mathematical optimization problems, and many algorithms exist
for solving them (e.g. Quadratic Programming libraries).
Sojka, IIR Group: PV211: Support Vector Machines 20 / 38
SVM intro SVM details Classiﬁcation in the real world
Recap
We start with a training set.
The data set deﬁnes the maximum-margin separating
hyperplane (if it is separable).
We use quadratic optimization to ﬁnd this plane.
Given a new point x to classify, the classiﬁcation function
f (x) computes the functional margin of the point (=
normalized distance).
The sign of this function determines the class to assign to the
point.
If the point is within the margin of the classiﬁer, the classiﬁer
can return “don’t know” rather than one of the two classes.
The value of f (x) may also be transformed into a probability
of classiﬁcation
Sojka, IIR Group: PV211: Support Vector Machines 21 / 38
SVM intro SVM details Classiﬁcation in the real world
Exercise
0 1 2 3
0
1
2
3
Which vectors are the support vectors? Draw the maximum margin
separator. What values of w1, w2 and b (for w1x + w2y + b = 0)
describe this separator? Recall that we must have
w1x + w2y + b ∈ {1, −1} for the support vectors.
Sojka, IIR Group: PV211: Support Vector Machines 22 / 38
SVM intro SVM details Classiﬁcation in the real world
Walkthrough example
Working geometrically:
The maximum margin weight vector
will be parallel to the shortest line
connecting points of the two classes,
that is, the line between (1, 1) and
(2, 3), giving a weight vector of (1, 2).
The optimal decision surface is
orthogonal to that line and intersects
it at the halfway point. Therefore, it
passes through (1.5, 2).
The SVM decision boundary is:
0 = x+2y−(1·1.5+2·2) ⇔ 0 =
2
5
x+
4
5
y−
11
5
0 1 2 3
0
1
2
3
Sojka, IIR Group: PV211: Support Vector Machines 23 / 38
SVM intro SVM details Classiﬁcation in the real world
Walkthrough example
Working algebraically:
With the constraint
sign(yi (wTxi + b)) ≥ 1, we seek to
minimize |w|.
We know that the solution is
w = (a, 2a) for some a. So:
a + 2a + b = −1, 2a + 6a + b = 1
Hence, a = 2/5 and b = −11/5. So
the optimal hyperplane is given by
w = (2/5, 4/5) and b = −11/5.
The margin ρ is 2/|w| =
2/ 4/25 + 16/25 = 2/(2
√
5/5) =
√
5 = (1 − 2)2 + (1 − 3)2.
0 1 2 3
0
1
2
3
Sojka, IIR Group: PV211: Support Vector Machines 24 / 38
SVM intro SVM details Classiﬁcation in the real world
Soft margin classiﬁcation
What happens if data is not linearly separable?
Standard approach: allow the fat decision margin to make a
few mistakes
some points, outliers, noisy examples are inside or on the
wrong side of the margin
Pay cost for each misclassiﬁed example, depending on how far
it is from meeting the margin requirement
Slack variable ξi : A non-zero value for ξi allows xi to not meet the
margin requirement at a cost proportional to the value of ξi .
Optimization problem: trading oﬀ how fat it can make the margin
vs. how many points have to be moved around to allow this margin.
The sum of the ξi gives an upper bound on the number of training
errors. Soft-margin SVMs minimize training error traded oﬀ
against margin.
Sojka, IIR Group: PV211: Support Vector Machines 25 / 38
SVM intro SVM details Classiﬁcation in the real world
Using SVM for one-of classiﬁcation
Recall how to use binary linear classiﬁers (k classes) for
one-of: train and run k classiﬁers and then select the class
with the highest conﬁdence
Another strategy used with SVMs: build k(k − 1)/2
one-versus-one classiﬁers, and choose the class that is selected
by the most classiﬁers. While this involves building a very
large number of classiﬁers, the time for training classiﬁers may
actually decrease, since the training data set for each classiﬁer
is much smaller.
Yet another possibility: structured prediction. Generalization
of classiﬁcation where the classes are not just a set of
independent, categorical labels, but may be arbitrary
structured objects with relationships deﬁned between them
Sojka, IIR Group: PV211: Support Vector Machines 26 / 38
SVM intro SVM details Classiﬁcation in the real world
Text classiﬁcation
Many commercial applications
There are many applications of text classiﬁcation for corporate
Intranets, government departments, and Internet publishers.
Often greater performance gains from exploiting
domain-speciﬁc text features than from changing from one
machine learning method to another.
Understanding the data is one of the keys to successful
categorization, yet this is an area in which many
categorization tool vendors are weak.
Sojka, IIR Group: PV211: Support Vector Machines 28 / 38
SVM intro SVM details Classiﬁcation in the real world
Choosing what kind of classiﬁer to use
When building a text classiﬁer, ﬁrst question: how much training
data is there currently available?
Practical challenge: creating or obtaining enough training data
Hundreds or thousands of examples from each class are required to
produce a high performance classiﬁer and many real world contexts
involve large sets of categories.
None?
Very little?
Quite a lot?
A huge amount, growing every day?
Sojka, IIR Group: PV211: Support Vector Machines 29 / 38
SVM intro SVM details Classiﬁcation in the real world
If you have no labeled training data
Use hand-written rules!
Example
IF (wheat OR grain) AND NOT (whole OR bread) THEN
c = grain
In practice, rules get a lot bigger than this, and can be phrased
using more sophisticated query languages than just Boolean
expressions, including the use of numeric scores.
With careful crafting, the accuracy of such rules can become very
high (high 90% precision, high 80% recall).
Nevertheless the amount of work to create such well-tuned rules is
very large. A reasonable estimate is 2 days per class, and extra
time has to go into maintenance of rules, as the content of
documents in classes drifts over time.
Sojka, IIR Group: PV211: Support Vector Machines 30 / 38
SVM intro SVM details Classiﬁcation in the real world
A Verity topic (a complex classiﬁcation rule)
Sojka, IIR Group: PV211: Support Vector Machines 31 / 38
SVM intro SVM details Classiﬁcation in the real world
Westlaw: Example queries
Information need: Information on the legal theories involved in
preventing the disclosure of trade secrets by employees formerly
employed by a competing company
Query: “trade secret” /s disclos! /s prevent /s employe!
Information need: Requirements for disabled people to be able to
access a workplace
Query: disab! /p access! /s work-site work-place (employment /3
place)
Information need: Cases about a host’s responsibility for drunk
guests
Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest
Sojka, IIR Group: PV211: Support Vector Machines 32 / 38
SVM intro SVM details Classiﬁcation in the real world
If you have fairly little data and you are going to train a
supervised classiﬁer
Work out how to get more labeled data as quickly as you can.
Best way: insert yourself into a process where humans will be
willing to label data for you as part of their natural tasks.
Example
Often humans will sort or route email for their own purposes, and
these actions give information about classes.
Active Learning
A system is built which decides which documents a human should
label.
Usually these are the ones on which a classiﬁer is uncertain of the
correct classiﬁcation.
Sojka, IIR Group: PV211: Support Vector Machines 33 / 38
SVM intro SVM details Classiﬁcation in the real world
If you have labeled data
Good amount of labeled data, but not huge
Use everything that we have presented about text classiﬁcation.
Consider hybrid approach (overlay Boolean classiﬁer)
Huge amount of labeled data
Choice of classiﬁer probably has little eﬀect on your results.
Choose classiﬁer based on the scalability of training or runtime
eﬃciency.
Rule of thumb: each doubling of the training data size produces a
linear increase in classiﬁer performance, but with very large
amounts of data, the improvement becomes sub-linear.
Sojka, IIR Group: PV211: Support Vector Machines 34 / 38
SVM intro SVM details Classiﬁcation in the real world
Large and diﬃcult category taxonomies
If you have a small number of well-separated categories, then many
classiﬁcation algorithms are likely to work well. But often: very
large number of very similar categories.
Example
Web directories (e.g. the Yahoo! Directory consists of over
200,000 categories or the Open Directory Project), library
classiﬁcation schemes (Dewey Decimal or Library of Congress), the
classiﬁcation schemes used in legal or medical applications.
Accurate classiﬁcation over large sets of closely related classes is
inherently diﬃcult. – No general high-accuracy solution.
Sojka, IIR Group: PV211: Support Vector Machines 35 / 38
SVM intro SVM details Classiﬁcation in the real world
Recap
Is there a learning method that is optimal for all text
classiﬁcation problems?
No, because there is a trade-oﬀ between bias and variance.
Factors to take into account:
How much training data is available?
How simple/complex is the problem? (linear vs. nonlinear
decision boundary)
How noisy is the problem?
How stable is the problem over time?
For an unstable problem, it’s better to use a simple and robust
classiﬁer.
Sojka, IIR Group: PV211: Support Vector Machines 36 / 38
SVM intro SVM details Classiﬁcation in the real world
Take-away today
Support vector machines: State-of-the-art text classiﬁcation
methods (linear and nonlinear)
Introduction to SVMs
Formalization
Soft margin case for nonseparable problems
Discussion: Which classiﬁer should I use for my problem?
Sojka, IIR Group: PV211: Support Vector Machines 37 / 38
SVM intro SVM details Classiﬁcation in the real world
Resources
Chapter 14 of IIR (basic vector space classiﬁcation)
Chapter 15-1 of IIR
Resources at https://www.fi.muni.cz/~sojka/PV211/
and http://cislmu.org, materials in MU IS and FI MU
library
Discussion of “how to select the right classiﬁer for my
problem” in Russell and Norvig
Sojka, IIR Group: PV211: Support Vector Machines 38 / 38