Basic statistical learning theory
Support vector machines
Other kernel methods
PA196: Pattern Recognition
06. Support vector machines
Dr. Vlad Popovici
popovici@iba.muni.cz
Institute of Biostatistics and Analyses
Masaryk University, Brno
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
SLT
views the problem of "learning" from a statistical perspective
aim (as for any theory): model some phenomena so that we
can make predictions about them
other equally valid theories exist: Bayesian inference,
inductive inference, statistical physics, "traditional" statistical
analysis, etc.
some assumptions need to be made which may deﬁne which
approach is more suitable in different contexts
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
In SLT:
we assume data is generated by some underlying (unknown)
distribution P(x, y)
a sample of n observations i.i.d. is drawn from P and is
available for the learner: S = {(xi, yi) ∈ Rd
× {±1}|i = 1, . . . , n}
there is a learning algorithm A that chooses a function
f = AF (S) from a function space F as a results of training on
S
generalization error (expected error):
ǫ(S, A, F ) = E(x,y)[l(AF (S), x, y)]
where l is a loss function
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
we are interested not only in ES[ǫ(S, A, F )] but also in the
distribution of ǫ(S, A, F )
classiﬁer consistency:
lim
n→∞
ES[ǫ(S, A, F )] = ǫBayes
where ǫBayes is the Bayes risk
the distribution of ǫ(S, A, F ) depends on the algorithm, F
and n
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
classical statistics: investigates mostly the mean value of the
distribution of ǫ
SLT: looks also at the tails; derives probabilistic bounds on the
generalization error
hence PAC: probably approximately correct - bound the
probability of being "deceived" and set it equal to some δ
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
What is the probability of being deceived by a "bad" function f? i.e.
what is the probability of having a perfect training, but a true error
above some ǫ?
PS{ErrS(f) = 0, Err(f) > ǫ} = (1 − Err(f))n
≤ (1 − ǫ)n
≤ exp(−ǫn)
By taking ǫ = 1
n ln 1
δ leads to
PS ErrS(f) = 0, Err(f) >
1
n
ln
1
δ
≤ δ
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
Now consider a (countable) set of functions F = {f1, . . . , fk , . . . }
and let the probability of being misled by fk less than qk δ
( k qk ≤ 1). Then
PS ∃fk : ErrS(fk ) = 0, Err(fk ) >
1
n
ln
1
qk δ
≤ δ
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
Theorem
Given a countable set of functions F and qk ≤ 1, with probability at
least 1 − δ over random samples of size n, the generalization error
of a function fk ∈ F with zero training error is bounded by
Err(fk ) ≤
1
n
ln
1
qk
+ ln
1
δ
Notes:
ln(1/qk ) can be thought of as a "complexity" (description
length) of the function fk
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
use 0-1 loss: 1
2 |yi − f(xi, α)| ∈ {0, 1}
the expected error (expected risk or actual risk) is
R(α) =
1
2
|y − f(x, α)| dP(x, y)
the empirical risk is measured over an observed set (here of
size n):
Remp(α) =
1
2n
n
i=1
|yi − f(xi, α)|
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
for such losses, the following bound holds (Vapnik, 1995): for
η ∈ [0, 1], with probability 1 − η,
R(α) ≤ Remp(α) +
h
n
log
2n
h
+
h
n
−
1
n
log
η
4
h is a non-negative integer called Vapnik-Chervonenkis (VC)
dimension and is a measure of the capacity of the set of
functions f
the 2nd term of the rhs in above bound (
√
. . .) is called the VC
conﬁdence
notes: the bound is independent of P(x, y); if we knew h we
could compute rhs
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
VC dimension is a characteristic of the set of functions
F = {f(x, α)}
we restrict the analysis to functions f ∈ {±1}
n points can be labeled in 2n
distinct ways
if for any labeling of the set of points, a function f(x, α) can be
found in F , then we say the F is shattering the set of points
the VC dimension (h) of F is the maximum number of points
that can be shattered by F
if the VC dim of F is h it means that there exists at least one
set of h points that can be shattered, and not that all such sets
can be shattered
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
Shattering points with oriented hyperplanes in Rd
The VC dimension of the set of oriented hyperplanes in Rd
is d + 1.
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
Notes:
h does not depend on the number of parameters a family of
functions has
for 2 machines having null empirical risk, the one with smaller
h has better generalization guarantees
a k−NN classiﬁer with k = 1 has h = ∞ and null empirical
risk → the bound becomes useless
h depends on the class of functions F , while R and Remp
depend on the particular function selected by the learning
machine
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Introduction
VC bounds on generalization error
The VC dimension
Structural risk minimization
h1
h2
h3
hk
we introduce a structure over the set of functions, such that
h1 < h2 < · · · < hk < . . .
idea: ﬁnd that subset of functions which minimizes the
empirical risk, while controlling the complexity
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
(Reminder)
2
|w|
− w0
|w|
w
− ξ
|w|
minimizew,w0,ξ
1
2
w, w + Ω(ξ)
subject to yi( w, xi + w0 ≥ 1 − ξ, i = 1, . . . , n
ξi ≥ 0, i = 1, . . . , n
Ω(ξ) = C i ξp
i
; p=1 → 1-norm (L1) soft margin SVM and
p = 2 → 2-Norm (L2) soft margin SVM
w0 can be computed from w0 = yi − w, xi and a more stable
solution is obtained by averaging over all support vectos
(SVs):
w0 =
1
|SV| i∈SV
(yi − w, xi )
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
L1 SVM
Dual optimization problem (from KKT conditions):
maximizeα
n
i=1
αi −
1
2
n
i,j=1
αiαjyiyj xi, xj
subject to
n
i=1
yiαi = 0
(box conditions) C ≥ αi ≥ 0, i = 1, . . . , n
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
Notes:
if αi = 0 then ξi = 0 and it follows that xi is correctly classiﬁed
if 0 < αi < C then yi( w, xi + w0) − 1 + ξi = 0 and ξi = 0
meaning that xi is an unbounded support vector
if αi = C then yi( w, xi + w0) − 1 + ξi = 0 and ξi > 0 meaning
that xi is a bounded support vector. Moreover, if 0 ≥ ξi < 1
then xi is correctly classiﬁed, otherwise it is misclassiﬁed
w0 is obtained as before, but averaging over unbounded SVs
the discriminant function is
h(x) =
i∈SV
αiyi xi, x + w0



> 0, predict y = +1
< 0, predict y = −1
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
L2 SVM
For convenience, we take Ω(ξ) = C/2 i ξ2
i
, which leads to the
dual optimization
maximizeα
n
i=1
αi −
1
2
n
i,j=1
yiyjαiαj xi, xj +
δij
C
subject to
n
i=1
yiαi = 0
αi ≥ 0, i = 1, . . . , n
where δij is Kronecker’s delta function.
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
Notes:
w0 is computed from averaging over terms of the form
yi −
n
j=1
αiyi xi, xj +
δij
C
the decision function remains the same
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
The kernel trick
the SVM problem was formulated in terms of inner products
let there a mapping Φ : Rd
→ H (from input space into
feature space) and suppose that there exists a "kernel
function" such that
K(xi, xj) = Φ(xi), Φ(xj)
H may be inﬁnite-dimensional, ex.
K(xi, xj) = exp

−
xi − xj
2
2σ2


if we replace xi, xj with K(xi, xj) in the linear SVM, we obtain
a nonlinear SVM!
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
Φ(x)
Discriminant function:
h(x) =
i∈SV
αiyiK(xi, x) + w0
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
Which functions can be used as kernels?
For some kernels, it is easy to ﬁnd the corresponding mapping Φ:
for ex., K(xi, xj) = xi, xj
2
corresponds to
Φ : R2
→ R3
, Φ(x) =


x2
1√
2x1x2
x2
2


In general, for a kernel there may exist several possible mappings
Φ.
from Burges: A tutorial on support vector machines for pattern recognition
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
(Theoretical conditions for kernels)
Mercer’s conditions
There exists a mapphing Φ and an expansion
K(x, x) =
i
Φ(x)Φ(z)
if and only if, for any g(x) such that g(x)2
dx < ∞ then
K(x, y)g(x)g(z) dx dz ≥ 0
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
if the Mercer’s conditions are not satisﬁed, there might exist
cases from which the optimization problem has no solution
the space which is generated by the kernel space is called
Reproducing Kernel Hilbert Space
kernel matrix (Gram matrix): Kij = K(xi, xj); Hessian matrix:
Hij = yiyjK(xi, xj)
K is positive semi-deﬁnite
in L2 SVM, the diagonal of K is augmented by 1/C thus
potentially transforming K into a positive deﬁnite matrix
all information about the data is concentrated into K
K can be seen as deﬁning a similarity between samples
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
Commonly used kernels:
linear kernel: K(x, z) = x, z
polynomial kernel: K(x, z) = ( x, z + 1)p
radial basis function (RBF) kernel: K(x, z) = exp −
xi−xj
2
2σ2
sigmoid kernel: K(x, y) = tanh(κ x, z − δ): this kernel does
not always satisfy the Mercer conditions!
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
Kernels - closure properties
If K1, and K2 are some kernels, and a ∈ R+, f a real valued
function, φ : Rd
→ Rm
and B a symmetric positive semi-deﬁnite
d × d matrix, then the following are kernels:
K1(x, z) + K2(x, z)
aK1(x, z)
K1(x, z)K2(x, z)
K(x, z) = f(x)f(z)
K1(φ(x), φ(bz))
K(x, z) = xt
Bz
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
The solution of the optimization problem is
global: any local solution of a convex optimization problem is
also a global solution
unique: if the Hessian matrix is positive deﬁnite the solution is
guaranteed to be unique
In the case the solution is not unique:
it is still global!
if w1 and w2 are solutions, then there exists a path
w(τ) = τw1 + (1 − τ)w2 with 0 ≤ τ ≤ 1, such that w(τ) is also
a solution
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
for a Mercer kernel K, the VC dimension of the SVM is
dim(H) + 1
the VC dimension of the RKHS generated by the polynomial
kernel is d+p−1
p where p is the degree of the polynomial
the VC dimension in the case of an RBF is inﬁnite
How comes that SVM can have very good generalization
performance, even in the case of an inﬁnite VC dimension??
Hint: it has to do with the large margin...
Another bound on the generalization error:
E[P(error)] ≤
E[no. of SVs]
n
where E[no. of SVs] is the expected number of support vectors of
all training sets of size n
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
Platt scaling
Idea: apply a logistic transformation to the classiﬁer score (margin):
P(y = +1|x) =
1
1 + exp(αh(x) + β)
The parameters α and β are found by optimization.
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
Some remarks
SVM have a good overall performance of a large number of
problems - but they are not the "Swiss knife" of pattern
recognition
one key ingredient: choosing the right kernel
another key ingredient: choosing the right formulation of the
problem
in general, there are a number of parameters (e.g. C and p or
σ) that need to be tuned
C can be used to re-balance the classes: C = C+ + C− and
assign different weights to each class
support vector regression and support vector density
estimation
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Outline
1 Basic statistical learning theory
Introduction
VC bounds on generalization error
The VC dimension
2 Support vector machines
Linear SVMs
Nonlinear SVM
The VC dimension of SVM
Posterior probabilities for SVM
3 Other kernel methods
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
why not replace the inner product with kernels in other
methods as well?
apply the same reasoning in the case of regression...
this leads to Kernel LDA, Kernel PCA, Kernel Perceptron, etc
etc
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
Kernel LDA
(Mika et al. Fisher Discriminant Analysis with Kernels, 1999)
Fisher criterion:
w∗
= arg max
w
wt
Sbw
wt Sww
Suppose now that this is carried out in the feature space: means
and scatter matrices are computed on the transformed data.
Vlad PA196: Pattern Recognition
Basic statistical learning theory
Support vector machines
Other kernel methods
(Sketch) This can still be expressed in terms of operations in the
input space. Let µΦ = 1/n i Φ(xi) be the mean in the feature
space (for each of the classes you have a similar mean).
The weight vector has the form w = i αiΦ(xi). So the product
w, µ will be of the form
w, µ =
1
n
i,j
αjK(xi, xj)
Vlad PA196: Pattern Recognition