PA196: Pattern Recognition
3. Linear discriminants (cont’d)
Dr. Vlad Popovici
popovici@recetox.muni.cz
RECETOX
Masaryk University, Brno
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Non-linearly separable case: soft margins
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Non-linearly separable case: soft margins
LDA
Remember (ﬁrst lecture):
• Bayes decision: assign x to the class with maximum a
posteriori probability
• let there be K classes denoted g1, . . . , gK , with
corresponding priors P(gi)
• the posteriors are:
P(gi|x) =
p(x|gi)P(gi)
K
i p(x|gi)P(gi)
∝ p(x|gi)P(gi)
• decision function (for class gi vs class gj) arise from log
odds-ratios (for example):
log
P(gi|x)
P(gj|x)
= log
p(x|gi)
p(x|gj)
+
P(gi)
P(gj)



> 0, predict gi
< 0, predict gj
Under the assumption of Gaussian class-conditional
densities:
p(x|g) =
1
(2π)d|Σg|1/2
exp −
1
2
(x − µ)t
Σ−1
(x − µ)
(|Σ| is the determinant of covariance matrix Σ) the
decision function becomes
hij(x) = log
P(gi|x)
P(gj|x)
= (xt
Wix+wt
i x+wi0)−(xt
Wjx+wt
j x+wj0)
where
Wi = −
1
2
Σ−1
i , wi = Σ−1
i µi
and
wi0 = −
1
2
µt
i Σ−1
i µi −
1
2
log |Σi| + log P(gi)
Simplest LDA
If Σi = Σj = σ2I ("spherical" covariance matrices)
hij(x) = wij(x − x0)
where
wij = µi − µj, x0 =
1
2
(µi + µj) −
σ2
µi − µj
2
log
P(gi)
P(gj)
(µi − µj)
µ2
µ1
µ3
h12
h13
h23
Classical LDA
If all classes share a common covariance matrix, Σi = Σ,
the decision function becomes
hij(x) = wt
(x − x0)
where
w = Σ−1
(µi−µj), x0 =
1
2
(µi+µj)−
1
(µi − µj)t Σ−1(µi − µj)
log
P(gi)
P(gj)
(µi−µj)
µ2
µ1
µ3
h23
h12
h13
Estimation of LDA parameters
• we are given {(xi, gi), i = 1, . . . , n} with xi ∈ Rd and
gi ∈ {g1, . . . , gK }.
• priors: ˆP(gi) = ni/n where ni is the number of elements
of class gi in the training set
• mean vectors: ˆµi = 1
ni
x∈gi
x
• covariance matrix:
ˆΣ = K
k=1 x∈gk
(x − ˆµk)(x − ˆµk)t /(n − K)
Quadratic Discriminant Analysis
Class-conditional probabilities are general Gaussians and
the decision function has the form:
hij(x) = log
P(gi|x)
P(gj|x)
= (xt
Wix+wt
i x+wi0)−(xt
Wjx+wt
j x+wj0)
where
Wi = −
1
2
Σ−1
i , wi = Σ−1
i µi
and
wi0 = −
1
2
µt
i Σ−1
i µi −
1
2
log |Σi| + log P(gi)
LDA and QDA
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1 1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1 1
1
1
1
1 1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2 2
2
2
2
22
2
2
2
2
2
2
2
2
2
2 22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
22
2 2
2
2
2
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3 3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3 3
3
3
3 3
3
3
3
3
3
3
3
3 3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1 1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1 1
1
1
1
1 1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2 2
2
2
2
22
2
2
2
2
2
2
2
2
2
2 22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
22
2 2
2
2
2
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3 3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3 3
3
3
3 3
3
3
3
3
3
3
3
3 3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
Hastie et al: The Elements of Statistical Learning - chpt 4
Note: a similar boundary to QDA could be obtained by
applying LDA in an augmented space with axes x1, . . . , xd,
x1x2, . . . , xd−1xd, x2
1
, . . . , x2
d
Regularized DA: between LDA and
QDA
Combine the pooled covariance with class-speciﬁc
covariance matrices, and allow the pooled covariance
the be more spherical or more general:
ˆΣk(α, γ) = αˆΣk + (1 − α) γˆΣ + (1 − γ)ˆσ2
I
• α = 1: QDA; α = 0: LDA
• γ = 1: general covariance matrix; γ = 0: spherical
covariance matrix
• α and γ must be optimized
Implementation of LDA
• use diagonalization of the covariance matrices
(either pooled or class-speciﬁc), which are symmetric
and positive deﬁnite:
Σi = UiDiUt
i
where Ui is a d × d orthonormal matrix and Di is a
diagonal matrix with eigenvalues dik > 0 on the
diagonal
• the ingredients for the decision functions become:
(x − µi)t
Σ−1
i (x − µi) = [Ut
i (x − µi)]t
D−1
i [Ut
i (x − µi)]
and
log |Σi| =
k
log dik
Implementation of LDA, cont’d
A possible 2-step procedure for LDA classiﬁcation
(common covariance matrix Σ = UDUt ):
1 "sphere" the data: X∗ = D−1
2 Ut X
2 assign a sample x to the closest centroid in
transformed space, modulo the effect of the priors
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Non-linearly separable case: soft margins
• the centroids µi i = 1, . . . , K lie in an afﬁne subspace of
dimension at most K − 1 < d
• any dimension orthogonal to this subspace does not
inﬂuence the classiﬁcation
• the classiﬁcation is carried out in a low dimensional
space, hence we have a dimensionality reduction
• the subspace axes can be found sequentially, using
Fisher’s criterion (ﬁnd directions that maximally
separate the centroids with respect to the variance)
• this is essentially the same as Principal Component
Analysis
1 compute M the K × d matrix of class centroids (by
rows) and the common covariance matrix W
(within-class covariance)
2 compute M∗ = MW− 1
2 (using eigen-decomposition of
W)
3 compute B∗ – the covariance matrix of M∗
(between-class covariance matrix), and its
eigen-decomposition B∗ = V∗DBV∗t
4 the columns of V∗ (ordered from largest to smallest
eigen-value dBi) give the coordinates of the optimal
subspaces
• the i−th discriminant variable (canonical variable) is
given by Zi = (W−1
2 v∗
i
)t X
o
o
oo
o
o o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Canonical Coordinate 1
CanonicalCoordinate2
Classification in Reduced Subspace
••
••
••
••
••
••
••
••
••
••
••
Hastie et al. - The Elements of Statistical Learning - chpt. 4
Coordinate 1
Coordinate3
-4 -2 0 2 4
-202
o
o
oo
oo
o
o
o
o
o
o
o
o
o
o o
o
ooo
o
o
o
o
o
o
o oo
o
o
o
o
o
o
o
o o o o
o o
o
o
o
o
o
o
oo
o
o
o
o
ooo
o o
oo
oo
o
o
o
o
oooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
oo
oooo
oo
o
o
o
o
o
o
o
o oo
o
ooo
oo
oo
o
o ooo
oo
o
o
o
oo
o
oo
o
o
o
o
o
oo
oo
o
o
oo
oo
o
o
oo
oo
oooo
o
o
o
o
o
o
oo
o
o
o
o o
o
o
o
o o
o
oo
o
oo
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
oo
ooo
o
o
oo
o
o
o
o
o
oo
o
oo
oo
o
o
o
ooooo
o
o
o
o
oo
o
oo
o
oo
o o
o
o
o
o
o
o
ooo
o o
oo
o
oo
oooo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ooo
o
o
o
o
ooo
o
o
oo oo
oo
o
o
o
o
o o o
oo
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
oo
o
oo
o
o
o
o
o o
oo
o
o
o
o
oo
oo
o
o
o
o
o
o
o
o
ooo
ooo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o o
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o oooo
o
ooo
o
o
o
o
o
o o o
o
oooo
o
o
o
o
o
o
o
o
o
oo
oo
o
o o o
o
o ooo
o o
o
o
oo o
o
o
o
o
o o
oo
o
o
o
o
o
o
o
o
oo
ooo
o
o
o
oo
o
o
oo
o
o
o
o
o
oo oo
ooo
oo
o o
oo
ooo
o
o
oo
o
o
oo
o
oo
o
••
••
••
••
••
•• ••
•••• ••
••
Coordinate 2
Coordinate3
-6 -4 -2 0 2 4
-202
o
o
oo
oo
o
o
o
o
o
o
o
o
o
oo
o
ooo
o
o
o
o
o
o
ooo
o
o
o
o
o
o
o
oooo
o o
o
o
o
o
o
o
oo
o
o
o
o
ooo
oo
oo
oo
o
o
o
o
oooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
oo
oooo
oo
o
o
o
o
o
o
o
ooo
o
ooo
oo
oo
o
oooo
oo
o
o
o
oo
o
oo
o
o
o
o
o
oo
oo
o
o
oo
oo
o
o
oo
oo
oooo
o
o
o
o
o
o
oo
o
o
o
oo
o
o
o
oo
o
oo
o
oo
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
ooo
o
o
oo
o
o
o
o
o
oo
o
oo
oo
o
o
o
ooooo
o
o
o
o
oo
o
oo
o
oo
oo
o
o
o
o
o
o
ooo
oo
oo
o
oo
oooo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ooo
o
o
o
o
ooo
o
o
oooo
oo
o
o
o
o
ooo
oo
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
oo
o
oo
o
o
o
o
oo
oo
o
o
o
o
oo
o o
o
o
o
o
o
o
o
o
oo o
ooo
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
ooooo
o
ooo
o
o
o
o
o
oo o
o
ooooo
o
o
o
o
o
o
o
o
oo
oo
o
ooo
o
oo oo
oo
o
o
ooo
o
o
o
o
oo
o o
o
o
o
o
o
o
o
o
oo
o oo
o
o
o
oo
o
o
o o
o
o
o
o
o
oooo
ooo
oo
oo
oo
ooo
o
o
oo
o
o
oo
o
oo
o
••
••
••
••
••
••••
••••••
••
Coordinate 1
Coordinate7
-4 -2 0 2 4
-3-2-10123
oooooo
o
o
o
o
o
o
o
o
o
o oo
oooooo
o o oo
ooo
oo
o
oo
o o o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
ooo
o
o
o
ooo
o
o
ooooo
oo
o
o
oo
o
o
o
o ooo
o
ooo
o
oo
o
o
o
o
o o
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o oo
o
o
o
ooo
oo
o o
oo
o
oo
oo
o
o
o
o
o
o
o
o
o
oo
o
o
oooo
o
o
oooo
o
o
ooo
ooo
oo
oo
o
o
o
o
o
ooo
ooooo o
o
o
o
o
o
o
ooo
o
o
o
o
ooo
o
o
o oo
o
oo
oo
oooo
o
o
oo
oo
ooo
oo
o
ooo
oo
o
o
o
oooo
o
o
o
o
o
o
ooo
ooo
o
o
oo
o o
o oooo
o
o
o
oo
o
o
o
o
oo
oo
o
o
o
ooo
o
o
o
o
o
o
o o o
o
o
o
o
ooo
o
o
o
oooo
ooo
oo oo
o
ooo
o o
o
o o
o
oo
oo
oo
oo
o
o
o
o
o
o
o o
oo
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
ooo
o
o
o
o
o
o
o
o
o
oo oo
o
o
oo
oo
o o
ooo o o
o
o
o
o
o
o
oo
o
o
o
o
o
o o
o
o
o
ooo
o
oo
ooo
o
oo
o
o
oo
o
o
oo
o
o
o
o
o
o
o o
oo
o
o
ooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o o
o
o
o
o
o
o
o
o
oo
oo
oo
o
o
o
o
o
o
ooo
ooo
o
oo
oo
o
oo
oo oo
o o
oo
o
o
o
o
o
oo
o
o
o
ooooo
o
o
oo
o
o
o
o
o
o o
••
••••
•• ••••
••
••
••
••••
Coordinate 9
Coordinate10
-2 -1 0 1 2 3
-2-1012
o
o
o
o
oo
o
o
o
o
o o
o
oo o
o
o
oo
o
ooo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
oo
o
o
o
o
o
oo
oooo
o
o
oooooo
o
o
o o
o
oo
o
o
o
o
o
o
o
o
o
o o
o
o o o
o
o
o
o
o
ooo
oo
o
o o
oo
o
o
o
oo
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o o
oo
o
o
o
o
oo o
o
o
o
o
oo
o
o o oooo
o o
o
o
o
ooo
o
o
oo
o
o
o
o
o o
o
o
ooo oo
o
o
o
o
o
o
o
oooo
o
o
o
o
o
o
o
o o
oooo
o
oooo
oo
oo
oo
o o ooo
o
o
o
o
ooo
oo
ooo
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
oo
o o
oooo
o
o
o
o
o
o
o
o
o o
o
o
oo
o
o
ooo
o
o
oo
o
o
o
o
o
o
o o
o
o o
o
o
o
o
o
o o
oo
o
o
o o
o
o
o
o
o
o
o
o
o
ooo oo o
o o
o
oo
o
o
o
o
o
o
o
oooo
o o
oo
oo
o o
o
o
oo
o
o
o
o o oo
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
oo
o
o
o
o
o
o
o
o
oo
o o o
o
o
o
oo
o
o
oo
o
o
o
o
o
o
o
oo
o
o
o
oo
oo
o
o
o
o
o
o
o
o
o o
o
o
o
o
oo o o
oo
o
o
oo
o
ooo
oo
o
o
o
o
o
o
o
o o
o
o
o
oo
o
o
o
o
o o
o
o
oo
o
oo
oo
o
o
o
oo
o
o
oooo
oo
o
o
o o
o
o
o oo
o
oo
o
oo
o
o
o
oo
o
o
o
o
••
••••
••••••••
••••
••
••
Linear Discriminant Analysis
Hastie et al. - The Elements of Statistical Learning - chpt. 4
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Non-linearly separable case: soft margins
• LDA, FDA and MSE regression with a particular coding
of class labels, lead to equivalent solutions
(separating hyperplane)
• LDA (QDA) is the optimal classiﬁer in the case of
Gaussian class-conditional distributions
• LDA can be used to project data into a lower
dimensional space for visualization
• LDA derivation assumes Gaussian densities, but FDA
does not
• LDA is naturally extended to multiple classes
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Non-linearly separable case: soft margins
Idea: model the posterior probabilities as linear functions
in x and ensure they sum up to 1.
For K classes g1, . . . , gK :
log
P(gi|x)
P(gK |x)
= wi, x + wi0, ∀i = 1, . . . , K − 1
which leads to
P(gi|x) =
exp( wi, x + wi0)
1 + K−1
j=1 exp( wj, x + wj0)
, i = 1, . . . , K − 1
P(gK |x) =
1
1 + K−1
j=1 exp( wj, x + wj0)
• the transformation p → log[p/(1 − p)] is called logit
transform
• the choice of reference class (K in our case) is purely
a convention
• the set of parameters of the model:
θ = {w1, w10, . . . , wK−1, wK−1,0}
• the log-likelihood is
L(θ) =
n
i=1
log P(gi|xi; θ)
For the binary case (K = 2), take the classes to be
encoded in response variables yi: yi = 0 for class g1 and
yi = 1 for class g2.
• a single posterior probability is needed:
P(y = 0|x) =
exp( w, x + w0)
1 + exp( w, x + w0)
• the likelihood function becomes:
L(θ = {w, w0}) =
n
i=1
[yi( w, xi + w0) − log(1 + exp( w, xi + w0)]
• using z = [1, x] and a = [w0, w],
L(a) =
n
i=1
[yi a, z − log(1 + exp( a, z ))]
• objective: ﬁnd a∗ = arg maxa L(a)
•
∂L(a)
∂a = n
i=1 zi(yi − P(yi = 0|zi))
• at a (local) extremum,
∂L(a)
∂a = 0 which leads to a
system of equations to be solved for a
• the solution can be found by a Newton-Raphson
procedure (iteratively re-weighted least squares)
A few remarks on logistic regression:
• brings the tools of linear regression to pattern
recognition
• can be used to identify those input variables that
explain the output
• its predictions can be interpreted as posterior
probabilities
• by introducing a penalty term, variable selection can
be embedded into the model construction - we’ll see
it later!
• both LDA and logistic regression use a linear form for
the log-posterior odds log(P(gi|x)/P(gK |x)); LDA
assumes the posterior to be Gaussians, while logistic
regression assumes they only lead to linear
log-posterior odds
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Non-linearly separable case: soft margins
• there are theoretical considerations to justify the goal
of maximizing the margin achieved by the separating
hyperplane
• intuitively, the larger the margin, more "room" for noise
in data and, hence, better generalization
• let a training set be {(xi, yi), i = 1, . . . , n} with yi = ±1
• the margin of a point xi w.r.t. the boundary function h
is γi = yih(xi)
• it can be shown that the maximal error attained by h
is upper bounded by a function of min(γi) (however,
the bound might not be tight)
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Non-linearly separable case: soft margins
• consider the dataset {(xi, yi), i = 1, . . . , n} be linearly
separable, i.e. γi > 0
• we will consider linear classiﬁers h(x) = w, w + w0
(with the predicted class being sign(h(x))
• if the (functional) margin achieved is 1, then γi ≥ 1
• then, the geometric margin is the normalized
functional margin 1/ w , hence:
Proposition
The hyperplane (w, w0) that solves the optimization
problem
minimizew,w0
1
2
w, w
subject to yi( w, xi + w0) ≥ 1, i = 1, . . . , n
realizes the maximal margin hyperplane with geometric
margin γ = 1/ w .
Solving the constrained optimization problem:
• let the objective function be f(w) and the equality
constraints hi(w) = 0 for i = 1, . . . , m, then the
Lagrangian function is
L(w, β) = f(w) +
m
i=1
βihi(w)
• a necessary and sufﬁcient condition for w∗ to be a
solution of the optimization problem (f : continuous
and convex, hi : continuous and differentiable) is
∂L(w∗, β∗
)
∂w∗
= 0
∂L(w∗, β∗
)
∂β∗ = 0
for some values of β∗
For a constrained optimization with a domain Ω ⊆ Rn :
minimizew f(w)
subject to gi(w) ≥ 0, i = 1, . . . , k
hi(w) = 0, i = 1, . . . , m
the Lagrangian function has the form
L(w, α, β) = f(w) +
k
i=1
αigi(w)
m
i=1
βihi(w)
with αi and βi being the Lagrange multipliers.
Karush-Kuhn-Tucker (KKT) optimality conditions for a
convex optimization problem: for a solution w∗ and
corresponding mutlipliers α∗ and β∗
,
∂L
w∗
= 0
∂L
β∗ = 0
α∗
i gi(w∗
) = 0
gi(w∗
) ≤ 0
αi∗ ≥ 0
• for active constraints (gi(w) = 0,) αi > 0; for inactive
constraints (gi(w) < 0,) αi = 0
• αi can be seen as the sensitivity of f to the active
constraint
Duality of convex optimization:
• the solution is a saddle point
• w are the primal variables
• Lagrange multipliers are the dual variables
• solving the dual optimization problem may be simpler:
the Lagrange mult. are the main variables, so set to 0
the derivatives w.r.t. w and substitute the result into
the Lagrangian
• the resulting function contains only the dual variables
and must be maximized under simpler constraints
...and back to our initial problem:
• the primal Lagrangian is
L(w, w0, α) =
1
2
w, w −
n
i=1
αi [yi( w, xi + w0) − 1]
• from KKT conditions, w = i yiαixi and i yiαi = 0
• which leads to the dual Lagrangian
L(w, w0, α) =
i
αi −
1
2
i j
yiyjαiαj xi, xj
Proposition
If α∗ is the solution of the quadratic optimization problem
maximize W(α) =
n
i=1
αi −
1
2
n
i,j=1
yiyjαiαj xi, xj
subject to
n
i=1
αiyi = 0
αi ≥ 0, i = 1, . . . , n
then the vector w∗ = i yiα∗
i
xi realizes the maximal margin
hyperplane with the geometric mean 1/ w∗ .
• in the dual formulation, w∗
0
still needs to be speciﬁed,
so
w∗
0 = −
1
2
max
yi =−1
{ w∗
, xi } + max
yi =1
{ w∗
, xi }
• from KKT conditions: α∗
i
[yi( w∗, x + w∗
0
) − 1] = 0, so only
for xi lying on the margin, α∗
i
0
• those xi for which αi 0 are called support vectors
• the optimal hyperplane is a linear combination of
support vectors:
h(x) =
i∈SV
yiα∗
i xi, x + w∗
0
<
w
,x>
+
w
0=
1
<
w
,x>
+
w
0=
-1
• the margin achieved is
γ =
1
w∗
=


i∈SV
α∗
i


− 1
2
• an (leave-one-out) estimate of the generalization
error is the proportion of support vectors of the total
training sample size,
#SV
n
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Non-linearly separable case: soft margins
L2-norm soft margins
• introduce the slack variables ξ and allow "softer"
margins:
minimizew,w0,ξ
1
2
w, w + C
n
i=1
ξ2
i ,
subject to yi( w, xi + w0) ≥ 1 − ξi, i = 1, . . . , n
ξi ≥ 0, i = 1, . . . , n
for some C > 0
• theory suggests optimal choice for C : 1/ maxi{ xi
2},
but in practice C is selected by testing various values
• the problem is solved in the dual space and the
margin achieved is


i∈SV
α∗
i − α 2
/C


− 1
2
L1-norm soft margins
• optimization problem:
minimizew,w0,ξ
1
2
w, w + C
n
i=1
ξi,
subject to yi( w, xi + w0) ≥ 1 − ξi, i = 1, . . . , n
ξi ≥ 0, i = 1, . . . , n
for some C > 0
• this results in "box contraints" on αi : 0 ≤ αi ≤ C
• non-zero slack variables correspond to αi = C and to
points with geometric margin less than 1/ w
Wrap-up
• LDA and MSE-based methods lead to similar solutions,
even though they are derived under different
assumptions
• LDA (and FDA) assign the vectors x to the closest
centroid, in the transformed space
• logistic regression and LDA model the likelihood as a
linear function
• the predicted values from logistic regression can be
interpreted as posterior probabilities
• margin optimization provides an alternative
approach