Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
PA196: Pattern Recognition
03. Linear discriminants
Dr. Vlad Popovici
popovici@iba.muni.cz
Institute of Biostatistics and Analyses
Masaryk University, Brno
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
LDA
Remember (ﬁrst lecture):
Bayes decision: assign x to the class with maximum a
posteriori probability
let there be K classes denoted g1, . . . , gK , with corresponding
priors P(gi)
the posteriors are:
P(gi|x) =
p(x|gi)P(gi)
K
i p(x|gi)P(gi)
∝ p(x|gi)P(gi)
decision function (for class gi vs class gj) arise from log
odds-ratios (for example):
log
P(gi|x)
P(gj|x)
= log
p(x|gi)
p(x|gj)
+
P(gi)
P(gj)



> 0, predict gi
< 0, predict gj
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
Under the assumption of Gaussian class-conditional densities:
p(x|g) =
1
(2π)d|Σg|1/2
exp −
1
2
(x − µ)t
Σ−1
(x − µ)
(|Σ| is the determinant of covariance matrix Σ) the decision
function becomes
hij(x) = log
P(gi|x)
P(gj|x)
= (xt
Wix + wt
i x + wi0) − (xt
Wjx + wt
j x + wj0)
where
Wi = −
1
2
Σ−1
i , wi = Σ−1
i µi
and
wi0 = −
1
2
µt
i Σ−1
i µi −
1
2
log |Σi| + log P(gi)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
Simplest LDA
If Σi = Σj = σ2
I ("spherical" covariance matrices)
hij(x) = wij(x − x0)
where
wij = µi − µj, x0 =
1
2
(µi + µj) −
σ2
µi − µj
2
log
P(gi)
P(gj)
(µi − µj)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
µ2
µ1
µ3
h12
h13
h23
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
Classical LDA
If all classes share a common covariance matrix, Σi = Σ, the
decision function becomes
hij(x) = wt
(x − x0)
where
w = Σ−1
(µi−µj), x0 =
1
2
(µi+µj)−
1
(µi − µj)t Σ−1(µi − µj)
log
P(gi)
P(gj)
(µi−µj)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
µ2
µ1
µ3
h23
h12
h13
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
Estimation of LDA parameters
we are given {(xi, gi), i = 1, . . . , n} with xi ∈ Rd
and
gi ∈ {g1, . . . , gK }.
priors: ˆP(gi) = ni/n where ni is the number of elements of
class gi in the training set
mean vectors: ˆµi = 1
ni x∈gi
x
covariance matrix: ˆΣ = K
k=1 x∈gk
(x − ˆµk )(x − ˆµk )t
/(n − K)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
Quadratic Discriminant Analysis
Class-conditional probabilities are general Gaussians and the
decision function has the form:
hij(x) = log
P(gi|x)
P(gj|x)
= (xt
Wix + wt
i x + wi0) − (xt
Wjx + wt
j x + wj0)
where
Wi = −
1
2
Σ−1
i , wi = Σ−1
i µi
and
wi0 = −
1
2
µt
i Σ−1
i µi −
1
2
log |Σi| + log P(gi)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
LDA and QDA
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1 1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1 1
1
1
1
1 1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2 2
2
2
2
22
2
2
2
2
2
2
2
2
2
2 22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
22
2 2
2
2
2
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3 3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3 3
3
3
3 3
3
3
3
3
3
3
3
3 3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1 1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1 1
1
1
1
1 1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2 2
2
2
2
22
2
2
2
2
2
2
2
2
2
2 22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
22
2 2
2
2
2
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3 3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3 3
3
3
3 3
3
3
3
3
3
3
3
3 3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
Hastie et al: The Elements of Statistical Learning - chpt 4
Note: a similar boundary to QDA could be obtained by applying
LDA in an augmented space with axes x1, . . . , xd, x1x2, . . . , xd−1xd,
x2
1
, . . . , x2
d
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
Regularized DA: between LDA and QDA
Combine the pooled covariance with class-speciﬁc covariance
matrices, and allow the pooled covariance the be more spherical
or more general:
ˆΣk (α, γ) = αˆΣk + (1 − α) γˆΣ + (1 − γ)ˆσ2
I
α = 1: QDA; α = 0: LDA
γ = 1: general covariance matrix; γ = 0: spherical covariance
matrix
α and γ must be optimized
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
Implementation of LDA
use diagonalization of the covariance matrices (either pooled
or class-speciﬁc), which are symmetric and positive deﬁnite:
Σi = UiDiUt
i
where Ui is a d × d orthonormal matrix and Di is a diagonal
matrix with eigenvalues dik > 0 on the diagonal
the ingredients for the decision functions become:
(x − µi)t
Σ−1
i (x − µi) = [Ut
i (x − µi)]t
D−1
i [Ut
i (x − µi)]
and
log |Σi| =
k
log dik
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
Implementation of LDA, cont’d
A possible 2-step procedure for LDA classiﬁcation (common
covariance matrix Σ = UDUt
):
1 "sphere" the data: X∗ = D−1
2 Ut
X
2 assign a sample x to the closest centroid in transformed
space, modulo the effect of the priors
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
the centroids µi i = 1, . . . , K lie in an afﬁne subspace of
dimension at most K − 1 < d
any dimension orthogonal to this subspace does not inﬂuence
the classiﬁcation
the classiﬁcation is carried out in a low dimensional space,
hence we have a dimensionality reduction
the subspace axes can be found sequentially, using Fisher’s
criterion (ﬁnd directions that maximally separate the centroids
with respect to the variance)
this is essentially the same as Principal Component Analysis
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
1 compute M the K × d matrix of class centroids (by rows) and
the common covariance matrix W (within-class covariance)
2 compute M∗ = MW− 1
2 (using eigen-decomposition of W)
3 compute B∗ – the covariance matrix of M∗ (between-class
covariance matrix), and its eigen-decomposition
B∗ = V∗DBV∗t
4 the columns of V∗ (ordered from largest to smallest
eigen-value dBi) give the coordinates of the optimal subspaces
the i−th discriminant variable (canonical variable) is given by
Zi = (W− 1
2 v∗
i
)t
X
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
o
o
oo
o
o o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Canonical Coordinate 1
CanonicalCoordinate2
Classification in Reduced Subspace
••
••
••
••
••
••
••
••
••
••
••
Hastie et al. - The Elements of Statistical Learning - chpt. 4
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
Coordinate 1
Coordinate3
-4 -2 0 2 4
-202
o
o
oo
oo
o
o
o
o
o
o
o
o
o
o o
o
ooo
o
o
o
o
o
o
o oo
o
o
o
o
o
o
o
o o o o
o o
o
o
o
o
o
o
oo
o
o
o
o
ooo
o o
oo
oo
o
o
o
o
oooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
oo
oooo
oo
o
o
o
o
o
o
o
o oo
o
ooo
oo
oo
o
o ooo
oo
o
o
o
oo
o
oo
o
o
o
o
o
oo
oo
o
o
oo
oo
o
o
oo
oo
oooo
o
o
o
o
o
o
oo
o
o
o
o o
o
o
o
o o
o
oo
o
oo
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
oo
ooo
o
o
oo
o
o
o
o
o
oo
o
oo
oo
o
o
o
ooooo
o
o
o
o
oo
o
oo
o
oo
o o
o
o
o
o
o
o
ooo
o o
oo
o
oo
oooo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ooo
o
o
o
o
ooo
o
o
oo oo
oo
o
o
o
o
o o o
oo
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
oo
o
oo
o
o
o
o
o o
oo
o
o
o
o
oo
oo
o
o
o
o
o
o
o
o
ooo
ooo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o o
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o oooo
o
ooo
o
o
o
o
o
o o o
o
oooo
o
o
o
o
o
o
o
o
o
oo
oo
o
o o o
o
o ooo
o o
o
o
oo o
o
o
o
o
o o
oo
o
o
o
o
o
o
o
o
oo
ooo
o
o
o
oo
o
o
oo
o
o
o
o
o
oo oo
ooo
oo
o o
oo
ooo
o
o
oo
o
o
oo
o
oo
o
••
••
••
••
••
•• ••
•••• ••
••
Coordinate 2
Coordinate3
-6 -4 -2 0 2 4
-202
o
o
oo
oo
o
o
o
o
o
o
o
o
o
oo
o
ooo
o
o
o
o
o
o
ooo
o
o
o
o
o
o
o
oooo
o o
o
o
o
o
o
o
oo
o
o
o
o
ooo
oo
oo
oo
o
o
o
o
oooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
oo
oooo
oo
o
o
o
o
o
o
o
ooo
o
ooo
oo
oo
o
oooo
oo
o
o
o
oo
o
oo
o
o
o
o
o
oo
oo
o
o
oo
oo
o
o
oo
oo
oooo
o
o
o
o
o
o
oo
o
o
o
oo
o
o
o
oo
o
oo
o
oo
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
ooo
o
o
oo
o
o
o
o
o
oo
o
oo
oo
o
o
o
ooooo
o
o
o
o
oo
o
oo
o
oo
oo
o
o
o
o
o
o
ooo
oo
oo
o
oo
oooo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ooo
o
o
o
o
ooo
o
o
oooo
oo
o
o
o
o
ooo
oo
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
oo
o
oo
o
o
o
o
oo
oo
o
o
o
o
oo
o o
o
o
o
o
o
o
o
o
oo o
ooo
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
ooooo
o
ooo
o
o
o
o
o
oo o
o
ooooo
o
o
o
o
o
o
o
o
oo
oo
o
ooo
o
oo oo
oo
o
o
ooo
o
o
o
o
oo
o o
o
o
o
o
o
o
o
o
oo
o oo
o
o
o
oo
o
o
o o
o
o
o
o
o
oooo
ooo
oo
oo
oo
ooo
o
o
oo
o
o
oo
o
oo
o
••
••
••
••
••
••••
••••••
••
Coordinate 1
Coordinate7
-4 -2 0 2 4
-3-2-10123
oooooo
o
o
o
o
o
o
o
o
o
o oo
oooooo
o o oo
ooo
oo
o
oo
o o o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
ooo
o
o
o
o
o
o
o
o
ooooo
o
o
o
o
oo
o
o
o
o ooo
o
ooo
o
oo
o
o
o
o
o o
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o oo
o
o
o
ooo
oo
o o
oo
o
oo
oo
o
o
o
o
o
o
o
o
o
oo
o
o
oooo
o
o
oooo
o
o
o
oo
ooo
oo
oo
o
o
o
o
o
ooo
ooooo o
o
o
o
o
o
o
ooo
o
o
o
o
ooo
o
o
o oo
o
oo
oo
oooo
o
o
oo
oo
ooo
oo
o
ooo
oo
o
o
o
oooo
o
o
o
o
o
o
ooo
ooo
o
o
oo
o o
o oooo
o
o
o
o
o
o
o
o
o
oo
oo
o
o
o
ooo
o
o
o
o
o
o
o o o
o
o
o
o
ooo
o
o
o
oooo
ooo
oo oo
o
ooo
o o
o
o o
o
oo
oo
oo
oo
o
o
o
o
o
o
o o
oo
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
ooo
o
o
o
o
o
o
o
o
o
o
o oo
o
o
oo
oo
o o
ooo o o
o
o
o
o
o
o
oo
o
o
o
o
o
o o
o
oo
ooo
o
oo
ooo
o
oo
o
o
oo
o
o
oo
o
o
o
o
o
o
o o
oo
o
o
ooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o o
o
o
o
o
o
o
o
o
oo
oo
oo
o
o
o
o
o
o
ooo
ooo
o
oo
oo
o
oo
oo oo
o o
oo
o
o
o
o
o
oo
o
o
o
ooooo
o
o
oo
o
o
o
o
o
o o
••
••••
•• ••••
••
••
••
••••
Coordinate 9
Coordinate10
-2 -1 0 1 2 3
-2-1012
o
o
o
o
oo
o
o
o
o
o o
o
oo o
o
o
oo
o
ooo
o
oo
o
o
o
o
o
o
o
o
o
o
o
ooo
o
o
o
o
o
oo
o
o
o
oo
o
o
o
o
o
oo
oooo
o
o
oooooo
o
o
o o
o
oo
o o
o
o
o
o
o
o
o
o o
o
o o o
o
o
o
o
o
ooo
oo
o
o o
oo
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
oo
o
o
o
o
oo o
o
o
o
o
oo
o
o o oooo
o o
o
o
o
ooo
o
o
oo
o
o
o
o
o o
o
o
ooo oo
o
o
o
o
o
o
o
oooo
o
o
o
o
o
oo
o o
oooo
o
oooo
oo
oo
oo
o o ooo
o
o
o
o
ooo
oo
ooo
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
oo
o o
oooo
o
o
o
o
o
o
o
o
o o
o
o
oo
o
o
ooo
o
o
oo
o
o
o
o
o
o
o o
o
o o
o
o
o
o
o
o o
oo
o
o
o o
o
o
o
o
o
o
o
o
o
ooo oo o
o o
o
oo
o
o
o
o
o
o
o
oooo
o o
oo
oo
o o
o
o
oo
o
o
o
o o oo
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
oo
o
o
o
o
o
o
o
o
oo
o o o
o
o
o
o
o
o o
oo
o
o
o
o
o
o
o
oo
o
o
o
oo
oo
o
o
o
o
o
o
o
o
o o
o
o
o
o
oo o o
oo
o
o
oo
o
ooo
oo
o
o
o
o
o
o
o
o o
o
o
o
oo
o
o
o
o
o o
o
o
oo
o
oo
oo
o
o
o
oo
o
o
oooo
oo
o
o
o o
o
o
o oo
o
oo
o
oo
o
o
o
oo
o
o
o
o
••
••••
••••••••
••••
••
••
Linear Discriminant Analysis
Hastie et al. - The Elements of Statistical Learning - chpt. 4
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
LDA, QDA, RDA
LD subspace
LDA: wrap-up
LDA, FDA and MSE regression with a particular coding of
class labels, lead to equivalent solutions (separating
hyperplane)
LDA (QDA) is the optimal classiﬁer in the case of Gaussian
class-conditional distributions
LDA can be used to project data into a lower dimensional
space for visualization
LDA derivation assumes Gaussian densities, but FDA does
not
LDA is naturally extended to multiple classes
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Idea: model the posterior probabilities as linear functions in x and
ensure they sum up to 1.
For K classes g1, . . . , gK :
log
P(gi|x)
P(gK |x)
= wi, x + wi0, ∀i = 1, . . . , K − 1
which leads to
P(gi|x) =
exp( wi, x + wi0)
1 + K−1
j=1 exp( wj, x + wj0)
, i = 1, . . . , K − 1
P(gK |x) =
1
1 + K−1
j=1 exp( wj, x + wj0)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
the transformation p → log[p/(1 − p)] is called logit transform
the choice of reference class (K in our case) is purely a
convention
the set of parameters of the model:
θ = {w1, w10, . . . , wK−1, wK−1,0}
the log-likelihood is
L(θ) =
n
i=1
log P(gi|xi; θ)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Binary case (K = 2):
take the classes to be encoded in response variables yi:
yi = 0 for class g1 and yi = 1 for class g2.
a single posterior probability is needed: let it be
P(y = 0|x) =
exp( w, x + w0)
1 + exp( w, x + w0)
the likelihood function becomes:
L(θ = {w, w0}) =
n
i=1
[yi( w, xi + w0) − log(1 + exp( w, xi + w0)]
using z = [1, x] and a = [w0, w],
L(a) =
n
i=1
[yi a, zi − log(1 + exp( a, zi ))]
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
objective: ﬁnd a∗ = arg maxa L(a)
∂L(a)
∂a = n
i=1 zi(yi − P(yi = 0|zi)
at a (local) extremum,
∂L(a)
∂a = 0 which is the system of
equations to be solved for a
the solution can be found by a Newton-Raphson procedure
(iteratively reweighted least squares)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
A few remarks on logistic regression:
brings the tools from linear regression to pattern recognition
problems
can be used to identify those input variables that explain the
output
its predictions can be interpreted as posterior probabilities
by introducing a penalty term, variable selection can be
embedded into the model construction - we’ll see it later!
both LDA and logistic regression use a linear form for the
log-posterior odds (log(P(gi|x)/P(gK |x))); LDA assumes the
posteriors to be Gaussians, while logistic regression assumes
they only lead to linear log-posterior odds
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
there are theoretical considerations to justify the goal of
maximizing the margin achieved by the separating hyperplane
intuitively, larger the margin, more "room" for noise in the data
and hence, better generalization
let a training set be {(xi, yi), i = 1, . . . , n} with yi = ±1
the margin of a point xi with respect to the boundary function
h is γi = yih(xi)
it can be shown that the maximal error attained by h is upper
bounded by a function of min(γi) (however, the bound might
not be tight)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
consider the dataset {(xi, yi), i = 1, . . . , n} be linearly
separable, i.e. γi > 0
we will consider linear classiﬁers h(x) = w, x + w0 (with the
predicted class sign(h(x)))
if the (functional) margin achieved is 1, then γi ≥ 1
then, the geometric margin is the normalized functional
margin: 1/ w , hence:
Proposition
The hyperplane (w, w) that solves the optimization problem
minimizew,w0
1
2
w, w ,
subject to yi( w, xi + w0) ≥ 1, i = 1, . . . , n
realizes the maximal margin hyperplane with geometric margin
γ = 1/ w .
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Solving the constrained optimization:
let the objective function be f(w) and the equality constraints
hi(w) = 0 for i = 1, . . . , m, then the Lagrangian function is
L(w, β) = f(w) +
m
i=1
βihi(w)
a necessary and sufﬁcient condition for w∗ to be a solution of
the optimization problem (f continuous and convex, hi
continuous and differentiable) is
∂L(w∗, β∗)
∂w
= 0
∂L(w∗, β∗)
∂β
= 0
for some values of β∗
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
For a constrained optimization with a domain Ω ⊆ Rn
:
minimize f(w), w ∈ Ω
subject to gi(w) ≤ 0, i = 1, . . . , k
hi(w) = 0 i = 1, . . . , m
the Lagrangian function has the form
L(w, α, β) = f(w) +
i
αigi(w) +
j
βjhj(w)
with αi and βj being the Lagrange multipliers.
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Karush-Kuhn-Tucker (KKT) optimality conditions for a convex
optimization problem: for a solution w∗ and corresponding
multipliers α∗ and β∗,
∂L
∂w
= 0
∂L
∂β
= 0
α∗
i gi(w∗
) = 0
gi(w∗
) ≤ 0
α∗
i ≥ 0
for active constraints (gi(w) = 0), αi > 0; for inactive
constraints (gi(w) < 0), αi = 0
αi can be seen as the sensitivity of f to the active constraint
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Duality of convex optimization:
the solution is a saddle point
w are the primal variables
Lagrange multipliers are the dual variables
solving the dual optimization may be simpler: the Lagrange
multipliers are the main variables, so set to 0 the derivatives
wrt to w and substitute the result into the Lagrangian
the resulting function contains only dual variables and must be
maximized under simpler constraints
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
...and back to our initial problem:
the primal Lagrangian is
L(w, w0, α) =
1
2
w, w −
n
i=1
αi [yi( w, xi + w0) − 1]
from KKT conditions, w = n
i=1 yiαixi and n
i=1 yiαi = 0
which leads to the dual Lagrangian
L(w, w0, α) =
n
i=1
αi −
1
2
n
i=1
n
j=1
yiyjαiαj xi, xj
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Proposition
If α∗ is the solution of the quadratic problem
maximize W(α) =
n
i=1
αi −
1
2
n
i,j=1
yiyjαiαj xi, xj
subject to
n
i=1
αiyi = 0
αi ≥ 0 i = 1, . . . , n
then the vector w∗ = n
i=1 yiα∗
i
xi realizes the maximal margin
hyperplane with the geometric mean 1/ w∗ .
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
in the dual formulation, w∗
0
still needs to be speciﬁed, so
w∗
0 = −
1
2
max
γi=−1
{ w∗
, xi } + min
γ=1
{ w∗
, xi }
from the KKT conditions: α∗
i
[yi( w∗, xi + w∗
0
) − 1] = 0, so only
for xi lying on the margin, the α∗
i
0
those xi for which α∗
i
0 are called support vectors
the optimal hyperplane is a linear combination of support
vectors:
h(x) =
i∈SV
yiα∗
i xi, x + w∗
0
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
<
w
,x>
+
w
0=
1
<
w
,x>
+
w
0=
-1
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
the margin achieved is
γ =
1
w∗
=


i∈SV
α∗
i


− 1
2
(a leave-one-out) estimate of the generalisation error is the
proportion of support vectors of the total training sample size
SV
n
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Outline
1 Linear Discriminant Analysis (cont’d)
LDA, QDA, RDA
LD subspace
LDA: wrap-up
2 Logistic regression
3 Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
2-Norm soft margin
introduce the slack variables ξ and allow "softer" margins:
minimizew,w0,ξ
1
2
w, w + C
n
i=1
ξ2
,
subject to yi( w, xi + w0) ≥ 1 − ξ, i = 1, . . . , n
ξi ≥ 0, i = 1, . . . , n
for some C > 0
theory suggests optimal choice for C: 1/ maxi{ xi
2
}, but in
practice C is selected by testing various values
the problem is solved in dual space and the margin achieved
is ( i∈SV α∗
i
− α∗ 2
/C)−1/2
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
1-Norm soft margin
optimization problem:
minimizew,w0,ξ
1
2
w, w + C
n
i=1
ξ,
subject to yi( w, xi + w0) ≥ 1 − ξ, i = 1, . . . , n
ξi ≥ 0, i = 1, . . . , n
for some C > 0
this results in "box constraints" on αi: 0 ≤ αi ≤ C
non-zero slack variables correspond to αi = C and to points
with geometric margin less than 1/ w
Vlad PA196: Pattern Recognition
Linear Discriminant Analysis (cont’d)
Logistic regression
Large margin (linear) classiﬁers
Linearly separable case
Soft margins (Non-linearly separable case)
Wrap-up
LDA and MSE-based methods lead to similar solutions, even
though they are derived under different assumptions
LDA (and FDA) assign the vectors x to the closest centroid, in
a transformed space
logistic regression and LDA model the likelihood as a linear
function
the predicted values from logistic regression can be
interpreted as posterior probabilities
margin optimization provides an alternative approach
Vlad PA196: Pattern Recognition