Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers PA196: Pattern Recognition 03. Linear discriminants Dr. Vlad Popovici popovici@iba.muni.cz Institute of Biostatistics and Analyses Masaryk University, Brno Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up LDA Remember (first lecture): Bayes decision: assign x to the class with maximum a posteriori probability let there be K classes denoted g1, . . . , gK , with corresponding priors P(gi) the posteriors are: P(gi|x) = p(x|gi)P(gi) K i p(x|gi)P(gi) ∝ p(x|gi)P(gi) decision function (for class gi vs class gj) arise from log odds-ratios (for example): log P(gi|x) P(gj|x) = log p(x|gi) p(x|gj) + P(gi) P(gj)    > 0, predict gi < 0, predict gj Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up Under the assumption of Gaussian class-conditional densities: p(x|g) = 1 (2π)d|Σg|1/2 exp − 1 2 (x − µ)t Σ−1 (x − µ) (|Σ| is the determinant of covariance matrix Σ) the decision function becomes hij(x) = log P(gi|x) P(gj|x) = (xt Wix + wt i x + wi0) − (xt Wjx + wt j x + wj0) where Wi = − 1 2 Σ−1 i , wi = Σ−1 i µi and wi0 = − 1 2 µt i Σ−1 i µi − 1 2 log |Σi| + log P(gi) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up Simplest LDA If Σi = Σj = σ2 I ("spherical" covariance matrices) hij(x) = wij(x − x0) where wij = µi − µj, x0 = 1 2 (µi + µj) − σ2 µi − µj 2 log P(gi) P(gj) (µi − µj) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up µ2 µ1 µ3 h12 h13 h23 Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up Classical LDA If all classes share a common covariance matrix, Σi = Σ, the decision function becomes hij(x) = wt (x − x0) where w = Σ−1 (µi−µj), x0 = 1 2 (µi+µj)− 1 (µi − µj)t Σ−1(µi − µj) log P(gi) P(gj) (µi−µj) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up µ2 µ1 µ3 h23 h12 h13 Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up Estimation of LDA parameters we are given {(xi, gi), i = 1, . . . , n} with xi ∈ Rd and gi ∈ {g1, . . . , gK }. priors: ˆP(gi) = ni/n where ni is the number of elements of class gi in the training set mean vectors: ˆµi = 1 ni x∈gi x covariance matrix: ˆΣ = K k=1 x∈gk (x − ˆµk )(x − ˆµk )t /(n − K) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up Quadratic Discriminant Analysis Class-conditional probabilities are general Gaussians and the decision function has the form: hij(x) = log P(gi|x) P(gj|x) = (xt Wix + wt i x + wi0) − (xt Wjx + wt j x + wj0) where Wi = − 1 2 Σ−1 i , wi = Σ−1 i µi and wi0 = − 1 2 µt i Σ−1 i µi − 1 2 log |Σi| + log P(gi) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up LDA and QDA 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Hastie et al: The Elements of Statistical Learning - chpt 4 Note: a similar boundary to QDA could be obtained by applying LDA in an augmented space with axes x1, . . . , xd, x1x2, . . . , xd−1xd, x2 1 , . . . , x2 d Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up Regularized DA: between LDA and QDA Combine the pooled covariance with class-specific covariance matrices, and allow the pooled covariance the be more spherical or more general: ˆΣk (α, γ) = αˆΣk + (1 − α) γˆΣ + (1 − γ)ˆσ2 I α = 1: QDA; α = 0: LDA γ = 1: general covariance matrix; γ = 0: spherical covariance matrix α and γ must be optimized Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up Implementation of LDA use diagonalization of the covariance matrices (either pooled or class-specific), which are symmetric and positive definite: Σi = UiDiUt i where Ui is a d × d orthonormal matrix and Di is a diagonal matrix with eigenvalues dik > 0 on the diagonal the ingredients for the decision functions become: (x − µi)t Σ−1 i (x − µi) = [Ut i (x − µi)]t D−1 i [Ut i (x − µi)] and log |Σi| = k log dik Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up Implementation of LDA, cont’d A possible 2-step procedure for LDA classification (common covariance matrix Σ = UDUt ): 1 "sphere" the data: X∗ = D−1 2 Ut X 2 assign a sample x to the closest centroid in transformed space, modulo the effect of the priors Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up the centroids µi i = 1, . . . , K lie in an affine subspace of dimension at most K − 1 < d any dimension orthogonal to this subspace does not influence the classification the classification is carried out in a low dimensional space, hence we have a dimensionality reduction the subspace axes can be found sequentially, using Fisher’s criterion (find directions that maximally separate the centroids with respect to the variance) this is essentially the same as Principal Component Analysis Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up 1 compute M the K × d matrix of class centroids (by rows) and the common covariance matrix W (within-class covariance) 2 compute M∗ = MW− 1 2 (using eigen-decomposition of W) 3 compute B∗ – the covariance matrix of M∗ (between-class covariance matrix), and its eigen-decomposition B∗ = V∗DBV∗t 4 the columns of V∗ (ordered from largest to smallest eigen-value dBi) give the coordinates of the optimal subspaces the i−th discriminant variable (canonical variable) is given by Zi = (W− 1 2 v∗ i )t X Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up o o oo o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Canonical Coordinate 1 CanonicalCoordinate2 Classification in Reduced Subspace •• •• •• •• •• •• •• •• •• •• •• Hastie et al. - The Elements of Statistical Learning - chpt. 4 Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up Coordinate 1 Coordinate3 -4 -2 0 2 4 -202 o o oo oo o o o o o o o o o o o o ooo o o o o o o o oo o o o o o o o o o o o o o o o o o o o oo o o o o ooo o o oo oo o o o o oooo o o o o o o o o o o o o o o o o o o o o o o o o oo oooo oo o o o o o o o o oo o ooo oo oo o o ooo oo o o o oo o oo o o o o o oo oo o o oo oo o o oo oo oooo o o o o o o oo o o o o o o o o o o o oo o oo o o oo o o o o o o o o o o o o o o o oo ooo o o oo o o o o o oo o oo oo o o o ooooo o o o o oo o oo o oo o o o o o o o o ooo o o oo o oo oooo o o o o oo o o o o o o o o o o o o o o o o ooo o o o o ooo o o oo oo oo o o o o o o o oo o o o o o o o o oo o oo o o o oo o oo o o o o o o oo o o o o oo oo o o o o o o o o ooo ooo o o o o o o o oo o o o o o o o o o o o oo o o o o o o oo o o o o oo o o o o o o oo o o o oooo o ooo o o o o o o o o o oooo o o o o o o o o o oo oo o o o o o o ooo o o o o oo o o o o o o o oo o o o o o o o o oo ooo o o o oo o o oo o o o o o oo oo ooo oo o o oo ooo o o oo o o oo o oo o •• •• •• •• •• •• •• •••• •• •• Coordinate 2 Coordinate3 -6 -4 -2 0 2 4 -202 o o oo oo o o o o o o o o o oo o ooo o o o o o o ooo o o o o o o o oooo o o o o o o o o oo o o o o ooo oo oo oo o o o o oooo o o o o o o o o o o o o o o o o o o o o o o o o oo oooo oo o o o o o o o ooo o ooo oo oo o oooo oo o o o oo o oo o o o o o oo oo o o oo oo o o oo oo oooo o o o o o o oo o o o oo o o o oo o oo o oo o o oo o o o o o o o o o o o o o oo oo ooo o o oo o o o o o oo o oo oo o o o ooooo o o o o oo o oo o oo oo o o o o o o ooo oo oo o oo oooo o o o o oo o o o o o o o o o o o o o o o o ooo o o o o ooo o o oooo oo o o o o ooo oo o o o o o o o o oo o oo o o o oo o oo o o o o oo oo o o o o oo o o o o o o o o o o oo o ooo o o o o o o o oo o o o o o oo o o o o oo o o o o o o oo o o o o oo o o o o o o oo o o ooooo o ooo o o o o o oo o o ooooo o o o o o o o o oo oo o ooo o oo oo oo o o ooo o o o o oo o o o o o o o o o o oo o oo o o o oo o o o o o o o o o oooo ooo oo oo oo ooo o o oo o o oo o oo o •• •• •• •• •• •••• •••••• •• Coordinate 1 Coordinate7 -4 -2 0 2 4 -3-2-10123 oooooo o o o o o o o o o o oo oooooo o o oo ooo oo o oo o o o o o o o o o oo o o oo o o o ooo o o o o o o o o ooooo o o o o oo o o o o ooo o ooo o oo o o o o o o o o o o oo o o o o o o oo o o oo o o o ooo oo o o oo o oo oo o o o o o o o o o oo o o oooo o o oooo o o o oo ooo oo oo o o o o o ooo ooooo o o o o o o o ooo o o o o ooo o o o oo o oo oo oooo o o oo oo ooo oo o ooo oo o o o oooo o o o o o o ooo ooo o o oo o o o oooo o o o o o o o o o oo oo o o o ooo o o o o o o o o o o o o o ooo o o o oooo ooo oo oo o ooo o o o o o o oo oo oo oo o o o o o o o o oo o o o o oo o o o o oo o o o o o o o o o ooo o o o o o o o o o o o oo o o oo oo o o ooo o o o o o o o o oo o o o o o o o o oo ooo o oo ooo o oo o o oo o o oo o o o o o o o o oo o o ooo o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o oo oo oo o o o o o o ooo ooo o oo oo o oo oo oo o o oo o o o o o oo o o o ooooo o o oo o o o o o o o •• •••• •• •••• •• •• •• •••• Coordinate 9 Coordinate10 -2 -1 0 1 2 3 -2-1012 o o o o oo o o o o o o o oo o o o oo o ooo o oo o o o o o o o o o o o ooo o o o o o oo o o o oo o o o o o oo oooo o o oooooo o o o o o oo o o o o o o o o o o o o o o o o o o o o ooo oo o o o oo o o o oo o o o o o o o o o o o o o o o o o o oo o o o o oo o o o o o oo o o o oooo o o o o o ooo o o oo o o o o o o o o ooo oo o o o o o o o oooo o o o o o oo o o oooo o oooo oo oo oo o o ooo o o o o ooo oo ooo o o o o o o o o o oo o o o o o o o o o o oo o o o o o oo o o o o o o o o o o o oo o o oooo o o o o o o o o o o o o oo o o ooo o o oo o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o ooo oo o o o o oo o o o o o o o oooo o o oo oo o o o o oo o o o o o oo o o o o o o o o o o o o o o o oo o o o o o o o o oo o o o o o o o o o o oo o o o o o o o oo o o o oo oo o o o o o o o o o o o o o o oo o o oo o o oo o ooo oo o o o o o o o o o o o o oo o o o o o o o o oo o oo oo o o o oo o o oooo oo o o o o o o o oo o oo o oo o o o oo o o o o •• •••• •••••••• •••• •• •• Linear Discriminant Analysis Hastie et al. - The Elements of Statistical Learning - chpt. 4 Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers LDA, QDA, RDA LD subspace LDA: wrap-up LDA, FDA and MSE regression with a particular coding of class labels, lead to equivalent solutions (separating hyperplane) LDA (QDA) is the optimal classifier in the case of Gaussian class-conditional distributions LDA can be used to project data into a lower dimensional space for visualization LDA derivation assumes Gaussian densities, but FDA does not LDA is naturally extended to multiple classes Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Idea: model the posterior probabilities as linear functions in x and ensure they sum up to 1. For K classes g1, . . . , gK : log P(gi|x) P(gK |x) = wi, x + wi0, ∀i = 1, . . . , K − 1 which leads to P(gi|x) = exp( wi, x + wi0) 1 + K−1 j=1 exp( wj, x + wj0) , i = 1, . . . , K − 1 P(gK |x) = 1 1 + K−1 j=1 exp( wj, x + wj0) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers the transformation p → log[p/(1 − p)] is called logit transform the choice of reference class (K in our case) is purely a convention the set of parameters of the model: θ = {w1, w10, . . . , wK−1, wK−1,0} the log-likelihood is L(θ) = n i=1 log P(gi|xi; θ) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Binary case (K = 2): take the classes to be encoded in response variables yi: yi = 0 for class g1 and yi = 1 for class g2. a single posterior probability is needed: let it be P(y = 0|x) = exp( w, x + w0) 1 + exp( w, x + w0) the likelihood function becomes: L(θ = {w, w0}) = n i=1 [yi( w, xi + w0) − log(1 + exp( w, xi + w0)] using z = [1, x] and a = [w0, w], L(a) = n i=1 [yi a, zi − log(1 + exp( a, zi ))] Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers objective: find a∗ = arg maxa L(a) ∂L(a) ∂a = n i=1 zi(yi − P(yi = 0|zi) at a (local) extremum, ∂L(a) ∂a = 0 which is the system of equations to be solved for a the solution can be found by a Newton-Raphson procedure (iteratively reweighted least squares) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers A few remarks on logistic regression: brings the tools from linear regression to pattern recognition problems can be used to identify those input variables that explain the output its predictions can be interpreted as posterior probabilities by introducing a penalty term, variable selection can be embedded into the model construction - we’ll see it later! both LDA and logistic regression use a linear form for the log-posterior odds (log(P(gi|x)/P(gK |x))); LDA assumes the posteriors to be Gaussians, while logistic regression assumes they only lead to linear log-posterior odds Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) there are theoretical considerations to justify the goal of maximizing the margin achieved by the separating hyperplane intuitively, larger the margin, more "room" for noise in the data and hence, better generalization let a training set be {(xi, yi), i = 1, . . . , n} with yi = ±1 the margin of a point xi with respect to the boundary function h is γi = yih(xi) it can be shown that the maximal error attained by h is upper bounded by a function of min(γi) (however, the bound might not be tight) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) consider the dataset {(xi, yi), i = 1, . . . , n} be linearly separable, i.e. γi > 0 we will consider linear classifiers h(x) = w, x + w0 (with the predicted class sign(h(x))) if the (functional) margin achieved is 1, then γi ≥ 1 then, the geometric margin is the normalized functional margin: 1/ w , hence: Proposition The hyperplane (w, w) that solves the optimization problem minimizew,w0 1 2 w, w , subject to yi( w, xi + w0) ≥ 1, i = 1, . . . , n realizes the maximal margin hyperplane with geometric margin γ = 1/ w . Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Solving the constrained optimization: let the objective function be f(w) and the equality constraints hi(w) = 0 for i = 1, . . . , m, then the Lagrangian function is L(w, β) = f(w) + m i=1 βihi(w) a necessary and sufficient condition for w∗ to be a solution of the optimization problem (f continuous and convex, hi continuous and differentiable) is ∂L(w∗, β∗) ∂w = 0 ∂L(w∗, β∗) ∂β = 0 for some values of β∗ Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) For a constrained optimization with a domain Ω ⊆ Rn : minimize f(w), w ∈ Ω subject to gi(w) ≤ 0, i = 1, . . . , k hi(w) = 0 i = 1, . . . , m the Lagrangian function has the form L(w, α, β) = f(w) + i αigi(w) + j βjhj(w) with αi and βj being the Lagrange multipliers. Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Karush-Kuhn-Tucker (KKT) optimality conditions for a convex optimization problem: for a solution w∗ and corresponding multipliers α∗ and β∗, ∂L ∂w = 0 ∂L ∂β = 0 α∗ i gi(w∗ ) = 0 gi(w∗ ) ≤ 0 α∗ i ≥ 0 for active constraints (gi(w) = 0), αi > 0; for inactive constraints (gi(w) < 0), αi = 0 αi can be seen as the sensitivity of f to the active constraint Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Duality of convex optimization: the solution is a saddle point w are the primal variables Lagrange multipliers are the dual variables solving the dual optimization may be simpler: the Lagrange multipliers are the main variables, so set to 0 the derivatives wrt to w and substitute the result into the Lagrangian the resulting function contains only dual variables and must be maximized under simpler constraints Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) ...and back to our initial problem: the primal Lagrangian is L(w, w0, α) = 1 2 w, w − n i=1 αi [yi( w, xi + w0) − 1] from KKT conditions, w = n i=1 yiαixi and n i=1 yiαi = 0 which leads to the dual Lagrangian L(w, w0, α) = n i=1 αi − 1 2 n i=1 n j=1 yiyjαiαj xi, xj Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Proposition If α∗ is the solution of the quadratic problem maximize W(α) = n i=1 αi − 1 2 n i,j=1 yiyjαiαj xi, xj subject to n i=1 αiyi = 0 αi ≥ 0 i = 1, . . . , n then the vector w∗ = n i=1 yiα∗ i xi realizes the maximal margin hyperplane with the geometric mean 1/ w∗ . Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) in the dual formulation, w∗ 0 still needs to be specified, so w∗ 0 = − 1 2 max γi=−1 { w∗ , xi } + min γ=1 { w∗ , xi } from the KKT conditions: α∗ i [yi( w∗, xi + w∗ 0 ) − 1] = 0, so only for xi lying on the margin, the α∗ i 0 those xi for which α∗ i 0 are called support vectors the optimal hyperplane is a linear combination of support vectors: h(x) = i∈SV yiα∗ i xi, x + w∗ 0 Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) < w ,x> + w 0= 1 < w ,x> + w 0= -1 Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) the margin achieved is γ = 1 w∗ =   i∈SV α∗ i   − 1 2 (a leave-one-out) estimate of the generalisation error is the proportion of support vectors of the total training sample size SV n Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) 2-Norm soft margin introduce the slack variables ξ and allow "softer" margins: minimizew,w0,ξ 1 2 w, w + C n i=1 ξ2 , subject to yi( w, xi + w0) ≥ 1 − ξ, i = 1, . . . , n ξi ≥ 0, i = 1, . . . , n for some C > 0 theory suggests optimal choice for C: 1/ maxi{ xi 2 }, but in practice C is selected by testing various values the problem is solved in dual space and the margin achieved is ( i∈SV α∗ i − α∗ 2 /C)−1/2 Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) 1-Norm soft margin optimization problem: minimizew,w0,ξ 1 2 w, w + C n i=1 ξ, subject to yi( w, xi + w0) ≥ 1 − ξ, i = 1, . . . , n ξi ≥ 0, i = 1, . . . , n for some C > 0 this results in "box constraints" on αi: 0 ≤ αi ≤ C non-zero slack variables correspond to αi = C and to points with geometric margin less than 1/ w Vlad PA196: Pattern Recognition Linear Discriminant Analysis (cont’d) Logistic regression Large margin (linear) classifiers Linearly separable case Soft margins (Non-linearly separable case) Wrap-up LDA and MSE-based methods lead to similar solutions, even though they are derived under different assumptions LDA (and FDA) assign the vectors x to the closest centroid, in a transformed space logistic regression and LDA model the likelihood as a linear function the predicted values from logistic regression can be interpreted as posterior probabilities margin optimization provides an alternative approach Vlad PA196: Pattern Recognition