PA196: Pattern Recognition 3. Linear discriminants (cont’d) Dr. Vlad Popovici popovici@recetox.muni.cz RECETOX Masaryk University, Brno Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Non-linearly separable case: soft margins Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Non-linearly separable case: soft margins LDA Remember (first lecture): • Bayes decision: assign x to the class with maximum a posteriori probability • let there be K classes denoted g1, . . . , gK , with corresponding priors P(gi) • the posteriors are: P(gi|x) = p(x|gi)P(gi) K i p(x|gi)P(gi) ∝ p(x|gi)P(gi) • decision function (for class gi vs class gj) arise from log odds-ratios (for example): log P(gi|x) P(gj|x) = log p(x|gi) p(x|gj) + P(gi) P(gj)    > 0, predict gi < 0, predict gj Under the assumption of Gaussian class-conditional densities: p(x|g) = 1 (2π)d|Σg|1/2 exp − 1 2 (x − µ)t Σ−1 (x − µ) (|Σ| is the determinant of covariance matrix Σ) the decision function becomes hij(x) = log P(gi|x) P(gj|x) = (xt Wix+wt i x+wi0)−(xt Wjx+wt j x+wj0) where Wi = − 1 2 Σ−1 i , wi = Σ−1 i µi and wi0 = − 1 2 µt i Σ−1 i µi − 1 2 log |Σi| + log P(gi) Simplest LDA If Σi = Σj = σ2I ("spherical" covariance matrices) hij(x) = wij(x − x0) where wij = µi − µj, x0 = 1 2 (µi + µj) − σ2 µi − µj 2 log P(gi) P(gj) (µi − µj) µ2 µ1 µ3 h12 h13 h23 Classical LDA If all classes share a common covariance matrix, Σi = Σ, the decision function becomes hij(x) = wt (x − x0) where w = Σ−1 (µi−µj), x0 = 1 2 (µi+µj)− 1 (µi − µj)t Σ−1(µi − µj) log P(gi) P(gj) (µi−µj) µ2 µ1 µ3 h23 h12 h13 Estimation of LDA parameters • we are given {(xi, gi), i = 1, . . . , n} with xi ∈ Rd and gi ∈ {g1, . . . , gK }. • priors: ˆP(gi) = ni/n where ni is the number of elements of class gi in the training set • mean vectors: ˆµi = 1 ni x∈gi x • covariance matrix: ˆΣ = K k=1 x∈gk (x − ˆµk)(x − ˆµk)t /(n − K) Quadratic Discriminant Analysis Class-conditional probabilities are general Gaussians and the decision function has the form: hij(x) = log P(gi|x) P(gj|x) = (xt Wix+wt i x+wi0)−(xt Wjx+wt j x+wj0) where Wi = − 1 2 Σ−1 i , wi = Σ−1 i µi and wi0 = − 1 2 µt i Σ−1 i µi − 1 2 log |Σi| + log P(gi) LDA and QDA 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Hastie et al: The Elements of Statistical Learning - chpt 4 Note: a similar boundary to QDA could be obtained by applying LDA in an augmented space with axes x1, . . . , xd, x1x2, . . . , xd−1xd, x2 1 , . . . , x2 d Regularized DA: between LDA and QDA Combine the pooled covariance with class-specific covariance matrices, and allow the pooled covariance the be more spherical or more general: ˆΣk(α, γ) = αˆΣk + (1 − α) γˆΣ + (1 − γ)ˆσ2 I • α = 1: QDA; α = 0: LDA • γ = 1: general covariance matrix; γ = 0: spherical covariance matrix • α and γ must be optimized Implementation of LDA • use diagonalization of the covariance matrices (either pooled or class-specific), which are symmetric and positive definite: Σi = UiDiUt i where Ui is a d × d orthonormal matrix and Di is a diagonal matrix with eigenvalues dik > 0 on the diagonal • the ingredients for the decision functions become: (x − µi)t Σ−1 i (x − µi) = [Ut i (x − µi)]t D−1 i [Ut i (x − µi)] and log |Σi| = k log dik Implementation of LDA, cont’d A possible 2-step procedure for LDA classification (common covariance matrix Σ = UDUt ): 1 "sphere" the data: X∗ = D−1 2 Ut X 2 assign a sample x to the closest centroid in transformed space, modulo the effect of the priors Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Non-linearly separable case: soft margins • the centroids µi i = 1, . . . , K lie in an affine subspace of dimension at most K − 1 < d • any dimension orthogonal to this subspace does not influence the classification • the classification is carried out in a low dimensional space, hence we have a dimensionality reduction • the subspace axes can be found sequentially, using Fisher’s criterion (find directions that maximally separate the centroids with respect to the variance) • this is essentially the same as Principal Component Analysis 1 compute M the K × d matrix of class centroids (by rows) and the common covariance matrix W (within-class covariance) 2 compute M∗ = MW− 1 2 (using eigen-decomposition of W) 3 compute B∗ – the covariance matrix of M∗ (between-class covariance matrix), and its eigen-decomposition B∗ = V∗DBV∗t 4 the columns of V∗ (ordered from largest to smallest eigen-value dBi) give the coordinates of the optimal subspaces • the i−th discriminant variable (canonical variable) is given by Zi = (W−1 2 v∗ i )t X o o oo o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Canonical Coordinate 1 CanonicalCoordinate2 Classification in Reduced Subspace •• •• •• •• •• •• •• •• •• •• •• Hastie et al. - The Elements of Statistical Learning - chpt. 4 Coordinate 1 Coordinate3 -4 -2 0 2 4 -202 o o oo oo o o o o o o o o o o o o ooo o o o o o o o oo o o o o o o o o o o o o o o o o o o o oo o o o o ooo o o oo oo o o o o oooo o o o o o o o o o o o o o o o o o o o o o o o o oo oooo oo o o o o o o o o oo o ooo oo oo o o ooo oo o o o oo o oo o o o o o oo oo o o oo oo o o oo oo oooo o o o o o o oo o o o o o o o o o o o oo o oo o o oo o o o o o o o o o o o o o o o oo ooo o o oo o o o o o oo o oo oo o o o ooooo o o o o oo o oo o oo o o o o o o o o ooo o o oo o oo oooo o o o o oo o o o o o o o o o o o o o o o o ooo o o o o ooo o o oo oo oo o o o o o o o oo o o o o o o o o oo o oo o o o oo o oo o o o o o o oo o o o o oo oo o o o o o o o o ooo ooo o o o o o o o oo o o o o o o o o o o o oo o o o o o o oo o o o o oo o o o o o o oo o o o oooo o ooo o o o o o o o o o oooo o o o o o o o o o oo oo o o o o o o ooo o o o o oo o o o o o o o oo o o o o o o o o oo ooo o o o oo o o oo o o o o o oo oo ooo oo o o oo ooo o o oo o o oo o oo o •• •• •• •• •• •• •• •••• •• •• Coordinate 2 Coordinate3 -6 -4 -2 0 2 4 -202 o o oo oo o o o o o o o o o oo o ooo o o o o o o ooo o o o o o o o oooo o o o o o o o o oo o o o o ooo oo oo oo o o o o oooo o o o o o o o o o o o o o o o o o o o o o o o o oo oooo oo o o o o o o o ooo o ooo oo oo o oooo oo o o o oo o oo o o o o o oo oo o o oo oo o o oo oo oooo o o o o o o oo o o o oo o o o oo o oo o oo o o oo o o o o o o o o o o o o o oo oo ooo o o oo o o o o o oo o oo oo o o o ooooo o o o o oo o oo o oo oo o o o o o o ooo oo oo o oo oooo o o o o oo o o o o o o o o o o o o o o o o ooo o o o o ooo o o oooo oo o o o o ooo oo o o o o o o o o oo o oo o o o oo o oo o o o o oo oo o o o o oo o o o o o o o o o o oo o ooo o o o o o o o oo o o o o o oo o o o o oo o o o o o o oo o o o o oo o o o o o o oo o o ooooo o ooo o o o o o oo o o ooooo o o o o o o o o oo oo o ooo o oo oo oo o o ooo o o o o oo o o o o o o o o o o oo o oo o o o oo o o o o o o o o o oooo ooo oo oo oo ooo o o oo o o oo o oo o •• •• •• •• •• •••• •••••• •• Coordinate 1 Coordinate7 -4 -2 0 2 4 -3-2-10123 oooooo o o o o o o o o o o oo oooooo o o oo ooo oo o oo o o o o o o o o o oo o o oo o o o ooo o o o ooo o o ooooo oo o o oo o o o o ooo o ooo o oo o o o o o o o o o o oo o o o o o o oo o o oo o o o ooo oo o o oo o oo oo o o o o o o o o o oo o o oooo o o oooo o o ooo ooo oo oo o o o o o ooo ooooo o o o o o o o ooo o o o o ooo o o o oo o oo oo oooo o o oo oo ooo oo o ooo oo o o o oooo o o o o o o ooo ooo o o oo o o o oooo o o o oo o o o o oo oo o o o ooo o o o o o o o o o o o o o ooo o o o oooo ooo oo oo o ooo o o o o o o oo oo oo oo o o o o o o o o oo o o o o oo o o o o oo o o o o o o o o o ooo o o o o o o o o o oo oo o o oo oo o o ooo o o o o o o o o oo o o o o o o o o o o ooo o oo ooo o oo o o oo o o oo o o o o o o o o oo o o ooo o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o oo oo oo o o o o o o ooo ooo o oo oo o oo oo oo o o oo o o o o o oo o o o ooooo o o oo o o o o o o o •• •••• •• •••• •• •• •• •••• Coordinate 9 Coordinate10 -2 -1 0 1 2 3 -2-1012 o o o o oo o o o o o o o oo o o o oo o ooo o oo o o o o o o o o o o o o oo o o o o o oo o o o oo o o o o o oo oooo o o oooooo o o o o o oo o o o o o o o o o o o o o o o o o o o o ooo oo o o o oo o o o oo o o o o o o o o o o o o o o o o o o oo o o o o oo o o o o o oo o o o oooo o o o o o ooo o o oo o o o o o o o o ooo oo o o o o o o o oooo o o o o o o o o o oooo o oooo oo oo oo o o ooo o o o o ooo oo ooo o o o o o o o o o oo o o o o o o o o o o oo o o o o o oo o o o o o o o o o o o oo o o oooo o o o o o o o o o o o o oo o o ooo o o oo o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o ooo oo o o o o oo o o o o o o o oooo o o oo oo o o o o oo o o o o o oo o o o o o o o o o o o o o o o oo o o o o o o o o oo o o o o o o oo o o oo o o o o o o o oo o o o oo oo o o o o o o o o o o o o o o oo o o oo o o oo o ooo oo o o o o o o o o o o o o oo o o o o o o o o oo o oo oo o o o oo o o oooo oo o o o o o o o oo o oo o oo o o o oo o o o o •• •••• •••••••• •••• •• •• Linear Discriminant Analysis Hastie et al. - The Elements of Statistical Learning - chpt. 4 Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Non-linearly separable case: soft margins • LDA, FDA and MSE regression with a particular coding of class labels, lead to equivalent solutions (separating hyperplane) • LDA (QDA) is the optimal classifier in the case of Gaussian class-conditional distributions • LDA can be used to project data into a lower dimensional space for visualization • LDA derivation assumes Gaussian densities, but FDA does not • LDA is naturally extended to multiple classes Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Non-linearly separable case: soft margins Idea: model the posterior probabilities as linear functions in x and ensure they sum up to 1. For K classes g1, . . . , gK : log P(gi|x) P(gK |x) = wi, x + wi0, ∀i = 1, . . . , K − 1 which leads to P(gi|x) = exp( wi, x + wi0) 1 + K−1 j=1 exp( wj, x + wj0) , i = 1, . . . , K − 1 P(gK |x) = 1 1 + K−1 j=1 exp( wj, x + wj0) • the transformation p → log[p/(1 − p)] is called logit transform • the choice of reference class (K in our case) is purely a convention • the set of parameters of the model: θ = {w1, w10, . . . , wK−1, wK−1,0} • the log-likelihood is L(θ) = n i=1 log P(gi|xi; θ) For the binary case (K = 2), take the classes to be encoded in response variables yi: yi = 0 for class g1 and yi = 1 for class g2. • a single posterior probability is needed: P(y = 0|x) = exp( w, x + w0) 1 + exp( w, x + w0) • the likelihood function becomes: L(θ = {w, w0}) = n i=1 [yi( w, xi + w0) − log(1 + exp( w, xi + w0)] • using z = [1, x] and a = [w0, w], L(a) = n i=1 [yi a, z − log(1 + exp( a, z ))] • objective: find a∗ = arg maxa L(a) • ∂L(a) ∂a = n i=1 zi(yi − P(yi = 0|zi)) • at a (local) extremum, ∂L(a) ∂a = 0 which leads to a system of equations to be solved for a • the solution can be found by a Newton-Raphson procedure (iteratively re-weighted least squares) A few remarks on logistic regression: • brings the tools of linear regression to pattern recognition • can be used to identify those input variables that explain the output • its predictions can be interpreted as posterior probabilities • by introducing a penalty term, variable selection can be embedded into the model construction - we’ll see it later! • both LDA and logistic regression use a linear form for the log-posterior odds log(P(gi|x)/P(gK |x)); LDA assumes the posterior to be Gaussians, while logistic regression assumes they only lead to linear log-posterior odds Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Non-linearly separable case: soft margins • there are theoretical considerations to justify the goal of maximizing the margin achieved by the separating hyperplane • intuitively, the larger the margin, more "room" for noise in data and, hence, better generalization • let a training set be {(xi, yi), i = 1, . . . , n} with yi = ±1 • the margin of a point xi w.r.t. the boundary function h is γi = yih(xi) • it can be shown that the maximal error attained by h is upper bounded by a function of min(γi) (however, the bound might not be tight) Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Non-linearly separable case: soft margins • consider the dataset {(xi, yi), i = 1, . . . , n} be linearly separable, i.e. γi > 0 • we will consider linear classifiers h(x) = w, w + w0 (with the predicted class being sign(h(x)) • if the (functional) margin achieved is 1, then γi ≥ 1 • then, the geometric margin is the normalized functional margin 1/ w , hence: Proposition The hyperplane (w, w0) that solves the optimization problem minimizew,w0 1 2 w, w subject to yi( w, xi + w0) ≥ 1, i = 1, . . . , n realizes the maximal margin hyperplane with geometric margin γ = 1/ w . Solving the constrained optimization problem: • let the objective function be f(w) and the equality constraints hi(w) = 0 for i = 1, . . . , m, then the Lagrangian function is L(w, β) = f(w) + m i=1 βihi(w) • a necessary and sufficient condition for w∗ to be a solution of the optimization problem (f : continuous and convex, hi : continuous and differentiable) is ∂L(w∗, β∗ ) ∂w∗ = 0 ∂L(w∗, β∗ ) ∂β∗ = 0 for some values of β∗ For a constrained optimization with a domain Ω ⊆ Rn : minimizew f(w) subject to gi(w) ≥ 0, i = 1, . . . , k hi(w) = 0, i = 1, . . . , m the Lagrangian function has the form L(w, α, β) = f(w) + k i=1 αigi(w) m i=1 βihi(w) with αi and βi being the Lagrange multipliers. Karush-Kuhn-Tucker (KKT) optimality conditions for a convex optimization problem: for a solution w∗ and corresponding mutlipliers α∗ and β∗ , ∂L w∗ = 0 ∂L β∗ = 0 α∗ i gi(w∗ ) = 0 gi(w∗ ) ≤ 0 αi∗ ≥ 0 • for active constraints (gi(w) = 0,) αi > 0; for inactive constraints (gi(w) < 0,) αi = 0 • αi can be seen as the sensitivity of f to the active constraint Duality of convex optimization: • the solution is a saddle point • w are the primal variables • Lagrange multipliers are the dual variables • solving the dual optimization problem may be simpler: the Lagrange mult. are the main variables, so set to 0 the derivatives w.r.t. w and substitute the result into the Lagrangian • the resulting function contains only the dual variables and must be maximized under simpler constraints ...and back to our initial problem: • the primal Lagrangian is L(w, w0, α) = 1 2 w, w − n i=1 αi [yi( w, xi + w0) − 1] • from KKT conditions, w = i yiαixi and i yiαi = 0 • which leads to the dual Lagrangian L(w, w0, α) = i αi − 1 2 i j yiyjαiαj xi, xj Proposition If α∗ is the solution of the quadratic optimization problem maximize W(α) = n i=1 αi − 1 2 n i,j=1 yiyjαiαj xi, xj subject to n i=1 αiyi = 0 αi ≥ 0, i = 1, . . . , n then the vector w∗ = i yiα∗ i xi realizes the maximal margin hyperplane with the geometric mean 1/ w∗ . • in the dual formulation, w∗ 0 still needs to be specified, so w∗ 0 = − 1 2 max yi =−1 { w∗ , xi } + max yi =1 { w∗ , xi } • from KKT conditions: α∗ i [yi( w∗, x + w∗ 0 ) − 1] = 0, so only for xi lying on the margin, α∗ i 0 • those xi for which αi 0 are called support vectors • the optimal hyperplane is a linear combination of support vectors: h(x) = i∈SV yiα∗ i xi, x + w∗ 0 < w ,x> + w 0= 1 < w ,x> + w 0= -1 • the margin achieved is γ = 1 w∗ =   i∈SV α∗ i   − 1 2 • an (leave-one-out) estimate of the generalization error is the proportion of support vectors of the total training sample size, #SV n Outline 1 Linear Discriminant Analysis (cont’d) LDA, QDA, RDA LD subspace LDA: wrap-up 2 Logistic regression 3 Large margin (linear) classifiers Linearly separable case Non-linearly separable case: soft margins L2-norm soft margins • introduce the slack variables ξ and allow "softer" margins: minimizew,w0,ξ 1 2 w, w + C n i=1 ξ2 i , subject to yi( w, xi + w0) ≥ 1 − ξi, i = 1, . . . , n ξi ≥ 0, i = 1, . . . , n for some C > 0 • theory suggests optimal choice for C : 1/ maxi{ xi 2}, but in practice C is selected by testing various values • the problem is solved in the dual space and the margin achieved is   i∈SV α∗ i − α 2 /C   − 1 2 L1-norm soft margins • optimization problem: minimizew,w0,ξ 1 2 w, w + C n i=1 ξi, subject to yi( w, xi + w0) ≥ 1 − ξi, i = 1, . . . , n ξi ≥ 0, i = 1, . . . , n for some C > 0 • this results in "box contraints" on αi : 0 ≤ αi ≤ C • non-zero slack variables correspond to αi = C and to points with geometric margin less than 1/ w Wrap-up • LDA and MSE-based methods lead to similar solutions, even though they are derived under different assumptions • LDA (and FDA) assign the vectors x to the closest centroid, in the transformed space • logistic regression and LDA model the likelihood as a linear function • the predicted values from logistic regression can be interpreted as posterior probabilities • margin optimization provides an alternative approach