PA196: Pattern Recognition 3. Linear discriminants Vlad Popovici popovici@iba.muni.cz Institute of Biostatistics and Analyses Masaryk University, Brno Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Reminder - scalar product • scalar (dot, inner) product of two vectors: x, w ∈ Rd : w · x = w, x = wt x = d i=1 wixi ∈ R • cos θ = w,x w x • w, x = 0 ⇐⇒ w ⊥ x • projection of x on w is Projw x = x, w w w w = x, w w 2 w θ x w Projwx Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures General problem • we consider the binary classification problem (K = 2) • without loss of generality, we let the labels of the classes be ±1 • we are given a set X × Y = {(xi, yi)|i = 1, . . . , n} ⊂ Rd × {−1, +1} • the goal is to find the parameters of the classifier such that the number of misclassified points is minimized • let the discriminant function have the form h(x) = wt x + w0 = w, x + w0 = w0 + d i=1 wixi • note that x can be replaced with φ(x)! (we’ll discuss this later) • the classifier is sign(h(x)) = sign( w, x + w0) • an error: if sign( w, xi + w0) yi; in other words: if yi( w, xi + w0) < 0 ⇔ yih(xi) < 0 • the risk of misclassification (error) is R(h) = Pr[Y sign(h(X))] where (X, Y) is a random pair of observations • the empirical risk is the estimation of the risk on a given set of points: ˆRn(h) = 1 n n i=1 1{yi sign(h(xi ))} = 1 n n i=1 1yi h(xi )<0 • you need n ≥ d + 1 points for learning the classifier x h(x)=0 w x1 x2 x3 w0/||w|| r H xp R 1 R 2 Thelineardecisionboundary H, whereh(x) = wt x+w0 = 0, separatesthe featurespaceinto two half-spacesR 1 (where h(x) > 0)and R 2 (where h(x) < 0). From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classi cation. Copyright c 2001 by John Wiley & Sons, Inc. fi Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Margins Functional Margin The functional margin of a point xi with respect to a hyperplane w is defined to be γi = yi( w, xi + w0) = yih(xi) Geometric Margin The geometric margin of a point xi with respect to a hyperplane w is defined to be γi = yi w w , xi + w0 w = yi h(xi) w → Geometric margin is the normalized functional margin. Margin of a point Margin of a set (of points) The maximum margin among all (hyper)planes is the margin of a set of points. The corresponding hyperplane is called maximum margin hyperplane. Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Generalization to multi-class problems • a multi-class problem can be decomposed in a series of two-class problems: 1-vs-all or 1-vs-1 • or, one can use K (no. of classes) discriminant fn. hi(x) and build classifiers of the form: assign x to class i if hi(x) > hj(x) for all i j • this defines K(K − 1)/2 hyperplanes Hij : hi(x) − hj(x) = 0 • in practice, there are usually less hyperplanes that form the decision surface g1 g2 g3 H13 H23 H12 R1 R2 R3 Generalized linear discriminants Consider a function ψ : Rd → R ˆd. The discriminant function g(x) = a, ψ(x) = ˆd i=1 aiψi(x) is a linear function in a (but not in x). Example: let x = x ∈ R and let ψ(x) = [1, x, x2]t ∈ R3. 0000 00 1111 11 0000 00 1111 11 0000 00 1111 11 00 00 00 11 11 11 0000 00 1111 11 0000 00 1111 11 ψ(x) Remarks: • a problem which is not linearly separable in Rd may become linearly separable in R ˆd • ψ =? • finding the coefficients in R ˆd requires much more training points! • the decision surface, when projected back into Rd (by ψ−1) is non-linear • a convenient (but trivial) transformation: "normalization" of the notation • take ψ(x) = y[1, x]t . This allows us to write γ = yh(x) = y( w, x + w0) = a, z where a = [w0, w]t and z = y[1, x]t • the problem becomes: find a such that a, z > 0 i.e. all the margins are positive • the decision surface ˆH in Rd+1, defined by a, z = 0, corresponds to a hyperplane passing through the origin of the z−space Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures • consider we are given the set {(xi, yi)} with yi = ±1 • with the previous "normalized" notation, the set is linearly separable if a, zi > 0, ∀i = 1, . . . , n • the solution a is constrained by each point zi aaa • under current conditions, the solution is not unique! • solutions on the boundary of the solution space may be too sensitive → you can use the condition a, zi ≥ ξ > 0 Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures General approach • let J(a) be a criterion function that measures the "suitability" of a candidate solution a • by convention, the solution to the classification problem is obtained as a∗ = arg min a J(a) • usually, J is chosen to be continuous (at least in a neighborhood of the solution) and differentiable Gradient descent ak+1 = ak − ηk J(ak) • the negative gradient, − J(a) is locally the steepest descent towards a (local) minimum • ηk is a line search parameter or learning rate • start with some a0 and iterate until |ηk J(ak)| < θ x0 x1 x2 x3 x4 * * Using Taylor’s 2nd order approximation: J(a) J(ak) + J(a − ak) + 1 2 (a − ak)t H(a − ak), where H is the Hessian matrix H = ∂2J ∂ai ∂aj ij , one can find the optimal learning rate as ηk = J 2 ( J)t H( J) . Note: if J is quadratic, then ηk is a constant. Newton’s method ak+1 = ak − H−1 ( J) • works well for quadratic objective functions • problems if the Hessian is singular • no need to invert H: solve the system Hs = − J and update the solution ak+1 = ak + s Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures The perceptron • criterion: find a∗ (or, equivalently, w∗ and w∗ 0 ) that minimize J(a) = − i∈I γi = − i∈I a, zi where I is the set of indices of misclassified points • note: since γi < 0 for all misclassified points, J(a) ≥ 0, reaching 0 when all points are correctly classified • it is easy to see that aJ(a) = − i∈I zi • using gradient descent we get the updating iterations of the form ak+1 = ak + ηkzi • the perceptron in guaranteed to converge in a finite number of iterations, if the training set is separable Novikoff’s thm • from Novikoff’s thm. the number of mistakes the perceptron makes is upper bounded by 2R γ 2 where R is the radius of the sphere containing the data points, i.e. R = maxi xi Perceptron algorithm (batch perceptron) Input: A separable training set X × Y and a stop criterion θ Output: ak such that γi > 0, ∀i and k is the number of mistakes 1: a0 ← 0, k ← 0, η0 ← some initial value 2: repeat 3: for i = 1 to n do 4: if γi = ak, zi < 0 then 5: ak+1 ← ak + ηkzi 6: k ← k + 1 7: end if 8: end for 9: until |ηk i∈Ik zi| < θ What about ηk? There are different "schedules" for modifying it... • conditions: ηk ≥ 0, limm→∞ m k=1 ηk = ∞ and lim m→∞ m k=1 η2 k m k=1 ηk 2 = 0 • ηk = constant > 0 • ηk ∝ 1 k • let a be the solution of the perceptron algorithm • it is easy to see that a = n i=1 αizi where αi =    0, if point i was always correctly classified > 0, ∝ the number of times point i was misclassified • αi can be seen as the importance (or contribution) of zi to the classification rule • the discriminant function can be rewritten as h(x) = a, z = n i=1 αizi, z = n i=1 αi zi, z • this is the dual form of the perceptron algorithm Dual formulation of the perceptron algorithm Input: A training set X × Y Output: α = [α1, . . . , αn] 1: α ← 0 2: repeat 3: for i = 1 to n do 4: if γi = n j=1 αj zj, zi ≤ 0 then 5: αi ← αi + 1 6: end if 7: end for 8: until no mistakes Dual representation - remarks • in dual representation, the only way data is involved in the algorithm/formula is through the dot products zi, zj • this property is valid for a large class of methods • the dot products for the data can be computed offline, and stored in a Gram matrix G = [ zi, zj ]ij • similarly, to predict the class of a new point x, just (some of) the products z, zi are needed Relaxation procedures Another objective function: Jr (a) = 1 2 i∈I ( a, zi − ξ)2 zi 2 • it is smooth and has a continuous gradient function • the term ξ is introduced to avoid the solution on the boundary of the solution space • z 2 is a normalization term to avoid Jr being dominated by the largest vectors • 1/2 is merely to make the gradient nicer... Jr = i∈I a, zi − ξ zi 2 zi Algorithms: • batch relaxation with margin: update step: ak+1 = ak + ηk i∈Ik ξ − ak, zi zi 2 zi • single-sample relaxation with margin: update step (for each misclassified sample zi): ak+1 = ak + ηk ξ − ak, zi zi 2 zi • if ηk < 1: underrelaxation; if ηk > 1: overrelaxation Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Fisher criterion Objective Find the hyperplane (w, w0) on which the projected data is maximally separated. • the lenght of the projection of a vector z onto w is w,z w • projection of the difference vector between the means of the two classes (taking w = 1): | w, (µ+1 − µ−1) | • maximize the difference, relative to the projected pool variance (scatter): 1 n+1 + n−1 (s2 +1 + s2 −1) • s2 · = i( w, xi − w, µ· )2 where the sum is over the elements in either class • the lenght of the projection of a vector z onto w is w,z w • projection of the difference vector between the means of the two classes (taking w = 1): | w, (µ+1 − µ−1) | • maximize the difference, relative to the projected pool variance (scatter): 1 n+1 + n−1 (s2 +1 + s2 −1) • s2 · = i( w, xi − w, µ· )2 where the sum is over the elements in either class Objective: maximize J(w) = | w, µ+1 − w, µ−1 |2 s2 +1 + s2 −1 Fisher criterion w∗ = arg max w J(w) = arg max w wt Sbw wt Sw w where • Sb = (µ+1 − µ−1)(µ+1 − µ−1)t ← between-class scatter matrix • Sw = i∈I+1 (xi − µ+1)(xi − µ+1)t + i∈I−1 (xi − µ−1)(xi − µ−1)t ← within-class scatter matrix • Sw is proportional to sample covariance matrix for the pooled data • Jw is also known as Rayleigh quotient • the solution has the form w∗ ∝ S−1 w (µ+1 − µ−1) and it defines the direction of Fisher’s linear discriminant • the classification of d−dimensional points is transformed into a classification of one-dimensional points • no assumption on the underlying distributions was made in finding w∗ • the complete form of the linear discriminant is w, x + w0 = 0 • to find w0 one can, for example: • assume p(x| ± 1) to be Gaussians: this leads to the previously seen formulas for w0 (see Ch. 2) • try to find a value optimal for the training set Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Linear regression problem Find a = ([w0, w]t ) such that bi = a, zi , i = 1, 2, . . . , n for some fixed positive constants bi. In matrix notation, solve the linear system Za = b for a. • Z is a n × (d + 1)−dimensional matrix (design matrix), a is a (d + 1)−elements vector. • b is a n−elements vector (response vector) • usually n > d + 1, so the system is overdetermined → no exact solution • define the error vector e = Za − b • minimum squared error criterion: minimize Js(a) = e 2 = n i=1 ( a, zi − bi)2 • at the minimum, the gradient Js = 2Zt (Za − b) is zero ⇒ a = (Zt Z)−1Zt b = Z†b, where Z† is the pseudoinverse of Z • the solution depends on b and different choices lead to various properties of the solution Relation to Fisher’s linear discriminant • by properly choosing the class coding, one can show that MSE approach is equivalent to FDA • bi = n n+1 for the class "+1" (with n+1 elements) and bj = n n−1 for the class "-1" (with n−1 elements) • the MSE criterion for a = [w0, w] leads to w ∝ nS−1 w (µ+1 − µ−1) which is the direction of FDA • additionally, it gives a value for the threshold: w0 = −µt w (µ is the grand mean vector) • the decision rule becomes: if wt (x − µ) > 0 classify x as belonging to the first class Relation with Bayesian classifier • let the Bayesian discriminant be h0(x) = P(g1|x) − P(g2|x) • the samples are assumed to be drawn independently and identically distributed from the underlying distribution p(x) = p(x|g1)P(g1) + p(x|g2)P(g2) • MSE becomes 2 = ( a, z − h0(x))2 p(x) dx • → the solution to MSE problem, a, generates an approximation of the Bayesian discriminant • p(x) =? • main problem of MSE: places more emphasis on points with high p(x) instead of point near to the discrimination surface • → the "best" approximation of Bayes decision does not necessarily minimize the probability of error Numerical considerations on the LS problem Using the pseudo-inverse is not the best technique, from a numerical stability perspective: • computing Zt Z and Zt b may lead to information loss due to approximations in floating-point computations • the conditioning of the system is worsen: cond(Zt Z) = [cond(Z)]2 Normally, a matrix factorization is used for improved numerical stability: QR, SVD,... QR factorization The n × m (with m > n) matrix Z can be factorized as Z = QR where • Q is an orthogonal matrix: Qt Q = I ⇔ Q−1 = Qt • R is an upper triangular matrix With this, the solution a to our problem is the solution of the triangular system (solved by backsubstitution): Ra = Qt b A statistical perspective A linear model (linear regression) problem: E[b] = Za, under the assumption Cov(b) = σ2 I It can be shown that the best linear unbiased estimator is ˆa = (Zt Z)−1 Zt b = R−1 Qt b for a decomposition Z = QR. Then: ˆb = QQt b. (Gauss-Markov thm.: LS estimator has the lowest variance among all unbiased linear estimators.) Also, Var(ˆa) = (Zt Z)−1 σ2 = (Rt R)−1 σ2 where σ2 = b − ˆb 2/(n − d − 1). Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures • the MSE criterion, Js(a) = n i=1( a, zi − bi)2 can also be minimized by gradient descent method • since Js = 2Zt (Za − b) the update rule becomes a1 = some value ak+1 = ak + ηkZt (Zak − b) • if ηk = η1/k, the procedure convergest to a limiting value for a satistifying Zt (Za − b) = 0 • this algorithm yields a solution even if Zt Z is singular or badly conditioned The Widrow-Hoff (or LMS) algorithm implements sequential gradient descent. (In signal processing: least mean squares filter - adaptive filtering...) Input: A training set (X, y) Output: a - approximate MSE solution 1: initialize a, b, η1, θ and k ← 0 2: repeat 3: k ← (k + 1)n 4: a ← a + ηk(bk − a, zk )zk 5: ηk ← η1/k 6: until |ηk(bk − a, zk )zk| < θ [DHS - Fig.5.17] x1 x2 separating hyperplane LMSsolution Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures • consider b = Za be the margins (instead of fixed labels) • idea: adjust both the coefficients a and the margins b such that b > 0 (each margin should be positive) • formally: find a and b > 0 such that Js(a, b) = Za − b 2 becomes 0 • use a modified gradient descent, with gradient taken w.r.t. a and b