Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression PA196: Pattern Recognition 3. Linear discriminants Vlad Popovici popovici@iba.muni.cz Institute of Biostatistics and Analyses Masaryk University, Brno Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations Reminder - scalar product scalar (dot, inner) product of two vectors: x, w ∈ Rd : w · x = w, x = wt x = d i=1 wixi ∈ R cos θ = w,x w x w, x = 0 ⇐⇒ w ⊥ x projection of x on w is Projw x = x, w w w w = x, w w 2 w θ x w Projwx Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations General problem we consider the binary classification problem (K = 2) without loss of generality, we let the labels of the classes be ±1 we are given a set X × Y = {(xi, yi)|i = 1, . . . , n} ⊂ Rd × {−1, +1} the goal is to find the parameters of the classifier such that the number of misclassified points is minimized let the discriminant function have the form h(x) = wt x + w0 = w, x + w0 = w0 + d i=1 wixi note that x can be replaced with φ(x)! (we’ll discuss this later) the classifier is sign(h(x)) = sign( w, x + w0) Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations an error: if sign( w, xi + w0) yi; in other words: if yi( w, xi + w0) < 0 ⇔ yih(xi) < 0 the risk of misclassification (error) is R(h) = Pr[Y sign(h(X))] where (X, Y) is a random pair of observations the empirical risk is the estimation of the risk on a given set of points: ˆRn(h) = 1 n n i=1 1{yi sign(h(xi))} = 1 n n i=1 1yih(xi)<0 you need n ≥ d + 1 points for learning the classifier Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations x h(x)=0 w x1 x2 x3 w0/||w|| r H xp R 1 R 2 Thelineardecisionboundary H, whereh(x) = wt x+w0 = 0, separatesthe featurespaceinto two half-spacesR 1 (where h(x) > 0)and R 2 (where h(x) < 0). From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classi cation. Copyright c 2001 by John Wiley & Sons, Inc. fi Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations Margins Functional Margin The functional margin of a point xi with respect to a hyperplane w is defined to be γi = yi( w, xi + w0) = yih(xi) Geometric Margin The geometric margin of a point xi with respect to a hyperplane w is defined to be γi = yi w w , xi + w0 w = yi h(xi) w → Geometric margin is the normalized functional margin. Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations Margin of a point Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations Margin of a set (of points) The maximum margin among all (hyper)planes is the margin of a set of points. The corresponding hyperplane is called maximum margin hyperplane. Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations Generalization to multi-class problems a multi-class problem can be decomposed in a series of two-class problems: 1-vs-all or 1-vs-1 or, one can use K (no. of classes) discriminant fn. hi(x) and build classifiers of the form: assign x to class i if hi(x) > hj(x) for all i j this defines K(K − 1)/2 hyperplanes Hij : hi(x) − hj(x) = 0 in practice, there are usually less hyperplanes that form the decision surface g1 g2 g3 H13 H23 H12 R1 R2 R3 Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations Generalized linear discriminants Consider a function ψ : Rd → R ˆd . The discriminant function g(x) = a, ψ(x) = ˆd i=1 aiψi(x) is a linear function in a (but not in x). Example: let x = x ∈ R and let ψ(x) = [1, x, x2 ]t ∈ R3 . 0000 00 1111 11 0000 00 1111 11 0000 00 1111 11 00 00 00 11 11 11 0000 00 1111 11 0000 00 1111 11 ψ(x) Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations Remarks: a problem which is not linearly separable in Rd may become linearly separable in R ˆd ψ =? finding the coefficients in R ˆd requires much more training points! the decision surface, when projected back into Rd (by ψ−1 ) is non-linear Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General problem Margins Generalizations a convenient (but trivial) transformation: "normalization" of the notation take ψ(x) = y[1, x]t . This allows us to write γ = yh(x) = y( w, x + w0) = a, z where a = [w0, w]t and z = y[1, x]t the problem becomes: find a such that a, z > 0 i.e. all the margins are positive the decision surface ˆH in Rd+1 , defined by a, z = 0, corresponds to a hyperplane passing through the origin of the z−space Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron consider we are given the set {(xi, yi)} with yi = ±1 with the previous "normalized" notation, the set is linearly separable if a, zi > 0, ∀i = 1, . . . , n the solution a is constrained by each point zi aaa under current conditions, the solution is not unique! solutions on the boundary of the solution space may be too sensitive → you can use the condition a, zi ≥ ξ > 0 Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron General approach let J(a) be a criterion function that measures the "suitability" of a candidate solution a by convention, the solution to the classification problem is obtained as a∗ = arg min a J(a) usually, J is chosen to be continuous (at least in a neighborhood of the solution) and differentiable Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron Gradient descent ak+1 = ak − ηk J(ak ) the negative gradient, − J(a) is locally the steepest descent towards a (local) minimum ηk is a line search parameter or learning rate start with some a0 and iterate until |ηk J(ak )| < θ x0 x1 x2 x3 x4 * * Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron Using Taylor’s 2nd order approximation: J(a) J(ak ) + J(a − ak ) + 1 2 (a − ak )t H(a − ak ), where H is the Hessian matrix H = ∂2 J ∂ai∂aj ij , one can find the optimal learning rate as ηk = J 2 ( J)t H( J) . Note: if J is quadratic, then ηk is a constant. Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron Newton’s method ak+1 = ak − H−1 ( J) works well for quadratic objective functions problems if the Hessian is singular no need to invert H: solve the system Hs = − J and update the solution ak+1 = ak + s Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron The perceptron criterion: find a∗ (or, equivalently, w∗ and w∗ 0 ) that minimize J(a) = − i∈I γi = − i∈I a, zi where I is the set of indices of misclassified points note: since γi < 0 for all misclassified points, J(a) ≥ 0, reaching 0 when all points are correctly classified it is easy to see that aJ(a) = − i∈I zi Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron using gradient descent we get the updating iterations of the form ak+1 = ak + ηk zi the perceptron in guaranteed to converge in a finite number of iterations, if the training set is separable - Novikoff’s thm from Novikoff’s thm. the number of mistakes the perceptron makes is upper bounded by 2R γ 2 where R is the radius of the sphere containing the data points, i.e. R = maxi xi Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron Perceptron algorithm (batch perceptron) Input: A separable training set X × Y and a stop criterion θ Output: ak such that γi > 0, ∀i and k is the number of mistakes 1: a0 ← 0, k ← 0, η0 ← some initial value 2: repeat 3: for i = 1 to n do 4: if γi = ak , zi < 0 then 5: ak+1 ← ak + ηk zi 6: k ← k + 1 7: end if 8: end for 9: until |ηk i∈Ik zi| < θ Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron What about ηk ? There are different "schedules" for modifying it... conditions: ηk ≥ 0, limm→∞ m k=1 ηk = ∞ and lim m→∞ m k=1 η2 k m k=1 ηk 2 = 0 ηk = constant > 0 ηk ∝ 1 k Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron let a be the solution of the perceptron algorithm it is easy to see that a = n i=1 αizi where αi =    0, if point i was always correctly classified > 0, ∝ the number of times point i was misclassified αi can be seen as the importance (or contribution) of zi to the classification rule the discriminant function can be rewritten as h(x) = a, z = n i=1 αizi, z = n i=1 αi zi, z this is the dual form of the perceptron algorithm Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron Dual formulation of the perceptron algorithm Input: A training set X × Y Output: α = [α1, . . . , αn] 1: α ← 0 2: repeat 3: for i = 1 to n do 4: if γi = n j=1 αj zj, zi ≤ 0 then 5: αi ← αi + 1 6: end if 7: end for 8: until no mistakes Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron Dual representation - remarks in dual representation, the only way data is involved in the algorithm/formula is through the dot products zi, zj this property is valid for a large class of methods the dot products for the data can be computed offline, and stored in a Gram matrix G = [ zi, zj ]ij similarly, to predict the class of a new point x, just (some of) the products z, zi are needed Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron Relaxation procedures Another objective function: Jr (a) = 1 2 i∈I ( a, zi − ξ)2 zi 2 it is smooth and has a continuous gradient function the term ξ is introduced to avoid the solution on the boundary of the solution space z 2 is a normalization term to avoid Jr being dominated by the largest vectors 1/2 is merely to make the gradient nicer... Jr = i∈I a, zi − ξ zi 2 zi Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression General approach The perceptron Algorithms: batch relaxation with margin: update step: ak+1 = ak + ηk i∈Ik ξ − ak , zi zi 2 zi single-sample relaxation with margin: update step (for each misclassified sample zi): ak+1 = ak + ηk ξ − ak , zi zi 2 zi if ηk < 1: underrelaxation; if ηk > 1: overrelaxation Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Fisher criterion Objective Find the hyperplane (w, w0) on which the projected data is maximally separated. Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression the lenght of the projection of a vector z onto w is w,z w projection of the difference vector between the means of the two classes (taking w = 1): | w, (µ+1 − µ−1) | maximize the difference, relative to the projected pool variance (scatter): 1 n+1 + n−1 (s2 +1 + s2 −1) s2 · = i( w, xi − w, µ· )2 where the sum is over the elements in either class Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression the lenght of the projection of a vector z onto w is w,z w projection of the difference vector between the means of the two classes (taking w = 1): | w, (µ+1 − µ−1) | maximize the difference, relative to the projected pool variance (scatter): 1 n+1 + n−1 (s2 +1 + s2 −1) s2 · = i( w, xi − w, µ· )2 where the sum is over the elements in either class Objective: maximize J(w) = | w, µ+1 − w, µ−1 |2 s2 +1 + s2 −1 Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Fisher criterion w∗ = arg max w J(w) = arg max w wt Sbw wt Sww where Sb = (µ+1 − µ−1)(µ+1 − µ−1)t ← between-class scatter matrix Sw = i∈I+1 (xi − µ+1)(xi − µ+1)t + i∈I−1 (xi − µ−1)(xi − µ−1)t ← within-class scatter matrix Sw is proportional to sample covariance matrix for the pooled data Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Jw is also known as Rayleigh quotient the solution has the form w∗ ∝ S−1 w (µ+1 − µ−1) and it defines the direction of Fisher’s linear discriminant the classification of d−dimensional points is transformed into a classification of one-dimensional points Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression no assumption on the underlying distributions was made in finding w∗ the complete form of the linear discriminant is w, x + w0 = 0 to find w0 one can, for example: assume p(x| ± 1) to be Gaussians: this leads to the previously seen formulas for w0 (see Ch. 2) try to find a value optimal for the training set Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Linear regression problem Find a = ([w0, w]t ) such that bi = a, zi , i = 1, 2, . . . , n for some fixed positive constants bi. In matrix notation, solve the linear system Za = b for a. Z is a n × (d + 1)−dimensional matrix (design matrix), a is a (d + 1)−elements vector. b is a n−elements vector (response vector) usually n > d + 1, so the system is overdetermined → no exact solution Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures define the error vector e = Za − b minimum squared error criterion: minimize Js(a) = e 2 = n i=1 ( a, zi − bi)2 at the minimum, the gradient Js = 2Zt (Za − b) is zero ⇒ a = (Zt Z)−1 Zt b = Z†b, where Z† is the pseudoinverse of Z the solution depends on b and different choices lead to various properties of the solution Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Relation to Fisher’s linear discriminant by properly choosing the class coding, one can show that MSE approach is equivalent to FDA bi = n n+1 for the class "+1" (with n+1 elements) and bj = n n−1 for the class "-1" (with n−1 elements) the MSE criterion for a = [w0, w] leads to w ∝ nS−1 w (µ+1 − µ−1) which is the direction of FDA additionally, it gives a value for the threshold: w0 = −µt w (µ is the grand mean vector) the decision rule becomes: if wt (x − µ) > 0 classify x as belonging to the first class Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Relation with Bayesian classifier let the Bayesian discriminant be h0(x) = P(g1|x) − P(g2|x) the samples are assumed to be drawn independently and identically distributed from the underlying distribution p(x) = p(x|g1)P(g1) + p(x|g2)P(g2) MSE becomes 2 = ( a, z − h0(x))2 p(x) dx Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures → the solution to MSE problem, a, generates an approximation of the Bayesian discriminant p(x) =? main problem of MSE: places more emphasis on points with high p(x) instead of point near to the discrimination surface → the "best" approximation of Bayes decision does not necessarily minimize the probability of error Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Numerical considerations on the LS problem Using the pseudo-inverse is not the best technique, from a numerical stability perspective: computing Zt Z and Zt b may lead to information loss due to approximations in floating-point computations the conditioning of the system is worsen: cond(Zt Z) = [cond(Z)]2 Normally, a matrix factorization is used for improved numerical stability: QR, SVD,... Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures QR factorization The n × m (with m > n) matrix Z can be factorized as Z = QR where Q is an orthogonal matrix: Qt Q = I ⇔ Q−1 = Qt R is an upper triangular matrix With this, the solution a to our problem is the solution of the triangular system (solved by backsubstitution): Ra = Qt b Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures A statistical perspective A linear model (linear regression) problem: E[b] = Za, under the assumption Cov(b) = σ2 I It can be shown that the best linear unbiased estimator is ˆa = (Zt Z)−1 Zt b = R−1 Qt b for a decomposition Z = QR. Then: ˆb = QQt b. (Gauss-Markov thm.: LS estimator has the lowest variance among all unbiased linear estimators.) Also, Var(ˆa) = (Zt Z)−1 σ2 = (Rt R)−1 σ2 where σ2 = b − ˆb 2 /(n − d − 1). Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures the MSE criterion, Js(a) = n i=1( a, zi − bi)2 can also be minimized by gradient descent method since Js = 2Zt (Za − b) the update rule becomes a1 = some value ak+1 = ak + ηk Zt (Zak − b) if ηk = η1/k, the procedure convergest to a limiting value for a satistifying Zt (Za − b) = 0 this algorithm yields a solution even if Zt Z is singular or badly conditioned Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures The Widrow-Hoff (or LMS) algorithm implements sequential gradient descent. (In signal processing: least mean squares filter adaptive filtering...) Input: A training set (X, y) Output: a - approximate MSE solution 1: initialize a, b, η1, θ and k ← 0 2: repeat 3: k ← (k + 1)n 4: a ← a + ηk (bk − a, zk )zk 5: ηk ← η1/k 6: until |ηk (bk − a, zk )zk | < θ Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures [DHS - Fig.5.17] x1 x2 separating hyperplane LMSsolution Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Outline 1 Introduction General problem Margins Generalizations 2 Linearly separable binary problems General approach The perceptron 3 Fisher discriminant analysis 4 Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures Vlad PA196: Pattern Recognition Introduction Linearly separable binary problems Fisher discriminant analysis Linear regression Minimum squared-error procedures The Widow-Hoff procedure Ho-Kashyap procedures consider b = Za be the margins (instead of fixed labels) idea: adjust both the coefficients a and the margins b such that b > 0 (each margin should be positive) formally: find a and b > 0 such that Js(a, b) = Za − b 2 becomes 0 use a modified gradient descent, with gradient taken w.r.t. a and b Vlad PA196: Pattern Recognition