Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
PA196: Pattern Recognition
3. Linear discriminants
Vlad Popovici
popovici@iba.muni.cz
Institute of Biostatistics and Analyses
Masaryk University, Brno
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
Outline
1 Introduction
General problem
Margins
Generalizations
2 Linearly separable binary problems
General approach
The perceptron
3 Fisher discriminant analysis
4 Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
Reminder - scalar product
scalar (dot, inner) product of two vectors:
x, w ∈ Rd
: w · x = w, x =
wt
x = d
i=1 wixi ∈ R
cos θ = w,x
w x
w, x = 0 ⇐⇒ w ⊥ x
projection of x on w is
Projw x =
x, w
w
w
w
=
x, w
w 2
w
θ
x
w
Projwx
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
Outline
1 Introduction
General problem
Margins
Generalizations
2 Linearly separable binary problems
General approach
The perceptron
3 Fisher discriminant analysis
4 Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
General problem
we consider the binary classiﬁcation problem (K = 2)
without loss of generality, we let the labels of the classes be
±1
we are given a set
X × Y = {(xi, yi)|i = 1, . . . , n} ⊂ Rd
× {−1, +1}
the goal is to ﬁnd the parameters of the classiﬁer such that the
number of misclassiﬁed points is minimized
let the discriminant function have the form
h(x) = wt
x + w0 = w, x + w0 = w0 +
d
i=1
wixi
note that x can be replaced with φ(x)! (we’ll discuss this later)
the classiﬁer is
sign(h(x)) = sign( w, x + w0)
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
an error: if sign( w, xi + w0) yi; in other words: if
yi( w, xi + w0) < 0 ⇔ yih(xi) < 0
the risk of misclassiﬁcation (error) is
R(h) = Pr[Y sign(h(X))]
where (X, Y) is a random pair of observations
the empirical risk is the estimation of the risk on a given set of
points:
ˆRn(h) =
1
n
n
i=1
1{yi sign(h(xi))} =
1
n
n
i=1
1yih(xi)<0
you need n ≥ d + 1 points for learning the classiﬁer
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
x
h(x)=0
w
x1
x2
x3
w0/||w||
r
H
xp
R 1
R 2
Thelineardecisionboundary H, whereh(x) = wt
x+w0 = 0, separatesthe
featurespaceinto two half-spacesR 1 (where h(x) > 0)and R 2 (where h(x) < 0). From:
Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classi cation. Copyright
c 2001 by John Wiley & Sons, Inc.
ﬁ
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
Outline
1 Introduction
General problem
Margins
Generalizations
2 Linearly separable binary problems
General approach
The perceptron
3 Fisher discriminant analysis
4 Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
Margins
Functional Margin
The functional margin of a point xi with respect to a hyperplane w
is deﬁned to be
γi = yi( w, xi + w0) = yih(xi)
Geometric Margin
The geometric margin of a point xi with respect to a hyperplane w
is deﬁned to be
γi = yi
w
w
, xi +
w0
w
= yi
h(xi)
w
→ Geometric margin is the normalized functional margin.
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
Margin of a point
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
Margin of a set (of points)
The maximum margin among all (hyper)planes is the margin of a
set of points. The corresponding hyperplane is called maximum
margin hyperplane.
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
Outline
1 Introduction
General problem
Margins
Generalizations
2 Linearly separable binary problems
General approach
The perceptron
3 Fisher discriminant analysis
4 Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
Generalization to multi-class problems
a multi-class problem can be
decomposed in a series of
two-class problems: 1-vs-all or
1-vs-1
or, one can use K (no. of classes)
discriminant fn. hi(x) and build
classiﬁers of the form: assign x to
class i if hi(x) > hj(x) for all i j
this deﬁnes K(K − 1)/2
hyperplanes Hij : hi(x) − hj(x) = 0
in practice, there are usually less
hyperplanes that form the decision
surface
g1
g2
g3
H13
H23
H12
R1
R2
R3
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
Generalized linear discriminants
Consider a function ψ : Rd
→ R
ˆd
. The discriminant function
g(x) = a, ψ(x) =
ˆd
i=1
aiψi(x)
is a linear function in a (but not in x).
Example: let x = x ∈ R and let ψ(x) = [1, x, x2
]t
∈ R3
.
0000
00
1111
11
0000
00
1111
11
0000
00
1111
11
00
00
00
11
11
11
0000
00
1111
11
0000
00
1111
11
ψ(x)
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
Remarks:
a problem which is not linearly separable in Rd
may become
linearly separable in R
ˆd
ψ =?
ﬁnding the coefﬁcients in R
ˆd
requires much more training
points!
the decision surface, when projected back into Rd
(by ψ−1
) is
non-linear
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General problem
Margins
Generalizations
a convenient (but trivial) transformation: "normalization" of the
notation
take ψ(x) = y[1, x]t
. This allows us to write
γ = yh(x) = y( w, x + w0) = a, z
where a = [w0, w]t
and z = y[1, x]t
the problem becomes: ﬁnd a such that
a, z > 0
i.e. all the margins are positive
the decision surface ˆH in Rd+1
, deﬁned by a, z = 0,
corresponds to a hyperplane passing through the origin of the
z−space
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
Outline
1 Introduction
General problem
Margins
Generalizations
2 Linearly separable binary problems
General approach
The perceptron
3 Fisher discriminant analysis
4 Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
consider we are given the set {(xi, yi)} with yi = ±1
with the previous "normalized" notation, the set is linearly
separable if
a, zi > 0, ∀i = 1, . . . , n
the solution a is constrained by each point zi
aaa
under current conditions, the solution is not unique!
solutions on the boundary of the solution space may be too
sensitive → you can use the condition a, zi ≥ ξ > 0
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
Outline
1 Introduction
General problem
Margins
Generalizations
2 Linearly separable binary problems
General approach
The perceptron
3 Fisher discriminant analysis
4 Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
General approach
let J(a) be a criterion function that measures the "suitability"
of a candidate solution a
by convention, the solution to the classiﬁcation problem is
obtained as
a∗
= arg min
a
J(a)
usually, J is chosen to be continuous (at least in a
neighborhood of the solution) and differentiable
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
Gradient descent
ak+1 = ak − ηk J(ak )
the negative gradient, − J(a) is locally
the steepest descent towards a (local)
minimum
ηk is a line search parameter or learning
rate
start with some a0 and iterate until
|ηk J(ak )| < θ
x0
x1
x2
x3
x4
*
*
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
Using Taylor’s 2nd order approximation:
J(a) J(ak ) + J(a − ak ) +
1
2
(a − ak )t
H(a − ak ),
where H is the Hessian matrix H = ∂2
J
∂ai∂aj
ij
, one can ﬁnd the
optimal learning rate as
ηk =
J 2
( J)t H( J)
.
Note: if J is quadratic, then ηk is a constant.
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
Newton’s method
ak+1 = ak − H−1
( J)
works well for quadratic objective functions
problems if the Hessian is singular
no need to invert H: solve the system Hs = − J and update
the solution ak+1 = ak + s
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
Outline
1 Introduction
General problem
Margins
Generalizations
2 Linearly separable binary problems
General approach
The perceptron
3 Fisher discriminant analysis
4 Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
The perceptron
criterion: ﬁnd a∗ (or, equivalently, w∗ and w∗
0
) that minimize
J(a) = −
i∈I
γi = −
i∈I
a, zi
where I is the set of indices of misclassiﬁed points
note: since γi < 0 for all misclassiﬁed points, J(a) ≥ 0,
reaching 0 when all points are correctly classiﬁed
it is easy to see that
aJ(a) = −
i∈I
zi
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
using gradient descent we get the updating iterations of the
form
ak+1 = ak + ηk zi
the perceptron in guaranteed to converge in a ﬁnite number of
iterations, if the training set is separable - Novikoff’s thm
from Novikoff’s thm. the number of mistakes the perceptron
makes is upper bounded by
2R
γ
2
where R is the radius of the sphere containing the data points,
i.e. R = maxi xi
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
Perceptron algorithm (batch perceptron)
Input: A separable training set X × Y and a stop criterion θ
Output: ak such that γi > 0, ∀i and k is the number of mistakes
1: a0 ← 0, k ← 0, η0 ← some initial value
2: repeat
3: for i = 1 to n do
4: if γi = ak , zi < 0 then
5: ak+1 ← ak + ηk zi
6: k ← k + 1
7: end if
8: end for
9: until |ηk i∈Ik
zi| < θ
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
What about ηk ? There are different "schedules" for modifying it...
conditions: ηk ≥ 0, limm→∞
m
k=1 ηk = ∞ and
lim
m→∞
m
k=1 η2
k
m
k=1 ηk
2
= 0
ηk = constant > 0
ηk ∝ 1
k
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
let a be the solution of the perceptron algorithm
it is easy to see that a = n
i=1 αizi where
αi =



0, if point i was always correctly classiﬁed
> 0, ∝ the number of times point i was misclassiﬁed
αi can be seen as the importance (or contribution) of zi to the
classiﬁcation rule
the discriminant function can be rewritten as
h(x) = a, z
=
n
i=1
αizi, z
=
n
i=1
αi zi, z
this is the dual form of the perceptron algorithm
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
Dual formulation of the perceptron algorithm
Input: A training set X × Y
Output: α = [α1, . . . , αn]
1: α ← 0
2: repeat
3: for i = 1 to n do
4: if γi = n
j=1 αj zj, zi ≤ 0 then
5: αi ← αi + 1
6: end if
7: end for
8: until no mistakes
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
Dual representation - remarks
in dual representation, the only way data is involved in the
algorithm/formula is through the dot products zi, zj
this property is valid for a large class of methods
the dot products for the data can be computed ofﬂine, and
stored in a Gram matrix G = [ zi, zj ]ij
similarly, to predict the class of a new point x, just (some of)
the products z, zi are needed
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
Relaxation procedures
Another objective function:
Jr (a) =
1
2
i∈I
( a, zi − ξ)2
zi
2
it is smooth and has a continuous gradient function
the term ξ is introduced to avoid the solution on the boundary
of the solution space
z 2
is a normalization term to avoid Jr being dominated by
the largest vectors
1/2 is merely to make the gradient nicer...
Jr =
i∈I
a, zi − ξ
zi
2
zi
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
General approach
The perceptron
Algorithms:
batch relaxation with margin: update step:
ak+1 = ak + ηk
i∈Ik
ξ − ak , zi
zi
2
zi
single-sample relaxation with margin: update step (for each
misclassiﬁed sample zi):
ak+1 = ak + ηk
ξ − ak , zi
zi
2
zi
if ηk < 1: underrelaxation; if ηk > 1: overrelaxation
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Outline
1 Introduction
General problem
Margins
Generalizations
2 Linearly separable binary problems
General approach
The perceptron
3 Fisher discriminant analysis
4 Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Fisher criterion
Objective
Find the hyperplane (w, w0) on which the projected data is
maximally separated.
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
the lenght of the projection of a vector z
onto w is w,z
w
projection of the difference vector
between the means of the two classes
(taking w = 1):
| w, (µ+1 − µ−1) |
maximize the difference, relative to the
projected pool variance (scatter):
1
n+1 + n−1
(s2
+1 + s2
−1)
s2
· = i( w, xi − w, µ· )2
where the sum
is over the elements in either class
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
the lenght of the projection of a vector z
onto w is w,z
w
projection of the difference vector
between the means of the two classes
(taking w = 1):
| w, (µ+1 − µ−1) |
maximize the difference, relative to the
projected pool variance (scatter):
1
n+1 + n−1
(s2
+1 + s2
−1)
s2
· = i( w, xi − w, µ· )2
where the sum
is over the elements in either class
Objective: maximize
J(w) =
| w, µ+1 − w, µ−1 |2
s2
+1
+ s2
−1
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Fisher criterion
w∗
= arg max
w
J(w) = arg max
w
wt
Sbw
wt Sww
where
Sb = (µ+1 − µ−1)(µ+1 − µ−1)t
← between-class scatter matrix
Sw = i∈I+1
(xi − µ+1)(xi − µ+1)t
+ i∈I−1
(xi − µ−1)(xi − µ−1)t
← within-class scatter matrix
Sw is proportional to sample covariance matrix for the pooled
data
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Jw is also known as Rayleigh quotient
the solution has the form
w∗
∝ S−1
w (µ+1 − µ−1)
and it deﬁnes the direction of Fisher’s linear discriminant
the classiﬁcation of d−dimensional points is transformed into
a classiﬁcation of one-dimensional points
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
no assumption on the underlying distributions was made in
ﬁnding w∗
the complete form of the linear discriminant is
w, x + w0 = 0
to ﬁnd w0 one can, for example:
assume p(x| ± 1) to be Gaussians: this leads to the previously
seen formulas for w0 (see Ch. 2)
try to ﬁnd a value optimal for the training set
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Outline
1 Introduction
General problem
Margins
Generalizations
2 Linearly separable binary problems
General approach
The perceptron
3 Fisher discriminant analysis
4 Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Outline
1 Introduction
General problem
Margins
Generalizations
2 Linearly separable binary problems
General approach
The perceptron
3 Fisher discriminant analysis
4 Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Linear regression problem
Find a = ([w0, w]t
) such that
bi = a, zi , i = 1, 2, . . . , n
for some ﬁxed positive constants bi. In matrix notation, solve the
linear system
Za = b
for a.
Z is a n × (d + 1)−dimensional matrix (design matrix), a is a
(d + 1)−elements vector.
b is a n−elements vector (response vector)
usually n > d + 1, so the system is overdetermined → no
exact solution
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
deﬁne the error vector
e = Za − b
minimum squared error criterion:
minimize Js(a) = e 2
=
n
i=1
( a, zi − bi)2
at the minimum, the gradient Js = 2Zt
(Za − b) is zero
⇒ a = (Zt
Z)−1
Zt
b = Z†b, where Z† is the pseudoinverse of Z
the solution depends on b and different choices lead to
various properties of the solution
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Relation to Fisher’s linear discriminant
by properly choosing the class coding, one can show that
MSE approach is equivalent to FDA
bi = n
n+1
for the class "+1" (with n+1 elements) and bj = n
n−1
for the class "-1" (with n−1 elements)
the MSE criterion for a = [w0, w] leads to
w ∝ nS−1
w (µ+1 − µ−1)
which is the direction of FDA
additionally, it gives a value for the threshold: w0 = −µt
w (µ is
the grand mean vector)
the decision rule becomes: if wt
(x − µ) > 0 classify x as
belonging to the ﬁrst class
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Relation with Bayesian classiﬁer
let the Bayesian discriminant be
h0(x) = P(g1|x) − P(g2|x)
the samples are assumed to be drawn independently and
identically distributed from the underlying distribution
p(x) = p(x|g1)P(g1) + p(x|g2)P(g2)
MSE becomes
2
= ( a, z − h0(x))2
p(x) dx
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
→ the solution to MSE problem, a, generates an
approximation of the Bayesian discriminant
p(x) =?
main problem of MSE: places more emphasis on points with
high p(x) instead of point near to the discrimination surface
→ the "best" approximation of Bayes decision does not
necessarily minimize the probability of error
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Numerical considerations on the LS problem
Using the pseudo-inverse is not the best technique, from a
numerical stability perspective:
computing Zt
Z and Zt
b may lead to information loss due to
approximations in ﬂoating-point computations
the conditioning of the system is worsen:
cond(Zt
Z) = [cond(Z)]2
Normally, a matrix factorization is used for improved numerical
stability: QR, SVD,...
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
QR factorization
The n × m (with m > n) matrix Z can be factorized as
Z = QR
where
Q is an orthogonal matrix: Qt
Q = I ⇔ Q−1
= Qt
R is an upper triangular matrix
With this, the solution a to our problem is the solution of the
triangular system (solved by backsubstitution):
Ra = Qt
b
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
A statistical perspective
A linear model (linear regression) problem:
E[b] = Za, under the assumption Cov(b) = σ2
I
It can be shown that the best linear unbiased estimator is
ˆa = (Zt
Z)−1
Zt
b = R−1
Qt
b
for a decomposition Z = QR. Then: ˆb = QQt
b. (Gauss-Markov
thm.: LS estimator has the lowest variance among all unbiased
linear estimators.) Also,
Var(ˆa) = (Zt
Z)−1
σ2
= (Rt
R)−1
σ2
where σ2
= b − ˆb 2
/(n − d − 1).
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Outline
1 Introduction
General problem
Margins
Generalizations
2 Linearly separable binary problems
General approach
The perceptron
3 Fisher discriminant analysis
4 Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
the MSE criterion, Js(a) = n
i=1( a, zi − bi)2
can also be
minimized by gradient descent method
since
Js = 2Zt
(Za − b)
the update rule becomes
a1 = some value
ak+1 = ak + ηk Zt
(Zak − b)
if ηk = η1/k, the procedure convergest to a limiting value for a
satistifying
Zt
(Za − b) = 0
this algorithm yields a solution even if Zt
Z is singular or badly
conditioned
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
The Widrow-Hoff (or LMS) algorithm implements sequential
gradient descent. (In signal processing: least mean squares ﬁlter adaptive
ﬁltering...)
Input: A training set (X, y)
Output: a - approximate MSE solution
1: initialize a, b, η1, θ and k ← 0
2: repeat
3: k ← (k + 1)n
4: a ← a + ηk (bk − a, zk )zk
5: ηk ← η1/k
6: until |ηk (bk − a, zk )zk | < θ
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
[DHS - Fig.5.17]
x1
x2
separating
hyperplane
LMSsolution
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Outline
1 Introduction
General problem
Margins
Generalizations
2 Linearly separable binary problems
General approach
The perceptron
3 Fisher discriminant analysis
4 Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
Vlad PA196: Pattern Recognition
Introduction
Linearly separable binary problems
Fisher discriminant analysis
Linear regression
Minimum squared-error procedures
The Widow-Hoff procedure
Ho-Kashyap procedures
consider b = Za be the margins (instead of ﬁxed labels)
idea: adjust both the coefﬁcients a and the margins b such
that b > 0 (each margin should be positive)
formally: ﬁnd a and b > 0 such that
Js(a, b) = Za − b 2
becomes 0
use a modiﬁed gradient descent, with gradient taken w.r.t. a
and b
Vlad PA196: Pattern Recognition