PA196: Pattern Recognition
1. Introduction
2. Bayesian decision theory
Dr. Vlad Popovici
popovici@recetox.muni.cz
RECETOX
Masaryk University, Brno
Before starting
Organization:
• 13 weeks: 2h lecture + 2h exercise
• 1st half of the semester: business as usual
• 2nd half of the semester (in addition to "business as
usual"): preparation of individual (small) projects
• evaluation: 50% written exam, 40% project, and 10%
participation
Bibliography
• [DHS] Duda, Hart, Stork: Pattern Classiﬁcation. 2nd Ed.
2001
• [EST] Hastie, Tibshirani, Friedman: The Elements of
Statistical Learning. 2nd Ed. (12th printing). 2017
Available in PDF! https:
//web.stanford.edu/~hastie/ElemStatLearn/
• Webb, Copsey: Statistical Pattern Recognition. 3rd
Ed. 2011
• Kuncheva: Combining Pattern Classiﬁers. 2004
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
WHAT?
Webster dictionary:
• Pattern: a combination of qualities, acts, tendencies,
etc., forming a consistent or characteristic
arrangement
• Recognition: the identiﬁcation of something as
having been previously seen, heard, known, etc.
• DHS: "the act of taking in raw data and taking an
action based on the category of the pattern"
• Wikipedia: "Pattern recognition is nearly synonymous
with machine learning. This branch of artiﬁcial
intelligence focuses on the recognition of patterns
and regularities in data. In many cases, these
patterns are learned from labeled "training" data
(supervised learning), but when no labeled data are
available other algorithms can be used to discover
previously unknown patterns (unsupervised learning)."
An example - from DHS (ﬁgs 1.1-1.4)
Characteristic/feature:
width
salmon seabass
length
count
l*
0
2
4
6
8
10
12
16
18
20
22
5 10 2015 25
Characteristic/feature:
lightness
2 4 6 8 10
0
2
4
6
8
10
12
14
lightness
count
x*
salmon seabass
Combining width and lightness features:
2 4 6 8 10
14
15
16
17
18
19
20
21
22
width
lightness
salmon seabass
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Applications - in no particular order
Biometrics:
• face detection and recognition/detection
• gender and age recognition
• ﬁngerprint recognition
• speaker recognition
• ...
Human-computer interfaces:
• user detection/recognition
• gait recognition
• gesture recognition
• brainwave categorization
• ...
Other:
• biomedical research: prediction of response, target
genes, etc etc
• military/security: target detection, intrusion detection,
etc etc
• spam ﬁltering
• optical character recognition
• natural language processing
• remote sensing
• etc etc
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Pattern recognition systems
environment
decision
feature extraction
acquisition
(sensor)
classification
Example:
• acquisition/sensor: CCD
camera
• feature extraction: in all
rectangular regions of
size 30 × 30, compute
the Gabor wavelet
decomposition in corner
regions
• given all the coefﬁcients
of Gabor w., classify the
region as human face or
"something else"
Some basic terminology
• learning: model ﬁtting, optimization, training
• main types of problems:
• CLASSIFICATION: the classes are known, the problem is
to assign classes to the inputs; approached usually by
supervised learning: samples and the corresponding
labels are used in training the classiﬁer
• CLASS DISCOVERY: the classes are not known (maybe
not even how many they are); approached usually by
clustering: grouping similar inputs together
• other methods pf learning:
• semi-supervised learning: some labeled and some
unlabeled data
• reinforcement learning: there is a "teacher" telling the
system when it’s right or wrong
Approaches:
• no clear cut separation between method types
• statistical/Bayesian: features are random variates;
estimate the PDFs and use maximum a posterior for
classiﬁcation; minimize the "risk" of misclassiﬁcation
• geometric: ﬁnd boundaries between regions of the
feature space
• neural networks
• model-based: reference pattern represent classes;
"nearest pattern" rule is used for classiﬁcation
• syntactic: classes are represented by grammars
• structural: classes represented by graphs (or similar)
Design cycle
• data collection −→ sample size estimation
• feature selection
• classiﬁer design −→ selection of classiﬁer(s), training,
model selection
• performance estimation −→ errors and costs, error
estimates, variability of the estimates
Other issues
• pre-processing and normalization
• improves stability of the models and convergence of
the learning procedure
• depend on classiﬁer and application domain
• feature standardization
• detection of outliers
• detection of errors in data
Goals of classiﬁer design
• to build a classiﬁcation model from a ﬁnite set of
examples that minimizes some error measure and
which generalizes well
• to estimate its future performance on unseen data
A bit of formalism
• a sample or a pattern is represented as a real-valued
vector: x ∈ Rd
• x = [x1, . . . , xi, . . . , xd]t , xi is called variable or feature
(actually, it is a realization - measurement - of a given
variable)
• generally, there is a label g ∈ G = {g1, . . . , gm} uniquely
associated with each sample
• there is a probability P(g) that each class would be
observed - a priori probability (prior). Normally:
i P(gi) = 1.
• the probability of observing a sample x, given a class
gi, is given by the class-conditional probability density
p(x|gi)
• NOTE: P(·) is used for proability mass function (discrete
variables) and p(·) for probability density function
Similarities, metrics, distances...
• most methods rely on an explicit or implicit distance
between data points: d : Rd × Rd → R+,
• d(x, z) > 0, ∀x z
• d(x, z) = 0 ⇐⇒ x = z
• d(x, z) = d(z, x)
• a metric is a distance which satisﬁes the triangle
inequality: d(x, z) ≤ d(x, y) + d(y, z), for x, y, z ∈ Rd.
• a similarity measure is less formally deﬁned; usually has
large values for more alike objects
• examples: Euclidean metric; correlation coefﬁcient as
similarity measure
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Bayes rule
Let G = {gi} be a number of classes and let X = {xi} be a
set of observed data points (i.e. training set).
• p(x|g) is called likelihood (function) with the variable
g (the class label). Note: if g is ﬁxed, but x are
considered random, we have the class-conditional
density model for generating the observations.
• P(g) is the prior
• the goal is to ﬁnd P(g|x) i.e. the posterior probability
that the label for x is g
• Bayes rule:
P(gi|x) =
p(x|gi)P(gi)
p(x)
=
p(x|gi)P(gi)
i p(x|gi)P(gi)
Bayes rule
posterior ∝ likelihood × prior
Tricky part: how to estimate the likelihood?!
Class-conditional density or
likelihood function?
• ...it depends!
• consider a set of r.v. X1, . . . , Xp conditionally
independent given that Θ = θ
• pXi |Θ(·|θ) is the (postulated) density model for the
variable Xi: for each possible value of Θ is the
uncertainty about the values of Xi
• Pr{X1 ∈ A1, . . . , Xp ∈ Ap} =
A1×···×Ap
p
i=1
pXi |Θ(xi|θ) dx1 . . . dxp
• once [x1, . . . , xp] are observed (hence they are ﬁxed!),
deﬁne the function
Lx1,...,xp : Ω → R; Lx1,...,xp (θ) =
p
i=1
pXi |Θ(xi|θ)
and call it likelihood function
Bayesian decision
• consider there are K classes {g1, . . . , gK }
• there are a possible actions: {α1, . . . , αa}
• let α(x) be the decision rule/function
• for each action-class pair there is a loss incurred:
λ(αk|gi)
• conditional risk:
R(αk|x) =
K
i=1
λ(αk|gi)P(gi|x)
• the expected risk for the rule α(x)
R =
x
R(α(x)|x)p(x) dx
Bayesian decision
α∗
= arg min
k
R(αk|x)
• results in minimum expected risk
• is the best decision that one can take
• you can use this framework to build "test cases" for
other classiﬁers
Example
• consider a classiﬁcation problem (actions correspond
to class assignment), with 0-1 loss function
λ(gk|gi) =



0, k = i
1, k i
• the conditional risk becomes
R(gk|x) =
K
i=1
λ(gk|gi)P(gi|x) =
i k
P(gi|x) = 1 − P(gk|x)
• Bayesian decision rule becomes maximum a
posteriori (MAP) rule
Bayesian classiﬁcation - Maximum A Posteriori rule
Assign x to class gk if P(gk|x) > P(gi|x) ∀i k
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Discriminant functions
• d-functions: hi(x), i = 1, . . . , K
• classiﬁer: x is assigned to gi if hi(x) > hk(x), ∀k i
• ex.: hi(x) = P(gi|x) or hi(x) = ln p(x|gi) + ln P(gi)
• different d-functions may give equivalent classiﬁers
Binary case
• if K = 2: binary classiﬁer or dichotomizer
• multiclass problems can be decomposed in a series
of 2-class problems
• a single discriminant function sufﬁces:
h(x) = h1(x) − h2(x)
with the decision rule: assign x to g1 if h(x) > 0
• this is the case we will study most of the time
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
The case of normal density
For x ∈ Rp a column vector, x ∼ N(µ, Σ) :
p(x) =
1
(2π)p/2|Σ|1/2
exp −
1
2
(x − µ)t
Σ−1
(x − µ)
• µ is the mean vector and Σ is the covariance matrix
(symmetric, positive deﬁnite)
Σ = E[(x − µ)(x − µ)t
] = (x − µ)(x − µ)t
p(x) dx
• for an p × r matrix A: y = At x ∼ N(At µ, At ΣA)
• whitening: Aw = ΦΛ−1/2 where Φ is the matrix with
eigenvectors of Σ as columns and Λ is a diagonal
matrix with corresponding eigenvalues on diagonal
[DHS - Fig.2.8:]
0
µ
P
t
µ
A
t
µ
N(µ,Σ )
P
N(At
µ, A
t
Σ A)
A
a
N(A
t
wµ, I)
Aw
A
t
wµ
σ
x1
x2
Mahalanobis distance
r = (x − µ)t Σ−1(x − µ)
• if the variables/features are independent and
standardized, Σ becomes the identity matrix and the
M-distance becomes Euclidean distance
• the volume of the hyperellipsoid corresponding to a
distance r is
V = Vp|Σ|1/2
rp
,
where
Vp =



πp/2/(d/2)! if p is even
2pπ(p−1)/2 d−1
2
!/p! if p is odd
Discriminant functions for Normal
densities
Assume p(x|gi) ∼ N(µi, Σi).
Since
P(gi|x) ∝ p(x|gi)P(gi)
one can take a discriminant function of the form:
hi(x) = ln p(x|gi) + ln P(gi)
which leads to
hi(x) = −
1
2
(x − µi)t
Σ−1
i (x − µi) −
p
2
ln 2π −
1
2
ln |Σi| + ln P(gi)
Special case: Σi = σ2
I
hi(x) = −
1
2σ2
x − µi
2
+ ln P(gi)
which can be re-written as linear discriminant functions
hi(x) = wt
i x + wi0
with wi ∈ Rp,
wi =
1
σ2
µi ← coefﬁcients
wi0 = −
1
2σ2
µt
i µi + ln P(gi) ← threshold or bias
Decision surface: let i and j be the categories with highest
posteriors. The eq. of the decision boundary is given by
hi(x) = hj(x)
and can be written as
wt
(x − x0) = 0
with w = µi − µj and x0 = 1
2
(µi + µj) − σ2
µi −µj
2 ln
P(gi )
P(gj )
(µi − µj)
-2 2 4
0.1
0.2
0.3
0.4
P(ω1)=.5 P(ω2)=.5
x
p(x|ωi)
ω1 ω2
0
R 1 R 2
0
2
4
0
0.05
0.1
0.15
-2
P(ω1)=.5
P(ω2)=.5
ω1
ω2
R 1
R 2
-2
0
2
4
-2
-1
0
1
2
0
1
2
-2
-1
0
1
2
P(ω1)=.5
P(ω2)=.5
ω1
ω2
R 1
R 2
[DHS - Fig.2.10]
Special case: Σi = Σ
→ the separation hyperplane is no longer orthogonal on
the line between Gaussian centers
-5
0
5 -5
0
5
0
-0.1
0.2
P(ω1)=.5
P(ω2)=.5
ω1
ω2
R 1
R 2
-5
0
5
-5
0
5
0
-0.1
0.2
P(ω1)=.1
P(ω2)=.9
ω1ω2
R 1
R 2
-2
0
2
-2
0
2
4
-2.5
0
2.5
5
7.5
P(ω1)=.5
P(ω2)=.5
ω1
ω2
R 1
R 2
-2
0
2
-2
0
2
4
0
-2.5
5
7.5
10
P(ω1)=.1
P(ω2)=.9
ω1
ω2
R 1
R 2
[DHS - Fig.2.12]
General case: Σi arbitrary
[DHS - Fig.2.14]
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Errors
• error: predict the wrong class
• consider the binary classiﬁcation problem
• Perr = P(x ∈ R2, g1) + P(x ∈ R1, g2)
xB : optimal (Bayes) decision; x∗ : another decision
threshold
So
Perr =
R2
p(x|g1)P(g1) dx +
R1
p(x|g2)P(g2) dx
• minimum of Perr is obtained for x∗ = xB
• to compute Perr we need the class-conditional
probabilities
• for Gaussian probabilities and binary classiﬁcation,
one can show that
Perr ≤ exp (−ψ(β, µ1,2, Σ1,2))
where β can be optimized to minimize the rhs
(Chernoff bound)
• for β = 1
2
one obtains the Bhatacharyya bound
Wrap-up
• Bayesian theory offers a complete framework for
building classﬁers (among other applications)
• minimize the overall risk → choose the action the
minimizes the conditional risk
• under normality assumption, exact formulation of the
optimal classiﬁer can be derived
• tight error bounds can also be computed
• ...but this assumption is rarely true in practice