Introduction
Bayesian Decision Theory
PA196: Pattern Recognition
1. Introduction
2. Bayesian decision theory
Dr. Vlad Popovici
popovici@iba.muni.cz
Institute of Biostatistics and Analyses
Masaryk University, Brno
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Before starting
Oragnization:
14 weeks: 2h lecture + 2h exercise
1st half of the semester: business as usual
2nd half of the semester (in addition to "business as usual"):
discussion of several articles
preparation of small projects
evaluation: 25% from article discussions, 40% project, 25%
written exam and 10% active participation
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bibliography
[DHS] Duda, Hart, Stork: Pattern Classiﬁcation. 2nd Ed. 2001
[EST] Hastie, Tibshirani, Friedman: The Elements of
Statistical Learning. 2nd Ed. (7th printing). 2013
Available in PDF!
http://statweb.stanford.edu/~tibs/ElemStatLearn/
Webb, Copsey: Statistical Pattern Recognition. 3rd Ed. 2011
Kuncheva: Combining Pattern Classiﬁers. 2004
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
WHAT?
Webster dictionary:
Pattern: a combination of qualities, acts, tendencies, etc.,
forming a consistent or characteristic arrangement
Recognition: the identiﬁcation of something as having been
previously seen, heard, known, etc.
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
DHS: "the act of taking in raw data and taking an action based
on the category of the pattern"
Wikipedia: "Pattern recognition is nearly synonymous with
machine learning. This branch of artiﬁcial intelligence focuses
on the recognition of patterns and regularities in data. In
many cases, these patterns are learned from labeled
"training" data (supervised learning), but when no labeled
data are available other algorithms can be used to discover
previously unknown patterns (unsupervised learning)."
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
An example - from DHS (ﬁgs 1.1-1.4)
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Characteristic/feature: width
salmon seabass
length
count
l*
0
2
4
6
8
10
12
16
18
20
22
5 10 2015 25
Characteristic/feature: lightness
2 4 6 8 10
0
2
4
6
8
10
12
14
lightness
count
x*
salmon seabass
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Combining width and lightness features:
2 4 6 8 10
14
15
16
17
18
19
20
21
22
width
lightness
salmon seabass
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Applications - in no particular order
Biometrics:
face detection and recognition/detection
gender and age recognition
ﬁngerprint recognition
speaker recognition
...
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Human-computer interfaces:
user detection/recognition
gait recognition
gesture recognition
brainwave categorization
...
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Other:
biomedical research: prediction of response, target genes, etc
etc
military/security: target detection, intrusion detection, etc etc
spam ﬁltering
optical character recognition
natural language processing
remote sensing
etc etc
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Pattern recognition systems
environment
decision
feature extraction
acquisition
(sensor)
classification
Example:
acquisition/sensor: CCD
camera
feature extraction: in all
rectangular regions of size
30 × 30, compute the Gabor
wavelet decomposition in
corner regions
given all the coefﬁcients of
Gabor w., classify the region
as human face or
"something else"
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Some basic terminology
learning: model ﬁtting, optimization, training
main types of problems:
CLASSIFICATION: the classes are known, the problem is to
assign classes to the inputs; approached usually by
supervised learning: samples and the corresponding labels
are used in training the classiﬁer
CLASS DISCOVERY: the classes are not known (maybe not
even how many they are); approached usually by clustering:
grouping similar inputs together
other methods pf learning:
semi-supervised learning: some labeled and some unlabeled
data
reinforcement learning: there is a "teacher" telling the system
when it’s right or wrong
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Approaches:
no clear cut separation between method types
statistical/Bayesian: features are random variates; estimate
the PDFs and use maximum a posterior for classiﬁcation;
minimize the "risk" of misclassiﬁcation
geometric: ﬁnd boundaries between regions of the feature
space
neural networks
model-based: reference pattern represent classes; "nearest
pattern" rule is used for classiﬁcation
syntactic: classes are represented by grammars
structural: classes represented by graphs (or similar)
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Design cycle
data collection −→ sample size estimation
feature selection
classiﬁer design −→ selection of classiﬁer(s), training, model
selection
performance estimation −→ errors and costs, error estimates,
variability of the estimates
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Other issues
pre-processing and normalization
improves stability of the models and convergence of the
learning procedure
depend on classiﬁer and application domain
feature standardization
detection of outliers
detection of errors in data
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Goals of classiﬁer design
to build a classiﬁcation model from a ﬁnite set of examples that
minimizes some error measure and which generalizes well
to estimate its future performance on unseen data
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
A bit of formalism
a sample or a pattern is represented as a real-valued vector:
x ∈ Rd
x = [x1, . . . , xi, . . . , xd]t
, xi is called variable or feature
(actually, it is a realization - measurement - of a given variable)
generally, there is a label g ∈ G = {g1, . . . , gm} uniquely
associated with each sample
there is a probability P(g) that each class would be observed
- a priori probability (prior). Normally: i P(gi) = 1.
the probability of observing a sample x, given a class gi, is
given by the class-conditional probability density p(x|gi)
NOTE: P(·) is used for proability mass function (discrete
variables) and p(·) for probability density function
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
Similarities, metrics, distances...
most methods rely on an explicit or implicit distance between
data points: d : Rd
× Rd
→ R+,
d(x, z) > 0, ∀x z
d(x, z) = 0 ⇐⇒ x = z
d(x, z) = d(z, x)
a metric is a distance which satisﬁes the triangle inequality:
d(x, z) ≤ d(x, y) + d(y, z), for x, y, z ∈ Rd
.
a similarity measure is less formally deﬁned; usually has large
values for more alike objects
examples: Euclidean metric; correlation coefﬁcient as
similarity measure
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Bayes rule
Let G = {gi} be a number of classes and let X = {xi} be a set of
observed data points (i.e. training set).
p(x|g) is called likelihood (function) with the variable g (the
class label). Note: if g is ﬁxed, but x are considered random,
we have the class-conditional density model for generating
the observations.
P(g) is the prior
the goal is to ﬁnd P(g|x) i.e. the posterior probability that the
label for x is g
Bayes rule:
P(gi|x) =
p(x|gi)P(gi)
p(x)
=
p(x|gi)P(gi)
i p(x|gi)P(gi)
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Bayes rule
posterior ∝ likelihood × prior
Tricky part: how to estimate the likelihood?!
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Class-conditional density or likelihood function?
...it depends!
consider a set of r.v. X1, . . . , Xp conditionally independent
given that Θ = θ
pXi|Θ(·|θ) is the (postulated) density model for the variable Xi:
for each possible value of Θ is the uncertainty about the
values of Xi
Pr{X1 ∈ A1, . . . , Xp ∈ Ap} =
A1×···×Ap
p
i=1
pXi|Θ(xi|θ) dx1 . . . dxp
once [x1, . . . , xp] are observed (hence they are ﬁxed!), deﬁne
the function
Lx1,...,xp : Ω → R; Lx1,...,xp (θ) =
p
i=1
pXi|Θ(xi|θ)
and call it likelihood function
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Bayesian decision
consider there are K classes {g1, . . . , gK }
there are a possible actions: {α1, . . . , αa}
let α(x) be the decision rule/function
for each action-class pair there is a loss incurred: λ(αk |gi)
conditional risk:
R(αk |x) =
K
i=1
λ(αk |gi)P(gi|x)
the expected risk for the rule α(x)
R =
x
R(α(x)|x)p(x) dx
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Bayesian decision
α∗
= arg min
k
R(αk |x)
results in minimum expected risk
is the best decision that one can take
you can use this framework to build "test cases" for other
classiﬁers
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Example
consider a classiﬁcation problem (actions correspond to class
assignment), with 0-1 loss function
λ(gk |gi) =



0, k = i
1, k i
the conditional risk becomes
R(gk |x) =
K
i=1
λ(gk |gi)P(gi|x) =
i k
P(gi|x) = 1 − P(gk |x)
Bayesian decision rule becomes maximum a posteriori (MAP)
rule
Bayesian classiﬁcation - Maximum A Posteriori rule
Assign x to class gk if P(gk |x) > P(gi|x) ∀i k
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Discriminant functions
d-functions: hi(x), i = 1, . . . , K
classiﬁer: x is assigned to gi if hi(x) > hk (x), ∀k i
ex.: hi(x) = P(gi|x) or hi(x) = ln p(x|gi) + ln P(gi)
different d-functions may give equivalent classiﬁers
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Binary case
if K = 2: binary classiﬁer or dichotomizer
multiclass problems can be decomposed in a series of 2-class
problems
a single discriminant function sufﬁces:
h(x) = h1(x) − h2(x)
with the decision rule: assign x to g1 if h(x) > 0
this is the case we will study most of the time
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
The case of normal density
For x ∈ Rp
a column vector, x ∼ N(µ, Σ) :
p(x) =
1
(2π)p/2|Σ|1/2
exp −
1
2
(x − µ)t
Σ−1
(x − µ)
µ is the mean vector and Σ is the covariance matrix
(symmetric, positive deﬁnite)
Σ = E[(x − µ)(x − µ)t
] = (x − µ)(x − µ)t
p(x) dx
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
for an p × r matrix A: y = At
x ∼ N(At
µ, At
ΣA)
whitening: Aw = ΦΛ−1/2
where Φ is the matrix with
eigenvectors of Σ as columns and Λ is a diagonal matrix with
corresponding eigenvalues on diagonal
[DHS - Fig.2.8:]
0
µ
P
t
µ
A
t
µ
N(µ, )
P
N(A
t
µ, A
t
A)
A
a
N(A
t
wµ, I)
Aw
A
t
wµ
x1
x2
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Mahalanobis distance
r = (x − µ)t Σ−1(x − µ)
if the variables/features are independent and standardized, Σ
becomes the identity matrix and the M-distance becomes
Euclidean distance
the volume of the hyperellipsoid corresponding to a distance r
is
V = Vp|Σ|1/2
rp
,
where
Vp =



πp/2
/(d/2)! if p is even
2p
π(p−1)/2 d−1
2 !/p! if p is odd
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Discriminant functions for Normal densities
Assume p(x|gi) ∼ N(µi, Σi).
Since
P(gi|x) ∝ p(x|gi)P(gi)
one can take a discriminant function of the form:
hi(x) = ln p(x|gi) + ln P(gi)
which leads to
hi(x) = −
1
2
(x − µi)t
Σ−1
i (x − µi) −
p
2
ln 2π −
1
2
ln |Σi| + ln P(gi)
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Special case: Σi = σ2
I
hi(x) = −
1
2σ2
x − µi
2
+ ln P(gi)
which can be re-written as linear discriminant functions
hi(x) = wt
i x + wi0
with wi ∈ Rp
,
wi =
1
σ2
µi ← coefﬁcients
wi0 = −
1
2σ2
µt
i µi + ln P(gi) ← threshold or bias
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Decision surface: let i and j be the categories with highest
posteriors. The eq. of the decision boundary is given by
hi(x) = hj(x)
and can be written as
wt
(x − x0) = 0
with w = µi − µj and x0 = 1
2 (µi + µj) − σ2
µi−µj
2 ln
P(gi)
P(gj)
(µi − µj)
-2 2 4
0.1
0.2
0.3
0.4
P( 1)=.5 P( 2)=.5
x
p(x| i)
1 2
0
R 1
R 2
0
2
4
0
0.05
0.1
0.15
-2
P( 1)=.5
P( 2)=.5
1
2
R 1
R 2
-2
0
2
4
-2
-1
0
1
2
0
1
2
-2
-1
0
1
2
P( 1)=.5
P( 2)=.5
1
2
R 1
R 2
[DHS - Fig.2.10]
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Special case: Σi = Σ
→ the separation hyperplane is no longer orthogonal on the line
between Gaussian centers
-5
0
5 -5
0
5
0
-0.1
0.2
P( 1)=.5
P( 2)=.5
12
R 1
R 2
-5
0
5
-5
0
5
0
-0.1
0.2
P( 1)=.1
P( 2)=.9
12
R 1
R 2
-2
0
2
-2
0
2
4
-2.5
0
2.5
5
7.5
P( 1)=.5
P( 2)=.5
1
2
R 1
R 2
-2
0
2
-2
0
2
4
0
-2.5
5
7.5
10
P( 1)=.1
P( 2)=.9
1
2
R 1
R 2
[DHS - Fig.2.12]
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
General case: Σi arbitrary
[DHS - Fig.2.14]
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Outline
1 Introduction
WHAT? (is Pattern Recognition)
WHY? (applications)
HOW? (different approaches)
2 Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Errors
error: predict the wrong class
consider the binary classiﬁcation problem
Perr = P(x ∈ R2, g1) + P(x ∈ R1, g2)
xB : optimal (Bayes) decision; x∗ : another decision threshold
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
So
Perr =
R2
p(x|g1)P(g1) dx +
R1
p(x|g2)P(g2) dx
minimum of Perr is obtained for x∗ = xB
to compute Perr we need the class-conditional probabilities
for Gaussian probabilities and binary classiﬁcation, one can
show that
Perr ≤ exp (−ψ(β, µ1,2, Σ1,2))
where β can be optimized to minimize the rhs (Chernoff
bound)
for β = 1
2 one obtains the Bhatacharyya bound
Vlad PA196: Pattern Recognition
Introduction
Bayesian Decision Theory
Bayes rule
Discriminant functions
Normal density
Errors
Wrap-up
Bayesian theory offers a complete framework for building
classﬁers (among other applications)
minimize the overall risk → choose the action the minimizes
the conditional risk
under normality assumption, exact formulation of the optimal
classiﬁer can be derived
tight error bounds can also be computed
...but this assumption is rarely true in practice
Vlad PA196: Pattern Recognition