Introduction Bayesian Decision Theory PA196: Pattern Recognition 1. Introduction 2. Bayesian decision theory Dr. Vlad Popovici popovici@iba.muni.cz Institute of Biostatistics and Analyses Masaryk University, Brno Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Before starting Organization: 14 weeks: 2h lecture + 2h exercise 1st half of the semester: business as usual 2nd half of the semester (in addition to "business as usual"): project work evaluation: 50% project, 40% written exam and 10% active participation Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bibliography [DHS] Duda, Hart, Stork: Pattern Classification. 2nd Ed. 2001 [EST] Hastie, Tibshirani, Friedman: The Elements of Statistical Learning. 2nd Ed. (7th printing). 2013 Available in PDF! http://statweb.stanford.edu/~tibs/ElemStatLearn/ Webb, Copsey: Statistical Pattern Recognition. 3rd Ed. 2011 Kuncheva: Combining Pattern Classifiers. 2004 Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) WHAT? Webster dictionary: Pattern: a combination of qualities, acts, tendencies, etc., forming a consistent or characteristic arrangement Recognition: the identification of something as having been previously seen, heard, known, etc. Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) DHS: "the act of taking in raw data and taking an action based on the category of the pattern" Wikipedia: "Pattern recognition is nearly synonymous with machine learning. This branch of artificial intelligence focuses on the recognition of patterns and regularities in data. In many cases, these patterns are learned from labeled "training" data (supervised learning), but when no labeled data are available other algorithms can be used to discover previously unknown patterns (unsupervised learning)." Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) An example - from DHS (figs 1.1-1.4) Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Characteristic/feature: width salmon seabass length count l* 0 2 4 6 8 10 12 16 18 20 22 5 10 2015 25 Characteristic/feature: lightness 2 4 6 8 10 0 2 4 6 8 10 12 14 lightness count x* salmon seabass Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Combining width and lightness features: 2 4 6 8 10 14 15 16 17 18 19 20 21 22 width lightness salmon seabass Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Applications - in no particular order Biometrics: face detection and recognition/detection gender and age recognition fingerprint recognition speaker recognition ... Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Human-computer interfaces: user detection/recognition gait recognition gesture recognition brainwave categorization ... Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Other: biomedical research: prediction of response, target genes, etc etc military/security: target detection, intrusion detection, etc etc spam filtering optical character recognition natural language processing remote sensing etc etc Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Pattern recognition systems environment decision feature extraction acquisition (sensor) classification Example: acquisition/sensor: CCD camera feature extraction: in all rectangular regions of size 30 × 30, compute the Gabor wavelet decomposition in corner regions given all the coefficients of Gabor w., classify the region as human face or "something else" Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Some basic terminology learning: model fitting, optimization, training main types of problems: CLASSIFICATION: the classes are known, the problem is to assign classes to the inputs; approached usually by supervised learning: samples and the corresponding labels are used in training the classifier CLASS DISCOVERY: the classes are not known (maybe not even how many they are); approached usually by clustering: grouping similar inputs together other methods pf learning: semi-supervised learning: some labeled and some unlabeled data reinforcement learning: there is a "teacher" telling the system when it’s right or wrong Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Approaches: no clear cut separation between method types statistical/Bayesian: features are random variates; estimate the PDFs and use maximum a posterior for classification; minimize the "risk" of misclassification geometric: find boundaries between regions of the feature space neural networks model-based: reference pattern represent classes; "nearest pattern" rule is used for classification syntactic: classes are represented by grammars structural: classes represented by graphs (or similar) Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Design cycle data collection −→ sample size estimation feature selection classifier design −→ selection of classifier(s), training, model selection performance estimation −→ errors and costs, error estimates, variability of the estimates Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Other issues pre-processing and normalization improves stability of the models and convergence of the learning procedure depend on classifier and application domain feature standardization detection of outliers detection of errors in data Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Goals of classifier design to build a classification model from a finite set of examples that minimizes some error measure and which generalizes well to estimate its future performance on unseen data Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) A bit of formalism a sample or a pattern is represented as a real-valued vector: x ∈ Rd x = [x1, . . . , xi, . . . , xd]t , xi is called variable or feature (actually, it is a realization - measurement - of a given variable) generally, there is a label g ∈ G = {g1, . . . , gm} uniquely associated with each sample there is a probability P(g) that each class would be observed - a priori probability (prior). Normally: i P(gi) = 1. the probability of observing a sample x, given a class gi, is given by the class-conditional probability density p(x|gi) NOTE: P(·) is used for probability mass function (discrete variables) and p(·) for probability density function Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) Similarities, metrics, distances... most methods rely on an explicit or implicit distance between data points: d : Rd × Rd → R+, d(x, z) > 0, ∀x z d(x, z) = 0 ⇐⇒ x = z d(x, z) = d(z, x) a metric is a distance which satisfies the triangle inequality: d(x, z) ≤ d(x, y) + d(y, z), for x, y, z ∈ Rd . a similarity measure is less formally defined; usually has large values for more alike objects examples: Euclidean metric; correlation coefficient as similarity measure Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Bayes rule Let G = {gi} be a number of classes and let X = {xi} be a set of observed data points (i.e. training set). p(x|g) is called likelihood (function) with the variable g (the class label). Note: if g is fixed, but x are considered random, we have the class-conditional density model for generating the observations. P(g) is the prior the goal is to find P(g|x) i.e. the posterior probability that the label for x is g Bayes rule: P(gi|x) = p(x|gi)P(gi) p(x) = p(x|gi)P(gi) i p(x|gi)P(gi) Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Bayes rule posterior ∝ likelihood × prior Tricky part: how to estimate the likelihood?! Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Class-conditional density or likelihood function? ...it depends! consider a set of r.v. X1, . . . , Xp conditionally independent given that Θ = θ pXi|Θ(·|θ) is the (postulated) density model for the variable Xi: for each possible value of Θ is the uncertainty about the values of Xi Pr{X1 ∈ A1, . . . , Xp ∈ Ap} = A1×···×Ap p i=1 pXi|Θ(xi|θ) dx1 . . . dxp once [x1, . . . , xp] are observed (hence they are fixed!), define the function Lx1,...,xp : Ω → R; Lx1,...,xp (θ) = p i=1 pXi|Θ(xi|θ) and call it likelihood function Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Bayesian decision consider there are K classes {g1, . . . , gK } there are a possible actions: {α1, . . . , αa} let α(x) be the decision rule/function for each action-class pair there is a loss incurred: λ(αk |gi) conditional risk: R(αk |x) = K i=1 λ(αk |gi)P(gi|x) the expected risk for the rule α(x) R = x R(α(x)|x)p(x) dx Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Bayesian decision α∗ = arg min k R(αk |x) results in minimum expected risk is the best decision that one can take you can use this framework to build "test cases" for other classifiers Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Example consider a classification problem (actions correspond to class assignment), with 0-1 loss function λ(gk |gi) =    0, k = i 1, k i the conditional risk becomes R(gk |x) = K i=1 λ(gk |gi)P(gi|x) = i k P(gi|x) = 1 − P(gk |x) Bayesian decision rule becomes maximum a posteriori (MAP) rule Bayesian classification - Maximum A Posteriori rule Assign x to class gk if P(gk |x) > P(gi|x) ∀i k Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Discriminant functions d-functions: hi(x), i = 1, . . . , K classifier: x is assigned to gi if hi(x) > hk (x), ∀k i ex.: hi(x) = P(gi|x) or hi(x) = ln p(x|gi) + ln P(gi) different d-functions may give equivalent classifiers Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Binary case if K = 2: binary classifier or dichotomizer multiclass problems can be decomposed in a series of 2-class problems a single discriminant function suffices: h(x) = h1(x) − h2(x) with the decision rule: assign x to g1 if h(x) > 0 this is the case we will study most of the time Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors The case of normal density For x ∈ Rp a column vector, x ∼ N(µ, Σ) : p(x) = 1 (2π)p/2|Σ|1/2 exp − 1 2 (x − µ)t Σ−1 (x − µ) µ is the mean vector and Σ is the covariance matrix (symmetric, positive definite) Σ = E[(x − µ)(x − µ)t ] = (x − µ)(x − µ)t p(x) dx Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors for an p × r matrix A: y = At x ∼ N(At µ, At ΣA) whitening: Aw = ΦΛ−1/2 where Φ is the matrix with eigenvectors of Σ as columns and Λ is a diagonal matrix with corresponding eigenvalues on diagonal [DHS - Fig.2.8:] 0 µ P t µ A t µ N(µ,Σ ) P N(At µ, A t Σ A) A a N(A t wµ, I) Aw A t wµ σ x1 x2 Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Mahalanobis distance r = (x − µ)t Σ−1(x − µ) if the variables/features are independent and standardized, Σ becomes the identity matrix and the M-distance becomes Euclidean distance the volume of the hyperellipsoid corresponding to a distance r is V = Vp|Σ|1/2 rp , where Vp =    πp/2 /(d/2)! if p is even 2p π(p−1)/2 d−1 2 !/p! if p is odd Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Discriminant functions for Normal densities Assume p(x|gi) ∼ N(µi, Σi). Since P(gi|x) ∝ p(x|gi)P(gi) one can take a discriminant function of the form: hi(x) = ln p(x|gi) + ln P(gi) which leads to hi(x) = − 1 2 (x − µi)t Σ−1 i (x − µi) − p 2 ln 2π − 1 2 ln |Σi| + ln P(gi) Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Special case: Σi = σ2 I hi(x) = − 1 2σ2 x − µi 2 + ln P(gi) which can be re-written as linear discriminant functions hi(x) = wt i x + wi0 with wi ∈ Rp , wi = 1 σ2 µi ← coefficients wi0 = − 1 2σ2 µt i µi + ln P(gi) ← threshold or bias Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Decision surface: let i and j be the categories with highest posteriors. The eq. of the decision boundary is given by hi(x) = hj(x) and can be written as wt (x − x0) = 0 with w = µi − µj and x0 = 1 2 (µi + µj) − σ2 µi−µj 2 ln P(gi) P(gj) (µi − µj) -2 2 4 0.1 0.2 0.3 0.4 P(ω1)=.5 P(ω2)=.5 x p(x|ωi) ω1 ω2 0 R 1 R 2 0 2 4 0 0.05 0.1 0.15 -2 P(ω1)=.5 P(ω2)=.5 ω1 ω2 R 1 R 2 -2 0 2 4 -2 -1 0 1 2 0 1 2 -2 -1 0 1 2 P(ω1)=.5 P(ω2)=.5 ω1 ω2 R 1 R 2 [DHS - Fig.2.10] Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Special case: Σi = Σ → the separation hyperplane is no longer orthogonal on the line between Gaussian centers -5 0 5 -5 0 5 0 -0.1 0.2 P(ω1)=.5 P(ω2)=.5 ω1 ω2 R 1 R 2 -5 0 5 -5 0 5 0 -0.1 0.2 P(ω1)=.1 P(ω2)=.9 ω1ω2 R 1 R 2 -2 0 2 -2 0 2 4 -2.5 0 2.5 5 7.5 P(ω1)=.5 P(ω2)=.5 ω1 ω2 R 1 R 2 -2 0 2 -2 0 2 4 0 -2.5 5 7.5 10 P(ω1)=.1 P(ω2)=.9 ω1 ω2 R 1 R 2 [DHS - Fig.2.12] Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors General case: Σi arbitrary [DHS - Fig.2.14] Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Errors error: predict the wrong class consider the binary classification problem Perr = P(x ∈ R2, g1) + P(x ∈ R1, g2) xB : optimal (Bayes) decision; x∗ : another decision threshold Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors So Perr = R2 p(x|g1)P(g1) dx + R1 p(x|g2)P(g2) dx minimum of Perr is obtained for x∗ = xB to compute Perr we need the class-conditional probabilities for Gaussian probabilities and binary classification, one can show that Perr ≤ exp (−ψ(β, µ1,2, Σ1,2)) where β can be optimized to minimize the rhs (Chernoff bound) for β = 1 2 one obtains the Bhatacharyya bound Vlad PA196: Pattern Recognition Introduction Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Wrap-up Bayesian theory offers a complete framework for building classfiers (among other applications) minimize the overall risk → choose the action the minimizes the conditional risk under normality assumption, exact formulation of the optimal classifier can be derived tight error bounds can also be computed ...but this assumption is rarely true in practice Vlad PA196: Pattern Recognition