PA196: Pattern Recognition 1. Introduction 2. Bayesian decision theory Dr. Vlad Popovici popovici@recetox.muni.cz RECETOX Masaryk University, Brno Before starting Organization: • 14 weeks: 2h lecture + 2h exercise • 1st half of the semester: business as usual • mid-term exam: 30% of the final mark; you need to solve at least 50% of the problems to qualify for final exam • 2nd half of the semester (in addition to "business as usual"): preparation of individual (small) projects: project is required for final exam • problem modeling assignment • evaluation: 30% written exam +30% mid-term, 10% assignment, 20% project, and 10% participation Bibliography • [DHS] Duda, Hart, Stork: Pattern Classification. 2nd Ed. 2001 • [EST] Hastie, Tibshirani, Friedman: The Elements of Statistical Learning. 2nd Ed. (12th printing). 2017 Available in PDF! https: //web.stanford.edu/~hastie/ElemStatLearn/ • Webb, Copsey: Statistical Pattern Recognition. 3rd Ed. 2011 • Kuncheva: Combining Pattern Classifiers. 2004 Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors WHAT? Webster dictionary: • Pattern: a combination of qualities, acts, tendencies, etc., forming a consistent or characteristic arrangement • Recognition: the identification of something as having been previously seen, heard, known, etc. • DHS: "the act of taking in raw data and taking an action based on the category of the pattern" • Wikipedia: "Pattern recognition is nearly synonymous with machine learning. This branch of artificial intelligence focuses on the recognition of patterns and regularities in data. In many cases, these patterns are learned from labeled "training" data (supervised learning), but when no labeled data are available other algorithms can be used to discover previously unknown patterns (unsupervised learning)." An example - from DHS (figs 1.1-1.4) Characteristic/feature: width salmon seabass length count l* 0 2 4 6 8 10 12 16 18 20 22 5 10 2015 25 Characteristic/feature: lightness 2 4 6 8 10 0 2 4 6 8 10 12 14 lightness count x* salmon seabass Combining width and lightness features: 2 4 6 8 10 14 15 16 17 18 19 20 21 22 width lightness salmon seabass Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Applications - in no particular order Biometrics: • face detection and recognition/detection • gender and age recognition • fingerprint recognition • speaker recognition • ... Human-computer interfaces: • user detection/recognition • gait recognition • gesture recognition • brainwave categorization • ... Other: • biomedical research: prediction of response, target genes, etc etc • military/security: target detection, intrusion detection, etc etc • spam filtering • optical character recognition • natural language processing • remote sensing • etc etc Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Pattern recognition systems environment decision feature extraction acquisition (sensor) classification Example: • acquisition/sensor: CCD camera • feature extraction: in all rectangular regions of size 30 × 30, compute the Gabor wavelet decomposition in corner regions • given all the coefficients of Gabor w., classify the region as human face or "something else" Some basic terminology • learning: model fitting, optimization, training • main types of problems: • CLASSIFICATION: the classes are known, the problem is to assign classes to the inputs; approached usually by supervised learning: samples and the corresponding labels are used in training the classifier • CLASS DISCOVERY: the classes are not known (maybe not even how many they are); approached usually by clustering: grouping similar inputs together • other methods pf learning: • semi-supervised learning: some labeled and some unlabeled data • reinforcement learning: there is a "teacher" telling the system when it’s right or wrong Approaches: • no clear cut separation between method types • statistical/Bayesian: features are random variates; estimate the PDFs and use maximum a posterior for classification; minimize the "risk" of misclassification • geometric: find boundaries between regions of the feature space • neural networks • model-based: reference pattern represent classes; "nearest pattern" rule is used for classification • syntactic: classes are represented by grammars • structural: classes represented by graphs (or similar) Design cycle • data collection −→ sample size estimation • feature selection • classifier design −→ selection of classifier(s), training, model selection • performance estimation −→ errors and costs, error estimates, variability of the estimates Other issues • pre-processing and normalization • improves stability of the models and convergence of the learning procedure • depend on classifier and application domain • feature standardization • detection of outliers • detection of errors in data Goals of classifier design • to build a classification model from a finite set of examples that minimizes some error measure and which generalizes well • to estimate its future performance on unseen data A bit of formalism • a sample or a pattern is represented as a real-valued vector: x ∈ Rd • x = [x1, . . . , xi, . . . , xd]t , xi is called variable or feature (actually, it is a realization - measurement - of a given variable) • generally, there is a label g ∈ G = {g1, . . . , gm} uniquely associated with each sample • there is a probability P(g) that each class would be observed - a priori probability (prior). Normally: i P(gi) = 1. • the probability of observing a sample x, given a class gi, is given by the class-conditional probability density p(x|gi) • NOTE: P(·) is used for proability mass function (discrete variables) and p(·) for probability density function Similarities, metrics, distances... • most methods rely on an explicit or implicit distance between data points: d : Rd × Rd → R+, • d(x, z) > 0, ∀x z • d(x, z) = 0 ⇐⇒ x = z • d(x, z) = d(z, x) • a metric is a distance which satisfies the triangle inequality: d(x, z) ≤ d(x, y) + d(y, z), for x, y, z ∈ Rd. • a similarity measure is less formally defined; usually has large values for more alike objects • examples: Euclidean metric; correlation coefficient as similarity measure Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Bayes rule Let G = {gi} be a number of classes and let X = {xi} be a set of observed data points (i.e. training set). • p(x|g) is called likelihood (function) with the variable g (the class label). Note: if g is fixed, but x are considered random, we have the class-conditional density model for generating the observations. • P(g) is the prior • the goal is to find P(g|x) i.e. the posterior probability that the label for x is g • Bayes rule: P(gi|x) = p(x|gi)P(gi) p(x) = p(x|gi)P(gi) i p(x|gi)P(gi) Bayes rule posterior ∝ likelihood × prior Tricky part: how to estimate the likelihood?! Class-conditional density or likelihood function? • ...it depends! • consider a set of r.v. X1, . . . , Xp conditionally independent given that Θ = θ • pXi |Θ(·|θ) is the (postulated) density model for the variable Xi: for each possible value of Θ is the uncertainty about the values of Xi • Pr{X1 ∈ A1, . . . , Xp ∈ Ap} = A1×···×Ap p i=1 pXi |Θ(xi|θ) dx1 . . . dxp • once [x1, . . . , xp] are observed (hence they are fixed!), define the function Lx1,...,xp : Ω → R; Lx1,...,xp (θ) = p i=1 pXi |Θ(xi|θ) and call it likelihood function Bayesian decision • consider there are K classes {g1, . . . , gK } • there are a possible actions: {α1, . . . , αa} • let α(x) be the decision rule/function • for each action-class pair there is a loss incurred: λ(αk|gi) • conditional risk: R(αk|x) = K i=1 λ(αk|gi)P(gi|x) • the expected risk for the rule α(x) R = x R(α(x)|x)p(x) dx Bayesian decision α∗ = arg min k R(αk|x) • results in minimum expected risk • is the best decision that one can take • you can use this framework to build "test cases" for other classifiers Example • consider a classification problem (actions correspond to class assignment), with 0-1 loss function λ(gk|gi) =    0, k = i 1, k i • the conditional risk becomes R(gk|x) = K i=1 λ(gk|gi)P(gi|x) = i k P(gi|x) = 1 − P(gk|x) • Bayesian decision rule becomes maximum a posteriori (MAP) rule Bayesian classification - Maximum A Posteriori rule Assign x to class gk if P(gk|x) > P(gi|x) ∀i k Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Discriminant functions • d-functions: hi(x), i = 1, . . . , K • classifier: x is assigned to gi if hi(x) > hk(x), ∀k i • ex.: hi(x) = P(gi|x) or hi(x) = ln p(x|gi) + ln P(gi) • different d-functions may give equivalent classifiers Binary case • if K = 2: binary classifier or dichotomizer • multiclass problems can be decomposed in a series of 2-class problems • a single discriminant function suffices: h(x) = h1(x) − h2(x) with the decision rule: assign x to g1 if h(x) > 0 • this is the case we will study most of the time Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors The case of normal density For x ∈ Rp a column vector, x ∼ N(µ, Σ) : p(x) = 1 (2π)p/2|Σ|1/2 exp − 1 2 (x − µ)t Σ−1 (x − µ) • µ is the mean vector and Σ is the covariance matrix (symmetric, positive definite) Σ = E[(x − µ)(x − µ)t ] = (x − µ)(x − µ)t p(x) dx • for an p × r matrix A: y = At x ∼ N(At µ, At ΣA) • whitening: Aw = ΦΛ−1/2 where Φ is the matrix with eigenvectors of Σ as columns and Λ is a diagonal matrix with corresponding eigenvalues on diagonal [DHS - Fig.2.8:] 0 µ P t µ A t µ N(µ,Σ ) P N(At µ, A t Σ A) A a N(A t wµ, I) Aw A t wµ σ x1 x2 Mahalanobis distance r = (x − µ)t Σ−1(x − µ) • if the variables/features are independent and standardized, Σ becomes the identity matrix and the M-distance becomes Euclidean distance • the volume of the hyperellipsoid corresponding to a distance r is V = Vp|Σ|1/2 rp , where Vp =    πp/2/(d/2)! if p is even 2pπ(p−1)/2 d−1 2 !/p! if p is odd Discriminant functions for Normal densities Assume p(x|gi) ∼ N(µi, Σi). Since P(gi|x) ∝ p(x|gi)P(gi) one can take a discriminant function of the form: hi(x) = ln p(x|gi) + ln P(gi) which leads to hi(x) = − 1 2 (x − µi)t Σ−1 i (x − µi) − p 2 ln 2π − 1 2 ln |Σi| + ln P(gi) Special case: Σi = σ2 I hi(x) = − 1 2σ2 x − µi 2 + ln P(gi) which can be re-written as linear discriminant functions hi(x) = wt i x + wi0 with wi ∈ Rp, wi = 1 σ2 µi ← coefficients wi0 = − 1 2σ2 µt i µi + ln P(gi) ← threshold or bias Decision surface: let i and j be the categories with highest posteriors. The eq. of the decision boundary is given by hi(x) = hj(x) and can be written as wt (x − x0) = 0 with w = µi − µj and x0 = 1 2 (µi + µj) − σ2 µi −µj 2 ln P(gi ) P(gj ) (µi − µj) -2 2 4 0.1 0.2 0.3 0.4 P(ω1)=.5 P(ω2)=.5 x p(x|ωi) ω1 ω2 0 R 1 R 2 0 2 4 0 0.05 0.1 0.15 -2 P(ω1)=.5 P(ω2)=.5 ω1 ω2 R 1 R 2 -2 0 2 4 -2 -1 0 1 2 0 1 2 -2 -1 0 1 2 P(ω1)=.5 P(ω2)=.5 ω1 ω2 R 1 R 2 [DHS - Fig.2.10] Special case: Σi = Σ → the separation hyperplane is no longer orthogonal on the line between Gaussian centers -5 0 5 -5 0 5 0 -0.1 0.2 P(ω1)=.5 P(ω2)=.5 ω1 ω2 R 1 R 2 -5 0 5 -5 0 5 0 -0.1 0.2 P(ω1)=.1 P(ω2)=.9 ω1ω2 R 1 R 2 -2 0 2 -2 0 2 4 -2.5 0 2.5 5 7.5 P(ω1)=.5 P(ω2)=.5 ω1 ω2 R 1 R 2 -2 0 2 -2 0 2 4 0 -2.5 5 7.5 10 P(ω1)=.1 P(ω2)=.9 ω1 ω2 R 1 R 2 [DHS - Fig.2.12] General case: Σi arbitrary [DHS - Fig.2.14] Outline 1 Introduction WHAT? (is Pattern Recognition) WHY? (applications) HOW? (different approaches) 2 Bayesian Decision Theory Bayes rule Discriminant functions Normal density Errors Errors • error: predict the wrong class • consider the binary classification problem • Perr = P(x ∈ R2, g1) + P(x ∈ R1, g2) xB : optimal (Bayes) decision; x∗ : another decision threshold So Perr = R2 p(x|g1)P(g1) dx + R1 p(x|g2)P(g2) dx • minimum of Perr is obtained for x∗ = xB • to compute Perr we need the class-conditional probabilities • for Gaussian probabilities and binary classification, one can show that Perr ≤ exp (−ψ(β, µ1,2, Σ1,2)) where β can be optimized to minimize the rhs (Chernoff bound) • for β = 1 2 one obtains the Bhatacharyya bound Wrap-up • Bayesian theory offers a complete framework for building classfiers (among other applications) • minimize the overall risk → choose the action the minimizes the conditional risk • under normality assumption, exact formulation of the optimal classifier can be derived • tight error bounds can also be computed • ...but this assumption is rarely true in practice