Performance parameters Performance estimation Performance comparison An example PA196: Pattern Recognition 04. Classifier performance: parameters, estimation and comparison Dr. Vlad Popovici popovici@iba.muni.cz Institute of Biostatistics and Analyses Masaryk University, Brno Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Specific bibliography W.J. Krzanowski, D.J. Hand: ROC curves for continuous data. CRC Press. 2009 M.S. Pepe: The statistical evaluation of medical tests for classification and prediction. Oxford Univ Press. 2003 N. Japkowicz, M. Shah. Evaluating Learning Algorithms: A Classification Perspective. Cambridge Univ Press. 2011 Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Context binary classifiers: Y(x) ∈ {0, 1} is the predicted class label Y is obtained usually from some discriminant function h(x) ∈ R: Y = I[h(X) ≥ θ] h(x) (be it margin, posterior probability, etc) can be interpreted as a score let C be the true label (0 or 1): gold standard or ground truth we assume symmetric loss Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs In medical applications... a classifier is often called a test the class of interest usually refers to an abnormal condition (e.g. "diseased") "positive test" indicates that the abnormal condition is predicted tests: diagnostic: detect the presence of disease prognostic: predict a clinical outcome (e.g. "recurrence" vs "non-recurrence") screening: a test is applied to a large population to detect the presence of an abnormal condition with low prevalence; it is usually followed by other tests Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Confusion matrix Gold standard C = 0 C = 1 Y = 0 true negative false negative Y = 1 false positive true positive Goal Estimate conditional and marginal probabilities. Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Confusion matrix Gold standard C = 0 C = 1 Y = 0 true negative false negative P[Y = 0|C = 0] P[Y = 0|C = 1] P[Y = 0] Y = 1 false positive true positive P[Y = 1|C = 0] P[Y = 1|C = 1] P[Y = 1] P[C = 0] P[C = 1] Goal Estimate conditional and marginal probabilities. Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs estimation is based on a finite test sample {(Yi, Ci)|i = 1, . . . , n} i.i.d. drawn from the population the probabilities will be estimated in terms of fractions/ proportions from the test sample Confusion matrix based on the test sample: Gold standard C = 0 C = 1 Y = 0 n− ¯C n− C n− Y = 1 n+ ¯C n+ C n+ n¯C nC n C indicates the "positive class" (C = 1) and ¯C indicates the "negative class" C = 0. Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Notes on the sampling of the test set: the most frequent ways of selecting the test set are i.i.d. from the underlying distribution → it means that it also approximates well the class priors (prevalence); in clinical studies this is called "cohort study" "case-control": a fixed number of positive (cases) and negative (controls) samples are randomly selected from the population → the class priors are not respected In the following, i.i.d. sampling is implied, unless stated otherwise. Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Basic performance parameters a performance parameter P is a random variable, and we only estimate it as ˆP however, to simplify notation we will denote the parameter simply as P even when referring to its estimate - the meaning is clear from context error rate or proportion of misclassified samples: Err = P[Y C] → n+ ¯C +n− C n false positive fraction: FPF = P[Y = 1|C = 0] → n+ ¯C n+ ¯C +n− ¯C (aka 1-Specificity (Sp)) true positive fraction: TPF = P[Y = 1|C = 1] → n+ C n+ C +n− C (aka Sensitivity (Se)) Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs let ρ = P[C = 1] be the prevalence of the positive cases then Err = ρ(1 − TPF) + (1 − ρ) FPF Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Positive Predicted Value: PPV = P[C = 1|Y = 1] → n+ C n+ C +n+ ¯C Negative Predicted Value: NPV = P[C = 0|Y = 0] → n− ¯C n− ¯C +n− C perfect classifier/test: PPV = NPV = 1 totally uninformative classifier/test: PPV = ρ, NPV = 1 − ρ PPV = ρ TPF ρ TPF +(1 − ρ) FPF NPV = (1 − ρ)(1 − FPF) (1 − ρ)(1 − FPF) + ρ(1 − TPF) Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs in information-retrieval applications: recall stands for TPF and precision stands for PPV F−measure: Fα = (1 − α)(precision × recall) α × precision + recall Matthews correlation coefficient MCC = n+ C × n− ¯C − n+ ¯C × n− C (n+ C + n+ ¯C )(n+ C + n− C )(n− ¯C + n+ ¯C )(n− ¯C + n− C ) Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Correcting for chance... Example 1: let the prevalence of positive cases be ρ = 0.75 and consider a classifier that predicts "1" or "0" with equal probabilities (flip of a coin) simply by chance, the classifier will be right in 0.5 × 0.75 = 0.375 proportion of cases Example 2: medical imaging: the true labels are not known, but there is an expert producing a labelling and the classifier produces another set of labels how can we compare the two, taking into account the concordances due to mere chance? Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs ...using agreement statistics probability of observed agreement between classifier and the true labels: Po = n− ¯C +n+ C n S−coefficient is defined as S = 2Po − 1 by taking into account the chance agreement (Pe): what is the ratio between the difference between observed and expected chance agreement (Po − Pe) and maximum possible agreement beyond chance: Po − Pe 1 − Pe Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs if the estimation of chance agreement is Pe = n+ + nC 2n 2 + n− + n¯C 2n 2 the fraction is denoted as π = Po−Pe 1−Pe and is called Scott’s π coefficient if the estimation of chance agreement is Pe = n+ × nC n2 + n− × n¯C n2 the fraction is denoted as κ = Po−Pe 1−Pe and is called Cohen’s kappa coefficient in medical applications, κ is usually used for measuring the agreement between an expert and another system Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Confidence intervals (CI) need ways for characterizing the uncertainty in the estimates informally, CI is a measure of reliability of the estimates; sample-based (observed) confidence level: how often the confidence interval contains the estimated value the values within the CI can be seen as alternative estimates of the parameter of interest smaller the sample size, larger the CI the (TPF, FPF) and (PPV, NPV) are r.v. from a Bernoulli trial Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs ( Bernoulli trial: experiment with a random binary outcome binomial distribution: discrete pdf of the number of successes in n independent Bernoulli trials with success probability p X ∼ B(n, p) : P[X = k] = n k pk (1 − p)n−k E[X] = np Var[X] = np(1 − p) as n → ∞, X − np np(1 − p) ∼ N(0, 1) ) Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs simplest CI: normal approximation: a 1 − α CI for the binomial parameter p (proportion of successes (between 0 and 1) in n trials) is p ± z1−α/2 p(1 − p) n where z1−α/2 is the 1 − α/2 percentile of standard normal distribution (e.g. for 95% CI, α = 0.05 and z0.975 ≈ 1.96) Warning The normal approximation is poor for FPF or TPF close to 0 or 1. Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Agresti-Coull 1 − α CI: let n be the number of trials and p the number of successes, then let ˜n = n + z2 1−α/2 and ˜p = 1 ˜n p + 1 2 z2 1−α/2 , then a good approximation for the CI is ˜p ± z1−α/2 ˜p(1 − ˜p) ˜n other formulas for CI: Wilson score intervals; Clopper–Pearson interval, Bayesian CIs Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Example: a test for predicting pCR in breast cancer yields pCR=0 pCR=1 predicted 0 61 5 predicted 1 24 10 TPF = 0.67, FPF = 0.28 PPV = 0.29, NPV = 0.92 95% confidence intervals: normal approx.: FPF ∈ (0.197, 0.391), TPF ∈ (0.428, 0.905) Wilson: FPF ∈ (0.208, 0.398), TPF ∈ (0.417, 0.848) Bayesian: FPF ∈ (0.205, 0.397), TPF ∈ (0.416, 0.860) Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Joint confidence intervals 00001111 00 0 00000 0 11 1111 111 0000011111 FPF TPF Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Joint confidence intervals 00001111 00 0 00000 0 11 1111 111 0000011111 FPF TPF what is the joint 100(1 − α)% confidence region for (FPF, TPF)? Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Joint confidence intervals 00001111 00 0 00000 0 11 1111 111 0000011111 FPF TPF what is the joint 100(1 − α)% confidence region for (FPF, TPF)? Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Joint confidence intervals 00001111 00 0 00000 0 11 1111 111 0000011111 FPF TPF what is the joint 100(1 − α)% confidence region for (FPF, TPF)? Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Rectangular confidence regions If (Plow, Pup) and (Qlow, Qup) are the 1 − α∗ univariate confidence intervals for two binomial random variables P and Q, then the rectangle R ≡ (Plow, Pup) × (Qlow, Qup) is a (1 − α) confidence region for (P, Q), where α = 1 − (1 − α∗)2 . Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Rectangular confidence regions If (Plow, Pup) and (Qlow, Qup) are the 1 − α∗ univariate confidence intervals for two binomial random variables P and Q, then the rectangle R ≡ (Plow, Pup) × (Qlow, Qup) is a (1 − α) confidence region for (P, Q), where α = 1 − (1 − α∗)2 . Examples: 95% univariate CI lead to a 90.25% confidence region for a 95% confidence region, 97.5% univariate CIs are needed Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs A motivating example Using (FPF, TPF) for comparing tests: 0,0 0,1 1,0 TPF,Se FPF, 1−Sp DC A B single point performance measure: partition the space in 4 regions region A: better than current test region D: worse than current test regions B,C: less clear Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Other issues with single point performance metrics: difficulty in selecting the optimal threshold: different context may need different operating regimes additive batch effects may spoil the single–point performance Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs ROC curves: Theory Negative Positive S(X) 0,0 0,1 1,0 FPF, 1−Sp TPF,Se t continuous test score Y = S(x) (could be margin h(x)) FPF(t) = P[y ≥ t|C = 0] TPF(t) = P[Y ≥ t|C = 1] ROC = {(FPF(t), TPF(t))|∀t ∈ R} limt→∞ FPF(t) = limt→∞ TPF(t) = 0 limt→−∞ FPF(t) = limt→−∞ TPF(t) = 1 Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Properties of the ROC curves: monotone increasing function ROC curve is invariant to strictly increasing transformations of the scores Y = ψ(S(x)) parametric model: ROC = (α, TPF(FPF−1 (α)))|∀α ∈ (0, 1) ROC(0) = 0, ROC(1) = 1, and ∂ ROC(t) ∂t = fC(FPF−1 (t)) f¯C(FPF−1 (t)) , where fC and f¯C are the probability densities of the scores within diseased and healthy populations, respectively. the ROC curve describes the relationship between the two distributions, and is independent of them Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Note that ∂ ROC(t) ∂t = P[Y = t|C = 1] P[Y = t|C = 0] = LR(t) → the likelihood ratio at threshold t. if LR is monotonically increasing, then the classification rule of the form LR > t is optimal the ROC curve based on LR is uniformly above all other curves the optimal ROC curve is concave; ⇒ its slope is a monotone decreasing function Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Summary indices Area under the ROC curve (AUC): AUC = 1 0 ROC(t)dt Properties: 0.5 ≤ AUC ≤ 1 AUC = P[YC > Y¯C] → the probability of correctly ordering a random pair of cases (Mann–Whitney–Wilcoxon U–statistic) 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 10000011111 0,0 1,0 0,1 TPF(t)=P[Y>t|C=1] FPF(t)=P[Y>t|C=0] P[Y>t|C=1] t t−d P[Y>t−d,Y Y¯C] → the probability of correctly ordering a random pair of cases (Mann–Whitney–Wilcoxon U–statistic) 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 10000011111 0,0 1,0 0,1 TPF(t)=P[Y>t|C=1] FPF(t)=P[Y>t|C=0] P[Y>t|C=1] t t−d P[Y>t−d,Y Y¯C] → the probability of correctly ordering a random pair of cases (Mann–Whitney–Wilcoxon U–statistic) 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 10000011111 0,0 1,0 0,1 TPF(t)=P[Y>t|C=1] FPF(t)=P[Y>t|C=0] P[Y>t|C=1] t t−d P[Y>t−d,Y Y¯C] → the probability of correctly ordering a random pair of cases (Mann–Whitney–Wilcoxon U–statistic) AUC = 1 0 TPF(FPF−1 (t))dt = − ∞ −∞ TPF(t)d FPF(t) 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 10000011111 0,0 1,0 0,1 TPF(t)=P[Y>t|C=1] FPF(t)=P[Y>t|C=0] P[Y>t|C=1] t t−d P[Y>t−d,Y 0 and Φ is the standard normal CDF. Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Properties: AUC = Φ α√ 1+β2 binormal assumption: there exists some monotone strictly increasing function h(·) which makes YC and Y¯C normally distributed if the ROC is binormal, ROC(t) = Φ(α + βΦ−1 (t)), then h(s) = −Φ−1 (FPF(s)) transforms the scores YC and Y¯C into normally distributed random variables. Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Empirical estimates of ROC ROCe(t) = TPF(FPF−1 (t)) : TPF(t) = nC i=1 I[YCi ≥ t] FPF(t) = n¯C i=1 I[Y¯Ci ≥ t] Empirical estimate False positive rate Truepositiverate 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs “ROC” for single threshold Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Empirical estimates of AUC Mann–Whitney–Wilcoxon U–statistic: AUCe = 1 nCn¯C nC i=1 n¯C j=1 I[YCi > Y¯Cj] + 0.5I[YCi = Y¯Cj] Note: if only one point in the (FPF, TPF) space is given, AUC = 0.5(1 + TPF − FPF). Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs AUC: sampling variability Var(AUCe) = 1 nCn¯C [AUC(1 − AUC) + (nC − 1)(Q1 − AUC2 ) + (n¯C − 1)(Q2 − AUC2 )] where Q1 = P[YCi ≥ Y¯Cj, YCk ≥ Y¯Cj] Q2 = P[YCi ≥ Y¯Cj, YCi ≥ Y¯Ck ]. Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Semi–parametric models Start from ROC(t) = TPF(FPF−1 (t|α)|β) and assume some parametric form for TPF and FPF for which estimate the parameters from data. Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Ex. of semi–parametric model: YCi = µC + σCεi Y¯Ci = µ¯C + σ¯Cεi where ε have mean 0 and variance 1 and follow some distribution function S. S(t) = 1 nC + n¯C    i I YCi − µC σC ≥ t + i I Y¯Ci − µ¯C σ¯C ≥ t    which leads to ROC(t) = S (µ¯C − µC)/σC + (σ¯C/σC)S−1 (t) Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs Ex: empirical vs. semi-parametric estimation 0.0 0.5 1.0 Score densities Scores Density non-diseased diseased Semi-parametric vs empirical estimates False positive rate Truepositiverate 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 empirical parametric AUCe ≈ 0.7475; AUCsp ≈ 0.7418 Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Why estimation? finite training data no formula for CI without distribution assumptions often, a single data set is available for both model building and performance measuring performance estimated on the modeling data is optimistically biased Idea Split (maybe repeatedly) the available data into a training and a validation set, and assess the performance only on the data that has not been used in building the model. Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example WARNING All the processing steps that depend on the sampling and which lead to the final model, MUST BE REPEATED IDENTICALLY ON EVERY TRAIN-VALIDATION SPLIT! This includes, but is not limited to: data normalization, feature selection, classifier training, meta-parameter optimization. Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Notes: any two training sets generated from the full data set by resampling will usually overlap to some extent → the models are not totally independent the variability is usually under-estimated the procedure is easy to be parallelized, but attention must be paid to the parallel RNG (to avoid repeating the same sequences) Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Resampling methods simple split–sample approach k–fold cross–validation Monte–Carlo cross-validation repeated k–fold cross–validation leave–one–out bootstrapping ... Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example k−fold cross–validation separated train and test sets randomly dived data into k subsets (folds) – you may also choose to enforce the proportion of the classes (stratified CV) train on k − 1 folds and test on the holdout fold estimate the error as the average error measured on holdout folds TRAIN SET TEST SET usually k = 5 or k = 10 if k = n ⇒ leave–one–out estimator improved estimation: repeated k−CV (e.g. 100 × (5 − CV)) Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example k−fold cross–validation From k folds: 1, . . . , k errors on the test folds (any other performance parameter) ˆEk−CV = 1 k k j=1 j estimated standard deviation Confidence intervals (simple version – normal approximation): E ≈ ˆE ±   0.5 n + z ˆE(1 − ˆE) n   where n is the dataset size and z = Φ−1 (1 − α/2), for a 1 − α confidence interval (e.g. z = 1.96 for 95% conf. interval) Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Bootstrap error estimation Performance estimation (X,Y) (X1,Y1) (XB,YB) E1 E2 EB ...(X2,Y2) 1 generate a new dataset (Xb, Yb) by resampling with replacement from the original dataset (X, Y) 2 train the classifier on (Xb, Yb) and test on the left out data, to obtain an error ˆEb. 3 repeat 1–2 for b = 1, . . . , B and collect ˆEb. Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Bootstrap error estimation estimate the error: for example, use the .632 estimator ˆE = 0.368E0 + 0.632 1 B B b=1 ˆEb where E0 is the error rate on the full training set (X, Y). use the empirical distribution of ˆEb to obtain confidence intervals Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example LPO bootstrap Classification rule: ˆh(x) C ¯C θ where ˆh is the estimated log-likelihood ratio and C and ¯C are the class labels. Empirical AUC (conditioned on the training set) can be approximated by: AUC = 1 n1n2 n2 j=1 n1 i=1 ψ ˆh(xi|C), ˆh(xj|¯C) where ψ is the Mann-Whitney kernel, ψ(a, b) =    1 a > b 1 2 a = b 0 a < b Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Estimation of the expected AUC by LPO bootstrap: AUC LPO = 1 n1n2 n2 j=1 n1 i=1 AUCi,j AUCi,j = B b=1 Ib j Ib i ψ(ˆhb(xi), ˆhb(xj)) B b=1 Ib j Ib I Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example When 2 independent data sets are available, one can estimate: the expected value of the conditional AUC: expectation over the population of training sets of the same size; variability of the performance estimate due to finite train set; variability of the performance estimate due to finite validation sets; Yousef et al., Assessing classifiers from two independent data sets using ROC analysis: a nonparametric approach, PAMI 2006 Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Conclusions What we do learn from CV (and related): the expected performance of the modeling recipe; the imprecision in estimating the performance; we can have a look at: what are the most stable features what are the points always missclassified Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Conclusions What we do learn from CV (and related): the expected performance of the modeling recipe; the imprecision in estimating the performance; we can have a look at: what are the most stable features what are the points always missclassified What we do not learn from CV: the best features the best classifier the best meta–parameters We obtain these by training on the full dataset (no CV). Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example General considerations: comparison of methods/algorithms or models? let there be two models M1 and M2 and a performance parameter P what differences are relevant? proper planning of the experimental design hypothesis testing (equivalence/difference and inferiority/superiority): H0 : there is no difference in performance P(M1) = P(M2) H1 : P(M1) P(M2) (two sided test) or H1 : P(M1) P(M2) (single sided or inferiority/superiority test) informally, one can check the overlap between CIs ideally, one would have a very large test set for comparison Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example In everyday applications... one has limited data → use the resampling (like cross-validation) for testing let P11, . . . , P1K be the performance of the 1st model on the K test sets and P21, . . . , P2K the performance of the 2nd model on the same K test sets simple tests: paired t−test and Wilcoxon signed rank test warning: variability is underestimated,hence t-test has inflated Type I error; there is a "corrected t-test" to alleviate the problem the two samples {P1·} and {P2·} are not independent! Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example McNemar’s test consider a single test set of size m, on which both models are applied the following contingency table is constructed: Model M2 0 1 Model M1 0 c00 c01 1 c10 c11 with c00 counting how many times both models misclassified the same sample c11 counting how many times both models correctly classified the same sample c10 and c01 counting how many times M1 correctly classified a sample the M2 misclassified, and vice-versa Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example McNemar’s test: H0 both classifiers have the same performance (same error rates) construct the test statistic χ2 Mc = (|c01 − c10| − 1)2 c01 + c10 χ2 Mc has an approximate χ2 distribution with 1 df χ2 Mc is to be compared with χ2 1,1−α values for 1 − α significance level rule-of-thumb: the test needs a sample size large enough such that c01 + c10 is at least 30 Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Wrap-up many performance parameters, depend on the intended usage performance estimation is a key step of classifier building process pay attention of proper application of resampling methods for performance estimation always (ALWAYS!) report the uncertainty in the estimates classifier performance comparison depends, again, on the intended application McNemar’s test and CIs provide indications on performance differences Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example MAQC-II: ∼ 300 participants, from 5 countries: data providers data analysis teams (DATs): 36 teams, (∼ 100 people) regulatory board (mainly FDA) 6 datasets, 13 endpoints > 30000 “models” each Data Analysis Plan (DAP) is peer-reviewed each DAT selects a single candidate model for each endpoint MAQC-II consortium selects 2 models for each endpoint, before the release of the validation sets Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example DAP Constraints: should be generally applicable, independent on dataset/endpoint trade-off: undestandability/reproducibility vs. performance/complexity the models should make single-chip predictions Solution: use MAS5 for normalization favor “simple” classifiers nested 10 × 5 − CV use AUC as main performance criterion Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Normalization (MAS5) CV performance estimation Feature selection/ranking Classifier design CV: performance estimation Model optimization Classifier design Feature selection/ranking Quality Control Test on external data CV: performance estimation classifiers: DLDA, LDA, k−NN, CART, logistic regression meta-parameters: number of features, k, ... inner CV: optimize the meta-parameter outer CV: estimate the performance of the system Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Some performance results Estimated vs. validation performance (AUC) A B C D E F G H I J K L M 0.00.20.40.60.81.0 AUC estimated by 10x5-CV Endpoint AUC A B C D E F G H I J K L M 0.00.20.40.60.81.0 Validation AUC Endpoint AUC Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Estimation bias A B C D E F G H I J K L M -1.0-0.50.00.51.0 CV - Validation AUC Endpoint deltaAUC -1.0-0.50.00.51.0 CV - Validation AUC Organization deltaAUC Vlad PA196: Pattern Recognition Performance parameters Performance estimation Performance comparison An example Vlad PA196: Pattern Recognition