PA196: Pattern Recognition 08. Multiple classifier systems (cont’d) Dr. Vlad Popovici popovici@recetox.muni.cz RECETOX Masaryk University, Brno General idea X X1 X2 XT Û Ø Ñ ÓÖ ØÝ ÚÓØ H = t wtht ×Ù × Ø× Ó Ü ÑÔÐ × ×Ù × Ø× Ó ØÙÖ × h1 h2 hT ½ n d Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Bagging (Breiman, 1996) • bagging = bootstrap aggregation • create T bootstrap samples Xt by sampling with replacement • train a classifier on each Xt • aggregate the classifications by plurality voting to obtain the aggregated classifier H • a similar approach works for regression • works well with unstable classifiers (with high variance): decision trees, neural networks Why does bagging work? • reduces variance (due to sampling in the test sets): E[(y − H(x))2 ] = (y − E[H(x)])2 + E[(H(x) − E[(H(x))])2 ] = bias2 + variance • the bias of H remains approximately the same as for ht : Bias(H) = 1 T T t=1 Bias(ht ) • but the variance is reduced: Var(H) ≈ 1 T Var(h1) Variants: • draw random subsamples of data → "Pasting" • draw random subsets of features → "Random Subspaces" • draw random subsamples and random features → "Random Patches" Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Idea: induce randomness in the base classifier (tree) and combine the predictions of an ensemble of such trees (forest) by averaging or majority vote. • refinement of bagging trees • (1st level of randomness) grow the trees on bootstrap samples • (2nd level of randomness) when growing a tree, at each node consider only a random subset of features (typically √ d or log2 d features) • for each tree, the error rate for observation left out from the learning set is monitored ("out-of-bag" error rate) • the result is a collection of "de-correlated" trees that by averaging/voting should lead to decreased variance of the final predictor Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective General approach (I will follow Freund & Schapire’s tutorial on boosting) • let S = {(xi, yi)|i = 1, . . . , n} be a data set with yi ∈ {±1} and xi a d−dimensional vector of features xij • let there exist a learner (or a few) able to produce some basic classifiers ht , based on sets such as S • ht will be called "weak classifiers" and the condition is that Err(ht ) = 0.5 − t where 0 < ≤ 0.5 • for each iteration t = 1, . . . , T produce a version of the training set St on which ht are fit and then, assemble their predictions • how to select the training points at each round? → concentrate on most difficult points • how to combine the weak classifiers? → take the (weighted) majority vote Boosting A general methodology of producing highly accurate predictors based on averaging some weak classifiers. Context • PAC framework: • a strong-PAC algorithm: • for any distribution (of data) • ∀ > 0, ∀δ > 0 • given enough data (i.i.d. from the distribution) • with probability at least 1 − δ, the algorithm will find a classifier with error ≤ • a weak-PAC algorithm: the same conditions, but the guaranteed error is ≥ 1 2 − γ • when weak-PAC learnability leads to strong-PAC? AdaBoost • a development of previous "boosting" algorithms • first to reach widespread applicability, due to simplicity of the implementation and good observed performance (in addition to theoretical performance) • Freund & Schapire (EuroCOLT, 1995); the more complete version: "A decision-theoretic generalization of on-line learning and an application to boosting", J. Comp Sys Sc 1997 • AdaBoost: adaptive boosting Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Basic AdaBoost Input: a training set S = {(xi, yi)} and the number of iterations T Output: final classifier H as a combination of weak classifiers ht for t = 1 to T do construct a distribution Dt on {1, . . . , n} find a weak classifier ht : X → {−1, +1} which minimizes the error t on Dt , t = PrDt [ht (xi) yi] end for How to construct Dt ? • let D1(i) = 1/n (uninformative priors) • given Dt and a weak classifier ht , Dt+1 = Dt (i) Zt ×    exp(−αt ) if ht (xi) = yi exp(αt ) if ht (xi) yi = Dt (i) Zt exp(−αt yiht (xi)) • Zt is a properly chosen normalization constant • αt = 1 2 ln 1 − t t ∈ R+ What about the final decision/classifier? H(x) = sign   T t=1 αt ht (x)   How to use Dt for training a classifier? • either generate a new training sample from S by sampling according to Dt , or • use directly the sample weights for constructing ht (Classical) Example (Freund & Schapire) Initial state: D1 weak classifiers: single variable threshold function (Classical) Example (Freund & Schapire) Iteration 1: h1 α ε1 1 =0.30 =0.42 2D (Classical) Example (Freund & Schapire) Iteration 2: α ε2 2 =0.21 =0.65 h2 3D (Classical) Example (Freund & Schapire) Iteration 3: h3 α ε3 3=0.92 =0.14 (Classical) Example (Freund & Schapire) Final classifier: H final +0.92+0.650.42sign= = Training error theorem: let t < 1/2 be the error rate at step t and let γt = 1/2 − t , then the training error of the final classifier is upper bounded by Errtrain(H) ≤ exp  −2 T t=1 γ2 t   • then if ∀t : γt ≥ γ > 0, Errtrain ≤ exp(−2γ2 T) • it follows that Errtrain → 0 as T → ∞ • if γt γ the convergence is much faster What about overfitting? • Occam’s razor suggests that simpler rules are preferable • for SVMs, sparser models (less SVs) have better generalization properties • AdaBoost? What about overfitting? • Occam’s razor suggests that simpler rules are preferable • for SVMs, sparser models (less SVs) have better generalization properties • AdaBoost? • practice shows that AdaBoost is resistant to overfitting, in normal conditions What about overfitting? • Occam’s razor suggests that simpler rules are preferable • for SVMs, sparser models (less SVs) have better generalization properties • AdaBoost? • practice shows that AdaBoost is resistant to overfitting, in normal conditions • in highly noisy conditions, AdaBoost can overfit! Regularized versions exist to tackle this situation Where does the robustness (to overfitting) come from? • it’s a matter of margin! (most likely) • define the margin as the "strength of the vote", i.e. "weighted fraction of correct votes" - "weighted fraction of incorrect votes" The output from the final classifier (before sign()) ∈ [−1, 1] : 0 1−1 incorrect correct high confidence, correctno confidencehigh confidence, but wrong Example (from F&S’s tutorial): the "letters" data set from UCI, C4.5 weak classifiers 10 100 1000 0 5 10 15 20 error test train )T#of rounds( -1 -0.5 0.5 1 0.5 1.0 cumulativedistribution 1000 100 margin 5 #rounds 5 100 1000 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1 %margins≤ 0.5 7.7 0.0 0.0 minimum margin 0.14 0.52 0.55 ...and a real world example: prediction of pCR in breast cancer • AdaBoost with weighted top scoring pairs weak classifiers • data: MDA gene expression data (∼22,000 variables) from MAQC project: n = 130 training samples, n = 100 testing samples • data comes different hospitals, clinical series, no much control on the representativeness of the training set • endpoint: pathologic complete response (pCR) Training and testing errors (with functional margin) # iterations = 1, errtr = 0.17, errts = 0.39 -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 1 Margin CDF -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 1 Margin CDF # iterations = 5, errtr = 0.02, errts = 0.34 -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 5 Margin CDF -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 5 Margin CDF # iterations = 10, errtr = 0.0, errts = 0.31 -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 10 Margin CDF -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 10 Margin CDF # iterations = 50, errtr = 0.0, errts = 0.23 -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 50 Margin CDF -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 50 Margin CDF # iterations = 100, errtr = 0.0, errts = 0.22 -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 100 Margin CDF -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 100 Margin CDF Ideas: • large margin allows a sparser approximation of the final classifier, hence the final classifier should have better generalization properties than its size would suggest • the AdaBoost increases the margin as T grows and decreases the effective complexity of the final classifier • ∀θ > 0, Err(H) ≤ ˆPr[margin ≤ θ] + O( √ h/n/θ) where h is the "complexity" of weak classifiers • ˆPr[margin ≤ θ] → 0 exponentially fast in T if γt > θ Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective A few different interpretations • game theory: AdaBoost classifier as a solution of a minmax game • loss minimization • additive logistic model • maximum entropy • etc. etc. AdaBoost as a minimizer of exponential loss • let L(y, f(x)) be the loss function measuring the discrepancies between true target (label or real value) y and the predicted value f(x) • it can be shown that AdaBoost minimizes (remember the scaling factor Zt ?) t Zt = 1 n i exp(−yif(xi)) where f(x) = t αt ht (x) • yf(x) is the (functional) margin, similar to SVM • exponential loss is an upper bound of the 0-1 loss • AdaBoost is a greedy procedure for loss minimization: αt and ht are chosen locally to minimize the current loss Coordinate descent [Breiman] • let {h1, . . . , hm} be the space of all weak classifiers • the goal is to find β1, . . . , βm (coordinates in the space of weak classifiers) where the loss L(β1, . . . , βm) = i exp(−yi k βk h(xi)) is minimized • coordinate descent procedure: • start with βk = 0 • at each step: choose coordinate βk (on axis ht ) and update it by an increment αt • αt is chosen to maximize the decrease in loss • this is the very procedure implemented by AdaBoost Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Gradient descent optimization (reminder) • let Ω be a differentiable optimization criterion • let xk = xk−1 − γ Ω(xk−1), then for some small γ > 0 Ω(xk ) ≤ Ω(xk−1) Issues: • slow convergence • sensitive to initial point Gradient descent in function space • in the following, we will generalize from ±1-valued classifiers to real-valued functions • change of notation: F becomes the generalized version of H and f the generalized version of h, respectively • FM(x) = M 1 fm(x) is evaluated at each x • gradient (steepest) descent: fm(x) = −ρmgm(x) = −ρm F Ey,x [L(y, F(x))] F=Fm−1 ρm = arg min ρ Ey,x [L(y, Fm−1(x) − ρgm(x))] Additive models Friedman, Hastie, Tibshirani, Additive logistic regression: a statistical view of boosting, The Annals of Statistics, 2000. • regression models: let y ∈ R and model the mean: E [y|x] = p j=1 fj(xj), where x = (x1, . . . , xp) ∈ Rp . • iteratively update (backfit) the current approximation until convergence: fj(xj) ← E  y − k j fk (xk ) xj   . • the final solution, F(x) = p 1 fj(xj), is a minimizer of E (y − F(x))2 . Extended additive models • consider a family of functions fm(x) = βmb(x; γm). • b(·) : basis functions (linear, sigmoid, RBF, wavelets,. . . ) • Notes on basis functions: • span a function subspace • they need not be orthogonal, nor form a complete/minimal base • they can be chosen to form a redundant dictionary: matching pursuit • applications in (statistical) signal processing; image compression; multi–scale data analysis;... Fitting the model: • generalized backfitting: {βm, γm} ← arg min β,γ E    y − ( k m βk b(x; γk ) + βb(x; γ))   2  • greedy optimization: let FM(x) = M 1 βmb(x; γm) be the solution after M iterations; the successive approximations are {βm, γm} = arg min β,γ E (y − (Fm−1(x) + βb(x; γ))2 → matching pursuit; in classification: kernel matching pursuit Mallat, Zhang, Matching pursuit with time–frequency dictionaries, 1993 Vincent, Bengio, Kernel matching pursuit, 2002 Popovici, Thiran, Kernel matching pursuit for large datasets, 2005 From regression to classification • goal (for binary problems): estimate Pr(y = 1|x) • logistic regression: ln Pr(y = 1|x) Pr(y = −1|x) = FM(x) with FM(x) ∈ R. • ⇔ p(x) = Pr(y = 1|x) = exp(FM(x)) 1+exp(FM(x)) • FM is obtained by minimizing the expected loss: FM(x) = arg min F Ey,x [L(y, F(x))] = arg min F Ex Ey [L(y, F(x)))] |x Generalized boosting algorithm 1: given {(xi, yi)|i = 1, . . . , N}, let F0(x) = f0(x) 2: for all m = 1, . . . , M do 3: compute the current negative gradient: zi = − F L(F) F=Fm−1 = − ∂L(yi, F(xi)) ∂F F=Fm−1(xi) and fit fm using the new set {(xi, zi)|i = 1, . . . , N} 4: find the step–size cm = arg min c N i=1 L(yi, Fm−1(xi) + cfm(xi)) 5: let Fm(x) = Fm−1(x) + cmfm(x) 6: end for 7: return final classifier sign [FM(x)] Which loss function? Loss functions Loss y F(x) Exponential loss L(y, F) = E e−yF(x) Notes: • L(y, F) is minimized at F(x) = 1 2 ln Pr(y = 1|x) Pr(y = −1|x) • yF(x) F is called margin of sample x ⇒ L(y, F) forces margin maximization • L is differentiable and an upper bound of 1[yF(x)<0] • L has the same population minimizer as the binomial log–likelihood AdaBoost builds an additive logistic regression model 1: let wi = 1/N 2: for all m = 1, . . . , M do 3: fit the weak classifier fm(x) ∈ {±1} using the weights wi on the training data 4: errm = Ew 1[y fm(x)] { expectation with respect to weights! } 5: cm = ln 1−errm errm (note: cm = 2 arg minc L( m−1 1 fi + cfm)) 6: update the weights wi ← wi exp cm1[yi fm(xi)] , i = 1, . . . , N and normalize such that w = 1 7: end for 8: return final classifier sign M m=1 cmfm(x) Real AdaBoost: stagewise optimization of exponential loss 1: let wi = 1/N 2: for all m = 1, . . . , M do 3: fit the weak classifier using the weights wi on the training data and obtain the posteriors pm(x) = ˆPw(y = 1|x) ∈ [0, 1] 4: let fm(x) = 1 2 ln pm(x) 1−pm(x) {note: this is the local minimizer of L} 5: update the weights wi ← wi exp (−yifm(xi)) , i = 1, . . . , N and normalize such that w = 1 6: end for 7: return final classifier sign M m=1 fm(x) LogitBoost: stagewise opt. of binomial log–likelihood Let y∗ = (1 + y)/2 ∈ {0, 1} and Pr(y∗ = 1|x) = p = exp(F(x))/(exp(F(x)) + exp(−F(x))) 1: let wi = 1/N, pi = 1/2, ∀i = 1, . . . , N, F(x) = 0 2: for all m = 1, . . . , M do 3: let zi = y∗ i −pi pi(1−pi) { new responses, instead of y} 4: let wi = pi(1 − pi) 5: fit fm by weighted least–square regression of zi to xi using weights wi 6: update F ← F + 1/2fm 7: update p ← exp(F)/(exp(F) + exp(−F)) 8: end for 9: return final classifier sign M m=1 fm(x) Which weak learner? • any classifier with an error rate < 0.5 • decision stumps (classification tree with 1 node) • classical classification trees • top scoring pairs classifier • linear (logistic) regression (an example later) • radial basis functions • . . . Practical issues • the weak classifier should not be too strong • AdaBoost or LogitBoost are good first choices for classification problems • stopping rules: • quit when the weak classifier cannot fit the data anymore • choose M by an inner cross–validation or independent data set • use AIC, BIC, MDL as criteria for choosing M