Bagging Random forests AdaBoost PA196: Pattern Recognition 08. Multiple classifier systems (cont’d) Dr. Vlad Popovici popovici@iba.muni.cz Institute of Biostatistics and Analyses Masaryk University, Brno Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost General idea ❳ X1 X2 XT Û Ø Ñ ÓÖ ØÝ ÚÓØ H = t wtht ×Ù × Ø× Ó Ü ÑÔÐ × ×Ù × Ø× Ó ØÙÖ × h1 h2 hT ½ n d Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Bagging (Breiman, 1996) bagging = bootstrap aggregation create T bootstrap samples Xt by sampling with replacement train a classifier on each Xt aggregate the classifications by plurality voting to obtain the aggregated classifier H a similar approach works for regression works well with unstable classifiers (with high variance): decision trees, neural networks Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Why does bagging work? reduces variance (due to sampling in the test sets): E[(y − H(x))2 ] = (y − E[H(x)])2 + E[(H(x) − E[(H(x))])2 ] = bias2 + variance the bias of H remains approximately the same as for ht : Bias(H) = 1 T T t=1 Bias(ht ) but the variance is reduced: Var(H) ≈ 1 T Var(h1) Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Variants: draw random subsamples of data → "Pasting" draw random subsets of features → "Random Subspaces" draw random subsamples and random features → "Random Patches" Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Idea: induce randomness in the base classifier (tree) and combine the predictions of an ensemble of such trees (forest) by averaging or majority vote. refinement of bagging trees (1st level of randomness) grow the trees on bootstrap samples (2nd level of randomness) when growing a tree, at each node consider only a random subset of features (typically √ d or log2 d features) for each tree, the error rate for observation left out from the learning set is monitored ("out-of-bag" error rate) the result is a collection of "de-correlated" trees that by averaging/voting should lead to decreased variance of the final predictor Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective General approach (I will follow Freund & Schapire’s tutorial on boosting) let S = {(xi, yi)|i = 1, . . . , n} be a data set with yi ∈ {±1} and xi a d−dimensional vector of features xij let there exist a learner (or a few) able to produce some basic classifiers ht , based on sets such as S ht will be called "weak classifiers" and the condition is that Err(ht ) = 0.5 − ǫt where 0 < ǫ ≤ 0.5 for each iteration t = 1, . . . , T produce a version of the training set St on which ht are fit and then, assemble their predictions Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective how to select the training points at each round? → concentrate on most difficult points how to combine the weak classifiers? → take the (weighted) majority vote Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Boosting A general methodology of producing highly accurate predictors based on averaging some weak classifiers. Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Context PAC framework: a strong-PAC algorithm: for any distribution (of data) ∀ǫ > 0, ∀δ > 0 given enough data (i.i.d. from the distribution) with probability at least 1 − δ, the algorithm will find a classifier with error ≤ ǫ a weak-PAC algorithm: the same conditions, but the guaranteed error is ǫ ≥ 1 2 − γ when weak-PAC learnability leads to strong-PAC? Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective AdaBoost a development of previous "boosting" algorithms first to reach widespread applicability, due to simplicity of the implementation and good observed performance (in addition to theoretical performance) Freund & Schapire (EuroCOLT, 1995); the more complete version: "A decision-theoretic generalization of on-line learning and an application to boosting", J. Comp Sys Sc 1997 AdaBoost: adaptive boosting Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Basic AdaBoost Input: a training set S = {(xi, yi)} and the number of iterations T Output: final classifier H as a combination of weak classifiers ht for t = 1 to T do construct a distribution Dt on {1, . . . , n} find a weak classifier ht : X → {−1, +1} which minimizes the error ǫt on Dt , ǫt = PrDt [ht (xi) yi] end for Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective How to construct Dt ? let D1(i) = 1/n (uninformative priors) given Dt and a weak classifier ht , Dt+1 = Dt (i) Zt ×    exp(−αt ) if ht (xi) = yi exp(αt ) if ht (xi) yi = Dt (i) Zt exp(−αt yiht (xi)) Zt is a properly chosen normalization constant αt = 1 2 ln 1 − ǫt ǫt ∈ R+ Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective What about the final decision/classifier? H(x) = sign   T t=1 αt ht (x)   How to use Dt for training a classifier? either generate a new training sample from S by sampling according to Dt , or use directly the sample weights for constructing ht Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective (Classical) Example (Freund & Schapire) Initial state: D1 weak classifiers: single variable threshold function Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective (Classical) Example (Freund & Schapire) Iteration 1: h1 α ε1 1 =0.30 =0.42 2D Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective (Classical) Example (Freund & Schapire) Iteration 2: α ε2 2 =0.21 =0.65 h2 3D Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective (Classical) Example (Freund & Schapire) Iteration 3: h3 α ε3 3=0.92 =0.14 Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective (Classical) Example (Freund & Schapire) Final classifier: H final +0.92+0.650.42sign= = Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Training error theorem: let ǫt < 1/2 be the error rate at step t and let γt = 1/2 − ǫt , then the training error of the final classifier is upper bounded by Errtrain(H) ≤ exp  −2 T t=1 γ2 t   then if ∀t : γt ≥ γ > 0, Errtrain ≤ exp(−2γ2 T) it follows that Errtrain → 0 as T → ∞ if γt ≫ γ the convergence is much faster Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective What about overfitting? Occam’s razor suggests that simpler rules are preferable for SVMs, sparser models (less SVs) have better generalization properties AdaBoost? Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective What about overfitting? Occam’s razor suggests that simpler rules are preferable for SVMs, sparser models (less SVs) have better generalization properties AdaBoost? practice shows that AdaBoost is resistant to overfitting, in normal conditions Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective What about overfitting? Occam’s razor suggests that simpler rules are preferable for SVMs, sparser models (less SVs) have better generalization properties AdaBoost? practice shows that AdaBoost is resistant to overfitting, in normal conditions in highly noisy conditions, AdaBoost can overfit! Regularized versions exist to tackle this situation Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Where does the robustness (to overfitting) come from? it’s a matter of margin! (most likely) define the margin as the "strength of the vote", i.e. "weighted fraction of correct votes" - "weighted fraction of incorrect votes" The output from the final classifier (before sign()) ∈ [−1, 1] : ✵ ✶✲✶ ✐ ✁✂✄✄☎✁✆ ✁✂✄✄☎✁✆ ❤✐✝❤ ✁✂ ✞✐✟☎ ✁☎✠ ✁✂✄✄☎✁✆ ✂ ✁✂ ✞✐✟☎ ✁☎❤✐✝❤ ✁✂ ✞✐✟☎ ✁☎✠ ✡☛✆ ☞✄✂ ✝ Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Example (from F&S’s tutorial): the "letters" data set from UCI, C4.5 weak classifiers 10 100 1000 0 5 10 15 20 error test train )T#of rounds( -1 -0.5 0.5 1 0.5 1.0 cumulativedistribution 1000 100 margin 5 #rounds 5 100 1000 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1 %margins 0.5 7.7 0.0 0.0 minimum margin 0.14 0.52 0.55 Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective ...and a real world example: prediction of pCR in breast cancer AdaBoost with weighted top scoring pairs weak classifiers data: MDA gene expression data (∼22,000 variables) from MAQC project: n = 130 training samples, n = 100 testing samples data comes different hospitals, clinical series, no much control on the representativeness of the training set endpoint: pathologic complete response (pCR) Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Training and testing errors (with functional margin) Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective # iterations = 1, errtr = 0.17, errts = 0.39 -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 1 Margin CDF -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 1 Margin CDF Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective # iterations = 5, errtr = 0.02, errts = 0.34 -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 5 Margin CDF -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 5 Margin CDF Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective # iterations = 10, errtr = 0.0, errts = 0.31 -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 10 Margin CDF -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 10 Margin CDF Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective # iterations = 50, errtr = 0.0, errts = 0.23 -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 50 Margin CDF -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 50 Margin CDF Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective # iterations = 100, errtr = 0.0, errts = 0.22 -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 100 Margin CDF -1 0 1 2 3 4 0.00.20.40.60.81.0 CDF of margins at iteration 100 Margin CDF Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Ideas: large margin allows a sparser approximation of the final classifier, hence the final classifier should have better generalization properties than its size would suggest the AdaBoost increases the margin as T grows and decreases the effective complexity of the final classifier ∀θ > 0, Err(H) ≤ ˆPr[margin ≤ θ] + O( √ h/n/θ) where h is the "complexity" of weak classifiers ˆPr[margin ≤ θ] → 0 exponentially fast in T if γt > θ Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective A few different interpretations game theory: AdaBoost classifier as a solution of a minmax game loss minimization additive logistic model maximum entropy etc. etc. Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective AdaBoost as a minimizer of exponential loss let L(y, f(x)) be the loss function measuring the discrepancies between true target (label or real value) y and the predicted value f(x) it can be shown that AdaBoost minimizes (remember the scaling factor Zt ?) t Zt = 1 n i exp(−yif(xi)) where f(x) = t αt ht (x) yf(x) is the (functional) margin, similar to SVM exponential loss is an upper bound of the 0-1 loss AdaBoost is a greedy procedure for loss minimization: αt and ht are chosen locally to minimize the current loss Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Coordinate descent [Breiman] let {h1, . . . , hm} be the space of all weak classifiers the goal is to find β1, . . . , βm (coordinates in the space of weak classifiers) where the loss L(β1, . . . , βm) = i exp(−yi k βk h(xi)) is minimized coordinate descent procedure: start with βk = 0 at each step: choose coordinate βk (on axis ht ) and update it by an increment αt αt is chosen to maximize the decrease in loss this is the very procedure implemented by AdaBoost Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Outline 1 Bagging 2 Random forests 3 AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Gradient descent optimization (reminder) let Ω be a differentiable optimization criterion let xk = xk−1 − γ∇Ω(xk−1), then for some small γ > 0 Ω(xk ) ≤ Ω(xk−1) Issues: slow convergence sensitive to initial point Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Gradient descent in function space in the following, we will generalize from ±1-valued classifiers to real-valued functions change of notation: F becomes the generalized version of H and f the generalized version of h, respectively FM(x) = M 1 fm(x) is evaluated at each x gradient (steepest) descent: fm(x) = −ρmgm(x) = −ρm∇F Ey,x [L(y, F(x))] F=Fm−1 ρm = arg min ρ Ey,x [L(y, Fm−1(x) − ρgm(x))] Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Additive models Friedman, Hastie, Tibshirani, Additive logistic regression: a statistical view of boosting, The Annals of Statistics, 2000. regression models: let y ∈ R and model the mean: E [y|x] = p j=1 fj(xj), where x = (x1, . . . , xp) ∈ Rp . iteratively update (backfit) the current approximation until convergence: fj(xj) ← E  y − k j fk (xk ) xj   . the final solution, F(x) = p 1 fj(xj), is a minimizer of E (y − F(x))2 . Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Extended additive models consider a family of functions fm(x) = βmb(x; γm). b(·) : basis functions (linear, sigmoid, RBF, wavelets,. . . ) Notes on basis functions: span a function subspace they need not be orthogonal, nor form a complete/minimal base they can be chosen to form a redundant dictionary: matching pursuit applications in (statistical) signal processing; image compression; multi–scale data analysis;... Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Fitting the model: generalized backfitting: {βm, γm} ← arg min β,γ E    y − ( k m βk b(x; γk ) + βb(x; γ))   2  greedy optimization: let FM(x) = M 1 βmb(x; γm) be the solution after M iterations; the successive approximations are {βm, γm} = arg min β,γ E (y − (Fm−1(x) + βb(x; γ))2 → matching pursuit; in classification: kernel matching pursuit Mallat, Zhang, Matching pursuit with time–frequency dictionaries, 1993 Vincent, Bengio, Kernel matching pursuit, 2002 Popovici, Thiran, Kernel matching pursuit for large datasets, 2005 Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective From regression to classification goal (for binary problems): estimate Pr(y = 1|x) logistic regression: ln Pr(y = 1|x) Pr(y = −1|x) = FM(x) with FM(x) ∈ R. ⇔ p(x) = Pr(y = 1|x) = exp(FM(x)) 1+exp(FM(x)) FM is obtained by minimizing the expected loss: FM(x) = arg min F Ey,x [L(y, F(x))] = arg min F Ex Ey [L(y, F(x)))] |x Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Generalized boosting algorithm 1: given {(xi, yi)|i = 1, . . . , N}, let F0(x) = f0(x) 2: for all m = 1, . . . , M do 3: compute the current negative gradient: zi = − ∇F L(F) F=Fm−1 = − ∂L(yi, F(xi)) ∂F F=Fm−1(xi) and fit fm using the new set {(xi, zi)|i = 1, . . . , N} 4: find the step–size cm = arg min c N i=1 L(yi, Fm−1(xi) + cfm(xi)) 5: let Fm(x) = Fm−1(x) + cmfm(x) 6: end for 7: return final classifier sign [FM(x)] Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Which loss function? Loss functions Loss   ✁✂✄☎ Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Exponential loss L(y, F) = E e−yF(x) Notes: L(y, F) is minimized at F(x) = 1 2 ln Pr(y = 1|x) Pr(y = −1|x) yF(x) F is called margin of sample x ⇒ L(y, F) forces margin maximization L is differentiable and an upper bound of 1[yF(x)<0] L has the same population minimizer as the binomial log–likelihood Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective AdaBoost builds an additive logistic regression model 1: let wi = 1/N 2: for all m = 1, . . . , M do 3: fit the weak classifier fm(x) ∈ {±1} using the weights wi on the training data 4: errm = Ew 1[y fm(x)] { expectation with respect to weights! } 5: cm = ln 1−errm errm (note: cm = 2 arg minc L( m−1 1 fi + cfm)) 6: update the weights wi ← wi exp cm1[yi fm(xi)] , i = 1, . . . , N and normalize such that w = 1 7: end for 8: return final classifier sign M m=1 cmfm(x) Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Real AdaBoost: stagewise optimization of exponential loss 1: let wi = 1/N 2: for all m = 1, . . . , M do 3: fit the weak classifier using the weights wi on the training data and obtain the posteriors pm(x) = ˆPw(y = 1|x) ∈ [0, 1] 4: let fm(x) = 1 2 ln pm(x) 1−pm(x) {note: this is the local minimizer of L} 5: update the weights wi ← wi exp (−yifm(xi)) , i = 1, . . . , N and normalize such that w = 1 6: end for 7: return final classifier sign M m=1 fm(x) Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective LogitBoost: stagewise opt. of binomial log–likelihood Let y∗ = (1 + y)/2 ∈ {0, 1} and Pr(y∗ = 1|x) = p = exp(F(x))/(exp(F(x)) + exp(−F(x))) 1: let wi = 1/N, pi = 1/2, ∀i = 1, . . . , N, F(x) = 0 2: for all m = 1, . . . , M do 3: let zi = y∗ i −pi pi(1−pi) { new responses, instead of y} 4: let wi = pi(1 − pi) 5: fit fm by weighted least–square regression of zi to xi using weights wi 6: update F ← F + 1/2fm 7: update p ← exp(F)/(exp(F) + exp(−F)) 8: end for 9: return final classifier sign M m=1 fm(x) Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Which weak learner? any classifier with an error rate < 0.5 decision stumps (classification tree with 1 node) classical classification trees top scoring pairs classifier linear (logistic) regression (an example later) radial basis functions . . . Vlad PA196: Pattern Recognition Bagging Random forests AdaBoost Introduction Basic AdaBoost Different views on AdaBoost An additive logistic regression perspective Practical issues the weak classifier should not be too strong AdaBoost or LogitBoost are good first choices for classification problems stopping rules: quit when the weak classifier cannot fit the data anymore choose M by an inner cross–validation or independent data set use AIC, BIC, MDL as criteria for choosing M Vlad PA196: Pattern Recognition