PA196: Pattern Recognition
08. Multiple classiﬁer systems (cont’d)
Dr. Vlad Popovici
popovici@recetox.muni.cz
RECETOX
Masaryk University, Brno
General idea
X
X1 X2 XT
Û Ø Ñ ÓÖ ØÝ ÚÓØ
H = t wtht
×Ù × Ø× Ó Ü ÑÔÐ ×
×Ù × Ø× Ó ØÙÖ ×
h1 h2
hT
½
n
d
Outline
1 Bagging
2 Random forests
3 AdaBoost
Introduction
Basic AdaBoost
Different views on AdaBoost
An additive logistic regression perspective
Bagging (Breiman, 1996)
• bagging = bootstrap aggregation
• create T bootstrap samples Xt by sampling with replacement
• train a classiﬁer on each Xt
• aggregate the classiﬁcations by plurality voting to obtain the
aggregated classiﬁer H
• a similar approach works for regression
• works well with unstable classiﬁers (with high variance):
decision trees, neural networks
Why does bagging work?
• reduces variance (due to sampling in the test sets):
E[(y − H(x))2
] = (y − E[H(x)])2
+ E[(H(x) − E[(H(x))])2
]
= bias2
+ variance
• the bias of H remains approximately the same as for ht :
Bias(H) =
1
T
T
t=1
Bias(ht )
• but the variance is reduced:
Var(H) ≈
1
T
Var(h1)
Variants:
• draw random subsamples of data → "Pasting"
• draw random subsets of features → "Random Subspaces"
• draw random subsamples and random features → "Random
Patches"
Outline
1 Bagging
2 Random forests
3 AdaBoost
Introduction
Basic AdaBoost
Different views on AdaBoost
An additive logistic regression perspective
Idea: induce randomness in the base classiﬁer (tree) and combine
the predictions of an ensemble of such trees (forest) by averaging
or majority vote.
• reﬁnement of bagging trees
• (1st level of randomness) grow the trees on bootstrap samples
• (2nd level of randomness) when growing a tree, at each node
consider only a random subset of features (typically
√
d or
log2 d features)
• for each tree, the error rate for observation left out from the
learning set is monitored ("out-of-bag" error rate)
• the result is a collection of "de-correlated" trees that by
averaging/voting should lead to decreased variance of the
ﬁnal predictor
Outline
1 Bagging
2 Random forests
3 AdaBoost
Introduction
Basic AdaBoost
Different views on AdaBoost
An additive logistic regression perspective
Outline
1 Bagging
2 Random forests
3 AdaBoost
Introduction
Basic AdaBoost
Different views on AdaBoost
An additive logistic regression perspective
General approach
(I will follow Freund & Schapire’s tutorial on boosting)
• let S = {(xi, yi)|i = 1, . . . , n} be a data set with yi ∈ {±1} and xi
a d−dimensional vector of features xij
• let there exist a learner (or a few) able to produce some basic
classiﬁers ht , based on sets such as S
• ht will be called "weak classiﬁers" and the condition is that
Err(ht ) = 0.5 − t where 0 < ≤ 0.5
• for each iteration t = 1, . . . , T produce a version of the training
set St on which ht are ﬁt and then, assemble their predictions
• how to select the training points at each round?
→ concentrate on most difﬁcult points
• how to combine the weak classiﬁers?
→ take the (weighted) majority vote
Boosting
A general methodology of producing highly accurate predictors
based on averaging some weak classiﬁers.
Context
• PAC framework:
• a strong-PAC algorithm:
• for any distribution (of data)
• ∀ > 0, ∀δ > 0
• given enough data (i.i.d. from the distribution)
• with probability at least 1 − δ, the algorithm will ﬁnd a classiﬁer
with error ≤
• a weak-PAC algorithm: the same conditions, but the
guaranteed error is ≥ 1
2
− γ
• when weak-PAC learnability leads to strong-PAC?
AdaBoost
• a development of previous "boosting" algorithms
• ﬁrst to reach widespread applicability, due to simplicity of the
implementation and good observed performance (in addition
to theoretical performance)
• Freund & Schapire (EuroCOLT, 1995); the more complete
version: "A decision-theoretic generalization of on-line
learning and an application to boosting", J. Comp Sys Sc 1997
• AdaBoost: adaptive boosting
Outline
1 Bagging
2 Random forests
3 AdaBoost
Introduction
Basic AdaBoost
Different views on AdaBoost
An additive logistic regression perspective
Basic AdaBoost
Input: a training set S = {(xi, yi)} and the number of iterations T
Output: ﬁnal classiﬁer H as a combination of weak classiﬁers ht
for t = 1 to T do
construct a distribution Dt on {1, . . . , n}
ﬁnd a weak classiﬁer
ht : X → {−1, +1}
which minimizes the error t on Dt ,
t = PrDt [ht (xi) yi]
end for
How to construct Dt ?
• let D1(i) = 1/n (uninformative priors)
• given Dt and a weak classiﬁer ht ,
Dt+1 =
Dt (i)
Zt
×



exp(−αt ) if ht (xi) = yi
exp(αt ) if ht (xi) yi
=
Dt (i)
Zt
exp(−αt yiht (xi))
• Zt is a properly chosen normalization constant
•
αt =
1
2
ln
1 − t
t
∈ R+
What about the ﬁnal decision/classiﬁer?
H(x) = sign


T
t=1
αt ht (x)


How to use Dt for training a classiﬁer?
• either generate a new training sample from S by sampling
according to Dt , or
• use directly the sample weights for constructing ht
(Classical) Example (Freund & Schapire)
Initial state:
D1
weak classiﬁers: single variable threshold function
(Classical) Example (Freund & Schapire)
Iteration 1:
h1
α
ε1
1
=0.30
=0.42
2D
(Classical) Example (Freund & Schapire)
Iteration 2:
α
ε2
2
=0.21
=0.65
h2 3D
(Classical) Example (Freund & Schapire)
Iteration 3:
h3
α
ε3
3=0.92
=0.14
(Classical) Example (Freund & Schapire)
Final classiﬁer:
H
final
+0.92+0.650.42sign=
=
Training error theorem: let t < 1/2 be the error rate at step t and
let γt = 1/2 − t , then the training error of the ﬁnal classiﬁer is
upper bounded by
Errtrain(H) ≤ exp

−2
T
t=1
γ2
t


• then if ∀t : γt ≥ γ > 0, Errtrain ≤ exp(−2γ2
T)
• it follows that Errtrain → 0 as T → ∞
• if γt γ the convergence is much faster
What about overﬁtting?
• Occam’s razor suggests that simpler rules are preferable
• for SVMs, sparser models (less SVs) have better
generalization properties
• AdaBoost?
What about overﬁtting?
• Occam’s razor suggests that simpler rules are preferable
• for SVMs, sparser models (less SVs) have better
generalization properties
• AdaBoost?
• practice shows that AdaBoost is resistant to overﬁtting, in
normal conditions
What about overﬁtting?
• Occam’s razor suggests that simpler rules are preferable
• for SVMs, sparser models (less SVs) have better
generalization properties
• AdaBoost?
• practice shows that AdaBoost is resistant to overﬁtting, in
normal conditions
• in highly noisy conditions, AdaBoost can overﬁt! Regularized
versions exist to tackle this situation
Where does the robustness (to overﬁtting) come from?
• it’s a matter of margin! (most likely)
• deﬁne the margin as the "strength of the vote", i.e. "weighted
fraction of correct votes" - "weighted fraction of incorrect
votes"
The output from the ﬁnal classiﬁer (before sign()) ∈ [−1, 1] :
0 1−1
incorrect correct
high confidence, correctno confidencehigh confidence, but wrong
Example (from F&S’s tutorial): the "letters" data set from UCI, C4.5
weak classiﬁers
10 100 1000
0
5
10
15
20
error
test
train
)T#of rounds(
-1 -0.5 0.5 1
0.5
1.0
cumulativedistribution
1000
100
margin
5
#rounds
5 100 1000
train error 0.0 0.0 0.0
test error 8.4 3.3 3.1
%margins≤ 0.5 7.7 0.0 0.0
minimum margin 0.14 0.52 0.55
...and a real world example: prediction of
pCR in breast cancer
• AdaBoost with weighted top scoring pairs weak classiﬁers
• data: MDA gene expression data (∼22,000 variables) from
MAQC project: n = 130 training samples, n = 100 testing
samples
• data comes different hospitals, clinical series, no much control
on the representativeness of the training set
• endpoint: pathologic complete response (pCR)
Training and testing errors (with
functional margin)
# iterations = 1, errtr = 0.17, errts = 0.39
-1 0 1 2 3 4
0.00.20.40.60.81.0
CDF of margins at iteration 1
Margin
CDF
-1 0 1 2 3 4
0.00.20.40.60.81.0
CDF of margins at iteration 1
Margin
CDF
# iterations = 5, errtr = 0.02, errts = 0.34
-1 0 1 2 3 4
0.00.20.40.60.81.0
CDF of margins at iteration 5
Margin
CDF
-1 0 1 2 3 4
0.00.20.40.60.81.0
CDF of margins at iteration 5
Margin
CDF
# iterations = 10, errtr = 0.0, errts = 0.31
-1 0 1 2 3 4
0.00.20.40.60.81.0
CDF of margins at iteration 10
Margin
CDF
-1 0 1 2 3 4
0.00.20.40.60.81.0
CDF of margins at iteration 10
Margin
CDF
# iterations = 50, errtr = 0.0, errts = 0.23
-1 0 1 2 3 4
0.00.20.40.60.81.0
CDF of margins at iteration 50
Margin
CDF
-1 0 1 2 3 4
0.00.20.40.60.81.0
CDF of margins at iteration 50
Margin
CDF
# iterations = 100, errtr = 0.0, errts = 0.22
-1 0 1 2 3 4
0.00.20.40.60.81.0
CDF of margins at iteration 100
Margin
CDF
-1 0 1 2 3 4
0.00.20.40.60.81.0
CDF of margins at iteration 100
Margin
CDF
Ideas:
• large margin allows a sparser approximation of the ﬁnal
classiﬁer, hence the ﬁnal classiﬁer should have better
generalization properties than its size would suggest
• the AdaBoost increases the margin as T grows and
decreases the effective complexity of the ﬁnal classiﬁer
• ∀θ > 0, Err(H) ≤ ˆPr[margin ≤ θ] + O(
√
h/n/θ) where h is the
"complexity" of weak classiﬁers
• ˆPr[margin ≤ θ] → 0 exponentially fast in T if γt > θ
Outline
1 Bagging
2 Random forests
3 AdaBoost
Introduction
Basic AdaBoost
Different views on AdaBoost
An additive logistic regression perspective
A few different interpretations
• game theory: AdaBoost classiﬁer as a solution of a minmax
game
• loss minimization
• additive logistic model
• maximum entropy
• etc. etc.
AdaBoost as a minimizer of exponential
loss
• let L(y, f(x)) be the loss function measuring the discrepancies
between true target (label or real value) y and the predicted
value f(x)
• it can be shown that AdaBoost minimizes (remember the
scaling factor Zt ?)
t
Zt =
1
n
i
exp(−yif(xi))
where f(x) = t αt ht (x)
• yf(x) is the (functional) margin, similar to SVM
• exponential loss is an upper bound of the 0-1 loss
• AdaBoost is a greedy procedure for loss minimization: αt and
ht are chosen locally to minimize the current loss
Coordinate descent [Breiman]
• let {h1, . . . , hm} be the space of all weak classiﬁers
• the goal is to ﬁnd β1, . . . , βm (coordinates in the space of weak
classiﬁers) where the loss
L(β1, . . . , βm) =
i
exp(−yi
k
βk h(xi))
is minimized
• coordinate descent procedure:
• start with βk = 0
• at each step: choose coordinate βk (on axis ht ) and update it
by an increment αt
• αt is chosen to maximize the decrease in loss
• this is the very procedure implemented by AdaBoost
Outline
1 Bagging
2 Random forests
3 AdaBoost
Introduction
Basic AdaBoost
Different views on AdaBoost
An additive logistic regression perspective
Gradient descent optimization (reminder)
• let Ω be a differentiable
optimization criterion
• let xk = xk−1 − γ Ω(xk−1),
then for some small γ > 0
Ω(xk ) ≤ Ω(xk−1)
Issues:
• slow convergence
• sensitive to initial point
Gradient descent in function space
• in the following, we will generalize from ±1-valued classiﬁers
to real-valued functions
• change of notation: F becomes the generalized version of H
and f the generalized version of h, respectively
• FM(x) = M
1 fm(x) is evaluated at each x
• gradient (steepest) descent:
fm(x) = −ρmgm(x) = −ρm F Ey,x [L(y, F(x))]
F=Fm−1
ρm = arg min
ρ
Ey,x [L(y, Fm−1(x) − ρgm(x))]
Additive models
Friedman, Hastie, Tibshirani, Additive logistic regression: a statistical view of boosting, The Annals of Statistics, 2000.
• regression models: let y ∈ R and model the mean:
E [y|x] =
p
j=1
fj(xj),
where x = (x1, . . . , xp) ∈ Rp
.
• iteratively update (backﬁt) the current approximation until
convergence:
fj(xj) ← E

y −
k j
fk (xk ) xj

 .
• the ﬁnal solution, F(x) = p
1
fj(xj), is a minimizer of
E (y − F(x))2
.
Extended additive models
• consider a family of functions
fm(x) = βmb(x; γm).
• b(·) : basis functions (linear, sigmoid, RBF, wavelets,. . . )
• Notes on basis functions:
• span a function subspace
• they need not be orthogonal, nor form a complete/minimal
base
• they can be chosen to form a redundant dictionary: matching
pursuit
• applications in (statistical) signal processing; image
compression; multi–scale data analysis;...
Fitting the model:
• generalized backﬁtting:
{βm, γm} ← arg min
β,γ
E



y − (
k m
βk b(x; γk ) + βb(x; γ))


2

• greedy optimization: let FM(x) = M
1 βmb(x; γm) be the
solution after M iterations; the successive approximations are
{βm, γm} = arg min
β,γ
E (y − (Fm−1(x) + βb(x; γ))2
→ matching pursuit; in classiﬁcation: kernel matching pursuit
Mallat, Zhang, Matching pursuit with time–frequency dictionaries, 1993
Vincent, Bengio, Kernel matching pursuit, 2002
Popovici, Thiran, Kernel matching pursuit for large datasets, 2005
From regression to classiﬁcation
• goal (for binary problems): estimate Pr(y = 1|x)
• logistic regression:
ln
Pr(y = 1|x)
Pr(y = −1|x)
= FM(x)
with FM(x) ∈ R.
• ⇔ p(x) = Pr(y = 1|x) =
exp(FM(x))
1+exp(FM(x))
• FM is obtained by minimizing the expected loss:
FM(x) = arg min
F
Ey,x [L(y, F(x))] = arg min
F
Ex Ey [L(y, F(x)))] |x
Generalized boosting algorithm
1: given {(xi, yi)|i = 1, . . . , N}, let F0(x) = f0(x)
2: for all m = 1, . . . , M do
3: compute the current negative gradient:
zi = − F L(F) F=Fm−1
= −
∂L(yi, F(xi))
∂F F=Fm−1(xi)
and ﬁt fm using the new set {(xi, zi)|i = 1, . . . , N}
4: ﬁnd the step–size
cm = arg min
c
N
i=1
L(yi, Fm−1(xi) + cfm(xi))
5: let Fm(x) = Fm−1(x) + cmfm(x)
6: end for
7: return ﬁnal classiﬁer sign [FM(x)]
Which loss function?
Loss functions
Loss
y F(x)
Exponential loss
L(y, F) = E e−yF(x)
Notes:
• L(y, F) is minimized at
F(x) =
1
2
ln
Pr(y = 1|x)
Pr(y = −1|x)
•
yF(x)
F is called margin of sample x ⇒ L(y, F) forces margin
maximization
• L is differentiable and an upper bound of 1[yF(x)<0]
• L has the same population minimizer as the binomial
log–likelihood
AdaBoost builds an additive logistic
regression model
1: let wi = 1/N
2: for all m = 1, . . . , M do
3: ﬁt the weak classiﬁer fm(x) ∈ {±1} using the weights wi on
the training data
4: errm = Ew 1[y fm(x)] { expectation with respect to weights! }
5: cm = ln 1−errm
errm
(note: cm = 2 arg minc L( m−1
1 fi + cfm))
6: update the weights
wi ← wi exp cm1[yi fm(xi)] , i = 1, . . . , N
and normalize such that w = 1
7: end for
8: return ﬁnal classiﬁer sign M
m=1 cmfm(x)
Real AdaBoost: stagewise optimization
of exponential loss
1: let wi = 1/N
2: for all m = 1, . . . , M do
3: ﬁt the weak classiﬁer using the weights wi on the training
data and obtain the posteriors
pm(x) = ˆPw(y = 1|x) ∈ [0, 1]
4: let fm(x) = 1
2 ln
pm(x)
1−pm(x)
{note: this is the local minimizer of L}
5: update the weights
wi ← wi exp (−yifm(xi)) , i = 1, . . . , N
and normalize such that w = 1
6: end for
7: return ﬁnal classiﬁer sign M
m=1 fm(x)
LogitBoost: stagewise opt. of binomial
log–likelihood
Let y∗ = (1 + y)/2 ∈ {0, 1} and
Pr(y∗ = 1|x) = p = exp(F(x))/(exp(F(x)) + exp(−F(x)))
1: let wi = 1/N, pi = 1/2, ∀i = 1, . . . , N, F(x) = 0
2: for all m = 1, . . . , M do
3: let zi =
y∗
i
−pi
pi(1−pi)
{ new responses, instead of y}
4: let wi = pi(1 − pi)
5: ﬁt fm by weighted least–square regression of zi to xi using
weights wi
6: update F ← F + 1/2fm
7: update p ← exp(F)/(exp(F) + exp(−F))
8: end for
9: return ﬁnal classiﬁer sign M
m=1 fm(x)
Which weak learner?
• any classiﬁer with an error rate < 0.5
• decision stumps (classiﬁcation tree with 1 node)
• classical classiﬁcation trees
• top scoring pairs classiﬁer
• linear (logistic) regression (an example later)
• radial basis functions
• . . .
Practical issues
• the weak classiﬁer should not be too strong
• AdaBoost or LogitBoost are good ﬁrst choices for
classiﬁcation problems
• stopping rules:
• quit when the weak classiﬁer cannot ﬁt the data anymore
• choose M by an inner cross–validation or independent data set
• use AIC, BIC, MDL as criteria for choosing M