Decision trees
Multiple classiﬁer systems
PA196: Pattern Recognition
07. Decision trees
08. Multiple classiﬁer systems
Dr. Vlad Popovici
popovici@iba.muni.cz
Institute of Biostatistics and Analyses
Masaryk University, Brno
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Color?
Size? Size?Shape?
round
Size?
yellow
redgreen
thin mediumsmall smallbig
Grapefruit
big small
Watermelon Banana AppleApple
Lemon
Grape Taste?
sweet sour
Cherry Grape
medium
level 0
level 1
level 2
level 3
root
[DHS - Fig.8.1]
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
attributes can be continuous or nominal/categorical
there is no need to have a metric
the interpretation is simple and can be written as a logical
proposition
natural handling of multi-class problems
different equivalent trees...
feature selection embedded into the algorithm
what if there are tens of thousands of features?
how many levels?
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Classiﬁcation And Regression Trees - CART
we are given a training set S = {(xi, yi)|i = 1, . . . , n} where yi
codes the class gi and xi are some ordered collection of
attributes
a tree splits the training set into subsets
the objective is to "grow" a tree such that the leaves are pure:
all elements in a subset belong to the same class
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Issues:
binary or multi-valued decision in the nodes? (i.e. how many
splits?)
which attribute should be tested?
when should a node be declared a leaf?
pruning strategies?
if a leaf is impure, what’s its label?
how to handle missing values?
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Number of splits
design decision: what’s the branching factor B of a node?
B = 2: binary trees
any tree can be transformed into a binary tree
Grapefruit
Banana Apple
Lemon Cherry Grape
GrapeApple
Watermelon
yes no
no
no
no
no
no
no
no
yesyes
yes yes yes
yes yes
color = Green?
size = big?
size = medium?
color = yellow?
size = small?shape = round?
size = big? taste = sweet?
[DHS - Fig.8.2]
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Attribute (variable) selection
Decision boundaries: single variable binary decisions lead to
boundaries that are (by portions) orthogonal to the axes. Oblique
boundaries can only be approximated (by large trees).
x1
x2
R1
R2
R1
x1
x2
R1
R2
R1
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
for a node N we search for that attribute T that would make
the descendant nodes as pure as possible
impurity i(N): 0 if all elements belong to the same class,
"large" if the classes are equally represented
entropy impurity:
i(N) = −
i
P(gi) log2 P(gi)
(for binary classiﬁcation) variance impurity
i(N) = P(g1)P(g2)
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Gini impurity (generalized variance impurity):
i(N) =
i j
P(gi)P(gj) = 1 −
i
[P(gi)]2
(interpretation: expected error rate if the label is randomly
selected from the class distribution present at N)
misclassiﬁcation impurity
i(N) = 1 − max
i
P(gi)
(interpretation: minimum probability of a misclassiﬁcation)
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Comparison of various impurity measures for the two-class case
0 1
P
i(P)
misclassification
entropy
Gini/variance
0.5
[DHS-Fig 8.4]
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
How to choose the test at the node N?
heuristic (greedy approach): choose the test that maximizes
the decrease in impurity of the descendent nodes:
∆i(N) = i(N) − PL i(NL ) − (1 − PL )i(NR)
where NL and NR are the two (left and right) descendent
nodes and PL is the fraction of examples that go to the left
subtree
one has to ﬁnd the attribute (variable) T to test and the
threshold value that would maximize ∆i(N)
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
in general entropy or Gini impurity functions are preferred; but
the choice makes little difference to the ﬁnal quality of the
classiﬁer
ﬁnding the optimal threshold may involve an optimization
process for continuous variables
for categorical variables, the optimal value is found by
exhaustive search
the optimum is local and may not be unique
the misclassiﬁcation impurity is not always decreasing
there are algorithms that allow multiway splits
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Stopping criteria
if too early: not enough accuracy; if too late: overﬁtting
use only a part of the data for growing the tree and the rest for
estimating its error rate (either single split of training set or in
cross-validation manner). Grow the tree as long as the error
rate (on the validation set) decreases;
or: grow the tree as long as the reduction in impurity is above
a threshold;
or: grow the tree as long as there are more than a certain
number of elements in any leaf
or: split until a minimum of
αsize +
leaf nodes
i(N)
is reached (kind of MDL)
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Alternative approach:
try to assess the statistical signiﬁcance of the reduction in
impurity
different tests (e.g. χ2
) can be used
you can also build an empirical distribution for ∆i from the
nodes already in the tree (after several nodes already there)
etc etc
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Tree pruning
horizon effect: the split decision at a node does not "see" the
decisions in the descendent nodes
tree pruning is the opposite strategy to early stopping
a tree is grown to its fullest, and then leaves or even nodes
are joined
these action try to optimize a global cost function
the approach is much more computationally expensive than
early stopping
alternative: use propositional logic to simplify the rules
expressed by the tree: remove irrelevant rules and try to
improve classiﬁcation performance on a validation set
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Label assignment for the leaves
if a leaf is pure - the label is clear
if i(N) > 0, then the majority rule is used
pure leaves is not the most important criterion: it may indicate
overﬁtting or over-sensitivity to small changes in training data
(noise)
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Other issues
an approximation for the training complexity O(dn2
log n)
multivariable decisions
.2 .4 .6 .8 1
0
.2
.4
.6
.8
1
- 1.2 x1 + x2 < 0.1
x1 < 0.27
x2 < 0.32
x1 < 0.07
x2 < 0.6
x1 < 0.55
x2 < 0.86
x1 < 0.81
x1
x2
 2  1
 2
 1
 1
 1
 1
 2
 2
 2
R 2
R 1
R 2
R 1
.2 .4 .6 .8 1
0
.2
.4
.6
.8
1
x1
x2
[DHS - Fig.8.5]
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
Other classical tree methods:
ID3 - interactive dichotomizer - it is intended for nominal
attributes
the real values are quantized and used as nominal
the branching factor is usually > 2
C4.5 - successor and reﬁnement of ID3
combines techniques from CART and ID3
real values are treated as by CART
nominal values generate multiple splits like in ID3
special method for pruning the rules
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
CART
Other classiﬁcation trees
A slightly different tree...
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Why combining classiﬁers?
obviously, in the hope of improving the overall accuracy
instead of looking for the "best" classiﬁer, we are looking for
how "best" to combine some "reasonable" classiﬁers
but, there are some other reasons: statistical, computational
and representational
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Statistical perspective: aggregating several "estimates" (classiﬁers)
may be closer to the best classiﬁer for the problems at hand:
❈ ✁✂✂✄☎✄✆✝ ✂✞✁✟✆
✧✠✡✡☛✧ ✟ ✁✂✂✄☎✄✆✝✂
❇✆✂☞ ✟ ✁✂✂✄☎✄✆✝
h1
h2
h3
h4
h∗
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Computational perspective: the various classiﬁers may represent
only local optima from the classiﬁer space hence, their combination
may give a better approximation of the global optimum.
❈ ✁✂✂✄☎✄✆✝ ✂✞✁✟✆
h1
h2
h3
h4
h∗
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Representational perspective: maybe the space of classiﬁers
chosen when modeling the problem does not contain the best
classiﬁer.
❈ ✁✂✂✄☎✄✆✝ ✂✞✁✟✆
h1
h2
h3
h4
h∗
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Different levels of combining classiﬁers
data level: different subsets of the training
set are used in training the base
classiﬁers
feature level: different subsets of features
are used for base classiﬁers
classiﬁer level: use different base
classiﬁers
combiner level: use various combiners
others: ECOC - error correcting codes:
change the labels of the examples...
❈ ✁✂✄☎✆✝
❉✞✟✞ ✠✆✟
h1 h2 hL
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Classiﬁer fusion vs selection
c. fusion: each base learner has knowledge of the whole
feature space
c. selection: the base learners have different domains of
compentencies (set of features)
fusion: combiners based on majority vote or weighted means,
etc.
cascades of classiﬁers: a special case of c. selection
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Decision optimization vs coverage optimization
decision optimization: optimize the combiner for a ﬁxed set of
base learners
coverage optimization: ﬁx the combiner and ﬁnd the best set
of base learners
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Trainable vs non-trainable ensembles
non-trainable combiners: e.g. majority vote
trainable combiners: may take into account, for example, the
reliability of the base learners
or build the combiner as the ensemble is developed (e.g.
AdaBoost)
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
We consider a set of classiﬁers h1, . . . , hL : Rd
→ G. The goal is to
construct a combiner (classiﬁer)
H : GL
→ G
The space GL
is called intermediate feature space.
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Types of classiﬁer outputs:
type 0: the only information about the output of classiﬁer hi is
that it is correct or false. For a data set S the classiﬁer hi
produces an output vector (one element for each point
xk ∈ S): [yik ] such that
yik =



1 if hi classiﬁes correctly xk
0 otherwise
type 1: the classiﬁer hi produces a class label gj for any input
vector x
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Types of classiﬁer outputs (cont’d):
type 2: each classiﬁer produces an ordered list of possible
class labels (a subset of G), from most plausible to the least
plausible
type 3: each classiﬁer outputs a vector [f1, . . . , fC] ∈ RC
with
values indicating the support for the hypothesis that x belongs
to each of the C = |G| classes
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Majority vote
types: unanimity, majority and plurality
let [ci1, . . . , ciC] ∈ {0, 1}C
be a vector associated with classiﬁer
hi: cik is 1 if hi assigns x to class gk
the plurality vote can be written as: assign x to gk if
L
i=1
ck = max
j=1,...,C
L
i=1
cij
♣ ✁✂✄ ☎✆✝
♠✄✞✟✂☎✆✝
✁✉✄✉☎♠☎✆✝
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
depending on the patterns of success and failures of
individual classiﬁers, the majority vote can improve
signiﬁcantly the overall performance...
...as it can decrease it with respect to the performance of the
best base classiﬁer
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Weighted majority vote
idea: give more weight to the better base classiﬁers
the discriminant function for class gi has the form
Hi(x) =
L
j=1
bjhi(x)
if the L base classiﬁers are independent with individual
accuracies p1, . . . , pL , then the accuracy of the ensemble is
maximized if the weights are chosen as
bi ∝ ln
pi
1 − pi
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Other methods for combining labels
Naive Bayes: assumes conditional independence of the
classiﬁers and tries to produce an estimate of the posterior
probability based on the probabilities of assignment from each
individual classiﬁer
multinomial methods try to estimate the posterior probability
for each possible combination of labels produced by the base
classiﬁers
combination based on SVD uses correspondence analysis in
the intermediate feature space
etc etc
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Fusion of continuous outputs
the output of the base classiﬁer is interpreted as a degree of
conﬁdence in the class assignment: either a conﬁdence or a
posterior probability
the combiner tries to estimate the support (evidence) for each
class
let DP be the decision proﬁle matrix:
DP(x) =


h11(x) . . . h1j(x) . . . h1C(x)
...
hi1(x) . . . hij(x) . . . hiC(x)
. . .


with the i−th row corresponding to hi output and the j−th
column showing the evidence from all classiﬁers in favor of
class gj.
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Class-conscious combiners:
non-trainable (i.e. there are no parameters to optimize)
compute the support µj for class gj as:
average over DP(x)·,j (j−th column)
minimum/maximum/median over DP(x)·,j
trimmed mean, product, some other mean, over DP(x)·,j
trainable:
various weighted means
fuzzy integral: measures also the "strength" of all subsets of
classiﬁers
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Class-indifferent combiners:
decision templates: build a "typical" (template) decision proﬁle
for each class and then compare the current decision proﬁle
with the template
the comparison can use Euclidean, Minkowski, city-block etc
metrics
you can try a k−NN in the intermediate feature space
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Vlad PA196: Pattern Recognition
Decision trees
Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Classiﬁer selection
idea: build "expert" classiﬁers for some subdomains
(subspace of the original space) and ﬁnd a way to identify
which base classiﬁer should take the decision for each new
input x
there are either dynamic or static estimation of regions of
competence for base classiﬁers
different ways of combining their outputs: fusion or selection
stochastic selection: select the label for x by sampling from
from {h1, . . . , hL } according to some distribution
p1(x), . . . , pL (x)
or, choose the classiﬁer with highest pi(x)
or, weighted average...
Vlad PA196: Pattern Recognition