PA196: Pattern Recognition
07. Decision trees
08. Multiple classiﬁer systems
Dr. Vlad Popovici
popovici@recetox.muni.cz
RECETOX
Masaryk University, Brno
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Color?
Size? Size?Shape?
round
Size?
yellow
redgreen
thin mediumsmall smallbig
Grapefruit
big small
Watermelon Banana AppleApple
Lemon
Grape Taste?
sweet sour
Cherry Grape
medium
level 0
level 1
level 2
level 3
root
[DHS - Fig.8.1]
• attributes can be continuous or nominal/categorical
• there is no need to have a metric
• the interpretation is simple and can be written as a logical
proposition
• natural handling of multi-class problems
• different equivalent trees...
• feature selection embedded into the algorithm
• what if there are tens of thousands of features?
• how many levels?
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Classiﬁcation And Regression Trees -
CART
• we are given a training set S = {(xi, yi)|i = 1, . . . , n} where yi
codes the class gi and xi are some ordered collection of
attributes
• a tree splits the training set into subsets
• the objective is to "grow" a tree such that the leaves are pure:
all elements in a subset belong to the same class
Issues:
• binary or multi-valued decision in the nodes? (i.e. how many
splits?)
• which attribute should be tested?
• when should a node be declared a leaf?
• pruning strategies?
• if a leaf is impure, what’s its label?
• how to handle missing values?
Number of splits
• design decision: what’s the branching factor B of a node?
• B = 2: binary trees
• any tree can be transformed into a binary tree
Grapefruit
Banana Apple
Lemon Cherry Grape
GrapeApple
Watermelon
yes no
no
no
no
no
no
no
no
yesyes
yes yes yes
yes yes
color = Green?
size = big?
size = medium?
color = yellow?
size = small?shape = round?
size = big? taste = sweet?
[DHS - Fig.8.2]
Attribute (variable) selection
Decision boundaries: single variable binary decisions lead to
boundaries that are (by portions) orthogonal to the axes. Oblique
boundaries can only be approximated (by large trees).
x1
x2
R1
R2
R1
x1
x2
R1
R2
R1
• for a node N we search for that attribute T that would make
the descendant nodes as pure as possible
• impurity i(N): 0 if all elements belong to the same class,
"large" if the classes are equally represented
• entropy impurity:
i(N) = −
i
P(gi) log2 P(gi)
• (for binary classiﬁcation) variance impurity
i(N) = P(g1)P(g2)
• Gini impurity (generalized variance impurity):
i(N) =
i j
P(gi)P(gj) = 1 −
i
[P(gi)]2
(interpretation: expected error rate if the label is randomly
selected from the class distribution present at N)
• misclassiﬁcation impurity
i(N) = 1 − max
i
P(gi)
(interpretation: minimum probability of a misclassiﬁcation)
Comparison of various impurity measures for the two-class case
0 1
P
i(P)
misclassification
entropy
Gini/variance
0.5
[DHS-Fig 8.4]
How to choose the test at the node N?
• heuristic (greedy approach): choose the test that maximizes
the decrease in impurity of the descendent nodes:
∆i(N) = i(N) − PL i(NL ) − (1 − PL )i(NR)
where NL and NR are the two (left and right) descendent
nodes and PL is the fraction of examples that go to the left
subtree
• one has to ﬁnd the attribute (variable) T to test and the
threshold value that would maximize ∆i(N)
• in general, entropy or Gini impurity functions are preferred; but
the choice makes little difference to the ﬁnal quality of the
classiﬁer
• ﬁnding the optimal threshold may involve an optimization
process for continuous variables
• for categorical variables, the optimal value is found by
exhaustive search
• the optimum is local and may not be unique
• the misclassiﬁcation impurity is not always decreasing
• there are algorithms that allow multiway splits
Stopping criteria
• if too early: not enough accuracy; if too late: overﬁtting
• use only a part of the data for growing the tree and the rest for
estimating its error rate (either single split of training set or in
cross-validation manner). Grow the tree as long as the error
rate (on the validation set) decreases;
• or: grow the tree as long as the reduction in impurity is above
a threshold;
• or: grow the tree as long as there are more than a certain
number of elements in any leaf
• or: split until a minimum of
αsize +
leaf nodes
i(N)
is reached (kind of MDL)
Alternative approach:
• try to assess the statistical signiﬁcance of the reduction in
impurity
• different tests (e.g. χ2
) can be used
• you can also build an empirical distribution for ∆i from the
nodes already in the tree (after several nodes already there)
• etc etc
Tree pruning
• horizon effect: the split decision at a node does not "see" the
decisions in the descendent nodes
• tree pruning is the opposite strategy to early stopping
• a tree is grown to its fullest, and then leaves or even nodes
are joined
• these action try to optimize a global cost function
• the approach is much more computationally expensive than
early stopping
• alternative: use propositional logic to simplify the rules
expressed by the tree: remove irrelevant rules and try to
improve classiﬁcation performance on a validation set
Label assignment for the leaves
• if a leaf is pure - the label is clear
• if i(N) > 0, then the majority rule is used
• pure leaves is not the most important criterion: it may indicate
overﬁtting or over-sensitivity to small changes in training data
(noise)
Other issues
• an approximation for the training complexity O(dn2
log n)
• multivariable decisions
.2 .4 .6 .8 1
0
.2
.4
.6
.8
1
- 1.2 x1 + x2 < 0.1
x1 < 0.27
x2 < 0.32
x1 < 0.07
x2 < 0.6
x1 < 0.55
x2 < 0.86
x1 < 0.81
x1
x2
ω2 ω1
ω2
ω1
ω1
ω1
ω1
ω2
ω2
ω2
R 2
R 1
R 2
R 1
.2 .4 .6 .8 1
0
.2
.4
.6
.8
1
x1
x2
[DHS - Fig.8.5]
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Other classical tree methods:
• ID3 - interactive dichotomizer - it is intended for nominal
attributes
• the real values are quantized and used as nominal
• the branching factor is usually > 2
• C4.5 - successor and reﬁnement of ID3
• combines techniques from CART and ID3
• real values are treated as by CART
• nominal values generate multiple splits like in ID3
• special method for pruning the rules
A slightly different tree...
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Why combining classiﬁers?
• obviously, in the hope of improving the overall accuracy
• instead of looking for the "best" classiﬁer, we are looking for
how "best" to combine some "reasonable" classiﬁers
• but, there are some other reasons: statistical, computational
and representational
Statistical perspective: aggregating several "estimates" (classiﬁers)
may be closer to the best classiﬁer for the problems at hand:
Classifier space
"Good" classifiers
Best classifier
h1
h2
h3
h4
h∗
Computational perspective: the various classiﬁers may represent
only local optima from the classiﬁer space hence, their combination
may give a better approximation of the global optimum.
Classifier space
h1
h2
h3
h4
h∗
Representational perspective: maybe the space of classiﬁers
chosen when modeling the problem does not contain the best
classiﬁer.
Classifier space
h1
h2
h3
h4
h∗
Different levels of combining classiﬁers
• data level: different subsets of the training
set are used in training the base
classiﬁers
• feature level: different subsets of features
are used for base classiﬁers
• classiﬁer level: use different base
classiﬁers
• combiner level: use various combiners
• others: ECOC - error correcting codes:
change the labels of the examples...
Combiner
Data set
h1 h2 hL
Classiﬁer fusion vs selection
• c. fusion: each base learner has knowledge of the whole
feature space
• c. selection: the base learners have different domains of
compentencies (set of features)
• fusion: combiners based on majority vote or weighted means,
etc.
• cascades of classiﬁers: a special case of c. selection
Decision optimization vs coverage
optimization
• decision optimization: optimize the combiner for a ﬁxed set of
base learners
• coverage optimization: ﬁx the combiner and ﬁnd the best set
of base learners
Trainable vs non-trainable ensembles
• non-trainable combiners: e.g. majority vote
• trainable combiners: may take into account, for example, the
reliability of the base learners
• or build the combiner as the ensemble is developed (e.g.
AdaBoost)
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
We consider a set of classiﬁers h1, . . . , hL : Rd
→ G. The goal is to
construct a combiner (classiﬁer)
H : GL
→ G
The space GL
is called intermediate feature space.
Types of classiﬁer outputs:
• type 0: the only information about the output of classiﬁer hi is
that it is correct or false. For a data set S the classiﬁer hi
produces an output vector (one element for each point
xk ∈ S): [yik ] such that
yik =



1 if hi classiﬁes correctly xk
0 otherwise
• type 1: the classiﬁer hi produces a class label gj for any input
vector x
Types of classiﬁer outputs (cont’d):
• type 2: each classiﬁer produces an ordered list of possible
class labels (a subset of G), from most plausible to the least
plausible
• type 3: each classiﬁer outputs a vector [f1, . . . , fC] ∈ RC
with
values indicating the support for the hypothesis that x belongs
to each of the C = |G| classes
Majority vote
• types: unanimity, majority and plurality
• let [ci1, . . . , ciC] ∈ {0, 1}C
be a vector associated with classiﬁer
hi: cik is 1 if hi assigns x to class gk
• the plurality vote can be written as: assign x to gk if
L
i=1
ck = max
j=1,...,C
L
i=1
cij
plurality
majority
unanimity
• depending on the patterns of success and failures of
individual classiﬁers, the majority vote can improve
signiﬁcantly the overall performance...
• ...as it can decrease it with respect to the performance of the
best base classiﬁer
Weighted majority vote
• idea: give more weight to the better base classiﬁers
• the discriminant function for class gi has the form
Hi(x) =
L
j=1
bjhi(x)
• if the L base classiﬁers are independent with individual
accuracies p1, . . . , pL , then the accuracy of the ensemble is
maximized if the weights are chosen as
bi ∝ ln
pi
1 − pi
Other methods for combining labels
• Naive Bayes: assumes conditional independence of the
classiﬁers and tries to produce an estimate of the posterior
probability based on the probabilities of assignment from each
individual classiﬁer
• multinomial methods try to estimate the posterior probability
for each possible combination of labels produced by the base
classiﬁers
• combination based on SVD uses correspondence analysis in
the intermediate feature space
• etc etc
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Fusion of continuous outputs
• the output of the base classiﬁer is interpreted as a degree of
conﬁdence in the class assignment: either a conﬁdence or a
posterior probability
• the combiner tries to estimate the support (evidence) for each
class
• let DP be the decision proﬁle matrix:
DP(x) =


h11(x) . . . h1j(x) . . . h1C(x)
...
hi1(x) . . . hij(x) . . . hiC(x)
. . .


with the i−th row corresponding to hi output and the j−th
column showing the evidence from all classiﬁers in favor of
class gj.
Class-conscious combiners:
• non-trainable (i.e. there are no parameters to optimize)
compute the support µj for class gj as:
• average over DP(x)·,j (j−th column)
• minimum/maximum/median over DP(x)·,j
• trimmed mean, product, some other mean, over DP(x)·,j
• trainable:
• various weighted means
• fuzzy integral: measures also the "strength" of all subsets of
classiﬁers
Class-indifferent combiners:
• decision templates: build a "typical" (template) decision proﬁle
for each class and then compare the current decision proﬁle
with the template
• the comparison can use Euclidean, Minkowski, city-block etc
metrics
• you can try a k−NN in the intermediate feature space
Outline
1 Decision trees
Introduction
CART
Other classiﬁcation trees
2 Multiple classiﬁer systems
Introduction
Fusion of label outputs
Fusion of continuous outputs
Classiﬁer selection
Classiﬁer selection
• idea: build "expert" classiﬁers for some subdomains
(subspace of the original space) and ﬁnd a way to identify
which base classiﬁer should take the decision for each new
input x
• there are either dynamic or static estimation of regions of
competence for base classiﬁers
• different ways of combining their outputs: fusion or selection
• stochastic selection: select the label for x by sampling from
from {h1, . . . , hL } according to some distribution
p1(x), . . . , pL (x)
• or, choose the classiﬁer with highest pi(x)
• or, weighted average...