Decision trees Multiple classifier systems PA196: Pattern Recognition 07. Decision trees 08. Multiple classifier systems Dr. Vlad Popovici popovici@iba.muni.cz Institute of Biostatistics and Analyses Masaryk University, Brno Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Color? Size? Size?Shape? round Size? yellow redgreen thin mediumsmall smallbig Grapefruit big small Watermelon Banana AppleApple Lemon Grape Taste? sweet sour Cherry Grape medium level 0 level 1 level 2 level 3 root [DHS - Fig.8.1] Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees attributes can be continuous or nominal/categorical there is no need to have a metric the interpretation is simple and can be written as a logical proposition natural handling of multi-class problems different equivalent trees... feature selection embedded into the algorithm what if there are tens of thousands of features? how many levels? Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Classification And Regression Trees - CART we are given a training set S = {(xi, yi)|i = 1, . . . , n} where yi codes the class gi and xi are some ordered collection of attributes a tree splits the training set into subsets the objective is to "grow" a tree such that the leaves are pure: all elements in a subset belong to the same class Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Issues: binary or multi-valued decision in the nodes? (i.e. how many splits?) which attribute should be tested? when should a node be declared a leaf? pruning strategies? if a leaf is impure, what’s its label? how to handle missing values? Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Number of splits design decision: what’s the branching factor B of a node? B = 2: binary trees any tree can be transformed into a binary tree Grapefruit Banana Apple Lemon Cherry Grape GrapeApple Watermelon yes no no no no no no no no yesyes yes yes yes yes yes color = Green? size = big? size = medium? color = yellow? size = small?shape = round? size = big? taste = sweet? [DHS - Fig.8.2] Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Attribute (variable) selection Decision boundaries: single variable binary decisions lead to boundaries that are (by portions) orthogonal to the axes. Oblique boundaries can only be approximated (by large trees). x1 x2 R1 R2 R1 x1 x2 R1 R2 R1 Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees for a node N we search for that attribute T that would make the descendant nodes as pure as possible impurity i(N): 0 if all elements belong to the same class, "large" if the classes are equally represented entropy impurity: i(N) = − i P(gi) log2 P(gi) (for binary classification) variance impurity i(N) = P(g1)P(g2) Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Gini impurity (generalized variance impurity): i(N) = i j P(gi)P(gj) = 1 − i [P(gi)]2 (interpretation: expected error rate if the label is randomly selected from the class distribution present at N) misclassification impurity i(N) = 1 − max i P(gi) (interpretation: minimum probability of a misclassification) Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Comparison of various impurity measures for the two-class case 0 1 P i(P) misclassification entropy Gini/variance 0.5 [DHS-Fig 8.4] Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees How to choose the test at the node N? heuristic (greedy approach): choose the test that maximizes the decrease in impurity of the descendent nodes: ∆i(N) = i(N) − PL i(NL ) − (1 − PL )i(NR) where NL and NR are the two (left and right) descendent nodes and PL is the fraction of examples that go to the left subtree one has to find the attribute (variable) T to test and the threshold value that would maximize ∆i(N) Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees in general entropy or Gini impurity functions are preferred; but the choice makes little difference to the final quality of the classifier finding the optimal threshold may involve an optimization process for continuous variables for categorical variables, the optimal value is found by exhaustive search the optimum is local and may not be unique the misclassification impurity is not always decreasing there are algorithms that allow multiway splits Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Stopping criteria if too early: not enough accuracy; if too late: overfitting use only a part of the data for growing the tree and the rest for estimating its error rate (either single split of training set or in cross-validation manner). Grow the tree as long as the error rate (on the validation set) decreases; or: grow the tree as long as the reduction in impurity is above a threshold; or: grow the tree as long as there are more than a certain number of elements in any leaf or: split until a minimum of αsize + leaf nodes i(N) is reached (kind of MDL) Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Alternative approach: try to assess the statistical significance of the reduction in impurity different tests (e.g. χ2 ) can be used you can also build an empirical distribution for ∆i from the nodes already in the tree (after several nodes already there) etc etc Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Tree pruning horizon effect: the split decision at a node does not "see" the decisions in the descendent nodes tree pruning is the opposite strategy to early stopping a tree is grown to its fullest, and then leaves or even nodes are joined these action try to optimize a global cost function the approach is much more computationally expensive than early stopping alternative: use propositional logic to simplify the rules expressed by the tree: remove irrelevant rules and try to improve classification performance on a validation set Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Label assignment for the leaves if a leaf is pure - the label is clear if i(N) > 0, then the majority rule is used pure leaves is not the most important criterion: it may indicate overfitting or over-sensitivity to small changes in training data (noise) Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Other issues an approximation for the training complexity O(dn2 log n) multivariable decisions .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 - 1.2 x1 + x2 < 0.1 x1 < 0.27 x2 < 0.32 x1 < 0.07 x2 < 0.6 x1 < 0.55 x2 < 0.86 x1 < 0.81 x1 x2  2  1  2  1  1  1  1  2  2  2 R 2 R 1 R 2 R 1 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 x1 x2 [DHS - Fig.8.5] Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees Other classical tree methods: ID3 - interactive dichotomizer - it is intended for nominal attributes the real values are quantized and used as nominal the branching factor is usually > 2 C4.5 - successor and refinement of ID3 combines techniques from CART and ID3 real values are treated as by CART nominal values generate multiple splits like in ID3 special method for pruning the rules Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction CART Other classification trees A slightly different tree... Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Why combining classifiers? obviously, in the hope of improving the overall accuracy instead of looking for the "best" classifier, we are looking for how "best" to combine some "reasonable" classifiers but, there are some other reasons: statistical, computational and representational Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Statistical perspective: aggregating several "estimates" (classifiers) may be closer to the best classifier for the problems at hand: ❈ ✁✂✂✄☎✄✆✝ ✂✞✁✟✆ ✧✠✡✡☛✧ ✟ ✁✂✂✄☎✄✆✝✂ ❇✆✂☞ ✟ ✁✂✂✄☎✄✆✝ h1 h2 h3 h4 h∗ Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Computational perspective: the various classifiers may represent only local optima from the classifier space hence, their combination may give a better approximation of the global optimum. ❈ ✁✂✂✄☎✄✆✝ ✂✞✁✟✆ h1 h2 h3 h4 h∗ Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Representational perspective: maybe the space of classifiers chosen when modeling the problem does not contain the best classifier. ❈ ✁✂✂✄☎✄✆✝ ✂✞✁✟✆ h1 h2 h3 h4 h∗ Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Different levels of combining classifiers data level: different subsets of the training set are used in training the base classifiers feature level: different subsets of features are used for base classifiers classifier level: use different base classifiers combiner level: use various combiners others: ECOC - error correcting codes: change the labels of the examples... ❈ ✁✂✄☎✆✝ ❉✞✟✞ ✠✆✟ h1 h2 hL Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Classifier fusion vs selection c. fusion: each base learner has knowledge of the whole feature space c. selection: the base learners have different domains of compentencies (set of features) fusion: combiners based on majority vote or weighted means, etc. cascades of classifiers: a special case of c. selection Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Decision optimization vs coverage optimization decision optimization: optimize the combiner for a fixed set of base learners coverage optimization: fix the combiner and find the best set of base learners Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Trainable vs non-trainable ensembles non-trainable combiners: e.g. majority vote trainable combiners: may take into account, for example, the reliability of the base learners or build the combiner as the ensemble is developed (e.g. AdaBoost) Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection We consider a set of classifiers h1, . . . , hL : Rd → G. The goal is to construct a combiner (classifier) H : GL → G The space GL is called intermediate feature space. Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Types of classifier outputs: type 0: the only information about the output of classifier hi is that it is correct or false. For a data set S the classifier hi produces an output vector (one element for each point xk ∈ S): [yik ] such that yik =    1 if hi classifies correctly xk 0 otherwise type 1: the classifier hi produces a class label gj for any input vector x Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Types of classifier outputs (cont’d): type 2: each classifier produces an ordered list of possible class labels (a subset of G), from most plausible to the least plausible type 3: each classifier outputs a vector [f1, . . . , fC] ∈ RC with values indicating the support for the hypothesis that x belongs to each of the C = |G| classes Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Majority vote types: unanimity, majority and plurality let [ci1, . . . , ciC] ∈ {0, 1}C be a vector associated with classifier hi: cik is 1 if hi assigns x to class gk the plurality vote can be written as: assign x to gk if L i=1 ck = max j=1,...,C L i=1 cij ♣ ✁✂✄ ☎✆✝ ♠✄✞✟✂☎✆✝ ✁✉✄✉☎♠☎✆✝ Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection depending on the patterns of success and failures of individual classifiers, the majority vote can improve significantly the overall performance... ...as it can decrease it with respect to the performance of the best base classifier Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Weighted majority vote idea: give more weight to the better base classifiers the discriminant function for class gi has the form Hi(x) = L j=1 bjhi(x) if the L base classifiers are independent with individual accuracies p1, . . . , pL , then the accuracy of the ensemble is maximized if the weights are chosen as bi ∝ ln pi 1 − pi Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Other methods for combining labels Naive Bayes: assumes conditional independence of the classifiers and tries to produce an estimate of the posterior probability based on the probabilities of assignment from each individual classifier multinomial methods try to estimate the posterior probability for each possible combination of labels produced by the base classifiers combination based on SVD uses correspondence analysis in the intermediate feature space etc etc Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Fusion of continuous outputs the output of the base classifier is interpreted as a degree of confidence in the class assignment: either a confidence or a posterior probability the combiner tries to estimate the support (evidence) for each class let DP be the decision profile matrix: DP(x) =   h11(x) . . . h1j(x) . . . h1C(x) ... hi1(x) . . . hij(x) . . . hiC(x) . . .   with the i−th row corresponding to hi output and the j−th column showing the evidence from all classifiers in favor of class gj. Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Class-conscious combiners: non-trainable (i.e. there are no parameters to optimize) compute the support µj for class gj as: average over DP(x)·,j (j−th column) minimum/maximum/median over DP(x)·,j trimmed mean, product, some other mean, over DP(x)·,j trainable: various weighted means fuzzy integral: measures also the "strength" of all subsets of classifiers Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Class-indifferent combiners: decision templates: build a "typical" (template) decision profile for each class and then compare the current decision profile with the template the comparison can use Euclidean, Minkowski, city-block etc metrics you can try a k−NN in the intermediate feature space Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Vlad PA196: Pattern Recognition Decision trees Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Classifier selection idea: build "expert" classifiers for some subdomains (subspace of the original space) and find a way to identify which base classifier should take the decision for each new input x there are either dynamic or static estimation of regions of competence for base classifiers different ways of combining their outputs: fusion or selection stochastic selection: select the label for x by sampling from from {h1, . . . , hL } according to some distribution p1(x), . . . , pL (x) or, choose the classifier with highest pi(x) or, weighted average... Vlad PA196: Pattern Recognition