PA196: Pattern Recognition 07. Decision trees 08. Multiple classifier systems Dr. Vlad Popovici popovici@recetox.muni.cz RECETOX Masaryk University, Brno Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Color? Size? Size?Shape? round Size? yellow redgreen thin mediumsmall smallbig Grapefruit big small Watermelon Banana AppleApple Lemon Grape Taste? sweet sour Cherry Grape medium level 0 level 1 level 2 level 3 root [DHS - Fig.8.1] • attributes can be continuous or nominal/categorical • there is no need to have a metric • the interpretation is simple and can be written as a logical proposition • natural handling of multi-class problems • different equivalent trees... • feature selection embedded into the algorithm • what if there are tens of thousands of features? • how many levels? Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Classification And Regression Trees - CART • we are given a training set S = {(xi, yi)|i = 1, . . . , n} where yi codes the class gi and xi are some ordered collection of attributes • a tree splits the training set into subsets • the objective is to "grow" a tree such that the leaves are pure: all elements in a subset belong to the same class Issues: • binary or multi-valued decision in the nodes? (i.e. how many splits?) • which attribute should be tested? • when should a node be declared a leaf? • pruning strategies? • if a leaf is impure, what’s its label? • how to handle missing values? Number of splits • design decision: what’s the branching factor B of a node? • B = 2: binary trees • any tree can be transformed into a binary tree Grapefruit Banana Apple Lemon Cherry Grape GrapeApple Watermelon yes no no no no no no no no yesyes yes yes yes yes yes color = Green? size = big? size = medium? color = yellow? size = small?shape = round? size = big? taste = sweet? [DHS - Fig.8.2] Attribute (variable) selection Decision boundaries: single variable binary decisions lead to boundaries that are (by portions) orthogonal to the axes. Oblique boundaries can only be approximated (by large trees). x1 x2 R1 R2 R1 x1 x2 R1 R2 R1 • for a node N we search for that attribute T that would make the descendant nodes as pure as possible • impurity i(N): 0 if all elements belong to the same class, "large" if the classes are equally represented • entropy impurity: i(N) = − i P(gi) log2 P(gi) • (for binary classification) variance impurity i(N) = P(g1)P(g2) • Gini impurity (generalized variance impurity): i(N) = i j P(gi)P(gj) = 1 − i [P(gi)]2 (interpretation: expected error rate if the label is randomly selected from the class distribution present at N) • misclassification impurity i(N) = 1 − max i P(gi) (interpretation: minimum probability of a misclassification) Comparison of various impurity measures for the two-class case 0 1 P i(P) misclassification entropy Gini/variance 0.5 [DHS-Fig 8.4] How to choose the test at the node N? • heuristic (greedy approach): choose the test that maximizes the decrease in impurity of the descendent nodes: ∆i(N) = i(N) − PL i(NL ) − (1 − PL )i(NR) where NL and NR are the two (left and right) descendent nodes and PL is the fraction of examples that go to the left subtree • one has to find the attribute (variable) T to test and the threshold value that would maximize ∆i(N) • in general, entropy or Gini impurity functions are preferred; but the choice makes little difference to the final quality of the classifier • finding the optimal threshold may involve an optimization process for continuous variables • for categorical variables, the optimal value is found by exhaustive search • the optimum is local and may not be unique • the misclassification impurity is not always decreasing • there are algorithms that allow multiway splits Stopping criteria • if too early: not enough accuracy; if too late: overfitting • use only a part of the data for growing the tree and the rest for estimating its error rate (either single split of training set or in cross-validation manner). Grow the tree as long as the error rate (on the validation set) decreases; • or: grow the tree as long as the reduction in impurity is above a threshold; • or: grow the tree as long as there are more than a certain number of elements in any leaf • or: split until a minimum of αsize + leaf nodes i(N) is reached (kind of MDL) Alternative approach: • try to assess the statistical significance of the reduction in impurity • different tests (e.g. χ2 ) can be used • you can also build an empirical distribution for ∆i from the nodes already in the tree (after several nodes already there) • etc etc Tree pruning • horizon effect: the split decision at a node does not "see" the decisions in the descendent nodes • tree pruning is the opposite strategy to early stopping • a tree is grown to its fullest, and then leaves or even nodes are joined • these action try to optimize a global cost function • the approach is much more computationally expensive than early stopping • alternative: use propositional logic to simplify the rules expressed by the tree: remove irrelevant rules and try to improve classification performance on a validation set Label assignment for the leaves • if a leaf is pure - the label is clear • if i(N) > 0, then the majority rule is used • pure leaves is not the most important criterion: it may indicate overfitting or over-sensitivity to small changes in training data (noise) Other issues • an approximation for the training complexity O(dn2 log n) • multivariable decisions .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 - 1.2 x1 + x2 < 0.1 x1 < 0.27 x2 < 0.32 x1 < 0.07 x2 < 0.6 x1 < 0.55 x2 < 0.86 x1 < 0.81 x1 x2 ω2 ω1 ω2 ω1 ω1 ω1 ω1 ω2 ω2 ω2 R 2 R 1 R 2 R 1 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 x1 x2 [DHS - Fig.8.5] Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Other classical tree methods: • ID3 - interactive dichotomizer - it is intended for nominal attributes • the real values are quantized and used as nominal • the branching factor is usually > 2 • C4.5 - successor and refinement of ID3 • combines techniques from CART and ID3 • real values are treated as by CART • nominal values generate multiple splits like in ID3 • special method for pruning the rules A slightly different tree... Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Why combining classifiers? • obviously, in the hope of improving the overall accuracy • instead of looking for the "best" classifier, we are looking for how "best" to combine some "reasonable" classifiers • but, there are some other reasons: statistical, computational and representational Statistical perspective: aggregating several "estimates" (classifiers) may be closer to the best classifier for the problems at hand: Classifier space "Good" classifiers Best classifier h1 h2 h3 h4 h∗ Computational perspective: the various classifiers may represent only local optima from the classifier space hence, their combination may give a better approximation of the global optimum. Classifier space h1 h2 h3 h4 h∗ Representational perspective: maybe the space of classifiers chosen when modeling the problem does not contain the best classifier. Classifier space h1 h2 h3 h4 h∗ Different levels of combining classifiers • data level: different subsets of the training set are used in training the base classifiers • feature level: different subsets of features are used for base classifiers • classifier level: use different base classifiers • combiner level: use various combiners • others: ECOC - error correcting codes: change the labels of the examples... Combiner Data set h1 h2 hL Classifier fusion vs selection • c. fusion: each base learner has knowledge of the whole feature space • c. selection: the base learners have different domains of compentencies (set of features) • fusion: combiners based on majority vote or weighted means, etc. • cascades of classifiers: a special case of c. selection Decision optimization vs coverage optimization • decision optimization: optimize the combiner for a fixed set of base learners • coverage optimization: fix the combiner and find the best set of base learners Trainable vs non-trainable ensembles • non-trainable combiners: e.g. majority vote • trainable combiners: may take into account, for example, the reliability of the base learners • or build the combiner as the ensemble is developed (e.g. AdaBoost) Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection We consider a set of classifiers h1, . . . , hL : Rd → G. The goal is to construct a combiner (classifier) H : GL → G The space GL is called intermediate feature space. Types of classifier outputs: • type 0: the only information about the output of classifier hi is that it is correct or false. For a data set S the classifier hi produces an output vector (one element for each point xk ∈ S): [yik ] such that yik =    1 if hi classifies correctly xk 0 otherwise • type 1: the classifier hi produces a class label gj for any input vector x Types of classifier outputs (cont’d): • type 2: each classifier produces an ordered list of possible class labels (a subset of G), from most plausible to the least plausible • type 3: each classifier outputs a vector [f1, . . . , fC] ∈ RC with values indicating the support for the hypothesis that x belongs to each of the C = |G| classes Majority vote • types: unanimity, majority and plurality • let [ci1, . . . , ciC] ∈ {0, 1}C be a vector associated with classifier hi: cik is 1 if hi assigns x to class gk • the plurality vote can be written as: assign x to gk if L i=1 ck = max j=1,...,C L i=1 cij plurality majority unanimity • depending on the patterns of success and failures of individual classifiers, the majority vote can improve significantly the overall performance... • ...as it can decrease it with respect to the performance of the best base classifier Weighted majority vote • idea: give more weight to the better base classifiers • the discriminant function for class gi has the form Hi(x) = L j=1 bjhi(x) • if the L base classifiers are independent with individual accuracies p1, . . . , pL , then the accuracy of the ensemble is maximized if the weights are chosen as bi ∝ ln pi 1 − pi Other methods for combining labels • Naive Bayes: assumes conditional independence of the classifiers and tries to produce an estimate of the posterior probability based on the probabilities of assignment from each individual classifier • multinomial methods try to estimate the posterior probability for each possible combination of labels produced by the base classifiers • combination based on SVD uses correspondence analysis in the intermediate feature space • etc etc Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Fusion of continuous outputs • the output of the base classifier is interpreted as a degree of confidence in the class assignment: either a confidence or a posterior probability • the combiner tries to estimate the support (evidence) for each class • let DP be the decision profile matrix: DP(x) =   h11(x) . . . h1j(x) . . . h1C(x) ... hi1(x) . . . hij(x) . . . hiC(x) . . .   with the i−th row corresponding to hi output and the j−th column showing the evidence from all classifiers in favor of class gj. Class-conscious combiners: • non-trainable (i.e. there are no parameters to optimize) compute the support µj for class gj as: • average over DP(x)·,j (j−th column) • minimum/maximum/median over DP(x)·,j • trimmed mean, product, some other mean, over DP(x)·,j • trainable: • various weighted means • fuzzy integral: measures also the "strength" of all subsets of classifiers Class-indifferent combiners: • decision templates: build a "typical" (template) decision profile for each class and then compare the current decision profile with the template • the comparison can use Euclidean, Minkowski, city-block etc metrics • you can try a k−NN in the intermediate feature space Outline 1 Decision trees Introduction CART Other classification trees 2 Multiple classifier systems Introduction Fusion of label outputs Fusion of continuous outputs Classifier selection Classifier selection • idea: build "expert" classifiers for some subdomains (subspace of the original space) and find a way to identify which base classifier should take the decision for each new input x • there are either dynamic or static estimation of regions of competence for base classifiers • different ways of combining their outputs: fusion or selection • stochastic selection: select the label for x by sampling from from {h1, . . . , hL } according to some distribution p1(x), . . . , pL (x) • or, choose the classifier with highest pi(x) • or, weighted average...