JOURNAL OF PATTERN RECOGNITION RESEARCH 1 (2009) 1-13 Received Jul 27, 2008. Accepted Feb 6, 2009. A Genetic Algorithm for Constructing Compact Binary Decision Trees Sung-Hyuk Cha scha@pace.edu Charles Tappert ctappert@pace.edu Computer Science Department, Pace University 861 Bedford Road, Pleasantville, New York, 10570, USA Abstract Tree-based classifiers are important in pattern recognition and have been well studied. Although the problem of finding an optimal decision tree has received attention, it is a hard optimization problem. Here we propose utilizing a genetic algorithm to improve on the finding of compact, near-optimal decision trees. We present a method to encode and decode a decision tree to and from a chromosome where genetic operators such as mutation and crossover can be applied. Theoretical properties of decision trees, encoded chromosomes, and fitness functions are presented. Keywords: Binary Decision Tree, Genetic Algorithm. 1. Introduction Decision trees have been well studied and widely used in knowledge discovery and decision support systems. Here we are concerned with decision trees for classification where the leaves represent classifications and the branches represent feature-based splits that lead to the classifications. These trees approximate discrete-valued target functions as trees and are a widely used practical method for inductive inference [1]. Decision trees have prospered in knowledge discovery and decision support systems because of their natural and intuitive paradigm to classify a pattern through a sequence of questions. Algorithms for constructing decision trees, such as ID3 [1-3], often use heuristics that tend to find short trees. Finding the shortest decision tree is a hard optimization problem [4, 5]. Attempts to construct short, near-optimal decision trees have been described (see the extensive survey [6]). Gehrke et al developed a bootstrapped optimistic algorithm for decision tree construction [7]. For continuous attribute data, Zhao and Shirasaka [8] suggested an evolutionary design, Bennett and Blue [9] proposed an extreme point tabu search algorithm, and Pajunen and Girolami [10] exploited linear independent component analysis (ICA) to construct binary decision trees. Genetic algorithms (GAs) use an optimization technique based on natural evolution [1, 2, 11, 12]. GAs have been used to find near-optimal decision trees in twofold. On the one hand, they were used to select attributes to be used to construct decision trees in a hybrid or preprocessing manner [13- 15]. On the other hand, they were applied directly to decision trees [16, 17]. A problem that arises with this approach is that an attribute may appear more than once in the path of the tree. In this paper, we describe an alternate method of constructing near-optimal binary decision trees proposed succinctly in [18]. In order to utilize genetic algorithms, decision trees must be represented as chromosomes where genetic operators such as mutation and crossover can be applied. The main contribution of this paper is proposing a new scheme to encode and decode a decision tree to and from a chromosome. The remainder of the paper is organized as follows. Section 2 reviews decision trees and defines a new function denoted as d∆ to describe the complexity. Section 3 presents the encoding/decoding c 2009 JPRR. All rights reserved. Permissions to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or to republish, requires a fee and/or special permission from JPRR. CHA AND TAPPERT C A D B A 0 1 0 0 0 11 w1 w2 w2 1 w1 0 1 w2 w1 (a) T1 consistent to D C A B D A 0 1 0 0 0 11 w1 w2 w2 1 w1 0 1 w2 w1 (b) Tx inconsistent to D Fig. 1: Decision trees consistent and inconsistent with D. decision trees to/from chromosomes which stems from d∆, genetic operators like mutation and crossover, fitness functions and their analysis. Finally, Section 4 concludes this work. 2. Preliminary Let D be a set of labeled training data, a database of instances represented by attribute-value pairs where each attribute can have a small number of possible disjoint values. Here we consider only binary attributes. Hence, D has n instances where each instance xi consists of d ordered binary attributes and a target value which is one of c states of nature, w. The following sample database D where n = 6, d = 4, c = 2, and w = {w1, w2} will be used for illustration throughout the rest of this paper. D = A B C D w x1 0 0 0 0 w1 x2 0 1 1 0 w1 x3 1 0 1 0 w1 x4 0 0 1 1 w2 x5 1 1 0 0 w2 x6 0 0 1 0 w2 Algorithms to construct a decision tree take a set of training instances D as input and output a learned discrete-valued target function in the form of a tree. A decision tree is a rooted tree T that consists of internal nodes representing attributes, leaf nodes representing labels, and edges representing the attributes possible values. Branches represent the attributes possible values and in binary decision trees, left branches have values of 0 and right branches have values of 1 as shown in Fig. 1. For simplicity we omit the value labels in some later figures. A decision tree represents a disjunction of conjunctions. In Fig. 1(a), for example, T1 represents the w1 and w2 states as (¬C ∧¬A)∨(C ∧¬D∧¬B∧A)∨(C ∧¬D∧B) and (¬C ∧A)∨(C ∧¬D∧¬B∧¬A)∨(C ∧D), respectively. Each conjunction corresponds to a path from the root to a leaf. A decision tree based on a database D with c number of classes is a c-class classification problem. Decision trees classify instances by traversing from root node to leaf node. The classification process starts from the root node of a decision tree, tests the attribute specified at this node, and then moves down the tree branch according to the attribute value given. Fig. 1 shows two decision trees, T1 and Tx. The decision tree T1 is said to be a consistent decision tree because it is consistent 2 A GENETIC ALGORITHM FOR CONSTRUCTING COMPACT BINARY DECISION TREES D A B B C 0 1 0 0 0 0 1 11 1 A B C B 0 1 0 0 01 1 1 (a) T2 with 11 nodes (b) T3 with 9 nodes w2 w2 w2 w2 w2 w1w1 w1 w1 w1 w1 Fig. 2: Two decision trees consistent with D: (a) by ID-3 and (b) by GA. with all instances in D. However, the decision tree Tx is inconsistent with D because x2’s class is actually w1 in D whereas Tx classifies it as w2. There are two important properties of a binary decision tree: Property 1 The size of a decision tree with l leaves is 2l − 1. Property 2 The lower and upper bounds of l for a consistent binary decision tree are c and n: c ≤ l ≤ n. The number of leaves in a consistent decision tree must be at least c in the best cases. In the worst cases, the number of leaves will be the size of D with each instance corresponding to a unique leaf, e.g., T1 and T2. 2.1 Occams Razor and ID3 Among numerous decision trees that are consistent with the training database of instances, Fig. 2 shows two of them. All instances x = {x1, . . . , x6} are classified correctly by both decision trees T2 and T3. However, an unknown instance 0, 0, 0, 1, ? , which is not in the training set, D is classified differently by the two decision trees; T2 classifies the instance as w2 whereas T3 classifies it as w1. This inductive inference is a fundamental problem in machine learning. The minimum description length principle formalized from Occam’s Razor [19] is a very important concept in machine learning theory [1, 2]. Albeit controversial, many decision-tree building algorithms such as ID3 [3] prefer smaller (more compact, shorter depth, fewer nodes) trees and thus the instance 0, 0, 0, 1, ? is preferably classified as w2 by T3 because T3 is shorter than T2. In other words, T3 has a simpler description than T2. The shorter the tree, the fewer the number of questions required to classify instances. Based on Occam’s Razor, Ross Quinlan proposed a heuristic that tends to find smaller decision trees [3]. The algorithm is called ID3 (Iterative Dichotomizer 3) and it utilizes the Entropy which is a measure of homogeneity of examples as defined in the equation 1. Entropy(D) = − x∈w={w1,w2} P(x) log P(x) (1) 3 CHA AND TAPPERT D A B B C 0 1 0 0 0 0 1 11 1 w2 w2 w2 w1w1 w1 0.191000gain C DBAA B C w x1 0 0 0 w1 x2 0 1 1 w1 x3 1 0 1 w1 x5 1 1 0 w2 x6 0 0 1 w2 gain 0.02 0.02 0.02 x4 0 0 1 w2 A B C w B C w x3 0 1 w1 x5 1 0 w2 gain 1 1 B C w x1 0 0 w1 x2 1 1 w1 x6 0 1 w2 gain 0.33 0.33 C w x1 0 w1 x6 1 w2 gain 1 DL DR DLL DLR DLLL C w x3 1 w1 DLRL C w x5 0 w2 DLRRC w x2 1 w1 DLLR Fig. 3: Illustration of the ID3 algorithm. Information gain or simply gain is defined in terms of Entropy where X is one of attributes in D. When all attributes are binary type, the gain can be defined as in the equation 2. gain(D, X) = Entropy(D) − |DL| |D| Entropy(DL) + |DR| |D| Entropy(DR) (2) The ID3 algorithm first selects the attribute whose gain is the maximum as a root node. For all subtrees of the root node, it finds the next attribute whose gain is the maximum iteratively. Fig. 3 illustrates the ID3 algorithm. Starting with the root node, it evaluates all attributes in the database D. Since the attribute D has the highest gain, the attribute D is selected as a root node. Then it partitions the database D into two sub databases: DL and DR. For each sub-database, it calculates the gain. As a result, T2 decision tree is built. However, as is apparent from Fig. 2 the ID3 algorithm does not necessarily find the smallest decision tree. 2.2 Complexity of d∆ Function To the extent that smaller trees are preferred, it becomes interesting to find a smallest decision tree. Finding a smallest decision tree is an NP-complete problem though [4, 5]. So as to comprehend the complexity of finding a smallest decision tree, consider a full binary decision tree, Tf 1 (the superscript f denotes full), where each path from the root to a leaf contains all the attributes exactly once as exemplified in Fig. 4 . There are 2d leaves where d+1 is the height of the full decision tree. In order to build a consistent full binary tree, one may choose any attribute as a root node, e.g., there are four choices in Fig. 4 . In the second level nodes, one can choose any attribute that is not in the root node. For example, in Fig. 4 there are 2 × 2 × 2 × 2 × 3 × 3 × 4 possible full binary trees. We denote the number of possible full binary tree with d attributes as d∆. Definition 1 The number of possible full binary tree with d attributes is formally defined as 4 A GENETIC ALGORITHM FOR CONSTRUCTING COMPACT BINARY DECISION TREES C A D D B B B D D B B A A A A 0 1 0 0 0 0 0 0 11 1111 d + 1 0 1 w1 w1 0 1 w1 w1 0 1 w2 w2 0 1 w2 w2 0 1 w2 w1 0 1 w1 w1 0 1 w2 w2 0 1 w2 w2 {x1,x2,x3,x4,x5,x6} {x1, x5} {x2,x3,x4,x6} {x1} {x5} {x4}{x2,x3,x6} {x2}{x3,x6}{x5} {x5} {x1} {x1} {x6} {x3} {x2} {x4} {x4} ∅∅∅∅ ∅ ∅ ∅ ∅ ∅∅ ∅∅∅ 2d Fig. 4: A sample full binary decision tree structure Tf 1 . Table 1: Comparison of d∆ and d! functions. d d∆ d! 1 1∆ = 1 = 1 1!= 1 2 2∆ = 1 × 1 × 2 = 2 2!= 2 3 3∆ = 1 × 1 × 1 × 1 × 2 × 2 × 3 = 12 3!= 6 4 4∆ = 2 × 2 × 2 × 2 × 3 × 3 × 4 = 576 4!= 24 5 5∆ = 2 × 2 × 2 × 2 × 2 × 2 × 2 × 2 × 3 × 3 × 3 × 3 × 4 × 4 × 5 = 1658880 5!= 120 6 6∆ = 216 × 38 × 44 × 52 × 6 = 16511297126400 6!= 720 7 7∆ = 232 × 316 × 48 × 54 × 62 × 7 = 1.9084e + 027 7!= 5040 8 8∆ = 264 × 332 × 416 × 58 × 64 × 72 × 8 = 2.9135e + 055 8!= 40320 9 9∆ = 2128 × 364 × 432 × 516 × 68 × 74 × 82 × 9 = 7.6395e + 111 9!= 362880 10 10∆ = 2256 × 3128 × 464 × 532 × 616 × 78 × 84 × 92 × 10 = 5.8362e + 224 10!= 3628800 d∆ = d i=1 i2d−i . (3) As illustrated in Table 1, as d grows, the function d∆ grows faster than all polynomial, exponential, and even factorial functions. The factorial function d! is the product of all positive integers less than or equal to d and many combinatorial optimization problems have the complexity of O(d!) search space. The search space of full binary decision trees is much larger, i.e., d∆ = Ω(d!). Lower and upper bounds of d∆ are Ω(22d ) and o(d2d ). Note that real decision trees, such as T1 in Fig. 1 (a), can be sparse because some internal nodes can be leaves as long as they are homogeneous. Hence, the search space for finding a shortest binary decision tree can be smaller than d∆. 3. Genetic Algorithm for Binary Decision Tree Construction Genetic algorithms can provide good solutions to many optimization problems [11, 12]. They are based on natural processes of evolution and the survival-of-the-fittest concept. In order to use the 5 CHA AND TAPPERT 3 1 3 2 1 2 2 {1~4} {1~3} {1~2} 4321 DCBA 321 DBA 321 DBA 21 DB 21 DB 21 BA 21 BA 3 1 3 2 1 2 2 (a) (b) {1~4} {1~3} {1~2} T1 e S1 f Fig. 5: Encoding and decoding schema: (a) encoded tree for Tf 1 in Fig. 4 and (b) its chromosome attribute-selection scheduling string. genetic algorithm process, one must define at least the following four steps: encoding, genetic operators such as mutation and crossover, decoding, and fitness function. 3.1 Encoding For genetic algorithms to construct decision trees the decision trees must be encoded so that genetic operators, such as mutation and crossover, can be applied. Let A = {a1, a2, . . . , ad} be the ordered attribute list. We illustrate and describe the process by considering the full binary decision tree Tf 1 in Fig. 4, where A = {A, B, C, D}. Graphically speaking, the encoding process converts the attribute names in Tf 1 into the index of the attribute according to the ordered attribute list, A, recursively, starting from the root as depicted in Fig. 5. For example, the root is C and its index in A is 3. Recursively, for each sub-tree, update A to A − {C} = {A, B, D} attribute list. The possible integer values at a node in the i’th level in the encoded decision tree Te are from 1 to d − i + 1. Finally, take the breadth-first traversal to generate the chromosome string S, which stems from d∆ function. For Tf 1 the chromosome string S1 is given in Fig. 5 (b). T1 and T3 in Fig. 1 is encoded into S1 = 3, 1, 3, ∗, ∗, 2, ∗ where ∗ can be any number within the restricted bounds. T2 and T3 in Fig. 2 are encoded into S2 = 4, 1, ∗, 1, 1, ∗, ∗ and S3 = 1, 1, 1, 1, ∗, ∗, ∗ , respectively. Let us call this a chromosome attribute-selection scheduling string, S, where genetic operators can be applied. Properties of S include: Property 3 The parent position of position i is ⌊i/2⌋, except for i = 1, the root. Property 4 The left and right child positions of position i are 2i and 2i + 1, respectively, if i ≤ 2d−2 − 1; otherwise, there are no children. Property 5 The length of the S’s is exponential in d : |S| = 2d−2 − 1. Property 6 Possible integer values at position i are 1 to d − ⌈log(i + 1)⌉ − 1 : Si ∈ {1, . . . , d − ⌈log(i + 1)⌉ − 1} . Property 6 provides the restricted bounds for each position of S. 6 A GENETIC ALGORITHM FOR CONSTRUCTING COMPACT BINARY DECISION TREES D A C C B B B C C B B A A A A 3 1 3 2 1 2 2 {1~4} {1~3} {1~2} C A B D B B B D D D D A A A A 4 f T 5 f T 4 1 3 2 1 2 2 3 1 2 2 1 2 2 S1 f S4 S5 3→4 3→2 Fig. 6: Mutation operator. Property 7 The number of possible S, |π(S)|, is d∆ or |π(S)| = |S| i=1 (d − ⌈log(i + 1)⌉ − 1)i. The lower and upper bounds of |π(S)| are Ω(2|S|) and o(d|S|). 3.2 Genetic Operators Two of the most common genetic operators are mutation and crossover. The mutation operator is defined as changing the value of a certain position in a string to one of the possible values in the range. We illustrate the mutation process on the attribute selection scheduling string Sf 1 = 3, 1, 3, 2, 1, 2, 2 in Fig. 6. If a mutation occurs in the first position and changes the value to 4, which is in the range {1, .., 4}, Tf 4 is generated. If a mutation happens in the third position and changes the value to 2, which is in the range {1, .., 3}, then Tf 5 is generated. As long as the changed value is within the allowed range, the resulting new string always generates a valid full binary decision tree. Consider Sx = 4, 1, 3, 1, 1, 2, 1 . A full binary decision tree is built according to this schedule. The final decision tree for Sx will be T2 Fig. 2, i.e., Sx = 4, 1, ∗, 1, 1, ∗, ∗ . There are 3×2×2 = 12 equivalent schedules that produce T2. Therefore, for the mutation to be effective we apply the mutation operators only to those positions that are not ∗. To illustrate the crossover operator, consider the two parent attribute selection scheduling strings, P1 and P2, in Fig. 7 . After randomly selecting a split point, the first part of P1 and the last part of P2 contribute to yield a child string S6. Reversing the crossover produces a second child S7. The resulting full binary decision trees for these two children are Tf 6 and Tf 7 , respectively. 3.3 Decoding Decoding is the reverse of the encoding process in Fig. 5. Starting from the root node, we place the attribute according to the chromosome schedule S which contains the index values of attribute list A. When an attribute a is selected, D is divided into left and right branches DL and DR. DL consists of all the xi having a value of 0 and DR consists of all the xi having a value of 1. For each pair of sub-trees we repeat the process recursively with the new attribute list A = A − {a}. When a node becomes homogeneous, i.e., all class values in D are the same, we label the leaf. Fig. 8 displays the decision trees from S4, S5, S6, and S7, respectively. 7 CHA AND TAPPERT C A B D B B B D D A D D D A A 3 1 3 2 1 2 2 D C C B A A A B B B B A A A A 4 3 2 2 1 1 2 P1 : P2 : 3 1 2 2 1 1 2 4 3 3 2 1 2 2 : : 6 f T 7 f T 6S 7S Fig. 7: Crossover operator. D A C B B w2 w2w1w1 w1w2 C A B D A w2w1 w1 w2 w1w2 C A B Aw2w1 w1 w1w2 D C B A w2 w1w2w1 B w1w2 T4 T5 T6 T7 Fig. 8: Decoded decision trees. Sometimes a chromosome introduces mutants. For instance, consider a chromosome S8 3, 3, 2, 1, 1, 1, 2 which results T8 in Fig. 9. The ⊗ occurs when D at the node attribute a has non-homogeneous labels but the a column in D has either all 0‘s or all 1‘s. In other words, the a attribute provides no information gain. We refer to such a decision tree as a mutant tree for two reasons. First, what values should we put in ⊗? The label may be chosen at random but DL provides no clue. Second, if we allow entering a value for ⊗, it may violate the Property 2; the number of leaves l may exceed n. Indeed, T8 behaves identically to T6 with respect to D. Thus, mutant decision trees will not be chosen as the fittest trees according to the fitness functions presented in the later section. 3.4 Fitness Functions Each attribute selection scheduling string S must be evaluated according to a fitness function. We consider two cases: in the first case D contains no contradicting instances and d is a small finite number, in the other case d is very large. Contradicting instances are those whose attribute values are 8 A GENETIC ALGORITHM FOR CONSTRUCTING COMPACT BINARY DECISION TREES w2011x5 0 D w100x1 wBADL w2000x6 w2100x4 w1001x3 w1010x2 D wBADR w200x6 w210x4 w101x3 D wADRL w211x5 w100x1 wBADLL w10x1 wBDLLL w21x5 wBDLLR w100x2 D wADRR w10x3 D wDRLR w20x6 w21x4 D wDRLL C D B A w1 w1w2 A w2w1 T8 Fig. 9: Mutant binary decision tree. identical but their target values are different. The case with small d and no contradicting instances has application to network function representation [20]. Theorem 1 Every attribute selection scheduling string S produces a consistent full binary decision tree as long as the set of instances contains no contradicting instances. Proof Every attribute selection scheduling string S produces a full binary decision tree, e.g., in Fig. 4. Since each instance in D corresponds to a certain branch, add the leaves with target values in the set of instances accordingly. If the target value is not available in the training set, add a random target value. This becomes a complete truth table as the number of leaves is 2d. The predicted value by the decision tree is always consistent with the actual target value in D. If there are two instances with same attribute values but different classes, no consistent binary decision tree exists. Since the databases we consider contain no contradicting instances, one obvious fitness function is the size of the tree; the fitness function fs is the size, the total number of nodes, e.g., fs(T2) = 11 and fs(T3) = 9. Here, however, we use a different fitness function fd, i.e., the sum of the depth of the leaf nodes for each instance. This is equivalent to the sum of questions to be asked for each instance, e.g., fd(T2) = 18 and fd(T3) = fd(T6) = 15 as shown in Table 2. This requires the assumption that each instance in D has equal prior probability. Both of these evaluation functions suggest that T3 and T6 are better than T2 which is produced by the ID3 algorithm. With two decision trees of the same size (2l − 1) where l is the number of leaves, the number of questions to be asked could be different by theorem 2. Theorem 2 There exist Tx and Ty such that fs(Tx) = fs(Ty) and fd(Tx) < fd(Ty). Proof Let Tx be a balanced binary tree and Ty be a skewed or linear binary tree. Then fd(Tx) = Θ(n log n) whereas Fd(Ty) = Θ(n2) = Θ( n i=1 i). 9 CHA AND TAPPERT Table 2: Training instances and their depths in decision trees. T1 T2 T3 T4 T5 T6 T7 T8 x1 2 4 3 3 2 2 3 3 x2 3 3 2 4 2 2 4 2 x3 4 3 2 3 4 3 3 3 x4 2 1 3 1 3 3 1 3 x5 2 3 2 3 2 2 3 3 x6 4 4 3 4 4 3 4 3 sum 17 18 15 18 17 15 18 17 The fd fitness function prefers not only shorter decision trees to larger ones, but also balanced decision trees to skewed ones. Corollary 1 n log c ≤ fd(Tx) ≤ n i=1 i − 1 where n is the number of instances. Proof By the Property 2, in the best case the decision tree will have l = c leaves. In the best case, the tree is completely balanced with height log c. Thus the lower bound for fd(Tx) is n log c. In the worst case, the number of leaves is n by the Property 2 and the decision tree forms a skewed or linear binary tree. Thus the upper bound is n i=1 i − 1. If fd(Tx) < n log c, Tx classifies only a subset of classes. Regardless of the size, this tree should not be selected. If fd(Tx) > n i=1 i − 1 , there is a mutant node and Tx can be shortened. When d is large, the size of the chromosome, |S| will explode by the Property 5. In this case we must limit the height of the decision tree. The chromosome has a finite length and guides to select attributes up to a certain height. Since n ≪ 2d typically, a good choice of the height of the decision tree is h = log n, i.e., |S| ≈ n. When a branch reaches a height h, the node is assigned to be a leaf with the class whose prior probability is the highest instead of choosing another attribute. When the height is limited, decision trees may be no longer consistent to D. Suppose that the height is limited to 3; h = 3 in Fig. 3 case. Fig. 10 shows the pruned binary decision tree, T9. In the 4th node in breadth first traversal order, instead of choosing B attribute in Fig. 3, w1 is labeled because the prior probability of w1 is higher than w2 in DLL. When the prior probabilities are the same, either one can be chosen. Let fa be the fitness function that represents the percentage of correctly classified instances by the decision tree. In Fig. 10, {x1, x2, x4, x5} are correctly predicted by T9 and hence fa(T9) = 4/6. We randomly generated a 26 binary attribute training database and limited the tree height to 8. Fig. 11 shows the initial population of 100 binary decision trees generated by random chromosomes in respect to their fd(Tx) and accuracy fa(Tx) on the training examples. The size of decision trees and their accuracies on the training examples have no correlations. 4. Discussion In this paper, we reviewed binary decision trees and demonstrated how to utilize genetic algorithms to find compact binary decision trees. By limiting the tree’s height the presented method guarantees finding a better or equal decision tree than the best known algorithms since such trees can be put in the initial population. Methods that apply GAs directly to decision trees [16, 17] can yield subtrees that are never visited as shown in Fig. 12. After mutation operator in ‘O’ node in Ta, Tc has a dashed subtree that 10 A GENETIC ALGORITHM FOR CONSTRUCTING COMPACT BINARY DECISION TREES D A 0 1 0 1 w2 A B C w x1 0 0 0 w1 x2 0 1 1 w1 x3 1 0 1 w1 x5 1 1 0 w2 x6 0 0 1 w2 DL x4 0 0 1 w2 A B C wDR B C w x1 0 0 w1 x2 1 1 w1 x6 0 1 w2 DLL B C w x3 0 1 w1 x5 1 0 w2 DLR w1 w2 T9 w2 w2 w2 w1 w1 w1 w w10100x6 w20011x5 w21100x4 w20101x3 w10110x2 0 D w1000x1 Predicted by T9CBAD Fig. 10: Pruned decision tree at the height of 3. Fig. 11: Binary decision trees with respect to fd and accuracy fa. is never visited. After crossover between Ta and Tb, the child trees Td and Te also have dashed subtrees. These unnecessary subtrees occur whenever an attribute occurs more than once in a path of a decision tree. However, by encoding the decision tree, this problem never occurs as illustrated in Fig. 13. When d is large, limiting the height was recommended. Encoding the binary decision tree in this paper utilized the breadth first traversal. However, pre-order depth first traversal can be utilized as shown in Fig. 13 . Mutation shown in this paper is still valid in pre-order depth first traversal representation but crossover may cause a mutant when two subtrees to be switched are at different levels. 11 CHA AND TAPPERT The index number of a node may exceed the limit due to a crossover. Thus mutation immediately after crossover is inevitable. Analyzing and implementing the depth first traversal of encoded decision tree remains ongoing work. Encoding and decoding non-binary decision trees where different attributes have different possible values is an open problem. References [1] Mitchell, T. M., ”Machine Learning”, McGraw-hill, 1997. [2] Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classification, 2nd Ed., Wiley interscience, 2001. [3] Quinlan, J. R., Induction of decision trees, Machine Learning, 1(1), 81-106. [4] L. Hyafil and R. L. Rivest, Constructing optimal binary decision trees is NP-complete , Information Processing Letters, Vol. 5, No. 1, 15-17, 1976. [5] Bodlaender, L.H. and Zantema H., Finding Small Equivalent Decision Trees is Hard, International Journal of Foundations of Computer Science, Vol. 11, No. 2 World Scientific Publishing, 2000, pp. 343-354. [6] Safavian, S.R. and Landgrebe, D., A survey of decision tree classifier methodology,IEEE Transactions on Systems, Man and Cybernetics, Vol 21, No. 3, pp 660-674, 1991. [7] Gehrke, J., Ganti, V., Ramakrishnan R., and Loh, W., BOAT-Optimistic Decision Tree Construction, in /it Proc. of the ACM SIGMOD Conference on Management of Data, 1999, p169-180. [8] Zhao, Q. and Shirasaka, M., A Study on Evolutionary Design of Binary Decision Trees, in Proceedings of the Congress on Evolutionary Computation, Vol 3, IEEE, 1999, pp. 1988-1993. [9] Bennett , K. and Blue, J., Optimal decision trees, Tech. Rpt. No. 214 Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, New York., 1996. [10] Pajunen, P. and Girolami, M., ”Implementing decisions in binary decision trees using independent component analysis”, in Proceedings of ICA, 2000, pp. 477-481. [11] Goldberg D. L., Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, 1989. [12] Mitchell, M., An Introduction to Genetic Algorithms, Massachusetts Institute of Technology, 1996. [13] Kyoung Min Kim, Joong Jo Park, Myung Hyun Song, In Cheol Kim, and Ching Y. Suen, Binary Decision Tree Using Genetic Algorithm for Recognizing Defect Patterns of Cold Mill Strip, ⁀LNCS Vol. 3060, Springer, 2004, pp. 1611-3349. [14] Teeuwsen, S.P., Erlich, I., El-Sharkawi, M.A., and Bachmann, U., Genetic algorithm and decision treebased oscillatory stability assessment, IEEE Transactions on Power Systems, Vol 21, Issue 2, May 2006 pp. 746-753. [15] Bala, J., Huang, J., Vafaie, H., DeJong, K., and Wechsler, H., Hybrid learning using genetic algorithms and decision tress for pattern classification. in Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, Canada, 1995, pp. 719-724. [16] Papagelis, A. and Kalles, D., GA Tree: genetically evolved decision trees, in Proceedings of 12th IEEE International Conference on Tools with Artificial Intelligence, 2000, pp. 203-206. [17] Fu, Z., An Innovative GA-Based Decision Tree Classifier in Large Scale Data Mining, LNCS Vol. 1704, Springer, 1999, pp 348-353. [18] S. Cha and C. C. Tappert, Constructing Binary Decision Trees using Genetic Algorithms, in Proceedings of International Conference on Genetic and Evolutionary Methods, July 14-17, 2008, Las Vegas, Nevada. [19] P. Grnwald, The Minimum Description Length principle, MIT Press, June 2007. [20] Martinez, T. R. and Campbell, D. M., A Self-Organizing Binary Decision Tree For Incrementally Defined Rule Based Systems, Systems, IEEE Systems, Man, and Cybernetics, Vol. 21, No. 5, pp.1231- 1238, 1991. 12 A GENETIC ALGORITHM FOR CONSTRUCTING COMPACT BINARY DECISION TREES B J O R V K M B C AJ K B J K K M B J B C AJ K O R V K O M Ta Tb Tc Td Te Mutate Crossover O Fig. 12: Direct GA on decision trees. 2 9 15 17 20 e bT 10 13 11 2 2 18 10 2 9 10 10 11 2 9 2 2 18 10 15 17 20 10 13 11 B J K L N B J C D AK L O R V J N L e aT e hTe fT e gT hTfT gT Fig. 13: Pre-order depth first traversal for decision trees. 13