© The McGraw-Hill Companies, Inc., 2008 C4.5 and CART 8-1 - 2 © The McGraw-Hill Companies, Inc., 2008 ID3  Creates tree using information theory concepts and tries to reduce expected number of comparison..  ID3 chooses split attribute with the highest information gain: 8-1 - 3 © The McGraw-Hill Companies, Inc., 2008 ID 3 Attribute 1 Attribute 2 Attribute 2’ Value 2Value 1 8-1 - 4 © The McGraw-Hill Companies, Inc., 2008 8-1 - 5 © The McGraw-Hill Companies, Inc., 2008 Entropy of D 8-1 - 6 © The McGraw-Hill Companies, Inc., 2008 Attribute Humidity 8-1 - 7 © The McGraw-Hill Companies, Inc., 2008 Attribute Wind 8-1 - 8 © The McGraw-Hill Companies, Inc., 2008 8-1 - 9 © The McGraw-Hill Companies, Inc., 2008 Best Attribution Chosen 8-1 - 10 © The McGraw-Hill Companies, Inc., 2008 8-1 - 11 © The McGraw-Hill Companies, Inc., 2008 Strong Weak 8-1 - 12 © The McGraw-Hill Companies, Inc., 2008 8-1 - 13 © The McGraw-Hill Companies, Inc., 2008 ? 8-1 - 14 © The McGraw-Hill Companies, Inc., 2008 Day D1 D2 D14D13… No No NoYes… 8-1 - 15 © The McGraw-Hill Companies, Inc., 2008 C4.5 ID3 favors attributes with large number of divisions Improved version of ID3: Missing Data Continuous Data Pruning Rules GainRatio: 2 8-1 - 16 © The McGraw-Hill Companies, Inc., 2008 ID3 ? 8-1 - 17 © The McGraw-Hill Companies, Inc., 2008 C4.5 OK 8-1 - 18 © The McGraw-Hill Companies, Inc., 2008 CART  Create Binary Tree  Uses entropy  Formula to choose split point, s, for node t:  PL,PR probability that a tuple in the training set will be on the left or right side of the tree. Maximum 8-1 - 19 © The McGraw-Hill Companies, Inc., 2008 案例 8-1 - 20 © The McGraw-Hill Companies, Inc., 2008 On-timelowhighnot-midlle On-timelowhighnot-midlle On-timelowhighNot-midlle On-timelowhighmidlle On-timelowhighmidlle On-timelownot-highmidlle On-timelownot-highmidlle Latenot-lownot-highmidlle Latenot-lowhighnot-midlle On-timelownot-highnot-midlle On-timelownot-highnot-midlle On-timenot-lownot-highnot-midlle On-timenot-lownot-highnot-midlle On-timenot-lownot-highnot-midlle Latenot-lownot-highnot-midlle On-timelownot-highnot-midlle Latenot-lownot-highnot-midlle Latenot-lownot-highnot-midlle On-timenot-lownot-highnot-midlle On-timenot-lownot-highnot-midlle ResultRiskIncomeAge 8-1 - 21 © The McGraw-Hill Companies, Inc., 2008 8-1 - 22 © The McGraw-Hill Companies, Inc., 2008 CART Analysis At the start, there are three choices for split point: (Age)=2(5/20)(15/20)(7/20 + 3/20)=0.1875 [4 On-time, 1 Late] Age Middle Not-middle [15 On-time, 5 Late] [11 On-time, 4 Late] 8-1 - 23 © The McGraw-Hill Companies, Inc., 2008 CART Analysis At the start, there are three choices for split point: (Income)=2(6/20)(14/20)(5/20 + 3/20)=0.168 [5 On-time, 1 Late] Income High Not-high [15 On-time, 5 Late] [10 On-time, 4 Late] 8-1 - 24 © The McGraw-Hill Companies, Inc., 2008 CART Analysis At the start, there are three choices for split point: (Risk)=2(10/20)(10/20)(5/20 + 5/20)=0.25 [10 On-time, 0 Late] Risk Low Not-low [15 On-time, 5 Late] [5 On-time, 5 Late] Maximum 8-1 - 25 © The McGraw-Hill Companies, Inc., 2008 Latenot-lownot-highmidlle Latenot-lowhighnot-midlle On-timenot-lownot-highnot-midlle On-timenot-lownot-highnot-midlle On-timenot-lownot-highnot-midlle Latenot-lownot-highnot-midlle Latenot-lownot-highnot-midlle Latenot-lownot-highnot-midlle On-timenot-lownot-highnot-midlle On-timenot-lownot-highnot-midlle ResultRiskIncomeAge Step 2: 8-1 - 26 © The McGraw-Hill Companies, Inc., 2008 At the start, there are three choices for split point: (Age)=2(1/10)(9/10)(5/10 + 3/10)=0.144 [0 On-time, 1 Late] Age Middle Not-middle [5 On-time, 5 Late] [5 On-time, 4 Late] identical CART Analysis 8-1 - 27 © The McGraw-Hill Companies, Inc., 2008 CART Analysis At the start, there are three choices for split point: (Income)=2(1/10)(9/10)(5/10 + 3/10)=0.144 [0 On-time, 1 Late] Income High Not-high [5 On-time, 5 Late] [5 On-time, 4 Late] 8-1 - 28 © The McGraw-Hill Companies, Inc., 2008 [10 On-time, 0 Late] Risk Low Not-low [15 On-time, 5 Late] [5 On-time, 5 Late] [0 On-time, 1 Late] Age Middle Not-middle [5 On-time, 4 Late] 8-1 - 29 © The McGraw-Hill Companies, Inc., 2008 Latenot-lowhighnot-midlle On-timenot-lownot-highnot-midlle On-timenot-lownot-highnot-midlle On-timenot-lownot-highnot-midlle Latenot-lownot-highnot-midlle Latenot-lownot-highnot-midlle Latenot-lownot-highnot-midlle On-timenot-lownot-highnot-midlle On-timenot-lownot-highnot-midlle ResultRiskIncomeAge Step 3: 8-1 - 30 © The McGraw-Hill Companies, Inc., 2008 [10 On-time, 0 Late] Risk Low Not-low [15 On-time, 5 Late] [5 On-time, 5 Late] [0 On-time, 1 Late] Age Middle Not-middle [5 On-time, 4 Late] [0 On-time, 1 Late] Income High Not-high [5 On-time, 3 Late]