Combining the Principal Components Method with Different Learning Algorithms Luboš Popelínský Faculty of Informatics, Masaryk University Brno, Czech Republic popelOfi.muni.cz Abstract. Principal components (PCO's) analysis is an unsupervised projection method that computes linear combinations of the original attributes that are constructed to explain maximally the variance. We show that the method is useful as a pre-processing step for various learning algorithms. Adding PCO's as new attributes results in decrease of error rate for a significant number of data sets for C5.0 decision tree learner, instance-based learner, and naive Bayes classifiers. It the case of C5.0 the decrease of error rate can be achieved without an increase of decision tree complexity. Replacement of attributes with PCO's results in increase of accuracy only for naive Bayes classifier. 1 Introduction In the process of learning it is very often transformation and reduction of data, that play an important role. A good transformation/reduction technique may result in new attributes that are more appropriate for a learning algorithm. A wide range of constructive induction techniques [3,4,16] has been explored aiming at enriching the language of propositional learners. Principal components (PCO's) [9,13] are useful linear functions that capture linear dependencies among attributes. This data preprocessing technique serves frequently for better understanding (e.g. visualisation) or compressing the data [5,6]. To our best knowledge few systematic studies have been carried out to evaluate their benefits and the associated computational costs. The aim of this work is to contribute to this issue. We have shown in [10,11] that if knowing the query examples (attribute values of a test set) combination of principal components method and decision tree learner C5.0 leads to a decrease in error rate without an increase in hypothesis complexity (the size of a decision tree). Here we bring results for the usual setting when query examples are not known in advance. We show that also for this settings some decrease of error rate was observed for several learning algorithms. The method presented here consists of two steps. In the first step we compute the principal components from numerical attributes. In the second step we use these principal components to add them to the original attributes. The first step is actually a form of constructive induction [3] applicable for numerical variables, which uses the methodology of the principal component analysis to generate derived attributes. We show that for three learning algorithms - C5.0, an instance-based learner and naive Bayes - addition of PCO's displays decrease of error rate for a significant number of data sets. For C5.0 we introduce new criteria that say when PCO's have such an effect and what profit can be expected. The paper is organised as follows. In Section 2 we briefly overview main ideas of principal components. Information about data set can be found in Section 3. The method used is described in Section 4. General overview of experimental results can be found in Section 5. Discussion of results follows in Section 6. 2 Principal Components Method Principal components analysis [9,13] (or the Karhunen-Loeve expansion) is a statistical method for dimensionality reduction. Principal components are commonly used to replace attributes with smaller number of new attributes. It is an unsupervised projection method that computes linear combinations of the original attributes that are constructed to explain maximally the variance. Let X = (Xi, ...,Xn) be a random vector with variance matrix V. When looking for the first principal component C\ we look for a vector C\ = (en, ...,Ci„) such that cic[ = 1 and variance of C\XT is maximal. It means that C\XT describes as much of variablity of data X as possible. The next principal component is computed in a similar way, and must be uncorrelated with the first one, etc. Thus the first few principal components are supposed to contain most of the information implicit in the attributes. It can be shown [9,13] that the principal components Cj are equal to eigenvectors of the variance matrix V. 3 Data sets The datasets used represent a subset of datasets used in comparative studies within project METAL [2] (October 2000). We explored all data sets that do not contain missing values. We also removed from our experiments the data sets for which most learners reached accuracy better than 99% as no profit can be obtained for such data. Because of dimensionality problems (too many attributes or too many records > 20000) musk and pyrimidines data set were removed, too. We evaluated both methods on 17 datasets containing both numerical and categorical attributes. 4 Method Learners. We explored propositional learners that are frequently used in machine learning community: decision tree learner C5.0 without boosting - c50tree, instance-based learner mlcib [1] and naive Bayes classifier mlcnb [1,8]. For all learners the default settings of parameters has been used. Algorithm. All results below are computed using 10-fold cross validation. In each step the following algorithm was called: 1. Normalise data (i.e. continuous attributes); 2. Compute the first TV principal components (linear functions) from the learning set; 3. Employ these linear functions to compute PCO's for the learning set as new attributes; ADD the computed values of principal components to the learning set; 4. Normalise the test set using means and standard deviations that have been computed from the learning set; 5. Employ the same linear functions as above to compute PCO's for the test set; ADD the computed values of the principal components to the test set; 6. Run a learning algorithm on the extended learning set ; 7. Test the result on the extended test set. We tested all possible values oi N = 1..Numberjof -Attributes for each data set. 5 Results In Fig. 8 there are results for four data sets and three learners c50tree, mlcib, mlcnb that display typical trends of increase/decrease of accuracy. Four typical trends can be observed that are independent on a particular learner. For some data sets PCO's displayed no effect - when increasing their number error rate was not changing (diabetes for all three learners; letter for c50tree). Sometimes the trend of error rate was decreasing but it never decrease below the default level - error rate on the data set without PCO's (letter and c50tree). The most interesting trends can be seen for satimage and wavef orm21 data sets. For small number of PCO's -1 or 2 - the error rate is higher but for 3,4,... PCO's it decrease below the default level (satimage for any of the three learners; wavef orm21 for c50tree, mlcnb). The important fact is that this trend is typical for majority of data sets for which an extension of data with PCO's results in increase of accuracy. Tables 1, 2, 3 contain results for C5.0 without boosting (c50tree), instance-based learner mlcib, and naive Bayes classifier mlcnb when PCO's have been added to the original data. In the first three columns there are results for the optimal number of PCO's - the number of them, comparison with error rate on the original data (++ - decrease of error rate > 95%, + - decrease of error rate < 95%), error rate ratio (error rate for the extended data set divided by the error rate for the original one), and error rate for the extended data. The second Fig. 1. Results for diabetes, letter, satimage, waveform21 diabetes -1-1— c50tree — c50tree ADDED PCOs —i mlcibl — s.mlcibl .errorrate.add" ......-> mlcnb — "diabetes.mlcnb.errorrate.add" --a letter -1-1— -j— c50tree -c50tree ADDED PCOs — mlcibl —■ "letter.mlcibl .errorrate.add" ....... mlcnb — "letter.mlcnb.errorrate.add" 55 - 25 - 15 - 10 - Tiber of principal components satimage c50tree - c50tree with PCA - mlcibl - 'satimage.mlcibl .errorrate.add' - mlcnb - 'satimage.mlcnb.errorrate.add' - ber of principal components 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of principal components c50tree — c50tree with PCA —i mlcibl — 'w21 .mlcibl .errorrate.add' ......-* mlcnb — 'w21 .mlcnb.errorrate.add' iber of principal components Table 1. Results for C5.0 without boosting (c50tree) N best N of PCO's e.r.ratio err.rate c50tree+ added 2 PCO's e.r.ratio err.rate added 3 PCO's e.r.ratio err.rate c50tree ann 1 1.06 0 35 1.303 0 43 1.303 0 43 0 33 australian 1 1.000 15 51 1.009 15 65 1.000 15 51 15 51 diabetes 1 1.00 23 83 1.016 24 21 1.016 24 48 23 83 fluid* 2 ++ 0.94 2 79 ++ 0.94 2 79 ++ 0.94 2 79 2 98 german num* 3 + 0.980 28 80 + 0.986 29 00 + 0.980 28 80 29 40 letter 1 1.01 12 16 1.021 12 34 1.024 12 38 12 09 optical 1 ++ 0.91 8 36 ++ 0.921 8 47 ++ 0.925 8 51 9 20 page 3 + 0.99 3 05 1.010 3 12 + 0.99 3 05 3 09 pendigits 12 ++ 0.893 3 24 + 0.986 3 58 ++ 0.915 3 32 3 63 quisclas 6 + 0.984 37 14 1.014 38 28 + 0.998 37 67 37 75 satimage 11 ++ 0.838 11 45 ++ 0.943 12 88 ++ 0.877 11 98 13 66 segment 3 + 0.964 3 51 1.022 3 72 + 0.964 3 51 3 64 vehicle 2 1.009 26 48 1.009 26 48 1.027 26 95 26 24 vowel 4 ++ 0.837 18 18 ++ 0.889 19 29 ++ 0.870 18 89 21 71 waveform 21 5 ++ 0.728 17 10 ++ 0.732 17 20 ++ 0.729 17 12 23 48 waveform40 6 ++ 0.689 17 28 ++ 0.707 17 72 ++ 0.709 17 78 25 08 yeast 2 1.014 43 87 1.014 43 87 1.025 44 34 43 26 Table 2. Results for mlcib instance-based learner best N of PCO's mlcib+ added 2 PCO's added 3 PCO's N e.r.ratio err.rate e.r.ratio err.rate e.r.ratio err.rate ann 4 + 0.99 2 32 1.02 2.37 1.01 2 36 2.33 australian 2 + 0.994 22 03 + 0.994 22.03 1.020 22 61 22.17 diabetes 3 ++ 0.949 29 30 + 0.975 30.08 ++ 0.949 29 30 30.86 fluid* 5 ++ 0.821 4 28 ++ 0.923 4.84 ++ 0.893 4 65 5.21 german num* 3 + 0.968 30 80 + 0.974 31.00 + 0.968 30 80 31.8 letter 16 ++ 0.815 5 20 1.028 6.56 + 0.992 6 33 6.38 optical 9 ++ 0.886 3 02 + 0.977 3.33 + 0.956 3 26 3.41 page 2 + 0.971 3 73 + 0.971 3.73 + 0.990 3 80 3.84 pendigits 1 + 0.956 0 66 1.000 0.691 1.000 0 69 0.69 quisclas 3 + 0.99 41 08 + 0.99 41.27 + 0.99 41 08 41.54 satimage 3 + 0.981 8 81 1.006 9.03 + 0.981 8 81 8.98 segment 11 + 0.951 5 63 1.103 6.53 1.074 6 36 5.92 vehicle 16 ++ 0.797 24 11 1.00 30.38 1.05 31 68 30.26 vowel 6 ++ 0.54 0 71 ++ 0.84 1.11 ++ 0.92 1 21 1.31 waveform 21 2 + 0.97 23 74 + 0.97 23.74 + 0.99 24 30 24.58 waveform40 3 + 0.95 28 36 + 0.96 28.61 + 0.95 28 36 29.84 yeast 4 + 0.95 47 03 + 0.98 48.38 + 0.98 48 45 49.33 mlcib and the third triple of columns differ only in the first column. The last column of the tables displays error rates for the original data set. First we have to answer the question about usefulness of PCO's for particular learner. For c50tree (Tab. 1), for 11 out of 17 data sets decrease of error rate has been observed. It means that a decrease of error rate appeared for significant number of data sets (one-sample Wilcoxon test on the level 95%). However, the decrease is not significant on level > 90% (t-test). It should be noticed that none of accuracy but wavef orm21 is smaller than the accuracy for C5.0 with boosting. For an instance-based learner mlcibl we can see (first triple of columns in Tab 2) that for any data set the error rate with PCO's is smaller than error rate for original data Using t-test, hypothesis that PCO's do not influence error rate can be rejected on level 90% (T=2.042, n=16). We can conclude that extension of a data set with PCO's results in a significantly lower error rate. Unfortunately, a similar assertion does not hold for 3 PCO's. As can be seen from Tab. 3, even for naive Bayes the method results if decrease of error rates for most of data sets. It is significant on level 99% (one-sample Wilcoxon test) but the decrease of error rate is not significant on level > 90% (t-test). In the next section we mainly discuss results obtained with C5.0. Table 3. Results for mlcnb naive Bayes mlcnb+ mlcib best N of PCO's added 2 PCO's added 3 PCO's N e.r.ratio err.rate e.r.ratio err.rate e.r.ratio err.rate ann 6 ++ 0.90 4 06 ++ 0.905 4 08 ++ 0.94 4 24 4.51 australian 1 1.077 24 35 1.090 24 64 1.078 24 35 22.61 diabetes 8 1.032 24 87 1.065 25 65 1.092 26 30 24.09 fluid* 9 ++ 0.919 8 56 1.000 9 31 ++ 1.000 9 31 9.31 german num* 1 1.004 24 40 1.004 24 40 1.062 25 80 24.30 letter 16 ++ 0.854 30 42 1.056 37 62 1.045 37 25 35.63 optical 10 + 0.971 9 38 1.006 9 72 + 0.997 9 63 9.66 page 5 ++ 0.727 7 29 ++ 0.919 9 22 ++ 0.821 8 24 10.03 pendigits 10 ++ 0.791 11 69 1.032 14 73 ++ 0.947 13 51 14.27 quisclas 16 + 0.971 58 83 + 0.997 60 43 0.998 60 48 60.60 satimage 34 ++ 0.861 17 64 + 0.986 20 20 ++ 0.926 18 96 20.48 segment 11 ++ 0.840 16 88 1.036 20 82 1.012 20 34 20.09 vehicle 17 ++ 0.71 37 71 1.02 54 14 1.01 53 55 53.07 vowel 10 ++ 0.84 27 37 33 03 34 74 32.52 waveform 21 2 ++ 0.90 17 20 ++ 0.90 17 20 ++ 0.94 17 90 19.08 waveform40 9 ++ 0.93 18 50 ++ 0.93 18 50 ++ 0.93 18 56 19.96 yeast 7 + 0.96 45 55 46 36 45 96 47.57 6 Discussion For what data set are PCO's useful ? The statement above can be reformulated by such a way: for most of data sets there exists a number N of PCO's that are good candidates for adding to the original set of attributes. A question that needs to be answered now is: what is the optimal number N of PCO's that we should add? One way of finding N is based on eigenvalues of variance matrix. In [10,11] TV was set the number of PCO's to the number of eigenvalues that are significantly greater than one. It can be computed using the following criterion [14] : N = #eigenvalues\eigenvalue > 1 + 2 * ^g^.'-i Another possibility is to exploit the level of variance that is explained with the first TV principal components. Let X = (Xi,...,Xn) be again a random vector with variance matrix V. Let Z{, i = l,...,r be principal components, r is a number of positive eigenvalues of variance matrix V. Then the following equalities hold: varZi + varZ2 + ... + varZr = varXi + varX2 + ... + varXn As varZi = Aj , where Aj is the i-th eigenvalue of V, the previous equation can be rewritten as: Ai + A2 + ... + Ar = varXi + varX2 + ... + varXn The sum a1 = varXi + varX2 + ... + varXn can be seen as a variability measure of vector X. Thus to explain X in terms of Z we need such number N of principal components that (Ai + ... + Ajv)/o'2 is close to 1. It was already mentioned above that for most of data sets a trend of accuracy as a function of TV has a similar shape. For small number of PCO's - 1 or 2 - the error rate is higher than the error rate of c50tree but starting from 3 PCO's the error rate decreases below this level. Compare now the first and the third triple of columns in Tab. 1. We can see that all data sets that are promising for any value of TV are also promising for N = 3. It does not mean at all that for some N > 3 the error rate ratio will be even smaller. However, the expected difference will not be big (see Tab 1). To prevent from over-fitting we prefer to add as little of attributes as possible. Thus N = 3 seems be a good choice. Let us validate this hypothesis. As showed above (Ai + ... + Ajv)/o'2 displays a quality of approximation via first TV PCO's. For the 15 of 17 data sets explored it was observed that if a sum of the first three PCO's express more than three fifth of variability (Ai + A2 + A3)/ct2 > 0.6 then increase of accuracy appears. It is significant on level 99% (Sign test). In the following text we call a promising data set (for a given learner) if the error rate on the extended data is smaller than for the original set of attributes. How a hypothesis size changed? We compared the sizes of decision trees (number of nodes) with the corresponding sizes of trees when using PCO's. For 13 out of 17 datasets the size of the decision tree generated decreased when using principal components. From promising data sets only for fluid data the tree complexity overcame the complexity of tree without PCO's. Some small increase in hypothesis complexity was observed for fluid from 15.4 to 16.3 (5.2%) what corresponds to 1 extra node in the decision tree. More extra nodes appeared for australian (increase from 19.9 to 21.8, 4=9.5%, i.e. approx. 2 extra nodes), yeast (166.3, 181.5, 4=9.1%, 15 nodes) and diabetes (22.1, 25.4, 4=14.9%, approx. 3 extra nodes). When adding the complexity of 3 PCO's (number of attributes * 3) to the complexity of the tree this sum was smaller than c50tree complexity only for wavef orm21. Where PCO's appear in a decision tree? We observed that at least one of PCO's appeared in the root (on level 1) of the decision tree or not deeper than on level 4 for the promising data sets. It holds for 15 out of 17 data sets (australian was removed because no PCO's appeared in the tree) (significant for the Sign test on level 99%) It must be noticed that it is the 2nd principal components that appears "sooner" : for 7 out of 11 such data sets it appeared 7-times (1st PCO three-times, 3rd PCO twice) as the first in the decision tree). Table 4. Training time c50tree+ c50tree PCO time train, time time ann 0.71 0.56 0.43 australian 0.04 0.07 0.04 fluid 0.18 0.11 0.15 diabetes 0.06 0.08 0.05 letter 10.05 7.68 optical 0.85 4.36 7.42 pendigits 3.96 1.02 3.37 page 0.52 0.82 1.14 quisclas 1.87 2.44 2.53 satimage 4.22 3.73 4.55 segment 0.75 0.57 0.61 vehicle 0.31 0.23 0.24 vowel 0.34 0.74 0.51 waveform21 0.20 1.87 2.55 waveform40 1.79 3.68 5.46 yeast 0.23 0.43 0.60 Training time. In our analysis of times (Tab. 4), we first compared the training with and without PCO's. For 12 out of 17 data sets the training time for data with PCO's was smaller than for the same data without PCO's. From promising data sets, only vowel displayed greater training time for data with PCO's. Then the time to compute PCO's was added to the training time of the decision tree and compared with the time used to construct a decision tree without PCO's. Only for aim data set the ratio of these times overcame 3, and only for 5 out of 16 the ratio was greater 2, i.e. more than twice slower than without PCO's. For optical data the sum of PCO time and the training time was smaller than the time without PCO's. Thus, we can conclude that time complexity seems acceptable. Other learners We also performed first experiments with other learners, namely with C5.0 with boosting, with linear discriminant method, and with linear trees (a decision tree that can introduce linear function in its nodes) [7]. However, no significant decrease of erro rate has been observed for them. The reason may be that they actually form some kind of linear combination of attributes. 7 Conclusion We described a method that employed the principal components analysis as a pre-processing stage for propositional learners. To summarise our results: — adding principal components to the original dataset results in a decrease in error rate for significant number of datasets for C5.0, instance-based learner and naive Bayes learner; — a decrease of error rate is significant for the instance-based learner; — for c5.0 decrease of error rate can be achieved without increasing complexity of the decision tree; The naive Bayes is the only algorithm where replacement of attributes with PCO's leads to decrease of error rate. It happened for 11 out of 17 data sets. Some of results are pretty preliminary, e.g. criterion for usefulness of PCO's, and need to be evaluated on other data sets. Neither we compared PCO's with other preprocessing methods for attribute reduction. Further work should be carried out to explore some other way of exploitation of principal components, e.g. pruning them (simplifying the linear functions) or choosing only some of them. No theoretical analysis has been performed yet to answer these questions. Exploitation of other statistical methods, e.g. correspondence analysis, is also a challenge. Acknowledgements Thanks to Pavel Brázdil and other members of LIACC of University of Porto for their help. This work has been partially supported by ESPRIT METAL Project. References 1. Machine Learning Library in C++ http://www.sgi.com/Technology/mlc/ 2. One of pages of Esprit LTR Project METAL http://www.ncc.up.pt/liacc/ML/METAL/ 3. Bala J.W., Michalski R.S., Wnek J.: The Principal Axes Method for Constructive Induction. In Sleeman D. and Edwards P.(eds.), Machine Learning: Proceedings of the Ninth International Workshop (ML92), Morgan Kaufmann, Los Altos/Palo Alto/San Francisco, pp.20-29, 1992. 4. Bloedorn, E. and Michalski, R.S.: Data-Driven Constructive Induction. IEEE Intelligent Systems, Special issue on Feature Transformation and Subset Selection, pp. 30-37, March/April, 1998. 5. Z. Duszak and W.W. Koczkodaj, Using Principal Component Transformation in Machine Learning, Proceedings of International Conference on Systems Research, Informatics and Cybernetics, Baden-Baden Germany, pp. 125-129, 1994. 6. A. Famili, Wei-Min Shen, Richard Weber, Evangelos Simoudis: Data Preprocessing and Intelligent Data Analysis. Intelligent Data Analysis Vol 1., No.l, 1997. 7. Gama J.: Probabilistic linear tree. In D. Fisher (ed.), Proceedings of the 14th Int. Conf. on Machine Learning (ICML'97), Morgan Kaufman, 1997. 8. Mitchell, T.M.: Machine Learning. McGraw Hill, New York, 1997. 9. F. Murtagh and A. Heck: Multivariate Data Analysis. Kluwer Academic, Dordrecht, 1987. 10. Popelmsky L., Brazdil P.: Combining the Principal Components Method with Decision Tree Learning. In Proceedings of Multistrategy Learning Workhop MSL-2000, Guimaraes, Portugal 2000. 11. Popelmsky L., Brazdil P.: The Principal Components Method as a Pre-processing Stage for Decision Tree Learning. PKDD '2000 Ws on Metalearning, Lyon 2000 12. Quinlan J.R.: C4-5: Programs for Machine Learning. Morgan Kaufmann Publ. 1992. 13. Ripley B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, 1996. 14. Saporta G.: Some simple rules for interpreting outputs of principal components and correspondence analysis. In: Bacelar-Nicolau H., Nicolau F.C, Janssen J.(eds.): Proceedings of 9th Intl.Symp. on Applied Stochastic Models and Data Analysis ASMDA-99, University of Lisbon, 1999. 15. Torgo L.: Inductive Learning of Tree-based Regression Models. PhD Thesis, Faculty of Science University of Porto, 1999. 16. Wnek, J. and Michalski, R.S.: Hypothesis-driven Constructive Induction in AQ17-HCI: A Method and Experiments. Machine Learning, Vol. 14, No. 2, pp. 139-168, 1994. byzantine H-1-1-;-1-1-I- -i-r c50tree c50tree ADDED PCOs —i-c50tree REPLACED with PCO -—x-mlcibl ........... ---|----|--- 2 4 6 8 10 12 14 16 18 20 number of principal components diabetes -1-1-1-1- c50tree c50tree ADDED PCOs c50tree REPLACED with PCO mlcibl "diabetes, mlcibl.errorrate. add" "diabetes.mlcibl .errorrate. replace" mlcnb "diabetes, mlcnb.errorrate. add" "diabetes, mlcnb.errorrate. replace" 1 2 3 4 5 6 7 8 number of principal components Appendix A: Results letter -i——i-1-1-1-1-1-1-1-1-1-1-1-1-r c50tree - Í c50tree ADDED PCOs —i— \ c50tree REPLACED with PCO -—x-- \ mlcibl .............. \ "letter, mlcibl. errorrate. add" — V. "letter.mlcibl.errorrate.replace" ij, mlcnb -------- \\ "letter.mlcnb.errorrate.add" \ \ "letter.mlcnb.errorrate.replace" —-e- 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of principal components germannumb 34 l 1 1 c50tree - c50tree ADDED PCOs —^— mlcibl -------- -"germannumb.mlcib1.errorrate.add" ......-x-...... mlcnb -------- "germannumb.mlcnb.errorrate.add" 32 _ 30 - H-------- 28 - 26 *------- ------------------ 24 22 - 20 1 2 3 number of principal components (%)ajBj jojja II 1 1 - 1* II t * p 1 □ V f i in°- c- £ a> o.<= q. A - (%)ajBj jojja satimage T" T" "T- "T" 50 45 40 35 c50tree — c50tree with PCA — c50tree REPLACED with PCA -—> mlcibl ........ 'satimage.mlcibl .errorrate.add' — > 'satimage.mlcibl .errorrate.replace' ---e mlcnb --- 'satimage. mlcnb. errorrate. add' —-i 'satimage.mlcnb.errorrate.replace' —-< 6? 30 £ 25 20 15 10 'A r,-3 » ****** 10 15 20 25 number of principal components 30 35 c50tree c50tree with PCA — c50treeREPLACE with PCA mlcibl 'w21 .mlcibl .errorrate.add' 'w21.mlcibl .errorrate.replace' mlcnb -------- 'w21 .mlcnb.errorrate.add' ------ 'w21.mlcnb.errorrate.replace' -—S-b O.tf.tf (Did) ^ ^ 0) i d) ' I ! I gO o ! o lo < b ! b o—i a> I a> (%)ajBj jojja ^<-9;D'a> £00:9-0 O g-D U a>Q S?LU ■83 Q_ LU ■if > i 5 10 15 number of principal components (%)ajBj jojja