Artif Intell Rev (2018) 49:105-130 https://doi.org/10.1007/sl0462-016-9518-2 CrossMark The robustness of majority voting compared to filtering misclassified instances in supervised classification tasks Michael R. Smith1 • Tony Martinez1 Published online: 22 September 2016 © Springer Science+Business Media Dordrecht 2016 Abstract Removing or filtering outliers and mislabeled instances prior to training a learning algorithm has been shown to increase classification accuracy, especially in noisy data sets. A popular approach is to remove any instance that is misclassified by a learning algorithm. However, the use of ensemble methods has also been shown to generally increase classification accuracy. In this paper, we extensively examine filtering and ensembling. We examine 9 learning algorithms individually and ensembled together as filtering algorithms as well as the effects of filtering in the 9 chosen learning algorithms on a set of 54 data sets. We compare the filtering results with using a majority voting ensemble. We find that the majority voting ensemble significantly outperforms filtering unless there are high amounts of noise present in the data set. Additionally, for most cases, using an ensemble of learning algorithms for filtering produces a greater increase in classification accuracy than using a single learning algorithm for filtering. Keywords Voting ensemble • Class noise • Filtering • Machine learning 1 Introduction The goal of supervised machine learning is to induce an accurate generalizing function T : X i-* Y from a set of input feature vectors X = {xi, X2, ■ ■ ■, x„} and a corresponding set of of label vectors Y = {y\, yi, ■ ■ ■, yn}- The quality of the induced function J7 by a learning algorithm is dependent on the quality of the data used for training. However, many real-world data sets are inherently noisy where the noise in a data set can be label noise and/or attribute noise. Label noise has been shown to be more detrimental than attribute noise (Zhu and Wu 2004) and is the focus of this paper. Noise can arise from various sources such as subjectivity, human errors, and sensor malfunctions. Most learning algorithms are designed SI Michael R. Smith msmith4 @ sandia.gov 1 Department of Computer Science, Brigham Young University, Provo, UT 84602, USA Springer 106 M. R. Smith, T. Martinez to tolerate a certain degree of noise by avoiding overfitting the training data. There are two general approaches for handling class noise: (1) creating learning algorithms that are robust to noise such as the C4.5 algorithm for decision trees (Quinlan 1993) and (2) preprocessing the data prior to inducing a model of the data such as filtering (Wilson 1972; Bradley and Friedl 1999), weighting (Rebbapragada and Bradley 2007; Smith and Martinez 2014) or correcting (Teng 2003) noisy instances. Previous works have generally examined filtering in a limited context using a single or very few learning algorithms and/or using a limited number of data sets. This may be in part due to the extra computational requirement to first filter a data set and then induce a model of the data using the filtered data set. As such, previous works were generally limited to investigating relatively fast learning algorithms such as decision trees (John 1995) and nearest-neighbor algorithms (Tomek 1976; Wilson and Martinez 2000). In addition, filtering prior to using instance-based learning algorithms was motivated in part to reduce the number of instances that have to be stored and because instance-based learning algorithms are more sensitive to noise than other learning algorithms. Most previous works also added artificial noise to the data set to show that filtering, weighting, or cleaning the data set is beneficial (5-50 % of the instances become noisy). In this work, we examine filtering misclassified instances and using a majority voting ensemble on a set of 54 data sets and 9 learning algorithms without adding artificial noise. The artificial noise was added in previous works to show that filtering/weighting/cleaning provided significant improvements with noisy data sets. Within the context of the benefits of filtering established by the previous work, we examine the extent to which filtering affects the performance of a learning algorithm without adding artificial noise to a data set. This also avoids making assumptions about the generation of the noise which may or may not be accurate. It also shows the effect of filtering on the inherent noise in real-world data sets that is not known before hand. The results provide insights on the robustness of a majority voting ensemble and when to employ a misclassification filter. Using a larger number of data sets allows for more statistical confidence in the results than if only a small number of data sets are used. We find that, in general, a voting ensemble is robust to noise and achieves significantly higher classification accuracy trained on unfiltered data than a single learning algorithm trained on filtered data. For filtering, we find that using an ensemble filter achieves significantly higher classification accuracy than using a single learning algorithm filter. On data sets with higher percentages of inherent noisy instances, however, using the ensemble filter achieves higher classification accuracy than a voting ensemble for some learning algorithms. Training a voting ensemble on filtered training data significantly decreases classification accuracy compared to training a voting ensemble on unfiltered training data. This is likely due to a reduction of diversity in the induced models of the ensemble. In the next section, we present previous works for handling noise in supervised classification problems. A mathematical motivation for filtering misclassified instances is presented in Sect. 3. We then present our experimental methodology in Sect. 4 followed by a presentation of the results in Sect. 5. In Sect. 6 we provide conclusions and directions for future work. 2 Related work As many real-wold data sets are inherently noisy, most learning algorithms are designed to tolerate a certain degree of noise. Typically, learning algorithms are designed to be somewhat robust to noise by making a trade-off between the complexity of the induced model and Springer The robustness of majority voting compared to filtering 107 optimizing the induced function on the training data to prevent overfit. Some techniques to avoid overfit include early stopping using a validation set, pruning (such as in the C4.5 algorithm for decision trees Quinlan 1993), or regularization by adding a complexity penalty to the loss function Bishop and Nasrabadi (2006). Some previous works have examined how class noise and attribute noise affects the performance of various learning algorithms (Zhu and Wu 2004; Nettleton et al. 2010) and found that class noise is generally more harmful than attribute noise and that noise in the training set is more harmful than noise in the test set. Further, some learning algorithms have been adapted specifically to better handle label noise. For example, noisy instances are problematic for boosting algorithms (Schapire 1990; Freund 1990) where more weight is placed upon misclassified instances, which often include mislabeled and noisy instances. To address this, Servedio (2003) presented a boosting algorithm that does not place too much weight on any single training instance. For support vector machines, Collobert et al. (2006) use the ramp-loss function to place a bound on the maximum penalty for an instance that lies on the wrong side of the margin. Lawrence and Scholkopf Lawrence and Scholkopf (2001) explicitly model the possibility that an instance is mislabeled using a generative model and then use expectation maximization to update the probability that an instance is mislabeled. Preprocessing the data set is another approach that explicitly handles label noise. This can be done by removing noisy instances, weighting the instances, or correcting incorrect labels. All three approaches first attempt to identify which instances are noisy by various criteria. Filtering noisy instances has received much attention and has generally resulted in an increase in classification accuracy (Gamberger et al. 2000; Smith and Martinez 2011). One frequently used filtering technique removes any instance that is misclassified by a learning algorithm (Wilson 1972) or set of learning algorithms (Brodley and Friedl 1999). Verbaeten and Van Assche (2003) further pursued the idea of using an ensemble for filtering using ideas from boosting and bagging. Other approaches use learning algorithm heuristics to remove noisy instances. Segata et al. (2009), for example, remove instances that are too close or on the wrong side of the decision surface generated by a support vector machine. Zeng and Martinez (2003) remove instances while training a neural network that have a low probability of being labeled correctly where the probability is calculated using the output from the neural network. Filtering has the potential downside of discarding useful instances. However, it is assumed that there are significantly more non-noisy instances and that throwing away a few correct instances with the noisy instances will not have a negative impact on a large data set. Weighting the instances in a training set has the benefit of not discarding any instances. Rebbapragada and Brodley (2007) weight the instances using expectation maximization to cluster instances that belong to a pair of the classes. The probabilities between classes for each instances is compiled and used to weight the influence of each instance. Smith and Martinez (2014) examine weighting the instances based on their probability of being misclassified. Similar to weighting the training instances, data cleaning does not discard any instances, but rather strives to correct the noise in the instances. As in filtering, the output from a learning algorithm has been used to clean the data. Automatic data enhancement (Zeng and Martinez 2001) uses the output from a neural network to correct the label for training instances that have a low probability of being correctly labeled. Polishing (Teng 2000, 2003) trains a learning algorithm (in this case a decision tree) to predict the value for each attribute (including the class). The predicted (i.e. corrected) attribute values for the instances that increase generalization accuracy on a validation set are used instead of the uncleaned attribute values. We differ from the related work in that we do not add artificial noise to the data sets when we examine filtering. Thus, we avoid making any assumptions about the noise source and Springer 108 M. R. Smith, T. Martinez Fig. 1 Graphical model of the generative probabilistic model proposed by Lawrence and Scholkopf (2001) a b focus on the noise inherent in the data sets. We also examine the effects of filtering on a larger set of learning algorithms and data sets providing more significance to the generality of the results. 3 Modeling class noise in a discriminative model Lawrence and Scholkopf (2001) proposed to model a data set probabilistically using a generative model that models the noise process. They assume that the joint distribution p(x, y, y) (where x is the set of input features, y is the observed, possibly noisy, class label given in the training set, and y is the actual unkown class label) is factorized as p{y\y) p{x\y)p{y) as shown in Fig. la. However, since modeling the prior distribution of the unobserved random variable y is not feasible, it is more practical to estimate the prior distribution of p{y) with some assumptions about the class noise as shown in Fig. lb. Here, we follow the premise of Lawrence and Scholkopf by explicitly modeling the possibility that an instance is misclassified. Rather than using a generative model, though, we use a discriminative model since we are focusing on classification tasks and do not require the full joint distribution. Also, discriminative models have been shown to yield better performance on classification tasks (Ng and Jordan 2001). Using a discrimintative model that accounts for class noise motivates our investigation of filtering and using a majority voting ensemble. Let T be a training set composed of instances (xi, yt) drawn i.i.d. from the underlying data distribution V. Each instance is composed of an input vector xj with a corresponding possibly noisy label vector yn. Given the training data T, a learning algorithm generally seeks to find the most probable hypothesis h that maps each x{ i-* yt. For supervised classification problems, most learning algorithms maximize p {yt \x(, h) for all instances in T. This is shown graphically in Fig. 2a where the probabilities are estimated using a discriminative approach such as a neural network or a decision tree to induce a hypothesis of the data. Using Bayes' rule and decomposing T into its individual constituent instances, the maximum a posteriori hypothesis is: p(T\h)p(h) argmax p(h\T) = -—- hen P(T) a 11 P(xi, yi\h)p(h) i argmax/?(/z|r) = \\ p{yi\xi,h)p{xi\h)p{h). (1) hen In Eq. 1, the MAP hypothesis h is found by finding a global optima where all instances are included in the optimization problem. However, noisy instances are often detrimental for finding the global optima since they are not representative of the true (and unknown) underlying data distribution V. The possibility of label noise is not explicitly modeled in this Springer The robustness of majority voting compared to filtering 109 Fig. 2 Graphical representation of a discriminative probabilistic model for a p(y\x)p(x) and b p(y\x, y)p(y\x)p(x) © © © G 0 © a b form—completely ignoring yi. Thus, label noise is generally handled by avoiding overfit such that more probable, simpler hypotheses are preferred (p(h)). The possibility of label noise can be modeled explicitly by including the latent random variable yi with xj and yi. Thus, an instance is the triplet (xi, yi, yi) and a supervised learning algorithm seeks to maximize p{yn\xi, y, h)—modeled graphically in Fig. 2b. Using the model in Fig. 2b, the MAP hypothesis becomes: Equation 2 shows that for an instance Xi, the probability of an observed class label {p{yi \xi, yi, h)) should be weighted by the probability of the actual class {p{yt \xi,h)). What we are really interested in is the probability that yi = yi. Using a discriminative model h trained on T, we can calculate p{yt\yt, Xi,h) as Since the quantity p(y\yt, h) is unknown, p(yt\yt, xt, h) can be approximated as p(yt \xt, h) assuming that p{yi\yi, h) is represented in h. In other words, the induced discriminative model is able to model if one class label is more likely than another class label given an observed, possibly noisy, label. Otherwise, all class labels are assumed to be equally likely given an observed label. Thus, p(yi\yi, X(, h) can be approximated by finding the class distributions for a given Xi from an induced discriminative model. That is, after training a learning algorithm on T, the class distribution for an instance Xi can be calculated based on the output from the learning algorithm. As shown in Eq. 1, p{yi \x(, h) is found naturally through a derivation of Bayes' law. The quantity p{yi\xt, h) is the maximum likelihood of an instance given a hypothesis h which a learning algorithm tries to maximize for each instance. Further, the dependence on a specific h can be removed by summing over all possible hypotheses h in H and multiplying each p{yi \xt, h) by p{h): This formulation is infeasible though because (1) it is not practical (or possible) to sum over the set of all hypotheses, (2) calculating p{h) is non-trivial, and 3) not all learning algorithms produce a probability distribution. These limitation make probabilistic generative models attractive, such as the kernel Fisher discriminant algorithm (Lawrence and Scholkopf 2001). However, for classification tasks, generative models generally have a higher asymptotic error than discriminative models (Ng and Jordan 2001). The following section shows how we estimate p(ji\yi,Xi,h). This framework for modeling class noise in a discriminative model motivates the use of removing instances with low p{yt\yt, h) and the use of ensembles to lessen the dependence (2) p{yi\yi,Xi,h) = p{yi\yi,h)p{yi\xi,h)p{h). P(yi\yt,Xi) « p(yt\xi) = ^ p{yj\xj,h)p{h). hen (3) Springer 110 M. R. Smith, T. Martinez on a given hypothesis h. Following Eq. 2, removing instances with low p{yt\xt, h) will increase the global p{h\T) since p{yt\xt, yt,h) will be low. Further, following Eq. 3, an ensemble should theoretically be more robust to the bias of a particular hypothesis and noise by utilizing multiple overfit avoidance techniques. This motivates our examination of a majority voting ensemble composed of models induced by different learning algorithms1 as a filtering technique and as a classifier. 4 Methodology In this section, we present how we calculate p{yi\yi ,X(,h) and the learning algorithms and data set that we use in our analysis. We also provide an overview of our experimentss. 4.1 Calculating p(yi\yi, h) To calculate p{yt\yt, xt,h) for each instance, we use an induced model from training a learning algorithm on the training set T. To lessen the dependence of p{yi\yi, X(, h) on a particular h, we estimate marginalizing over the hypothesis space H by selecting a diverse set of learning algorithms to represent H. The diversity of the learning algorithm refers to the learning algorithms not having the same classification for all of the instances and is determined using unsupervised meta-learning (UML) (Lee and Giraud-Carrier 2011). UML first uses Classifier Output Difference (COD) (Peterson and Martinez 2005) to measure the diversity between learning algorithms. COD measures the distance between two learning algorithms as the probability that the learning algorithms make different predictions. UML then clusters the learning algorithms based on their COD scores with hierarchical agglomerative clustering. We considered 20 learning algorithms from Weka with their default parameters (Hall et al. 2009). The resulting dendrogram is shown in Fig. 3, where the height of the line connecting two clusters corresponds to the distance (COD value) between them. A cut-point of 0.18 was chosen to create 9 clusters and a representative algorithm from each cluster was used to create a diverse set of learning algorithms. The learning algorithms that were used are listed in Table 1. UML provides a diverse set of learning algorithms intended to be representative of H. 4.2 Experiments Given a method for estimating p{yt\yt, X{, h) and for lessening the dependence on a specific h, we examine several techniques for filtering instances with low p{yi\%, X(, h) and for constructing a voting ensemble. 4.2.1 Misclassification filters In this paper, we examine misclassification filters which filter any instance that is misclassified by a given learning algorithm. Given that a number of different learning algorithms could be employed for filtering, we conduct an extensive evaluation of filtering misclassified instances using a diverse set of learning algorithms as described in the previous section. Each learning algorithm first filters instances from the training set that were misclassified and then induces As opposed to an ensemble composed of models linduced by the same learning algorithm such as bagging or boosting. Springer The robustness of majority voting compared to filtering 111 a; o ö a; ft: 3 a O Sh a; to U o lO o d n \ tu 2 CO ffl Pi t Oh £ CO cd Ö .2 o Ö .2 > s o h 1 2; 2; h o 2; fa ffl Pi 2 2 ^ H £ P^ Q "S3 ffl 2 o Ö cd Pi Fig. 3 Dendrogram of the considered learning algorithms clustered using unsupervised metalearning based on their classifier output difference Table 1 Set of learning algorithms used for filtering Learning algorithms Multilayer Perceptron trained with Back Propagation (MLP) Decision Tree (C4.5) (Quinlan 1993) Locally Weighted Learning (LWL) 5-Nearest Neighbors (5-NN) Nearest Neighbor with generalization (NNge) Naive Bayes (NB) Ripple DOwn Rule learner (RIDOR) Random Forest (RandForest) Repeated Incremental Pruning to Produce Error Reduction (RIPPER) a model of the data using the filtered training set. Misclassification filters using a single learning algorithm establish a good baseline to compare against. 4.2.2 Ensemble filter We also examine using an ensemble filter-removing instances that are misclassified by different percentages of the 9 learning algorithms. The ensemble filter more closely approximates p (y | yi, Xi) from Eq. 3 since it sums over a set of learning algorithms (which in this case were chosen to be diverse and represent a larger subset of the hypothesis space H) lessening the dependence on a single hypothesis h. For the ensemble filter, p{y\yt, x{) is estimated using a subset of learning algorithms C: l \c\ p{y\yi,Xi)^p{yi\xi,C) « — ^p(yi\xi, lj{T)) (4) 11 7 = 1 where lj (T) is the hypothesis from the j'th learning algorithm trained on training set T. From Eq. 3, p(h) is estimated as for the j'th hypothesis generated from training the learning algorithms in C on T and as zero for all of the other hypotheses in H. Also, p{yt \xt, lj (T)) Springer 112 M. R. Smith, T. Martinez Algorithm 1 Adaptively constructing a filter set. 1: Let F be the filter set used for filtering and C be the set of candidate learning algorithms for F. 2: Initialize F to the empty set: F {} 3: Initialize the current accuracy to the accuracy from an empty filter set: currAcc runLA({}).runLA(F) returns the accuracy from a learning algorithm trained on a data set filtered with F. 4: while C # {} do 5: best Acc currAcc; best LA null; 6: for all g e C do 7: tempF F + g; acc runLA(tempF); 8: if acc > £>es?Acc then 9: £>es?Acc acc; bestLA g; 10: end if 11: end for 12: if £>es?Acc > currAcc then 13: C C — bestLA; F F + bestLA; currAcc £>es?Acc; 14: else 15: break; 16: end if 17: end while is estimated using the indicator function (h(xi) = yt) since not all learning algorithms produce a probability distribution over the output classes. Set up as such, the ensemble filter filters an instance that is misclassified by x % of the learning algorithms in the ensemble. In this paper, we examine an ensemble filter, removing instances that are misclassified by 50,70, and 90 percent of the learning algorithms in the ensemble. One of the problems of using an ensemble filter is having to choose the percentage of learning algorithms that misclassify an instance for filtering. For the results, we report the accuracy from the percentage that produces the highest accuracy using 5 by 10-fold cross-validation to choose the best percentage for each data set. This method highlights the impact of using an ensemble filter, however, in practice a validation set is often used to determine the percentage that would be used. 4.2.3 Adaptive filter We also examine an adaptive filtering approach that iteratively adds a learning algorithm to a set of filtering learning algorithms by selecting the learning algorithm from a set of candidate learning algorithms C that produces the highest classification accuracy on a validation set when added to the set of learning algorithms used for filtering, as shown in Algorithm 1. The function runLA(F) trains a learning algorithm on a data set using the filter set F to filter the instances and returns the accuracy of the learning algorithm on a validation set. As with the ensemble filter, instances are removed that are misclassified by a given percentage of the filtering learning algorithms. The idea is to choose an optimal subset of learning algorithms through a greedy search of the candidate filtering algorithms. For the results, we report the accuracy from the percentage that produces the highest accuracy using n -fold cross-validation to choose the best percentage for each data set. 4.2.4 Majority voting esemble In addition, we examine the use of a majority voting ensemble compared to filtering misclassified instances. The majority voting ensemble is composed of the diverse set learning algorithms as described in Sect. 4.1. The classification of an instance is the class that receives the most votes from the trained ensembled models. Springer The robustness of majority voting compared to filtering 113 4.3 Evaluation Each method is evaluated using 5 by 10-fold cross-validation (running 10-fold cross validation 5 times, each time with a different seed to partition the data). We examine filtering using the 9 chosen learning algorithms on a set of 47 data sets from the UCI data repository and 7 non-UCI data sets (Thomson and McQueen 1996; Salojarvi et al. 2005; Sayyad Shirabad and Menzies 2005; Stiglic and Kokol 2009). For filtering, we examine two methods for training the filtering algorithms: (1) removing the misclassified instances when trained on the entire training set and (2) using cross-validation on the training set that removes instances that are misclassified in the validation set. The number of folds for using cross-validation for the training set was set to 2, 3, 4, and 5. Table 2 shows the data sets used in this study organized according to the number of instances, number of attributes, and attribute type. The non-UCI data sets are in bold. Statistical significance between pairs of algorithms is determined using the Wilcoxon signed-ranks test as suggested by Demsar (2006). We emphasize the extensive nature of this evaluation: 1. Filtering is examined on 9 diverse learning algorithms. 2. 9 diverse learning algorithms are examined as misclassification filtering techniques. 3. In addition to the single algorithm misclassification filters, an ensemble filter and an adaptive filter are examined. 4. Each filtering method is examined on a set of 54 data sets using 5 by 10-fold cross-validation. 5. Each filtering method is examined on the entire training set as well as using 2-, 3-, 4-, and 5-fold cross-validation. 6. A majority voting ensemble is examined on a set of 54 data sets using 5 by 10-fold cross-validation. 5 Results In this section, we present the results from filtering the 54 data sets using (1) a biased misclassification filter (the same learning algorithm to filter misclassified instances is used to induce a model of the data), (2) the ensemble filter, and (3) the adaptive filter as well as a voting ensemble. Our results can be summarized as follows: (1) using a voting ensemble is generally preferable to filtering, (2) when filtering, our results suggest that using the ensemble filter in all cases produces the best results, and (3) filtering is preferable to a voting ensemble in some cases with high amounts of noise. Except for the adaptive filter, we find that using cross-validation on the training set for filtering results in a lower accuracy (and often significantly lower) than using the entire training set and, as such, the following results for the biased filter and the ensemble filter are from using the entire training set for filtering rather than using cross-validation. We first show how filtering affects each learning algorithm in Sect. 5.1. Next, we examine using a set of data set measures to determine when filtering is the most effective in Sect. 5.2. In Sect. 5.3, we compare filtering with a voting ensemble. Springer 114 M. R. Smith, T. Martinez Table 2 Datasets used organized by number of instances (# Ins), number of attributes, and attribute type # Ins # Attributes Attribute type Categorical Numerical Mixed M < 100 k < 10 Contact lenses Post-operative cml_req 10 < k < 100 Lung cancer desharnais Labor Pasture 100 < M < 1000 k < 10 Breast-w Iris Badges 2 Breast cancer Ecoli Teaching-assistant Pima Indians Glass Bupa Balance Scale 10 < k < 100 Audiology Ionosphere Annealing Soybean(large) Wine Dermatology Lymphography Sonar Credit-A Congressional- Heart-Statlog Credit-G voting records arl Horse Colic Vowel Heart-c Primary-Tumor Hepatitis Zoo Autos Heart-h eucalyptus k > 100 AP_Breast-Uterus Arrhythmia 1000 < M < 10,000 k < 10 Car Evaluation Yeast Titanic fe < 100 Waveform-5000 Thyroid-(sick & hypothyroid) Segment Spambase Ozone level-detection M > 10,000 fe < 10 Nursery MAGIC Telescope k < 100 Chess-(King- Eye-movements Rook vs. King-Pawn) 5.1 Filtering results The results of the biased, ensemble, and adaptive filters are summarized in Table 3—showing the average classification accuracy for each learning algorithm and filtering algorithm pair.2 The NNge learning algorithm did not finish running two data sets: eye-movements and Magic telescope. RIPPER did not finish on the lung cancer data set. In these cases, the data sets are omitted from the presented Springer The robustness of majority voting compared to filtering 115 Table 3 Summary of filtering using the same learning algorithm to filter misclassified instances and to induce a model of the data, the ensemble filter, and the adaptive filter MLP C4.5 IB5 LWL NB NNge RF Rid RIP Orig 81.74 80.80 79.91 72.80 76.94 80.14 82.28 79.90 79.76 Biased 81.72 80.75 79.53 70.91 75.88 80.34 82.14 79.02 79.87 Ensemble 83.40 81.61 80.85 73.48 78.92 82.21 82.93 80.57 81.26 Adaptive 82.38 80.63 80.01 73.44 78.48 81.33 81.87 80.00 80.43 For all learning algorithms, the ensemble filter significantly increases the classification accuracy The values in bold represent those that are a statistically significant improvement over not filtering. The results of the statistical significance tests for each of the learning algorithms is provided in Tables 10, 11, 12, 13, 14, 15, 16, 17 and 18 in "Appendix 1". We find that using a biased filter does not significantly increase the classification for any of the learning algorithms and that using a biased filter significantly decreases the classification accuracy for the LWL, naive Bayes, Ridor and RIPPER learning algorithms (Tables 13,14,17,18). These results suggest that simply removing the misclassified instances by a single learning algorithm is not sufficient. Bear in mind that these results reflect not adding any artificial noise to the training set. In the case where artificial noise is added to the training set (as was commonly done in previous works), using a biased filter may result in an improvement in accuracy. However, most real-world scenarios do not artificially add noise to their data set but are concerned with the inherent noise found within it. For all of the learning algorithms, the ensemble filter significantly increases the classification accuracy over not filtering and over the other filtering techniques. An ensemble generally provides better predictive performance than any of the constituent learning algorithms (Polikar 2006) and generally yields better results when the underlying ensembled models are diverse (Kuncheva and Whitaker 2003). Thus, by using a more powerful model, only the noisiest instances are removed. This provides empirical evidence supporting the notion that filtering instances with low p{yt \x{) that are not dependent on a single hypothesis is preferred to filtering instances where the probability of the class is dependent on a particular hypothesis p{% \x(,h) as outlined in Eq. 3. Surprisingly, the adaptive filter does not outperform the ensemble filter and in, one case, it does not even outperform training on unfiltered data. Perhaps this is because it overfits the training data since the best accuracy is chosen on the training set. Adaptive filtering has significantly better results when cross-validation is used to filter misclassified instances as opposed to removing misclassified instances that were also used to train the filtering algorithm. Even with using the results with cross-validation, the results are not significantly better than using the ensemble filter. Examining each learning algorithm individually, we find that some learning algorithms are more robust to noise than others. To determine which learning algorithms are more robust to noise, we compare the accuracy of the learning algorithms without filtering to the accuracy obtained using the ensemble filter. The p values from the Wilcoxon signed-ranks statistical significance test are shown in Table 4 ordered from greatest (least significant impact) to Footnote 2 continued results. As such, NNge was evaluated on a set of 52 data sets and RIPPER was evaluated on a set of 53 data sets. Springer 116 M. R. Smith, T. Martinez Table 4 The p values from the Wilcoxon signed-ranks statistical significance test comparing not filtering with the ensemble filter RF C4.5 Rid IB5 NNge MLP LWL RIP NB p val 0.045 0.035 0.019 0.018 0.006 0.004 0.004 <0.001 <0.001 The learning algorithms are ordered in descending order of p value from left to right least reading from left to right. We see that random forests and decision trees are the most robust to noise as filtering has the least significant impact on their accuracy. This is not too surprising given that the C4.5 algorithm was designed to take noise into account and random forests are built using decision trees. Ridor and 5-nearest neighbor (IB5) are more robust to noise, but still greatly improve with filtering. IB5 is more robust to noise since it compares with the 5 nearest neighbors of an instance. If K were set to 1, then filtering would have a greater effect on the accuracy. Filtering has the most significant effect on the accuracy of the last five learning algorithms: MLP, NNge, LWL, RIPPER, and naive Bayes. 5.2 Analysis of when to filter Using only the inherent noise in a data set, the efficacy of filtering is limited and can be detrimental in some data sets. Thus, we examine the cases in which filtering significantly improves the classification accuracy. This investigation is similar to the recent work by Sáez et al. (2013) who investigate creating a set of rules to understand when to filter using a 1-nearest neighbor learning algorithm. They use a set of data complexity measures from Ho and Basu (2002). The complexity measures are designed for binary classification problems, yet we do not limit ourselves to binary classification problems. As such, we use a subset of the data complexity measures shown in Table 5 that have been extended to handle multi-class problems (Orriols-Puig et al. 2009). In addition, we also examine a set of hardness measures (Smith et al. 2014) shown in Table 6. The hardness measures are designed to determine and characterize instances that have a high likelihood of being misclassified and are taken with respect to a specific instance. For the hardness measures, the "disjunct" refers to the class leaf in a decision tree that classifies an investigated instance. We examine using the set of data complexity measures and the hardness measures to create rules and/or a classifier to determine when to use filtering. We set up the classification problem similar to Sáez et al. where the features are the complexity measures and the hardness measures. The class label is set to "TRUE" if filtering significantly improves the classification accuracy for a data set using the Wilcoxon signed-ranks test otherwise it is set to "FALSE". We also examine predicting the difference in accuracy between using and not using a filter. Unlike Sáez et al., we find that the data complexity measures and the hardness measures do not create a satisfactory classifier to determine when to filter. Granted, we examine more learning algorithms and do not artificially add noise to the data sets which provides for few data sets where filtering significantly improves the classification accuracy. In the study by Sáez et al., 75 % of the data sets had at least 5 % noise added providing more positive examples. More future work is required to determine when to use filtering on unmodified data sets. Based on our results, we would recommend always using the ensemble filter for all of the learning algorithms as it significantly outperforms the other filtering techniques. Springer The robustness of majority voting compared to filtering 117 Table 5 List of complexity measures from Ho and Basu (2002) F2: Volume of overlap region: The overlap of the per-class bounding boxes calculated for each attribute by normalizing the difference of the maximum and minimum values from each class F3: Max individual feature efficiency: For all of the features, the maximum ratio of the number of instances not in the overlapping region to the total number of instances F4: Collective feature efficiency: F3 only return the ratio for the attribute that maximizes the ratio. F4 is a measure for all of the attributes Nl: Fraction of points on class boundary: The fraction of instances in a data set that are connected to their nearest neighbors that have a different class in a spanning tree N2: Ratio ofave intra/inter class NN dist: The average distance to the nearest intra-class neighbors divided by the average distance to the nearest inter-class neighbors N3: Error rate of INN classifier: Leave-one-out error estimate of INN Tl: Fraction of maximum covering spheres: The normalized count of the number of clusters of instances containing a single class T2: Ave number of points per dimension: Compares the number of instances to the number of features Table 6 List of hardness measures from Smith et al. (2014) feDN k-Disagreeing neighbors: The percentage of the k nearest neighbors (using Euclidean distance) for an instance that do not share its target class value DS Disjunct size: The number of instances covered by a disjunct that the investigated instance belongs to divided by the number of instances covered by the largest disjunct in an unpruned decision tree induced using C4.5 (Quinlan 1993) DCP Disjunct class percentage: The number of instances in a disjunct that have the same class label as the investigated instance divided by the total number of instances in the disjunct in a pruned decision tree TD Tree depth: The depth of the leaf node that classifies an instance in an induced decision tree CL Class likelihood: The probability that an instance belongs to its class given the input features CLD Class likelihood difference: The difference between the class likelihood of an instance and the maximum class likelihood for all of the other classes MV Minority value: The ratio of the number of instances sharing its target class value to the number of instances in the majority class CB Class balance: The difference of the ratio of the number of instances belonging to a class and the ratio of the classes if they were distributed equally 5.3 Voting ensemble versus filtering In this section, we compare the results of filtering using the ensemble filter with a voting ensemble. The voting ensemble uses the same learning algorithms as the ensemble filter (Table 1) and the vote from each learning algorithm is equally weighted. Table 7 compares the voting ensemble with using the ensemble filter on each of the investigated learning algorithms giving the average accuracy, the /?-value, and the number of times that the accuracy of a voting ensemble is greater than, equal to, or less than using the ensemble filter. The results for each data set are provided in Table 19 in "Appendix 2". With no artificially generated noise, a voting ensemble achieves significantly higher classification accuracy than the ensemble filter Springer 118 M. R. Smith, T. Martinez Table 7 Summary of the comparison between a voting ensemble and filtering using the ensemble filter on each learning algorithm Ensemble MLP C4.5 IB5 LWL NB Acc 84.37 83.40 81.61 80.85 73.48 78.92 p value 0.008 <0.001 <0.001 <0.001 <0.001 >, —, < 33, 1, 20 43, 1, 10 42, 2, 10 47, 1, 6 41,0, 13 Ensemble NNge RF Rid RIP Acc 84.37 81.59 82.93 80.57 80.76 p value <0.001 <0.001 <0.001 <0.001 >, —, < 44, 2, 8 39, 0, 15 47, 1, 6 44, 1, 9 The average accuracy for each method over all tested datasets is used to illustrate the differences. Using the ensemble significantly improves the classification accuracy over using the ensemble filter for all of the examined learning algorithms for each of the examined learning algorithms. This is not too surprising considering that previous research has shown that ensemble methods address issues that are common to all non-ensemble learning algorithms (Dietterich 2000) and that ensemble methods generally obtain a greater accuracy than that from a single learning algorithm that makes up part of the ensemble (Opitz and Maclin 1999). Considering the computational requirements for training, using a voting ensemble for classification rather than filtering appears to be more beneficial. Many previous studies (Zhu and Wu 2004; Lawrence and Scholkopf 2001; Brodley and Friedl 1999; Verbaeten and Van Assche 2003) have shown that when a large amount of artificial noise is added to a data set (i.e. > 10 %), then filtering outperforms a voting ensemble. We examine which of the 54 data sets have a high percentage of noise using instance hardness (Smith et al. 2014) to identify suspected noisy instances. Instance hardness approximates the likelihood that an instance will be misclassified by evaluating the classification of an instance from a set of learning algorithms C: p{yt\xt, C). The set of learning algorithms C is composed of the learning algorithms shown in Table 1. The instances that have a probability greater than 0.9 of being misclassified we consider to be noisy instances. Table 8 shows the accuracies from a voting ensemble and the considered learning algorithms using the ensemble filter for the subset of data sets with more than 10% noisy instances. Examining the more noisy data sets shows that the gains from using the ensemble filter are more noticeable. However, only 9 out of the 54 investigated data sets were identified as having more than 10 % noisy instances. We ran a Wilcoxon signed-ranks test, but with the small sample size it is difficult to determine the statistical significance of using the ensemble filter over using a voting ensemble. Based on the small sample provided here, training a learning algorithm on a filtered data set is statistically equivalent to training a voting ensemble classifier. The computational complexity required to train an ensemble is less than that to train an ensemble for filtering followed by training another learning algorithm from the filtered data set. A single learning algorithm trained on the filtered data set has the benefit that only one learning algorithm is queried for a novel instance. Future work will include discovering if a smaller subset of learning algorithms for filtering approximates using the ensemble filter in order to reduce the computational complexity. Examining the more noisy data sets shows that filtering has a more significant effect on classification accuracy, however, the amount of noise is not the only factor that needs to be considered. For example, 32.2 % of the instances in the primary-tumor data set are noisy, yet Springer Table 8 Comparison of a voting ensemble with the ensemble filter on a subset of data sets where more than 10 % of the constituent instances are noisy Data set Ens MLP C4.5 IB5 LWL NB NNge RF Rid RIP Per breastc 73.99 73.43 75.17 74.13 73.31 73.43 74.13 73.19 74.71 74.59 10.1 arrhyth 71.11 70.13 70.65 59.14 57.67 65.63 65.71 66.59 70.65 71.09 12.0 contact 76.67 83.33 83.33 76.39 76.39 76.39 80.56 80.56 79.17 77.78 12.5 lungCan 53.75 52.08 56.25 47.92 55.21 55.21 56.25 52.08 51.04 54.17 12.5 yeast 61.08 59.43 60.13 59.32 40.7 58.15 59.4 61.08 59.74 60.01 13.7 cml_req 75.73 76.40 77.53 77.53 76.78 77.53 77.15 76.78 77.53 76.78 16.9 titanic 78.72 78.66 78.68 78.59 77.9 77.77 78.68 78.68 78.28 78.68 16.9 post-op 69.78 71.11 71.11 71.11 71.11 71.11 71.11 71.11 71.11 71.11 26.7 pri-tum 48.08 47.79 41.2 45.82 34.42 48.57 44.44 45.23 39.82 40.41 32.2 Acc 67.66 68.04 68.23 65.55 62.61 67.09 67.49 67.26 66.89 67.18 p value 0.367 0.820 0.125 0.213 0.410 0.545 0.312 0.410 0.715 >, =< 6, 0,3 4, 0,5 6, 0,3 6, 0,3 5,0,4 4, 0,5 5, 1, 3 5,0,4 4, 0,5 The accuracy of the voting ensemble ("Ens") is in bold if it is greater than the accuracies from using the ensemble filter for the investigated learning algorithms. The accuracy from using the ensemble filter is in bold if it is higher than the accuracy from a voting ensemble. The column "Per" refers to the percentage of instances in the data set that are considered noisy 120 M. R. Smith, T. Martinez Table 9 Comparison of a majority voting ensemble trained on unfiltered (Ens) and filtered data (FEns) Ens FEns 50 FEns 70 FEns 90 FEns Max All Accuracy 84.37 83.40 82.21 73.96 83.62 p value <0.001 <0.001 <0.001 <0.001 Greater-equal-less 42,3,9 44,2,8 48, 1,5 39,2, 13 Noisy Accuracy 67.66 67.00 67.47 60.52 67.93 p value 0.102 0.455 0.049 0.633 Greater-equal-less 7,0,2 5,0,4 6,0,3 5,0,4 <90% Accuracy 78.49 77.19 75.74 66.02 77.44 p values <0.001 <0.001 <0.001 <0.001 Greater-equal-less 31,0,6 30,1,6 34,0,3 28,0,9 <80% Accuracy 74.70 73.08 71.41 61.49 73.41 p values <0.001 <0.001 <0.001 0.002 Greater-equal-less 24,0,3 22,0,5 24,0,3 21,0,6 <70% Accuracy 64.65 61.99 60.41 51.04 62.25 p values 0.001 0.002 <0.001 0.009 Greater-equal-less 10,0, 1 10,0, 1 10,0, 1 10,0, 1 <60% Accuracy 58.44 55.81 53.49 42.56 56.12 p values 0.016 0.016 0.016 0.016 Greater-equal-less 6,0,0 6,0,0 6,0,0 6,0,0 <50% Accuracy 50.92 48.22 48.07 38.61 49.16 p values 0.250 0.250 0.250 0.250 Greater-equal-less 2,0,0 2,0,0 2,0,0 2,0,0 The value after "FEns" represents the percentage of learning algorithms that have to misclassify an instance for it to be filtered from the training set and "Max" uses the accuracy from the percentage that results in the greatest accuracy. Training with unfiltered data is significantly better than training with filtered data for a voting ensemble only one learning algorithm achieves a greater classification accuracy than the voting ensemble. On the other hand, the classification accuracy on the ar 1 and ozone data sets for all of the considered learning algorithms trained on filtered data is greater than using a voting ensemble despite only having 3.3 and 0.5 % noisy instances respectively. Thus, there are other unknown data set features affecting when filtering is appropriate. Future work also includes discovering and examining data set features that are indicative of when filtering should be used. We further investigate the robustness of the majority voting ensemble to noise by applying the ensemble filter to the training data for the voting ensemble. We find that a majority voting ensemble is significantly better without filtering. The summary results are shown in Table 9 and the full results for each data set can be found in Table 20 in "Appendix 2". Table 9 divides the data sets into subsets that have more than 10% noisy instances ("Noisy"), and Sprin ger The robustness of majority voting compared to filtering 121 those that have an original accuracy less than 90, 80, 70, 60, and 50% averaged across the investigated learning algorithms (