H. Dai, R. Srikant, and C. Zhang (Eds.): PAKDD 2004, LNAI 3056, pp. 207–211, 2004.
© Springer-Verlag Berlin Heidelberg 2004
The Application of Emerging Patterns for Improving the
Quality of Rare-Class Classification
Hamad Alhammady and Kotagiri Ramamohanarao
Department of Computer Science and Software Engineering
The University of Melbourne
Parkville, Victoria 3010, Australia
{hhammady, rao}@cs.mu.oz.au
Abstract. The classification of rare cases is a challenging problem in many real
life applications. The scarcity of the rare cases makes it difficult for traditional
classifiers to classify them correctly. In this paper, we propose a new approach
to use emerging patterns (EPs) [3] in rare-class classification (EPRC).
Traditional EP-based classifiers [2] fail to achieve accepted results when
dealing with rare cases. EPRC overcomes this problem by applying three
improving stages: generating new undiscovered EPs for the rare class, pruning
low-support EPs, and increasing the supports of the rare-class EPs. An
experimental evaluation carried out on a number of rare-class databases shows
that EPRC outperforms EP-based classifiers as well as other classification
methods such as PNrule [1], Metacost [6], and C4.5 [7].
1 Introduction
Classification of rare cases is an important problem in data mining. This problem is
identified as distinguishing rarely-occurring samples from other overwhelming
samples in a significantly imbalanced dataset [1]. In this paper, we investigate how to
employ emerging patterns (EPs) in rare-case classification. EPs are a new kind of
patterns that introduced recently [3]. EPs are defined as itemsets whose supports
increase significantly from one class to another. The power of EPs can be used to
build high-performance classifiers [2]. Usually, these classifiers achieve higher
accuracies than other state-of-the-art classifiers. However, simple EP-based classifiers
do not retain their high performance when dealing with datasets which have rare
cases. The reason for this failure is that the number of the rare-class EPs is very small,
and their supports are very low. Hence, they fail to distinguish rare cases from a vast
majority of other cases.
In this paper we propose a new approach to use the advantage of EPs to classify
rare cases in imbalanced datasets. The aim of our approach (called EPRC) is to
improve the discriminating power of EPs so that they achieve better results when
dealing with rare cases. This is achieved through three improving stages; 1)
generating new undiscovered EPs for the rare class, 2) pruning the low-support EPs,
and 3) increasing the support of rare-class EPs. These stages are detailed in section 3.
208 H. Alhammady and K. Ramamohanarao
In this paper we adopt the weighted accuracy [8], and the f-measure [9] as they are
well-known metrics for measuring the performance of rare-case classification. The fmeasure
depends on the recall and precision of the rare class.
2 Emerging Patterns
Let obj = {a1, a2, a3, ... an} is a data object following the schema {A1, A2, A3, ... An}. A1,
A2, A3.... An are called attributes, and a1, a2, a3, ... an are values related to these
attributes. We call each pair (attribute, value) an item.
Let I denote the set of all items in an encoding dataset D. Itemsets are subsets of I.
We say an instance Y contains an itemset X, if X ⊆ Y.
Definition 1. Given a dataset D, and an itemset X, the support of X in D is defined as
the percentage of the instances in D that contain X.
Definition 2. Given two different classes of datasets D1 and D2. The growth rate of
an itemset X from D1 to D2 is defined as the ratio between the support of X in D2 and
its support in D1.
Definition 3. Given a growth rate threshold p>1, an itemset X is said to be
a p-emerging pattern (p-EP or simply EP) from D1 to D2 if
pXGrowthRate DD ≥→ )(21 .
3 Improving Emerging Patterns
As described earlier, our approach aims at using the discriminating power of EPs in
rare-case classification. We introduce the idea of generating new EPs for the rare
class. Moreover, we adopt eliminating low-support EPs in both the major and rare
classes, and increasing the support of rare-class EPs.
The first step in our approach involves generating new rare-class EPs. Given a
training dataset and a set of the discovered EPs, the values that have the highest
growth rates from the major class to the rare class are found. The new EPs are
generated by replacing different attribute values (in the original rare-class EPs) with
the highest-growth-rate values. After that, the new EPs that already exist in the
original set of EPs are filtered out. Figure 1 shows an example of this process. The
left table shows four rare-class EPs. Suppose that the values that have the highest
growth rates for attributes A1 and A3 are V11 and V33 respectively. Using these two
values and EP e4, {V13, X, V34, V44, V55}, we can generate 2 more EPs (in the right
table). The first EP is {V11, X, V34, V44, V55} (by replacing V13 with V11). The second
EP is {V13, X, V33, V44, V55} (by replacing V34 with V33). However, the first new EP
already exists in the original set of EPs (e1). This EP is filtered out. We argue that
these new generated EPs have a strong power to discriminate rare-class instances
from major-class instances. There are two reasons for this argument. The first reason
The Application of Emerging Patterns for Improving the Quality 209
is that these new EPs are inherited from the original rare-class EPs which themselves
have a discriminating power to classify rare cases. The second reason is that they
contain the most discriminating attribute values (attribute values with the highest
growth rates) obtained from the training dataset.
Fig. 1. Example of generating new rare-class Eps
Based on the above explanation, we have algorithm 1 to generate new rare-class
EPs.
Algorithm 1 (Generating new rare-class EPs)
Input: the training dataset D, and discovered EPs E.
Output: a set of new rare-class EPs.
Method:
For each attribute i in D
Ai = value with the highest growth rate of attribute i.
For each rare-class EP e
For each attribute value k related to i
If k != Ai
Generate a new EP enew = e
Replace k by Ai in enew
If enew does not exist in E
The second step involves pruning the low-support EPs. This is performed for both
the major and rare classes. Given a pruning threshold, the average growth rate of the
attribute values in each EP is found. The EPs whose average growth rates are less
than the given threshold are eliminated. We argue that these eliminated EPs have the
least discriminating power as they contain many least-occurring values in the dataset.
The third step involves increasing the support of rare-class EPs by a given
percentage. The postulate behind this point is that this increase compensates the effect
of the large number of major-class EPs. That is, the overwhelming major-class EPs
make many rare-class instances classified as major-class.
4 Experimental Evaluation
In order to investigate the performance of our approach, we carry out a number of
experiments. We use three challenging rare-class databases with different
distributions of data between the major and rare classes. These datasets are the
insurance dataset [5], the disease dataset [4], and the sick dataset [4]. We compare our
210 H. Alhammady and K. Ramamohanarao
approach to other methods such as PNrule, Metacost, C4.5, and CEP. The comparison
we present is based on the weighted accuracy, traditional accuracy, major-class
accuracy, recall (rare-class accuracy), precision, and F-measure.
4.1 Tuning
Our approach uses three parameters. These parameters are the threshold of pruning
the major-class EPs, the threshold of pruning the rare-class EPs, and the percentage of
increasing the support of rare-class EPs. The parameters of our approach need to be
tuned to achieve the best results. To achieve this aim, we split the training set into 2
partitions. The first partition (70%) is used to train the classifier. The second partition
(30%) is used to tune the parameters.
4.2 Comparative Results
After tuning the insurance dataset, the parameters of our approach are fixed to deal
with this dataset. We run different methods on the test set of this dataset. These
methods include EPRC (our approach), PNrule [1], C4.5 [7], Metacost [6], and CEP
(EP-based classifier) [2]. The results of these methods are presented in table 1. Our
approach outperforms all other methods in terms of the weighted accuracy and fmeasure.
This performance is achieved by balancing both the recall and the precision.
Table 1. The results of the insurance dataset
4.3 The Effects on the EP-Based Classifier
Our basic aim behind the work presented in this paper is to improve the performance
of EP-based classifiers in rare-case classification. In this experiment we compare the
results obtained for our three datasets using CEP (EP-based classifier) and EPRC. As
stated in section 2, EPRC uses CEP as its basic EP-based classifier. The three
datasets were tuned using 30% of the training set. Table 2 shows how our approach
enhances the performance of the EP-based classifier. There are significant increases
in the weighted accuracy and f-measure from CEP to EPRC.
The Application of Emerging Patterns for Improving the Quality 211
Table 2. The effect on the EP-based classifier
Experiment Weighted accuracy F-measure
Insurance dataset (CEP) 50.85% 3.87%
Insurance dataset (EPRC) 63.89% 21.59%
Disease dataset (CEP) 49.94% Undefined
Disease dataset (EPRC) 65.07% 34.78%
Sick dataset (CEP) 78.89% 70%
Sick dataset (EPRC) 94.57% 79.71%
5 Conclusions and Future Research
In this paper, we propose a new EP-based approach to classify rare classes. Our
approach, called EPRC, introduces the idea of generating new rare-class EPs.
Moreover, it improves EPs by adopting pruning low-support EPs, and increasing the
support of rare-class EPs. We empirically demonstrate how improving EPs enhances
the performance of EP-based classifiers in rare-case classification problems.
Moreover, our approach helps EP-based classifiers outperform other classifiers in
such problems. The proposed approach opens many doors for further research. One
possibility is improving the performance of EP-based classifiers further by adding
further improving stages to increase the discriminating power of EPs.
References
1. M. V. Joshi, R. Agarwal, and V. Kumar. Mining Needles in a Haystack: Classifying Rare
Classes via Two-Phase Rule Induction. In Proceedings of ACM (SIGMOD '01), Santa
Barbara, California, USA.
2. J. Bailey, T. Manoukian, and K. Ramamohanarao. Classification Using Constrained
Emerging Patterns. In Proceedings of the Fourth International Conference on Web-Age
Information Management (WAIM '03), Chengdu, China.
3. G. Dong, and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and
Differences. In Proceedings of 1999 International Conference on Knowledge Discovery
and Data Mining (KDD'99), San Diego, CA, USA.
4. C. Blake, E. Keogh, and C. J. Merz. UCI repository of machine learning databases.
Department of Information and Computer Science, University of California at Irvine,
CA, 1999. http://www.ics.uci.edu/~mlearn/MLRepository.html.
5. P. van der Putten, M. de Ruiter, and M. van Someren. The CoIL Challenge 2000 report.
http://www.liacs.nl/~putten/library/cc2000, June 2000.
6. P. Domingos. MetaCost: A General Method for Making Classifiers Cost-Sensitive. In
Proceedings of 1999 International Conference on Knowledge Discovery and Data
Mining (KDD'99), San Diego, CA, USA.
7. I. H. Witten, E. Frank. Data Mining: Practical Machine Learning Tools and Techniques
with Java Implementations. Morgan Kaufmann, San Mateo, CA., 1999.
8. J. Cheng, C. Hatzis, H. Hayashi, M. Krogel, S. Morishita, D. Page, and J. Sese. KDD
Cup 2001 report. ACM SIGKDD Explorations, January, 2002.
9. C. J. van Rijsbergan. Information Retrieval. Butterworths, London, 1979.