Learning with Imbalanced Domains
and Rare Event Detection
Luis Torgo, Stan Matwin, Nathalie Japkowicz, Nuno Moniz,
Paula Branco, Rita P. Ribeiro and Lubomir Popelinsky
Dalhousie University, Canada
INESC TEC, University of Porto, Portugal
American University, USA
University of Ottawa, Canada
Masaryk University, Czech Republic
September, 2020
(Torgo et. al.) LIDTA2020 September, 2020 1 / 127
Outline
1 Welcome
2 Rare Event Detection - Principles
3 Methods and Evaluation
4 Class-based Outlier Detection
5 Explanation of rare events
6 Open Challenges
(Torgo et. al.) LIDTA2020 September, 2020 2 / 127
Learning with Imbalanced Domains
Imbalanced Domain Learning
It is based on the following assumptions:
the representativeness of the cases on the training data is not uniform;
the underrepresented cases are the most relevant ones for the domain.
The focus is on the identiﬁcation of these scarce/outlier cases.
But, the deﬁnition of these cases is dependent on the application
domain knowledge.
(Torgo et. al.) LIDTA2020 September, 2020 7 / 127
Nature of Input Data
Key Aspects of Imbalanced Domain Learning
Each data instance has:
I One attribute (univariate)
I Multiple attributes (multivariate)
Relationship among data instances:
I None
I Sequential/Temporal
I Spatial
I Spatio-temporal
I Graph
Dimensionality of data
(Torgo et. al.) LIDTA2020 September, 2020 11 / 127
Performance Metrics
Key Aspects of Imbalanced Domain Learning
Standard performance metrics (e.g. accuracy, error rate) assume that
all instances are equally relevant for the model performance.
These metrics give a good performance estimate to a model that
performs well on normal (frequent) cases and bad on outlier (rare)
cases.
Credit Card Fraud Detection:
I data set D with only 1% of fraudulent transactions;
I model M predicts all transactions as non-fraudulent;
I M has a estimated accuracy of 99%;
I yet, all the fraudulent transactions were missed!
Standard performance metrics are not suitable!
(Torgo et. al.) LIDTA2020 September, 2020 17 / 127
Predictive Modelling
Supervised Imbalanced Domain Learning
In a supervised learning task the goal is:
I given an unknown function Y = f (X1, X2, · · · , Xp),
I use a training set D = {hxi , yi i}n
i=1 with examples of this function
I to obtain the best approximation to the function f , i.e. the model,
h(X1, X2, · · · , Xp).
Depending on the type of target variable Y , we have:
I classiﬁcation task, if Y is nominal
I regression task, if Y is numeric
Imbalanced Predictive Modelling
I More importance is assigned to a subset of target variable Y domain.
I The cases that are more relevant are poorly represented in the training set.
How to specify these non-uniform importance values?
(Torgo et. al.) LIDTA2020 September, 2020 26 / 127
Predictive Modelling: Notion of Relevance
Supervised Imbalanced Domain Learning
Relevance function (Y ) (Torgo and Ribeiro, 2007)
A relevance function (Y ) : Y ! [0, 1] is a function that expresses the
application-speciﬁc bias concerning the target variable domain Y by
mapping it into a [0, 1] scale of relevance, where 0 and 1 represent the
minimum and maximum relevance, respectively.
The notion of relevance applicable to both classiﬁcation and
regression problems.
It can be used to build the sets of rare and normal cases.
Torgo, L. and Ribeiro, R. (2007). “Utility-based Regression”. In: Proceedings of 11th
ECML/PKDD 2007. Springer.
(Torgo et. al.) LIDTA2020 September, 2020 27 / 127
Predictive Modelling: Notion of Relevance (cont.)
Supervised Imbalanced Domain Learning
With an user-deﬁned threshold on the relevance values tR.
Partition the training set D in two complementary subsets:
I DR = {hx, yi 2 D : (y) tR }
I DN = D \ DR
In this case, we have that |DR| << |DN|
How to deﬁne the relevance function (Y )?
I It can be provided by the domain knowledge.
I Estimated from the target variable data distribution, so that rare target
classes/values are assigned more importance.
(Torgo et. al.) LIDTA2020 September, 2020 28 / 127
Predictive Modelling: Imbalanced Classiﬁcation
Supervised Imbalanced Domain Learning
In imbalanced classiﬁcation specifying the relevance of a target variable for
each class is feasible.
The most important cases are the cases labelled with infrequent classes in
the target variable Y , i.e. the cases for which (y) tR .
(Torgo et. al.) LIDTA2020 September, 2020 29 / 127
Predictive Modelling: Imbalanced Regression
Supervised Imbalanced Domain Learning
In imbalanced regression, given the potentially inﬁnite nature of the
target variable domain, specifying the relevance of all values is
virtually impossible, requiring an approximation.
Ribeiro (2011) proposed two methods for estimating (Y ):
interpolation method
I user provides a set of interpolating points
automatic method
I no input required from the user;
I it uses the target variable distribution;
I it assumes that the most relevant cases are located at the extremes of
the target variable distribution.
Ribeiro, Rita P. ”Utility-based regression”. PhD thesis, Dep. Computer Science, Faculty of
Sciences, University of Porto, 2011.
(Torgo et. al.) LIDTA2020 September, 2020 30 / 127
Predictive Modelling: Imbalanced Regression (cont.)
Supervised Imbalanced Domain Learning
The automatic method interpolates the boxplot statistics to obtain a
continuous relevance function that maps the domain of the target variable Y
to the relevance interval [0, 1], so that the extreme values of Y are most
important ones, i.e. the cases for which (y) tR .
(Torgo et. al.) LIDTA2020 September, 2020 31 / 127
Predictive Modelling Challenges
Supervised Imbalanced Domain Learning
It is of key importance that the obtained models are particularly
accurate at the sub-range of the domain of the target variable for
which training examples are rare.
To prevent the models of being biased to the most frequent cases, it is
necessary to use:
performance metrics biased towards the performance on rare cases;
learning strategies that focus on these rare cases.
I Data pre-processing
I Special-purpose Learning
I Predictions post-processing
Branco P, Torgo L, Ribeiro RP (2016). ”A survey of predictive modeling on imbalanced
domains”. In: ACM Computing Surveys (CSUR) 49 (2), 1–35
(Torgo et. al.) LIDTA2020 September, 2020 32 / 127
Data Pre-Processing Strategies
Supervised Imbalanced Domain Learning
Proposal
Change the data distribution to make standard algorithm focus on
rare and relevant cases.
Advantages
They allow the application of any learning algorithm
The obtained model will be biased to the goals of the domain
Models will be interpretable
Disadvantages
di culty of relating the modiﬁcations in the data distribution and
domain preferences
mapping the given data distribution into an optimal new distribution
according to domain goals is not easy
(Torgo et. al.) LIDTA2020 September, 2020 33 / 127
Special-purpose Learning Strategies
Supervised Imbalanced Domain Learning
Proposal
Change the learning algorithms so they can learn from imbalance data.
Advantages
The domain goals are incorporated directly into the models by setting
an appropriate preference criterion.
Models will be interpretable.
Disadvantages
It is restricted to that speciﬁc set of modiﬁed learning algorithms.
It requires a deep knowledge of algorithms.
If the preference criterion changes, models have to be relearned and,
possibly the algorithm has to be re-adapted.
It is not easy to map the domain preferences with a suitable
preference criterion.
(Torgo et. al.) LIDTA2020 September, 2020 34 / 127
Prediction Post-processing Strategies
Proposal
Use the original data set and a standard learning algorithm, only
manipulating the predictions of the models according to the domain
preferences and the imbalance of the data
Advantages
It is not necessary to be aware of the domain preferences at learning
time.
The same model can be applied to di↵erent deployment scenarios
without having to be relearned.
Any standard learning algorithm can be used.
Disadvantages
the models do not reﬂect the domain preferences.
models interpretability is jeopardized as they were obtained by
optimizing a function that does not follow the domain preference bias.
(Torgo et. al.) LIDTA2020 September, 2020 35 / 127