Learning with Imbalanced Domains
and Rare Event Detection
Luis Torgo, Stan Matwin, Nathalie Japkowicz, Nuno Moniz,
Paula Branco, Rita P. Ribeiro and Lubomir Popelinsky
Dalhousie University, Canada
INESC TEC, University of Porto, Portugal
American University, USA
University of Ottawa, Canada
Masaryk University, Czech Republic
September, 2020
(Torgo et. al.) LIDTA2020 September, 2020 1 / 127
Outline
1 Welcome
2 Rare Event Detection - Principles
3 Methods and Evaluation
4 Class-based Outlier Detection
5 Explanation of rare events
6 Open Challenges
(Torgo et. al.) LIDTA2020 September, 2020 2 / 127
Rare Event Detection
Principles
Motivation
Most of machine learning tasks focus on creating a model of the
“normal” patterns in the data, extracting knowledge from what is
common (e.g. frequent patterns).
Still, rare patterns can also give us some import insights about data.
These patterns represent rare events, i.e. outliers.
Depending on the goal, those insights can be even more interesting
than the “normal” patterns.
(Torgo et. al.) LIDTA2020 September, 2020 4 / 127
What is an Outlier?
“An outlier is an observation which deviates so much from the other
observations as to arouse suspicions that it was generated by a
diﬀerent mechanism” (Hawkins, 1980)
(Torgo et. al.) LIDTA2020 September, 2020 5 / 127
What is an Outlier? (cont.)
Outliers represent patterns in data that do not conform to a
well-deﬁned notion of normal.
Initially, outliers were considered errors/noise and their identiﬁcation
had data cleaning purposes.
However, they can represent a truthful deviation of data.
For some applications, they represent critical information, which can
trigger preventive or corrective actions.
(Torgo et. al.) LIDTA2020 September, 2020 6 / 127
Learning with Imbalanced Domains
Imbalanced Domain Learning
It is based on the following assumptions:
the representativeness of the cases on the training data is not uniform;
the underrepresented cases are the most relevant ones for the domain.
The focus is on the identiﬁcation of these scarce/outlier cases.
But, the deﬁnition of these cases is dependent on the application
domain knowledge.
(Torgo et. al.) LIDTA2020 September, 2020 7 / 127
Some Applications with Imbalanced Domains
Financial Applications
Credit Card Fraud, Insurance Claim Fraud, Stock Market Anomalies
(Cyber) Security Applications
Host-based, Network Intrusion Detection
Medical Applications
Medical Sensor or Imaging for Rare Disease Diagnostics
Text and Social Media Applications
Anomalous Activity in Social Networks, Fake News Detection
Earth Science Applications
Sea Surface Temperature Anomalies, Environmental Disasters
Fault Detection Applications
Quality Control, Systems Diagnosis, Structure Defect Detection
(Torgo et. al.) LIDTA2020 September, 2020 8 / 127
Challenges of Imbalanced Domain Learning
Deﬁne every possible “normal” behaviour is hard.
The boundary between normal and outlying behaviour is often not
precise.
There is no general outlier deﬁnition; it depends on the application
domain.
It is diﬃcult to distinguish real, meaningful outliers from simple
random noise in data.
The outlier behaviour may evolve with time.
Malicious actions adapt themselves to appear as normal.
Inherent lack of known labelled outliers for training/validation of
models.
(Torgo et. al.) LIDTA2020 September, 2020 9 / 127
Key Aspects of Imbalance Domain Learning
Nature of Input Data
Type of Outliers
Intended Output
Learning Task
Performance Metrics
(Torgo et. al.) LIDTA2020 September, 2020 10 / 127
Nature of Input Data
Key Aspects of Imbalanced Domain Learning
Each data instance has:
One attribute (univariate)
Multiple attributes (multivariate)
Relationship among data instances:
None
Sequential/Temporal
Spatial
Spatio-temporal
Graph
Dimensionality of data
(Torgo et. al.) LIDTA2020 September, 2020 11 / 127
Types of Outliers
Key Aspects of Imbalanced Domain Learning
Point Outlier
An instance that individually or in small groups is very diﬀerent from the
rest of the instances.
(Torgo et. al.) LIDTA2020 September, 2020 12 / 127
Types of Outliers (cont.)
Key Aspects of Imbalanced Domain Learning
Contextual Outlier
An instance that when considered within a context is very diﬀerent from
the rest of the instances.
(Torgo et. al.) LIDTA2020 September, 2020 13 / 127
Types of Outliers (cont.)
Key Aspects of Imbalanced Domain Learning
Collective Outlier
An instance that, even though isolated may not be an outlier, inspected in
conjunction with related instances and regarding the entire data set is an
outlier.
(Torgo et. al.) LIDTA2020 September, 2020 14 / 127
Intended Output
Key Aspects of Imbalanced Domain Learning
Value
A label / numeric value identifying normal or outlier instance.
Score
The probability of being an outlier.
This allows the output to be ranked.
But, requires the speciﬁcation of a threshold.
(Torgo et. al.) LIDTA2020 September, 2020 15 / 127
Learning Task
Key Aspects of Imbalanced Domain Learning
Unsupervised Learning
Data set has no information on the behaviour of each instance.
It assumes that instances with normal behaviour are far more frequent.
Most common case in real-life applications.
Semi-supervised Learning
Data set has a few instances of normal or outlier behaviour.
Some real-life applications, such as fault detection, provide such data.
Supervised Learning
Data set has instances of both normal and outlier behaviour.
Hard to obtain such data in real-life applications.
(Torgo et. al.) LIDTA2020 September, 2020 16 / 127
Performance Metrics
Key Aspects of Imbalanced Domain Learning
Standard performance metrics (e.g. accuracy, error rate) assume that
all instances are equally relevant for the model performance.
These metrics give a good performance estimate to a model that
performs well on normal (frequent) cases and bad on outlier (rare)
cases.
Credit Card Fraud Detection:
data set D with only 1% of fraudulent transactions;
model M predicts all transactions as non-fraudulent;
M has a estimated accuracy of 99%;
yet, all the fraudulent transactions were missed!
Standard performance metrics are not suitable!
(Torgo et. al.) LIDTA2020 September, 2020 17 / 127
Taxonomy of Approaches
Imbalanced Domain Learning
Imbalanced Domain
Learning
Unsupervised
Semi-supervised
Supervised
Statistical-based Proximity-based Clustering-based Data
Pre-processing
Special-purpose
Methods
Prediction
Post-processing
One-Class
Classification
Other
Approaches
Hybrid
Approaches
Hybrid
Approaches
(Torgo et. al.) LIDTA2020 September, 2020 18 / 127
Taxonomy of Approaches
Imbalanced Domain Learning
Imbalanced Domain
Learning
Unsupervised
Semi-supervised
Supervised
Statistical-based Proximity-based Clustering-based Data
Pre-processing
Special-purpose
Methods
Prediction
Post-processing
One-Class
Classification
Other
Approaches
Hybrid
Approaches
Hybrid
Approaches
(Torgo et. al.) LIDTA2020 September, 2020 19 / 127
Statistical-based Approaches
Unsupervised Imbalanced Domain Learning
Proposal
All the points that satisfy a statistical discordance test for some
statistical model are declared as outliers.
Advantages
If the assumptions of the statistical model hold, these techniques
provide an acceptable solution for outlier detection.
The outlier score is associated with a conﬁdence interval.
Disadvantages
The data does not always follow a statistical model.
Choosing the best hypothesis test statistics is not straightforward.
Capturing interactions between attributes is not always possible.
Estimating the parameters for some statistical models is hard.
(Torgo et. al.) LIDTA2020 September, 2020 20 / 127
Proximity-based Approaches
Unsupervised Imbalanced Domain Learning
Proposal
Normal instances occur in dense neighbourhoods, while outliers occur
far from their closest neighbours.
Advantages
Purely data-driven technique.
Does not make any assumptions regarding the underlying distribution
of data.
Disadvantages
True outliers and noisy regions of low density may be hard to
distinguish.
These methods need to combine global and local analysis.
In high dimensional data, the contrast in the distances is lost.
Computationally expensive in the test phase.
(Torgo et. al.) LIDTA2020 September, 2020 21 / 127
Clustering-based Approaches
Unsupervised Imbalanced Domain Learning
Proposal
Normal instances belong to large and dense clusters, while outlier
instances are instances that: do not belong to any of the clusters, are
far from its closest cluster or form very small or low-density clusters.
Advantages
Easily adaptable to on-line/incremental mode.
Test phase is fast.
Disadvantages
Computationally expensive in the training phase.
If normal points do not create any clusters, this technique may fail.
In high dimensional spaces, clustering algorithms may not give any
meaningful clusters.
Some techniques detect outliers as a byproduct, i.e. they are not
optimized to ﬁnd outliers; their main aim is to ﬁnd clusters.
(Torgo et. al.) LIDTA2020 September, 2020 22 / 127
Taxonomy of Approaches
Imbalanced Domain Learning
Imbalanced Domain
Learning
Unsupervised
Semi-supervised
Supervised
Statistical-based Proximity-based Clustering-based Data
Pre-processing
Special-purpose
Methods
Prediction
Post-processing
One-Class
Classification
Other
Approaches
Hybrid
Approaches
Hybrid
Approaches
(Torgo et. al.) LIDTA2020 September, 2020 23 / 127
One Class Classiﬁcation Approach
Semi-supervised Imbalanced Domain Learning
Proposal
Build a prediction model to the normal behaviour and classify any
deviations from this behaviour as outliers.
Advantages
Models are interpretable.
Normal behaviour can be accurately learned.
Can detect new outliers that may not appear close to an outlier point
in the training set.
Disadvantages
Requires previous labelled instances for normal behaviour.
Possible high false alarm rate - previously unseen normal data may be
identiﬁed as an outlier.
(Torgo et. al.) LIDTA2020 September, 2020 24 / 127
Taxonomy of Approaches
Imbalanced Domain Learning
(Torgo et. al.) LIDTA2020 September, 2020 25 / 127
Predictive Modelling
Supervised Imbalanced Domain Learning
In a supervised learning task the goal is:
given an unknown function Y = f (X1, X2, · · · , Xp),
use a training set D = { xi , yi }n
i=1 with examples of this function
to obtain the best approximation to the function f , i.e. the model,
h(X1, X2, · · · , Xp).
Depending on the type of target variable Y , we have:
classiﬁcation task, if Y is nominal
regression task, if Y is numeric
Imbalanced Predictive Modelling
More importance is assigned to a subset of target variable Y domain.
The cases that are more relevant are poorly represented in the training set.
How to specify these non-uniform importance values?
(Torgo et. al.) LIDTA2020 September, 2020 26 / 127
Predictive Modelling: Notion of Relevance
Supervised Imbalanced Domain Learning
Relevance function φ(Y ) (Torgo and Ribeiro, 2007)
A relevance function φ(Y ) : Y → [0, 1] is a function that expresses the
application-speciﬁc bias concerning the target variable domain Y by
mapping it into a [0, 1] scale of relevance, where 0 and 1 represent the
minimum and maximum relevance, respectively.
The notion of relevance applicable to both classiﬁcation and
regression problems.
It can be used to build the sets of rare and normal cases.
Torgo, L. and Ribeiro, R. (2007). “Utility-based Regression”. In: Proceedings of 11th
ECML/PKDD 2007. Springer.
(Torgo et. al.) LIDTA2020 September, 2020 27 / 127
Predictive Modelling: Notion of Relevance (cont.)
Supervised Imbalanced Domain Learning
With an user-deﬁned threshold on the relevance values tR.
Partition the training set D in two complementary subsets:
DR = { x, y ∈ D : φ(y) ≥ tR }
DN = D \ DR
In this case, we have that |DR| << |DN|
How to deﬁne the relevance function φ(Y )?
It can be provided by the domain knowledge.
Estimated from the target variable data distribution, so that rare target
classes/values are assigned more importance.
(Torgo et. al.) LIDTA2020 September, 2020 28 / 127
Predictive Modelling: Imbalanced Classiﬁcation
Supervised Imbalanced Domain Learning
In imbalanced classiﬁcation specifying the relevance of a target variable for
each class is feasible.
The most important cases are the cases labelled with infrequent classes in
the target variable Y , i.e. the cases for which φ(y) ≥ tR .
(Torgo et. al.) LIDTA2020 September, 2020 29 / 127
Predictive Modelling: Imbalanced Regression
Supervised Imbalanced Domain Learning
In imbalanced regression, given the potentially inﬁnite nature of the
target variable domain, specifying the relevance of all values is
virtually impossible, requiring an approximation.
Ribeiro (2011) proposed two methods for estimating φ(Y ):
interpolation method
user provides a set of interpolating points
automatic method
no input required from the user;
it uses the target variable distribution;
it assumes that the most relevant cases are located at the extremes of
the target variable distribution.
Ribeiro, Rita P. ”Utility-based regression”. PhD thesis, Dep. Computer Science, Faculty of
Sciences, University of Porto, 2011.
(Torgo et. al.) LIDTA2020 September, 2020 30 / 127
Predictive Modelling: Imbalanced Regression (cont.)
Supervised Imbalanced Domain Learning
The automatic method interpolates the boxplot statistics to obtain a
continuous relevance function that maps the domain of the target variable Y
to the relevance interval [0, 1], so that the extreme values of Y are most
important ones, i.e. the cases for which φ(y) ≥ tR .
(Torgo et. al.) LIDTA2020 September, 2020 31 / 127
Predictive Modelling Challenges
Supervised Imbalanced Domain Learning
It is of key importance that the obtained models are particularly
accurate at the sub-range of the domain of the target variable for
which training examples are rare.
To prevent the models of being biased to the most frequent cases, it is
necessary to use:
performance metrics biased towards the performance on rare cases;
learning strategies that focus on these rare cases.
Data pre-processing
Special-purpose Learning
Predictions post-processing
Branco P, Torgo L, Ribeiro RP (2016). ”A survey of predictive modeling on imbalanced
domains”. In: ACM Computing Surveys (CSUR) 49 (2), 1–35
(Torgo et. al.) LIDTA2020 September, 2020 32 / 127
Data Pre-Processing Strategies
Supervised Imbalanced Domain Learning
Proposal
Change the data distribution to make standard algorithm focus on
rare and relevant cases.
Advantages
They allow the application of any learning algorithm
The obtained model will be biased to the goals of the domain
Models will be interpretable
Disadvantages
diﬃculty of relating the modiﬁcations in the data distribution and
domain preferences
mapping the given data distribution into an optimal new distribution
according to domain goals is not easy
(Torgo et. al.) LIDTA2020 September, 2020 33 / 127
Special-purpose Learning Strategies
Supervised Imbalanced Domain Learning
Proposal
Change the learning algorithms so they can learn from imbalance data.
Advantages
The domain goals are incorporated directly into the models by setting
an appropriate preference criterion.
Models will be interpretable.
Disadvantages
It is restricted to that speciﬁc set of modiﬁed learning algorithms.
It requires a deep knowledge of algorithms.
If the preference criterion changes, models have to be relearned and,
possibly the algorithm has to be re-adapted.
It is not easy to map the domain preferences with a suitable
preference criterion.
(Torgo et. al.) LIDTA2020 September, 2020 34 / 127
Prediction Post-processing Strategies
Proposal
Use the original data set and a standard learning algorithm, only
manipulating the predictions of the models according to the domain
preferences and the imbalance of the data
Advantages
It is not necessary to be aware of the domain preferences at learning
time.
The same model can be applied to diﬀerent deployment scenarios
without having to be relearned.
Any standard learning algorithm can be used.
Disadvantages
the models do not reﬂect the domain preferences.
models interpretability is jeopardized as they were obtained by
optimizing a function that does not follow the domain preference bias.
(Torgo et. al.) LIDTA2020 September, 2020 35 / 127
Up next ...
Suitable Performance Metrics
Speciﬁc Learning Methods
(Torgo et. al.) LIDTA2020 September, 2020 36 / 127
Methods and Evaluation
Imbalanced Domains and Rare Event Detection
Performance Evaluation
(Torgo et. al.) LIDTA2020 September, 2020 38 / 127
Why is performance evaluation a challenge?
Standard metrics (e.g. error rate or mean squared error) describe the
average predictive performance of the models
When the user is focused on a small subset of rare values, the average
is not a good idea
These metrics will be mostly inﬂuenced by the performance of the
models on cases that are irrelevant for the user
(Torgo et. al.) LIDTA2020 September, 2020 39 / 127
An Example from Classiﬁcation
Fraud Detection
Two classes: Fraud and Normal
Fraudulent cases are roughly 1% of the training sample
A classiﬁer that always predicts Normal would achieve on average
99% accuracy!
This classiﬁer is completely useless!
Because frauds are very rare, failing them or correctly predicting them
will have a minor impact on the accuracy (or error rate) metric.
(Torgo et. al.) LIDTA2020 September, 2020 40 / 127
An Example from Regression
Forecasting Stock Market Returns
Very high or low returns (% variations of prices) are interesting
Near-zero returns are very common but uninteresting for traders unable
to cover transaction costs
Examples:
Forecasting a future return of 3% and then it happens -5% is a very
bad error!
Forecasting a return of 3% and then it happens 11% has the same
error amplitude but it is not a serious error
Forecasting 0.2% for a true value of 0.4% is reasonably accurate but
irrelevant!
Forecasting -7.5% for a true value of -8% is a good an useful prediction
Because near 0 returns are very common a model that always
forecasts 0 is hard to beat in terms of Mean Squared Error. But this
model is useless!
(Torgo et. al.) LIDTA2020 September, 2020 41 / 127
Metrics and the Available Information
Diﬀerent applications may involve diﬀerent type of information on the
user preferences
This may have an impact on the metrics you can and/or should
calculate
Independently, there are two classes of metrics: scalar and graphical
(Torgo et. al.) LIDTA2020 September, 2020 42 / 127
Evaluation with Full Utility Information
Utility Matrices
Table where each entry speciﬁes the
cost (negative beneﬁt) or beneﬁt of
each type of prediction
Pred.
c1 c2 c3
Obs.
c1 B1,1 C1,2 C1,3
c2 C2,1 B2,2 C2,3
c3 C3,1 C3,2 B3,3
Models are then evaluated by the total utility of their predictions, i.e.
the sum of the beneﬁts minus the costs.
Similar setting for regression using Utility Surfaces (Ribeiro, 2011)
R. Ribeiro (2011). “Utility-based Regression”. PhD on Computer Science, Univ. Porto.
(Torgo et. al.) LIDTA2020 September, 2020 43 / 127
The Precision/Recall Framework
Classiﬁcation
Problems with two classes
One of the classes is much less frequent and it is also the most
relevant
Preds.
Pos Neg
Obs.
Pos True Positives (TP) False Negatives (FN))
Neg False Positives (FP) True Negatives (TN)
(Torgo et. al.) LIDTA2020 September, 2020 44 / 127
The Precision/Recall Framework
Classiﬁcation - 2
Preds.
P N
Obs.
P TP FN
N FP TN
Precision - proportion of the signals
(events) of the model that are
correct
Prec =
TP
TP + FP
Recall - proportion of the real
events that are captured by the
model
Rec =
TP
TP + FN
(Torgo et. al.) LIDTA2020 September, 2020 45 / 127
The F-Measure
Combining Precision and Recall into a single measure
Useful to have a single measure - e.g. optimization within a search
procedure
Maximizing one of them is easy at the cost of the other (it is easy to
have 100% recall - always predict “P”).
What is diﬃcult is to have both of them with high values
(Torgo et. al.) LIDTA2020 September, 2020 46 / 127
The F-Measure
Combining Precision and Recall into a single measure
Useful to have a single measure - e.g. optimization within a search
procedure
Maximizing one of them is easy at the cost of the other (it is easy to
have 100% recall - always predict “P”).
What is diﬃcult is to have both of them with high values
The F-measure is a statistic that is based on the values of precision
and recall and allows establishing a trade-oﬀ between the two using a
user-deﬁned parameter (β),
Fβ =
(β2 + 1) · Prec · Rec
β2 · Prec + Rec
where β controls the relative importance of Prec and Rec. If β = 1
then F is the harmonic mean between Prec and Rec; When β → 0 the
weight of Rec decreases. When β → ∞ the weight of Prec decreases.
(Torgo et. al.) LIDTA2020 September, 2020 46 / 127
The G-Mean and Adjusted G-Mean
Gm =
TP
TP + FN
×
TN
TN + FP
= sensitivity × speciﬁcity
AGm =
Gm+Speciﬁcity×Nn
1+Nn
sensitivity ≥ 0
0 sensitivity = 0
where Nn is the proportion of majority class examples in the data set.
M. Kubat and S. Matwin. “Addressing the curse of imbalanced training sets: one-sided
selection.” In Proc. of 14th Int. Conf. on Machine Learning, 1997, Nashville, USA, pp.179-186
R. Batuwita and V. Palade. “A new performance measure for class imbalance learning.
Application to bioinformatics problems.” In ICMLA’09, pp.545–550. IEEE, 2009.
(Torgo et. al.) LIDTA2020 September, 2020 47 / 127
Metrics for Multiclass Imbalance Problems
φ(i) is the relevance of class i.
Diﬀerent ways to obtain φ() depending on the available domain
information (Branco, 2017).
Recφ = 1
C
i=1
φ(i)
C
i=1
φ(i) · recalli Precφ = 1
C
i=1
φ(i)
C
i=1
φ(i) · precisioni
Fφ
β = (1+β2)·Precφ·Recφ
(β2·Precφ)+Recφ
AvFφ
β = 1
C
i=1
φ(i)
C
i=1
φ(i)·(1+β2)·precisioni ·recalli
(β2·precisioni )+recalli
CBAφ =
C
i=1
φ(i) ·
mati,i
max
C
j=1
mati,j ,
C
j=1
matj,i
P. Branco, L. Torgo, and R. Ribeiro. ”Relevance-based evaluation metrics for multi-class
imbalanced domains.” PAKDD. Springer, Cham, pp.698-710 (2017).
(Torgo et. al.) LIDTA2020 September, 2020 48 / 127
The Precision/Recall Framework
Regression
For forecasting rare extreme values, the concepts of Precision and Recall
were also adapted to regression (Torgo and Ribeiro, 2009; Branco, 2014),
precφ
=
φ(ˆyi )>tR
(1 + U(ˆyi , yi ))
φ(ˆyi )>tR
(1 + φ(ˆyi ))
recφ
=
φ(yi )>tR
(1 + U(ˆyi , yi ))
φ(yi )>tR
(1 + φ(yi ))
L. Torgo and R. P. Ribeiro (2009). “Precision and Recall for Regression”. In: Discovery
Science’2009. Springer.
P. Branco (2014). “Re-sampling Approaches for Regression Tasks under Imbalanced Domains”.
MSc on Computer Science, Univ. Porto.
(Torgo et. al.) LIDTA2020 September, 2020 49 / 127
Summary of Scalar Metrics for Imbalanced Domains
Adapted from:
P. Branco, L. Torgo and R. Ribeiro. “A Survey of Predictive Modeling on Imbalanced
Domains”. In: ACM Comput. Surv. 49-2, 1–31 (2016).
P. Branco (2018). ”Utility-based Predictive Analytics”. PhD on Computer Science, Univ. Porto.
(Torgo et. al.) LIDTA2020 September, 2020 50 / 127
ROC curve and Precision-Recall Curve
Classiﬁcation
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
False Positive Rate
TruePositiveRate
random
classifier
A
B
C
AUC−ROC(A) = 0.87
AUC−ROC(B) = 0.83
AUC−ROC(C) = 0.64
Ideal Model
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Recall
Precision
AB
C
AUC−PR(A) = 0.97
AUC−PR(B) = 0.80
AUC−PR(C) = 0.76
Ideal Model
Taken from:
P. Branco (2018). ”Utility-based Predictive Analytics”. PhD on Computer Science, Univ. Porto.
(Torgo et. al.) LIDTA2020 September, 2020 51 / 127
ROC curve and Precision-Recall Curve
Regression
Taken from:
R. Ribeiro (2011). “Utility-based Regression”. PhD on Computer Science, Univ. Porto.
(Torgo et. al.) LIDTA2020 September, 2020 52 / 127
Summary of Graphical Metrics for Imbalanced Domains
Adapted from:
P. Branco, L. Torgo and R. Ribeiro. “A Survey of Predictive Modeling on Imbalanced
Domains”. In: ACM Comput. Surv. 49-2, 1–31 (2016).
P. Branco (2018). ”Utility-based Predictive Analytics”. PhD on Computer Science, Univ. Porto.
(Torgo et. al.) LIDTA2020 September, 2020 53 / 127
Imbalanced Domains and Rare Event Detection
Unsupervised Methods
(Torgo et. al.) LIDTA2020 September, 2020 54 / 127
Unsupervised Methods
Unsupervised
Statistical-based Proximity-based Clustering-based
Hybrid
Approaches
(Torgo et. al.) LIDTA2020 September, 2020 55 / 127
Unsupervised Methods
(Torgo et. al.) LIDTA2020 September, 2020 56 / 127
Statistical-based Methods
Parametric
Assumption
Normal instances occur in high probability regions of a stochastic model.
Anomalies occur in the low probability regions of the stochastic model.
Gaussian Model Based
Regression Model Based
Mixture of Parametric Distributions Based
(Torgo et. al.) LIDTA2020 September, 2020 57 / 127
Statistical-based Methods
Parametric
Grubb’s test
z = |x−µ|
σ
Box Plot Rule
Regression model based
ﬁt a regression model
use the residuals to
determine the anomaly
score
(Torgo et. al.) LIDTA2020 September, 2020 58 / 127
Statistical-based Methods
Non-parametric
Assumption
The model structure is not determined a priori but is determined from the
given data. Few assumptions regarding the data when compared to
parametric techniques.
Histogram Based
Kernel Function Based
(Torgo et. al.) LIDTA2020 September, 2020 59 / 127
Statistical-based Methods
Non-parametric
Histogram Based
build histogram
for a new test instance, check if it falls in a bin of the histogram. If it
does: normal, otherwise: anomaly.
Variant: assign an anomaly score based on the bin frequency
(Torgo et. al.) LIDTA2020 September, 2020 60 / 127
Statistical-based Methods
Non-parametric
Kernel Function Based
Non-parametric techniques for probability density estimation
Example: parzen windows estimation (Parzen, 1962)
Use kernel functions to approximate the actual density.
Similar to parametric methods. Diﬀerence: the density estimation
technique used
Parzen, E. (1962) On the estimation of a probability density function and mode. Annals of
Mathematical Statistics 33, 1065–1076.
(Torgo et. al.) LIDTA2020 September, 2020 61 / 127
Unsupervised Methods
(Torgo et. al.) LIDTA2020 September, 2020 62 / 127
Proximity-based Methods
Distance-based: Nearest Neighbors (NN) Approach
The anomaly score of a case is its distance to the kth nearest neighbor
Apply a threshold on the anomaly score to determine is a case is
anomalous or not.
Examples of applications: land mines detection from satellite ground
images, detect anomalies in large synchronous turbine-generators
Ramaswamy, S.,Rastogi, R. and Shim., K. ”Eﬃcient algorithms for mining outliers from large
data sets.” Proceedings of the 2000 ACM SIGMOD international conference on Management of
data. 2000.
(Torgo et. al.) LIDTA2020 September, 2020 63 / 127
Proximity-based Methods
Distance-based: Nearest Neighbors (NN) Approach
(Torgo et. al.) LIDTA2020 September, 2020 64 / 127
Proximity-based Methods
Distance-based: Nearest Neighbors (NN) Approach
Alternative way for computing the anomaly score: count the number
of nearest neighbors that are not more than d distance apart from the
case.
Can be viewed as a way to obtain an estimate of the global density
for each case.
Knorr, E. M., and Ng, R. T. ”Algorithms for mining distance based outliers in large datasets.”
Proceedings of the international conference on very large data bases. 1998.
(Torgo et. al.) LIDTA2020 September, 2020 65 / 127
Proximity-based Methods
LOF-based
Local Outlier Factor (LOF) (Breunig et al., 2000)
Each point has a score that captures the relative degree of isolation of the
point from its surrounding neighbourhood.
Breunig, M. M., Kriegel, H. P., Ng, R., and Sander, J. (2000). ”LOF: Identifying density-based
local outliers.” In Chen, W., Naughton, J. F., and Bernstein, P. A., editors, Proceedings of ACM
SIGMOD 2000 International Conference on Management of Data. ACM Press.
(Torgo et. al.) LIDTA2020 September, 2020 66 / 127
Proximity-based Methods
LOF-based
LOF Approach
MinPts: number of nearest neighbors used in deﬁning the local
neighborhood
For each point x compute distance to the kth nearest neighbor
(k − dist)
Compute reachability distance:
reach − distk(x, p) = max{k − dist(p), d(x, p)}
Compute local reachability density:
lrdMinPts(x) = MinPts
p reach−distMinPts (x,p)
Compute LOF score:
LOFMinPts(x) = 1
MinPts · p
lrdMinPts (p)
lrdMinPts (x)
(Torgo et. al.) LIDTA2020 September, 2020 67 / 127
Proximity-based Methods
KNN-based vs LOF-based
(Torgo et. al.) LIDTA2020 September, 2020 68 / 127
Proximity-based Methods
Isolation Forest
Anomalies are few and diﬀerent.
Collection of isolation trees (iTrees)
Each iTree isolates every case from the remaining cases for a given
sample
Anomalies should be more susceptible to isolation, i.e., they exhibit a
shorter average path
Score(x) = 1
t
t
i=1 li (x), where li (x) is the path length of observation
x in tree i
5 1
6
3 2
1 11 2
1 1
4 2
6
2 2
1 111
1 1
[1] Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (2008). ”Isolation Forest”. 2008 Eighth IEEE
International Conference on Data Mining: 413–422.
(Torgo et. al.) LIDTA2020 September, 2020 69 / 127
Proximity-based Methods
Isolation Forest
[1] Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (2008). ”Isolation Forest”. 2008 Eighth IEEE
International Conference on Data Mining: 413–422.
(Torgo et. al.) LIDTA2020 September, 2020 70 / 127
Unsupervised Methods
(Torgo et. al.) LIDTA2020 September, 2020 71 / 127
Clustering-based Methods
DBSCAN
Idea: ﬁnd the areas that satisfy a simple minimum density level, and
which are separated by areas with lower density.
Parameters: MinPts: threshold for the number of neighbors, : radius
Objects with more than MinPts neighbors within a radius of
(including the query point) are considered to be core points.
Ester, Martin, et al. ”A density-based algorithm for discovering clusters in large spatial
databases with noise.” KDD. Vol. 96. No. 34. 1996.
Schubert, Erich, et al. ”DBSCAN revisited, revisited: why and how you should (still) use
DBSCAN.” ACM TODS 42.3 (2017): 1-21.
(Torgo et. al.) LIDTA2020 September, 2020 72 / 127
Clustering-based Methods
DBSCAN
Steps
Compute neighbors of each point and identify core points
Join neighboring core points into clusters
for each non-core point
Add to a neighboring core point if possible
Otherwise, add to noise
(Torgo et. al.) LIDTA2020 September, 2020 73 / 127
Clustering-based Methods
CBLOF
Cluster-based Local Outlier Factor (He, 2003)
Two parameters:
α: ratio of the data set that is expected to be normal
β: minimum ratio of the size of the large cluster to the small clusters
Idea: Anomaly score of a case is equal to the distance to the nearest
large cluster multiplied by the size of the cluster the case belong to.
He, Zengyou, Xiaofei Xu, and Shengchun Deng. ”Discovering cluster-based local outliers.”
(Torgo et. al.) LIDTA2020 September, 2020 74 / 127
Clustering-based Methods
CBLOF
(Torgo et. al.) LIDTA2020 September, 2020 75 / 127
Unsupervised Methods
(Torgo et. al.) LIDTA2020 September, 2020 76 / 127
Hybrid Approaches
Feature Bagging for Outlier Detection
Feature Bagging for Outlier Detection runs LOF method on multiple
projections of the data and combines the results for improved
detection qualities in high dimensions.
First ensemble learning approach to outlier detection.
Lazarevic, A. and Kumar, V., 2005, Feature bagging for outlier detection. In KDD ’05. 2005.
(Torgo et. al.) LIDTA2020 September, 2020 77 / 127
Up next ...
Semi-supervised Methods
Class-based Anomaly Detection
Explanation of Rare Events
(Torgo et. al.) LIDTA2020 September, 2020 78 / 127
Semi-supervised outlier detection
training data has labeled instances only for one class
the most common - only normal data available, labeled outliers are missing
more robust than unsupervised methods
can outperform supervised ones if we are not sure about representativness
of labeled outliers
Jason Sopheap Tun, Semi-Supervised Outlier Detection Algorithms, U. California 2018
(Torgo et. al.) LIDTA2020 September, 2020 79 / 127
Methods
One-class learning: disadvantage: can be sensitive (as One-class SVM) to
outliers and thus does not perform very well (see also Aggarwal for details)
any unsupervised anomaly detection algorithm can be used
learning set contains only normal instances, test set both
= (a sort of) novelty detection
Evaluation: outliers lie outside the area
Novelty detection is more general: can result even in a dense cluster that is far from
normal points.
(Torgo et. al.) LIDTA2020 September, 2020 80 / 127
Example: sckit-learn
Figure 1: Caption(Torgo et. al.) LIDTA2020 September, 2020 81 / 127
Class-based Outlier Detection
Class-based outliers
Why we need a new concept?
Example:
e-shop. planning marketing campaign to increase income
Which clients to be sent with a new oﬀer?
Monitoring two groups of clients
Group PLUS : buying products more or less often
Group MINUS : browsing list of oﬀers/products more or less often but
(almost) have not bought anything so far
Which clients – subsets of groups PLUS and MINUS – to be sent
with a new oﬀer?
(Torgo et. al.) LIDTA2020 September, 2020 83 / 127
Class-based outliers
Deﬁnition
Class-based outliers
each example belongs to a class
Class-based outliers are those cases that look anomalous when the
class labels are taken into account but they do not have to be
anomalous when the class labels are ignored.
outliers = data point which behaves diﬀerently with other data points
in the same class
may look normal with respect to data points in another class
(Torgo et. al.) LIDTA2020 September, 2020 84 / 127
Multi-class outliers
Han, Data Mining. Principle and Techniques, 3rd edition
learn a model for each normal class
if the data point does not ﬁt any of the model, then it is declared an
outlier
advantage - easy to use
disadvantage – some outliers cannot be detected
(Torgo et. al.) LIDTA2020 September, 2020 85 / 127
Semantic outliers
He et al. 2004
solve the problem
cluster and then
compute the probability of the class label of the example with respect
to other members of the cluster
the similarity between the example and other examples in the class
introduce COF, a class outlier factor
COF = OF w.r.t. own class (+) OF w.r.t. the other classes
disadvantage: how to deﬁne (+) addition
He Z. et al. Mining Class Outliers: Concepts, Algorithms and Applications in CRM.
Expert Systems and Applications, ESWA 2004, 27(4), pp. 681-697, 2004.
(Torgo et. al.) LIDTA2020 September, 2020 86 / 127
CODB
combination of distance-based and density-based approach w.r.t class
attribute
in RapidMiner
T ... instance K ... a number fo nearest neighbors α, β ... parameters
COF(T) =
SimilarityToTheK-NearestNeighbors
... compare a class of T to classes of the neighbors
+ α * 1/DistanceFromOtherElementsOfTheClass ... Distance
+ β * DistanceFromTheNearestNeighbors ... Density
Hewahi N.M. and Saad M.K. Class Outliers Mining: Distance-Based Approach. Int.
Journal of Intelligent Systems and Technologies, Vol. 2, No. 1, pp 55-68, 2007.
(Torgo et. al.) LIDTA2020 September, 2020 87 / 127
RF-OEX
Random Forest-based method
use proximity matrix for class outlier factor computation
COF = sum of three diﬀerent measures of proximity or outlierness
COF =
Proximity to the members of the same class
+ Misclassication - proximity to the members of other classes and
+ Ambiguity measure – a percentage of ambiguous classiﬁcation
More: https://www.ﬁ.muni.cz/˜popel/685269/
NEZVALOV´A, Leona, Lubom´ır POPEL´INSK´Y, Luis TORGO a Karel VACUL´IK.
Class-Based Outlier Detection: Staying Zombies or Awaiting for Resurrection? In
Proceedings of IDA 2015.
(Torgo et. al.) LIDTA2020 September, 2020 88 / 127
(Torgo et. al.) LIDTA2020 September, 2020 89 / 127
ILP. Rule-based approach.
Given E+ positive and E− negative examples and the background
knowledge B, learn concept C and dual concept C1 (swap positive and
negative examples). C and C1 are pure logic programs.
Look for examples that if removed from the learning set change the
description (logic program) of C and C1 signiﬁcantly
i.e. diﬀerence of coverage is greater then a threshold.
= outliers
ANGIULLI, Fabrizio; FASSETTI, Fabio. Exploiting domain knowledge to detect outliers.
Data Mining and Knowledge Discovery. 2014, vol. 28, no. 2,
(Torgo et. al.) LIDTA2020 September, 2020 90 / 127
Case studies
Educational Data mining Correct vs incorrect student solutions in
logic
Czech Parliament 44 most important votings. Deputies that looks
anomalous if compared with other members of the same party
Small and medium enterprises (growing/non-growing)
...
(Torgo et. al.) LIDTA2020 September, 2020 91 / 127
IMDb
Star ratings vs. sentiment of a review
transform 0..10 stars into positive/negative rating
perform 2-class sentiment analysis of a review
used RF-OEX, CODB and LOF (LOF for each class separately)
(Torgo et. al.) LIDTA2020 September, 2020 92 / 127
IMDb. Example of results
positive review, actors horrible
Tsui Hark’s visual artistry is at its peek in this movie. Unfortunately
the terrible acting by Ekin Cheng and especially Cecilia Cheung (I felt
the urge to strangle her while watching this, it’s that bad :)
made it diﬃcult to watch at times.
This movie is a real breakthrough in the visual department. ...
positive review to a realy bad horror that cannot be taken seriously
People are seeing it as a typical horror movie that is set out to
scare us and prevent us from getting some sleep. Which if it was
trying to do then it would deservedly get a 1/10.
The general view on this movie is that it has bad acting,
a simple script that a 10 year old could produce and that
it cant be taken seriously...
...
(Torgo et. al.) LIDTA2020 September, 2020 93 / 127
Open challenges
two groups A, B, a member of A pretends to be in B
Filtering outliers to improve (classiﬁer) accuracy
Anomalies in multi-modal data
(Torgo et. al.) LIDTA2020 September, 2020 94 / 127
Updated version of this part and the next one can be found here
https://www.ﬁ.muni.cz/˜popel/685269/
(Torgo et. al.) LIDTA2020 September, 2020 95 / 127
Explanation of rare events
Need for explanation of outliers
A user need to understand why an instance is detected as an outlier
For many applications, explanation (interpretation, description,
outlying property detection, characterization) of outliers is as
important as identiﬁcation
Outlier factor (degree) and ranking is only quantitative information
Not only for high-dimensional data we need qualitative information
Based also on ODD v5.0: Outlier Detection De-constructed ACM SIGKDD 2018
Workshop keynote speeches, namely Making sense of unusual suspects - Finding and
Characterizing Outliers (Ira Assent) and Outlier Description and Interpretation (Jian
Pei)
(Torgo et. al.) LIDTA2020 September, 2020 97 / 127
How to generate explanation?
Compare with inlying data as well as conﬁrmed outlying data
Find outlier explanatory component / outlying property / outlier
context / outlier characteristic
Help domain expert in verifying outliers and understanding how the
outlier method works
(Torgo et. al.) LIDTA2020 September, 2020 98 / 127
What is meaningfull explanation
A method for ﬁnding of explanation must be
helpful for a user, namely easy to understand. E.g. the smallest
subset of attributes
eﬃcient, scalable
Most frequent approaches
visual
look for a subset of attributes where each outlier has its own
explanatory subspace
(Torgo et. al.) LIDTA2020 September, 2020 99 / 127
Finding the most important attributes
For an object q, ﬁnd the subspaces where q is most unusual compared to
the rest of the data
A 3D space {x, y, z} and all its 2D projections. {x, z} is an explanatory subspace
(Micenkova 2015)
(Torgo et. al.) LIDTA2020 September, 2020 100 / 127
Strongest, weak and trivial outliers
Knorr and Ng 1998
Non-trivial outliers
P is a non-trivial outlier in space A if P is not an outlier in any subspace
of A.
Strongest outlier
The space A containing one or more outliers is called a strongest outlying
space if no outlier exist in any subspace of A.
Any P that is an outlier in A is called a strongest outlier.
Any non-trivial outlier that is not strongest is called weak outlier.
(Torgo et. al.) LIDTA2020 September, 2020 101 / 127
Example: NHL ice hockey players
Knorr and Ng 1999
5-D space {A, B, C, D, E} of power-play goals, short-handed goals,
game-winning goals, game-tying goals, and game played
Lattice representation
(Torgo et. al.) LIDTA2020 September, 2020 102 / 127
Explaining outliers by subspace separability
(Micenkova and Ng 2013)
Cannot derive explanatory subspace just by analyzing vicinity of the
point in full space ⇒ need to consider diﬀerent subspace projections
no monotonicity property for outliers wrt. subspaces
need for heurstics because of exponential complexity,
look for a subspace A where the outlier factor is high and the
dimension of A is low
separability - instance outlierness is related to its separability from the
rest of the data
B. Micenkov´a, R. T. Ng, X. H. Dang, and I. Assent. Explaining outliers by subspace
separability. In IEEE ICDM 2013
(Torgo et. al.) LIDTA2020 September, 2020 103 / 127
Outlierness as accuracy of classiﬁcation
(Micenkova and Ng 2013)
separablity as error at classiﬁcation. Assume that the data follows a
distribution f
original data = inlierclass; outlier + artiﬁcial points = outlierclass
use standard feature selection methods to ﬁnd explanatory subspaces
Measuring outlierness by separability. p1, p2 are points from the distribution f (x) and
the normal distributions gp1(x) and gp2(x) were artiﬁcially generated.
(Torgo et. al.) LIDTA2020 September, 2020 104 / 127
RF-OEX: Analysis of Random Forest
two methods: 1. search for frequent branches and 2. reduction of trees
NEZVALOV´A, Leona et al. Class-Based Outlier Detection: Staying Zombies or Awaiting
for Resurrection? In Proceedings of IDA 2015.
(Torgo et. al.) LIDTA2020 September, 2020 105 / 127
RF-OEX
Examples of explanation
Form: (Condition, certainty factor)
Zoo dataset
Instance number: 64, Class: mammal
eggs=true, 0.51
toothed=false, 0.49
Iris dataset
Instance number: 19, Class: Iris-setosa
sepallength >= 5.5 && sepalwidth < 4, 0.53
sepallength >= 5.5, 0.47
(Torgo et. al.) LIDTA2020 September, 2020 106 / 127
Recent work
Beyond Outlier Detection: LookOut for Pictorial Explanation, ECML
PKDD. (Gupta et al. 2018)
Explaining anomalies in groups with characterizing subspace rules.Data
Mining and Knowledge Discovery (2018) 32 (Macha and Akoglu 2018)
Oui! Outlier Interpretation on Multi-dimensional Data viaVisual Analytics
Eurographics Conference on Visualization (EuroVis) (Xun Zhao et al.
2019)
Sequential Feature Explanation for Anomaly Detection. ACM Transactions
on Knowledge Discovery from Data, Vol. 13, No. 1, (Siddiqui et al. 2019)
Towards explaining anomalies. A deep Taylor decomposition of one-class
models. Pattern Recognition 101 (2020) 1071098 (Kauﬀmann et al. 2020)
(Torgo et. al.) LIDTA2020 September, 2020 107 / 127
Open Challenges
Imbalanced Domains and Rare Events
The future
More and more human activities are being monitored through data
collection
For a large set of critical application domains, stability is a key factor
Deviations from normality are undesirable and their timely
anticipation may be highly rewarding
This is the central playing ﬁeld of imbalanced domains and rare
event detection
(Torgo et. al.) LIDTA2020 September, 2020 109 / 127
Some Examples
Monitoring critical ecosystems (e.g. Ocean, Marine Protected Areas,
etc.)
Autonomous vehicles
Health data science (e.g. monitoring elderly people, ICU’s)
Financial domains
etc.
(Torgo et. al.) LIDTA2020 September, 2020 110 / 127
Some Critical Aspects of Imbalanced Domains
We face Imbalanced Domains when the following two conditions
(both!) are true:
Not all values of the target variable are equally important
The more important values are scarcely represented in the training data
How to diﬀerentiate importance?
How to deﬁne rarity?
How to make algorithms focus on what is important (being rare)?
How to properly evaluate the performance of the algorithms on what
it matters to the end user?
(Torgo et. al.) LIDTA2020 September, 2020 111 / 127
Some Critical Aspects of Imbalanced Domains (cont.)
Established answers exist for some special cases.
Problem deﬁnition
The most established case is Binary Classiﬁcation with one class value
being rare and more important
Focus of algorithms
The most established methods revolve around resampling
Evaluation
The most common is to use the Precision+Recall/ F-measure setting
(Torgo et. al.) LIDTA2020 September, 2020 112 / 127
Other Problem Settings
Multiclass imbalance
Regression
Time series and data streams
Spatiotemporal data streams
Ordinal classiﬁcation
Multi-label classiﬁcation
Association rules mining
Multi-instance learning
Explainability
etc.
(Torgo et. al.) LIDTA2020 September, 2020 113 / 127
Multiclass imbalance
Classiﬁcation problems with more than two classes, with diﬀerentiated
importance among classes, with several of the relevant class values being
rare
How to express which classes are relevant?
How to diﬀerentiate the diﬀerent classes (maybe some values are
more important than others) ?
How to bias the algorithms taking into account the importance
information?
How to evaluate the performance in these contexts?
(Torgo et. al.) LIDTA2020 September, 2020 114 / 127
Multiclass imbalance
How to express which classes are relevant?
Branco et. al (2017) proposed to use the concept of relevance to
specify the importance of each class value.
How to derive this relevance information depends on the type of
information we can get from the user:
Informal - the user just let us know that the rarer the class the more
important
Intermediate - the user is able to establish a partial order of importance
between class values
Full - the user is able to provide full quantiﬁcation of the importance of
each class value
The authors provide means of estimating the relevance of each class
for diﬀerent user information settings
Based on the estimated relevance the authors propose a series of
evaluation metrics for the performance of the models
P. Branco, L. Torgo, and R. Ribeiro. ”Relevance-based evaluation metrics for multi-class
imbalanced domains.” PAKDD. Springer, Cham, pp.698-710 (2017).
(Torgo et. al.) LIDTA2020 September, 2020 115 / 127
Multiclass imbalance
Open Challenges
How to bias the models?
Resampling based on relevance?
Algorithms directly optimizing the proposed metrics?
Other approaches to multiclass? Multiple binary problems (e.g.
Fern´andez et al, 2013)?
Fern´andez, A., L´opez, V., Galar, M., del Jes´us, M.J., Herrera, F.: Analysing the classiﬁcation of
imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches.
Knowledge Based Systems, 42, 97–110 (2013)
(Torgo et. al.) LIDTA2020 September, 2020 116 / 127
Imbalance in Time Series Forecasting
In time series observations are ordered by time and the goal is to obtain
models able to forecast future values of the series
Imbalance occurs when some of the observed values are more
important, but rare.
Time series are usually numeric, so can we re-use the methods of
imbalanced regression ?
(Torgo et. al.) LIDTA2020 September, 2020 117 / 127
Imbalance in Time Series Forecasting
Specifying diﬀerentiated importance
Moniz et al. (2017) used the concept of relevance by automatically
deriving it assuming extreme values of the time series are rare and thus
more important (e.g. returns of stock market)
Standard resampling algorithms were applied to numeric time series
forecasting tasks (under- and over-sampling and SMOTE)
Several biased resampling methods were proposed
Time bias - preferring to keep important values that are more recent
Time and relevance bias - also allowing older values if they are
extremely relevant
N. Moniz, P. Branco, and L. Torgo. Resampling strategies for imbalanced time series
forecasting. International Journal of Data Science and Analytics, 3(3):161–181, 2017.
(Torgo et. al.) LIDTA2020 September, 2020 118 / 127
Imbalance in Time Series Forecasting
Open Challenges
How to address other importance criteria (not only extreme values)?
What about time series of symbolic values? Can we also adapt
classiﬁcation approaches?
Are there other forms of biasing resampling using other properties of
time series (e.g. taking into account seasonality)?
(Torgo et. al.) LIDTA2020 September, 2020 119 / 127
Imbalance in Data Streams
Data Streams share some of the characteristics of Time Series but are
usually too big for normal batch learning and frequently have serious
concept drift eﬀects
Imbalance ratio may change with time
What was rare may become common and vice versa
New types of values not seen in the past may appear in the data
Full past data access is typically impossible
(Torgo et. al.) LIDTA2020 September, 2020 120 / 127
Imbalance in Data Streams (cont.)
Several authors describe some of the eﬀorts in this area (e.g.
Krawczyk et al., 2017; or Hoens et al,, 2012)
A frequent strategy to ﬁght problems raised by the properties of data
streams the use of ensembles
B. Krawczyk, L. Minku, J. Gama, J. Stefanowski, and M. Wozniak. Ensemble learning for data
stream analysis: A survey. Information Fusion, 37:132 – 156, 2017. ISSN 1566-2535. doi:
http://dx.doi.org/10.1016/j.inﬀus.2017.02.004.
Hoens, T.R., Polikar, R., Chawla, N.V.: Learning from streaming data with concept drift and
imbalance: an overview. Progress AI 1(1), 89–101 (2012)
(Torgo et. al.) LIDTA2020 September, 2020 121 / 127
Imbalance in Data Streams
With the advances of mobile computing more and more data with
both spatial and temporal properties is being collected
How to resample a dataset that is a moving target?
How to properly evaluate the performance in these settings?
(Torgo et. al.) LIDTA2020 September, 2020 122 / 127
Imbalance in Spatiotemporal Datasets
In spatiotemporal forecasting you have to cope not only with temporal
correlation but also spatial correlation.
Are standard resampling strategies applicable?
Is there any change on the way we should evaluate the models?
Is it worth to think about special purpose learning algorithms?
(Torgo et. al.) LIDTA2020 September, 2020 123 / 127
Imbalance in Spatiotemporal Datasets (cont.)
Oliveira et al., 2019 are among the ﬁrst to address the issue of imbalance
on spatiotemporal forecasting
The authors propose resampling approaches tuned for this type of
data
They proposed biased resampling that takes into account the
temporal and spatial correlation of the data
The bias is calculate through temporal and spatial weights
temporal - favour more recent observations
spatial 1 - favour spatially isolated rare cases
spatial 2 - decrease the weights on cases that are faraway from rare
cases
M. Oliveira, N. Moniz, L. Torgo and V. Santos Costa, ”Biased Resampling Strategies for
Imbalanced Spatio-Temporal Forecasting,” 2019 IEEE International Conference on Data Science
and Advanced Analytics (DSAA), Washington, DC, USA, 2019, pp. 100-109, doi:
10.1109/DSAA.2019.00024.
(Torgo et. al.) LIDTA2020 September, 2020 124 / 127
Imbalance in Spatiotemporal Datasets
Open Challenges
Cope with situations of real time data (spatiotemporal data streams)
New forms of sampling bias
Cope with rarity that may be location dependent
(Torgo et. al.) LIDTA2020 September, 2020 125 / 127
Explainability with Imbalanced Domains
Explainable AI and ML is a hot topic
Many critical decisions are being taken based on the outcome of ML
models
This is even more critical with imbalanced domains
Imbalanced domains have to do with rarity and high importance
Frequently associated with rare and costly events
Frequently used as early detection of these costly events
Driving important (and frequently costly) decisions
Explaining WHY becomes even more important
(Torgo et. al.) LIDTA2020 September, 2020 126 / 127
Learning with Imbalanced Domains
and Rare Event Detection
Luis Torgo, Stan Matwin, Nathalie Japkowicz, Nuno Moniz,
Paula Branco, Rita P. Ribeiro and Lubomir Popelinsky
Dalhousie University, Canada
INESC TEC, University of Porto, Portugal
American University, USA
University of Ottawa, Canada
Masaryk University, Czech Republic
September, 2020
(Torgo et. al.) LIDTA2020 September, 2020 127 / 127