Learning with Imbalanced Domains and Rare Event Detection Luis Torgo, Stan Matwin, Nathalie Japkowicz, Nuno Moniz, Paula Branco, Rita P. Ribeiro and Lubomir Popelinsky Dalhousie University, Canada INESC TEC, University of Porto, Portugal American University, USA University of Ottawa, Canada Masaryk University, Czech Republic September, 2020 (Torgo et. al.) LIDTA2020 September, 2020 1 / 127 Outline 1 Welcome 2 Rare Event Detection - Principles 3 Methods and Evaluation 4 Class-based Outlier Detection 5 Explanation of rare events 6 Open Challenges (Torgo et. al.) LIDTA2020 September, 2020 2 / 127 Learning with Imbalanced Domains Imbalanced Domain Learning It is based on the following assumptions: the representativeness of the cases on the training data is not uniform; the underrepresented cases are the most relevant ones for the domain. The focus is on the identification of these scarce/outlier cases. But, the definition of these cases is dependent on the application domain knowledge. (Torgo et. al.) LIDTA2020 September, 2020 7 / 127 Nature of Input Data Key Aspects of Imbalanced Domain Learning Each data instance has: I One attribute (univariate) I Multiple attributes (multivariate) Relationship among data instances: I None I Sequential/Temporal I Spatial I Spatio-temporal I Graph Dimensionality of data (Torgo et. al.) LIDTA2020 September, 2020 11 / 127 Performance Metrics Key Aspects of Imbalanced Domain Learning Standard performance metrics (e.g. accuracy, error rate) assume that all instances are equally relevant for the model performance. These metrics give a good performance estimate to a model that performs well on normal (frequent) cases and bad on outlier (rare) cases. Credit Card Fraud Detection: I data set D with only 1% of fraudulent transactions; I model M predicts all transactions as non-fraudulent; I M has a estimated accuracy of 99%; I yet, all the fraudulent transactions were missed! Standard performance metrics are not suitable! (Torgo et. al.) LIDTA2020 September, 2020 17 / 127 Predictive Modelling Supervised Imbalanced Domain Learning In a supervised learning task the goal is: I given an unknown function Y = f (X1, X2, · · · , Xp), I use a training set D = {hxi , yi i}n i=1 with examples of this function I to obtain the best approximation to the function f , i.e. the model, h(X1, X2, · · · , Xp). Depending on the type of target variable Y , we have: I classification task, if Y is nominal I regression task, if Y is numeric Imbalanced Predictive Modelling I More importance is assigned to a subset of target variable Y domain. I The cases that are more relevant are poorly represented in the training set. How to specify these non-uniform importance values? (Torgo et. al.) LIDTA2020 September, 2020 26 / 127 Predictive Modelling: Notion of Relevance Supervised Imbalanced Domain Learning Relevance function (Y ) (Torgo and Ribeiro, 2007) A relevance function (Y ) : Y ! [0, 1] is a function that expresses the application-specific bias concerning the target variable domain Y by mapping it into a [0, 1] scale of relevance, where 0 and 1 represent the minimum and maximum relevance, respectively. The notion of relevance applicable to both classification and regression problems. It can be used to build the sets of rare and normal cases. Torgo, L. and Ribeiro, R. (2007). “Utility-based Regression”. In: Proceedings of 11th ECML/PKDD 2007. Springer. (Torgo et. al.) LIDTA2020 September, 2020 27 / 127 Predictive Modelling: Notion of Relevance (cont.) Supervised Imbalanced Domain Learning With an user-defined threshold on the relevance values tR. Partition the training set D in two complementary subsets: I DR = {hx, yi 2 D : (y) tR } I DN = D \ DR In this case, we have that |DR| << |DN| How to define the relevance function (Y )? I It can be provided by the domain knowledge. I Estimated from the target variable data distribution, so that rare target classes/values are assigned more importance. (Torgo et. al.) LIDTA2020 September, 2020 28 / 127 Predictive Modelling: Imbalanced Classification Supervised Imbalanced Domain Learning In imbalanced classification specifying the relevance of a target variable for each class is feasible. The most important cases are the cases labelled with infrequent classes in the target variable Y , i.e. the cases for which (y) tR . (Torgo et. al.) LIDTA2020 September, 2020 29 / 127 Predictive Modelling: Imbalanced Regression Supervised Imbalanced Domain Learning In imbalanced regression, given the potentially infinite nature of the target variable domain, specifying the relevance of all values is virtually impossible, requiring an approximation. Ribeiro (2011) proposed two methods for estimating (Y ): interpolation method I user provides a set of interpolating points automatic method I no input required from the user; I it uses the target variable distribution; I it assumes that the most relevant cases are located at the extremes of the target variable distribution. Ribeiro, Rita P. ”Utility-based regression”. PhD thesis, Dep. Computer Science, Faculty of Sciences, University of Porto, 2011. (Torgo et. al.) LIDTA2020 September, 2020 30 / 127 Predictive Modelling: Imbalanced Regression (cont.) Supervised Imbalanced Domain Learning The automatic method interpolates the boxplot statistics to obtain a continuous relevance function that maps the domain of the target variable Y to the relevance interval [0, 1], so that the extreme values of Y are most important ones, i.e. the cases for which (y) tR . (Torgo et. al.) LIDTA2020 September, 2020 31 / 127 Predictive Modelling Challenges Supervised Imbalanced Domain Learning It is of key importance that the obtained models are particularly accurate at the sub-range of the domain of the target variable for which training examples are rare. To prevent the models of being biased to the most frequent cases, it is necessary to use: performance metrics biased towards the performance on rare cases; learning strategies that focus on these rare cases. I Data pre-processing I Special-purpose Learning I Predictions post-processing Branco P, Torgo L, Ribeiro RP (2016). ”A survey of predictive modeling on imbalanced domains”. In: ACM Computing Surveys (CSUR) 49 (2), 1–35 (Torgo et. al.) LIDTA2020 September, 2020 32 / 127 Data Pre-Processing Strategies Supervised Imbalanced Domain Learning Proposal Change the data distribution to make standard algorithm focus on rare and relevant cases. Advantages They allow the application of any learning algorithm The obtained model will be biased to the goals of the domain Models will be interpretable Disadvantages di culty of relating the modifications in the data distribution and domain preferences mapping the given data distribution into an optimal new distribution according to domain goals is not easy (Torgo et. al.) LIDTA2020 September, 2020 33 / 127 Special-purpose Learning Strategies Supervised Imbalanced Domain Learning Proposal Change the learning algorithms so they can learn from imbalance data. Advantages The domain goals are incorporated directly into the models by setting an appropriate preference criterion. Models will be interpretable. Disadvantages It is restricted to that specific set of modified learning algorithms. It requires a deep knowledge of algorithms. If the preference criterion changes, models have to be relearned and, possibly the algorithm has to be re-adapted. It is not easy to map the domain preferences with a suitable preference criterion. (Torgo et. al.) LIDTA2020 September, 2020 34 / 127 Prediction Post-processing Strategies Proposal Use the original data set and a standard learning algorithm, only manipulating the predictions of the models according to the domain preferences and the imbalance of the data Advantages It is not necessary to be aware of the domain preferences at learning time. The same model can be applied to di↵erent deployment scenarios without having to be relearned. Any standard learning algorithm can be used. Disadvantages the models do not reflect the domain preferences. models interpretability is jeopardized as they were obtained by optimizing a function that does not follow the domain preference bias. (Torgo et. al.) LIDTA2020 September, 2020 35 / 127