Learning with Imbalanced Domains and Rare Event Detection Luis Torgo, Stan Matwin, Nathalie Japkowicz, Nuno Moniz, Paula Branco, Rita P. Ribeiro and Lubomir Popelinsky Dalhousie University, Canada INESC TEC, University of Porto, Portugal American University, USA University of Ottawa, Canada Masaryk University, Czech Republic September, 2020 (Torgo et. al.) LIDTA2020 September, 2020 1 / 127 Outline 1 Welcome 2 Rare Event Detection - Principles 3 Methods and Evaluation 4 Class-based Outlier Detection 5 Explanation of rare events 6 Open Challenges (Torgo et. al.) LIDTA2020 September, 2020 2 / 127 Rare Event Detection Principles Motivation Most of machine learning tasks focus on creating a model of the “normal” patterns in the data, extracting knowledge from what is common (e.g. frequent patterns). Still, rare patterns can also give us some import insights about data. These patterns represent rare events, i.e. outliers. Depending on the goal, those insights can be even more interesting than the “normal” patterns. (Torgo et. al.) LIDTA2020 September, 2020 4 / 127 What is an Outlier? “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980) (Torgo et. al.) LIDTA2020 September, 2020 5 / 127 What is an Outlier? (cont.) Outliers represent patterns in data that do not conform to a well-defined notion of normal. Initially, outliers were considered errors/noise and their identification had data cleaning purposes. However, they can represent a truthful deviation of data. For some applications, they represent critical information, which can trigger preventive or corrective actions. (Torgo et. al.) LIDTA2020 September, 2020 6 / 127 Learning with Imbalanced Domains Imbalanced Domain Learning It is based on the following assumptions: the representativeness of the cases on the training data is not uniform; the underrepresented cases are the most relevant ones for the domain. The focus is on the identification of these scarce/outlier cases. But, the definition of these cases is dependent on the application domain knowledge. (Torgo et. al.) LIDTA2020 September, 2020 7 / 127 Some Applications with Imbalanced Domains Financial Applications Credit Card Fraud, Insurance Claim Fraud, Stock Market Anomalies (Cyber) Security Applications Host-based, Network Intrusion Detection Medical Applications Medical Sensor or Imaging for Rare Disease Diagnostics Text and Social Media Applications Anomalous Activity in Social Networks, Fake News Detection Earth Science Applications Sea Surface Temperature Anomalies, Environmental Disasters Fault Detection Applications Quality Control, Systems Diagnosis, Structure Defect Detection (Torgo et. al.) LIDTA2020 September, 2020 8 / 127 Challenges of Imbalanced Domain Learning Define every possible “normal” behaviour is hard. The boundary between normal and outlying behaviour is often not precise. There is no general outlier definition; it depends on the application domain. It is difficult to distinguish real, meaningful outliers from simple random noise in data. The outlier behaviour may evolve with time. Malicious actions adapt themselves to appear as normal. Inherent lack of known labelled outliers for training/validation of models. (Torgo et. al.) LIDTA2020 September, 2020 9 / 127 Key Aspects of Imbalance Domain Learning Nature of Input Data Type of Outliers Intended Output Learning Task Performance Metrics (Torgo et. al.) LIDTA2020 September, 2020 10 / 127 Nature of Input Data Key Aspects of Imbalanced Domain Learning Each data instance has: One attribute (univariate) Multiple attributes (multivariate) Relationship among data instances: None Sequential/Temporal Spatial Spatio-temporal Graph Dimensionality of data (Torgo et. al.) LIDTA2020 September, 2020 11 / 127 Types of Outliers Key Aspects of Imbalanced Domain Learning Point Outlier An instance that individually or in small groups is very different from the rest of the instances. (Torgo et. al.) LIDTA2020 September, 2020 12 / 127 Types of Outliers (cont.) Key Aspects of Imbalanced Domain Learning Contextual Outlier An instance that when considered within a context is very different from the rest of the instances. (Torgo et. al.) LIDTA2020 September, 2020 13 / 127 Types of Outliers (cont.) Key Aspects of Imbalanced Domain Learning Collective Outlier An instance that, even though isolated may not be an outlier, inspected in conjunction with related instances and regarding the entire data set is an outlier. (Torgo et. al.) LIDTA2020 September, 2020 14 / 127 Intended Output Key Aspects of Imbalanced Domain Learning Value A label / numeric value identifying normal or outlier instance. Score The probability of being an outlier. This allows the output to be ranked. But, requires the specification of a threshold. (Torgo et. al.) LIDTA2020 September, 2020 15 / 127 Learning Task Key Aspects of Imbalanced Domain Learning Unsupervised Learning Data set has no information on the behaviour of each instance. It assumes that instances with normal behaviour are far more frequent. Most common case in real-life applications. Semi-supervised Learning Data set has a few instances of normal or outlier behaviour. Some real-life applications, such as fault detection, provide such data. Supervised Learning Data set has instances of both normal and outlier behaviour. Hard to obtain such data in real-life applications. (Torgo et. al.) LIDTA2020 September, 2020 16 / 127 Performance Metrics Key Aspects of Imbalanced Domain Learning Standard performance metrics (e.g. accuracy, error rate) assume that all instances are equally relevant for the model performance. These metrics give a good performance estimate to a model that performs well on normal (frequent) cases and bad on outlier (rare) cases. Credit Card Fraud Detection: data set D with only 1% of fraudulent transactions; model M predicts all transactions as non-fraudulent; M has a estimated accuracy of 99%; yet, all the fraudulent transactions were missed! Standard performance metrics are not suitable! (Torgo et. al.) LIDTA2020 September, 2020 17 / 127 Taxonomy of Approaches Imbalanced Domain Learning Imbalanced Domain Learning Unsupervised Semi-supervised Supervised Statistical-based Proximity-based Clustering-based Data Pre-processing Special-purpose Methods Prediction Post-processing One-Class Classification Other Approaches Hybrid Approaches Hybrid Approaches (Torgo et. al.) LIDTA2020 September, 2020 18 / 127 Taxonomy of Approaches Imbalanced Domain Learning Imbalanced Domain Learning Unsupervised Semi-supervised Supervised Statistical-based Proximity-based Clustering-based Data Pre-processing Special-purpose Methods Prediction Post-processing One-Class Classification Other Approaches Hybrid Approaches Hybrid Approaches (Torgo et. al.) LIDTA2020 September, 2020 19 / 127 Statistical-based Approaches Unsupervised Imbalanced Domain Learning Proposal All the points that satisfy a statistical discordance test for some statistical model are declared as outliers. Advantages If the assumptions of the statistical model hold, these techniques provide an acceptable solution for outlier detection. The outlier score is associated with a confidence interval. Disadvantages The data does not always follow a statistical model. Choosing the best hypothesis test statistics is not straightforward. Capturing interactions between attributes is not always possible. Estimating the parameters for some statistical models is hard. (Torgo et. al.) LIDTA2020 September, 2020 20 / 127 Proximity-based Approaches Unsupervised Imbalanced Domain Learning Proposal Normal instances occur in dense neighbourhoods, while outliers occur far from their closest neighbours. Advantages Purely data-driven technique. Does not make any assumptions regarding the underlying distribution of data. Disadvantages True outliers and noisy regions of low density may be hard to distinguish. These methods need to combine global and local analysis. In high dimensional data, the contrast in the distances is lost. Computationally expensive in the test phase. (Torgo et. al.) LIDTA2020 September, 2020 21 / 127 Clustering-based Approaches Unsupervised Imbalanced Domain Learning Proposal Normal instances belong to large and dense clusters, while outlier instances are instances that: do not belong to any of the clusters, are far from its closest cluster or form very small or low-density clusters. Advantages Easily adaptable to on-line/incremental mode. Test phase is fast. Disadvantages Computationally expensive in the training phase. If normal points do not create any clusters, this technique may fail. In high dimensional spaces, clustering algorithms may not give any meaningful clusters. Some techniques detect outliers as a byproduct, i.e. they are not optimized to find outliers; their main aim is to find clusters. (Torgo et. al.) LIDTA2020 September, 2020 22 / 127 Taxonomy of Approaches Imbalanced Domain Learning Imbalanced Domain Learning Unsupervised Semi-supervised Supervised Statistical-based Proximity-based Clustering-based Data Pre-processing Special-purpose Methods Prediction Post-processing One-Class Classification Other Approaches Hybrid Approaches Hybrid Approaches (Torgo et. al.) LIDTA2020 September, 2020 23 / 127 One Class Classification Approach Semi-supervised Imbalanced Domain Learning Proposal Build a prediction model to the normal behaviour and classify any deviations from this behaviour as outliers. Advantages Models are interpretable. Normal behaviour can be accurately learned. Can detect new outliers that may not appear close to an outlier point in the training set. Disadvantages Requires previous labelled instances for normal behaviour. Possible high false alarm rate - previously unseen normal data may be identified as an outlier. (Torgo et. al.) LIDTA2020 September, 2020 24 / 127 Taxonomy of Approaches Imbalanced Domain Learning (Torgo et. al.) LIDTA2020 September, 2020 25 / 127 Predictive Modelling Supervised Imbalanced Domain Learning In a supervised learning task the goal is: given an unknown function Y = f (X1, X2, · · · , Xp), use a training set D = { xi , yi }n i=1 with examples of this function to obtain the best approximation to the function f , i.e. the model, h(X1, X2, · · · , Xp). Depending on the type of target variable Y , we have: classification task, if Y is nominal regression task, if Y is numeric Imbalanced Predictive Modelling More importance is assigned to a subset of target variable Y domain. The cases that are more relevant are poorly represented in the training set. How to specify these non-uniform importance values? (Torgo et. al.) LIDTA2020 September, 2020 26 / 127 Predictive Modelling: Notion of Relevance Supervised Imbalanced Domain Learning Relevance function φ(Y ) (Torgo and Ribeiro, 2007) A relevance function φ(Y ) : Y → [0, 1] is a function that expresses the application-specific bias concerning the target variable domain Y by mapping it into a [0, 1] scale of relevance, where 0 and 1 represent the minimum and maximum relevance, respectively. The notion of relevance applicable to both classification and regression problems. It can be used to build the sets of rare and normal cases. Torgo, L. and Ribeiro, R. (2007). “Utility-based Regression”. In: Proceedings of 11th ECML/PKDD 2007. Springer. (Torgo et. al.) LIDTA2020 September, 2020 27 / 127 Predictive Modelling: Notion of Relevance (cont.) Supervised Imbalanced Domain Learning With an user-defined threshold on the relevance values tR. Partition the training set D in two complementary subsets: DR = { x, y ∈ D : φ(y) ≥ tR } DN = D \ DR In this case, we have that |DR| << |DN| How to define the relevance function φ(Y )? It can be provided by the domain knowledge. Estimated from the target variable data distribution, so that rare target classes/values are assigned more importance. (Torgo et. al.) LIDTA2020 September, 2020 28 / 127 Predictive Modelling: Imbalanced Classification Supervised Imbalanced Domain Learning In imbalanced classification specifying the relevance of a target variable for each class is feasible. The most important cases are the cases labelled with infrequent classes in the target variable Y , i.e. the cases for which φ(y) ≥ tR . (Torgo et. al.) LIDTA2020 September, 2020 29 / 127 Predictive Modelling: Imbalanced Regression Supervised Imbalanced Domain Learning In imbalanced regression, given the potentially infinite nature of the target variable domain, specifying the relevance of all values is virtually impossible, requiring an approximation. Ribeiro (2011) proposed two methods for estimating φ(Y ): interpolation method user provides a set of interpolating points automatic method no input required from the user; it uses the target variable distribution; it assumes that the most relevant cases are located at the extremes of the target variable distribution. Ribeiro, Rita P. ”Utility-based regression”. PhD thesis, Dep. Computer Science, Faculty of Sciences, University of Porto, 2011. (Torgo et. al.) LIDTA2020 September, 2020 30 / 127 Predictive Modelling: Imbalanced Regression (cont.) Supervised Imbalanced Domain Learning The automatic method interpolates the boxplot statistics to obtain a continuous relevance function that maps the domain of the target variable Y to the relevance interval [0, 1], so that the extreme values of Y are most important ones, i.e. the cases for which φ(y) ≥ tR . (Torgo et. al.) LIDTA2020 September, 2020 31 / 127 Predictive Modelling Challenges Supervised Imbalanced Domain Learning It is of key importance that the obtained models are particularly accurate at the sub-range of the domain of the target variable for which training examples are rare. To prevent the models of being biased to the most frequent cases, it is necessary to use: performance metrics biased towards the performance on rare cases; learning strategies that focus on these rare cases. Data pre-processing Special-purpose Learning Predictions post-processing Branco P, Torgo L, Ribeiro RP (2016). ”A survey of predictive modeling on imbalanced domains”. In: ACM Computing Surveys (CSUR) 49 (2), 1–35 (Torgo et. al.) LIDTA2020 September, 2020 32 / 127 Data Pre-Processing Strategies Supervised Imbalanced Domain Learning Proposal Change the data distribution to make standard algorithm focus on rare and relevant cases. Advantages They allow the application of any learning algorithm The obtained model will be biased to the goals of the domain Models will be interpretable Disadvantages difficulty of relating the modifications in the data distribution and domain preferences mapping the given data distribution into an optimal new distribution according to domain goals is not easy (Torgo et. al.) LIDTA2020 September, 2020 33 / 127 Special-purpose Learning Strategies Supervised Imbalanced Domain Learning Proposal Change the learning algorithms so they can learn from imbalance data. Advantages The domain goals are incorporated directly into the models by setting an appropriate preference criterion. Models will be interpretable. Disadvantages It is restricted to that specific set of modified learning algorithms. It requires a deep knowledge of algorithms. If the preference criterion changes, models have to be relearned and, possibly the algorithm has to be re-adapted. It is not easy to map the domain preferences with a suitable preference criterion. (Torgo et. al.) LIDTA2020 September, 2020 34 / 127 Prediction Post-processing Strategies Proposal Use the original data set and a standard learning algorithm, only manipulating the predictions of the models according to the domain preferences and the imbalance of the data Advantages It is not necessary to be aware of the domain preferences at learning time. The same model can be applied to different deployment scenarios without having to be relearned. Any standard learning algorithm can be used. Disadvantages the models do not reflect the domain preferences. models interpretability is jeopardized as they were obtained by optimizing a function that does not follow the domain preference bias. (Torgo et. al.) LIDTA2020 September, 2020 35 / 127 Up next ... Suitable Performance Metrics Specific Learning Methods (Torgo et. al.) LIDTA2020 September, 2020 36 / 127 Methods and Evaluation Imbalanced Domains and Rare Event Detection Performance Evaluation (Torgo et. al.) LIDTA2020 September, 2020 38 / 127 Why is performance evaluation a challenge? Standard metrics (e.g. error rate or mean squared error) describe the average predictive performance of the models When the user is focused on a small subset of rare values, the average is not a good idea These metrics will be mostly influenced by the performance of the models on cases that are irrelevant for the user (Torgo et. al.) LIDTA2020 September, 2020 39 / 127 An Example from Classification Fraud Detection Two classes: Fraud and Normal Fraudulent cases are roughly 1% of the training sample A classifier that always predicts Normal would achieve on average 99% accuracy! This classifier is completely useless! Because frauds are very rare, failing them or correctly predicting them will have a minor impact on the accuracy (or error rate) metric. (Torgo et. al.) LIDTA2020 September, 2020 40 / 127 An Example from Regression Forecasting Stock Market Returns Very high or low returns (% variations of prices) are interesting Near-zero returns are very common but uninteresting for traders unable to cover transaction costs Examples: Forecasting a future return of 3% and then it happens -5% is a very bad error! Forecasting a return of 3% and then it happens 11% has the same error amplitude but it is not a serious error Forecasting 0.2% for a true value of 0.4% is reasonably accurate but irrelevant! Forecasting -7.5% for a true value of -8% is a good an useful prediction Because near 0 returns are very common a model that always forecasts 0 is hard to beat in terms of Mean Squared Error. But this model is useless! (Torgo et. al.) LIDTA2020 September, 2020 41 / 127 Metrics and the Available Information Different applications may involve different type of information on the user preferences This may have an impact on the metrics you can and/or should calculate Independently, there are two classes of metrics: scalar and graphical (Torgo et. al.) LIDTA2020 September, 2020 42 / 127 Evaluation with Full Utility Information Utility Matrices Table where each entry specifies the cost (negative benefit) or benefit of each type of prediction Pred. c1 c2 c3 Obs. c1 B1,1 C1,2 C1,3 c2 C2,1 B2,2 C2,3 c3 C3,1 C3,2 B3,3 Models are then evaluated by the total utility of their predictions, i.e. the sum of the benefits minus the costs. Similar setting for regression using Utility Surfaces (Ribeiro, 2011) R. Ribeiro (2011). “Utility-based Regression”. PhD on Computer Science, Univ. Porto. (Torgo et. al.) LIDTA2020 September, 2020 43 / 127 The Precision/Recall Framework Classification Problems with two classes One of the classes is much less frequent and it is also the most relevant Preds. Pos Neg Obs. Pos True Positives (TP) False Negatives (FN)) Neg False Positives (FP) True Negatives (TN) (Torgo et. al.) LIDTA2020 September, 2020 44 / 127 The Precision/Recall Framework Classification - 2 Preds. P N Obs. P TP FN N FP TN Precision - proportion of the signals (events) of the model that are correct Prec = TP TP + FP Recall - proportion of the real events that are captured by the model Rec = TP TP + FN (Torgo et. al.) LIDTA2020 September, 2020 45 / 127 The F-Measure Combining Precision and Recall into a single measure Useful to have a single measure - e.g. optimization within a search procedure Maximizing one of them is easy at the cost of the other (it is easy to have 100% recall - always predict “P”). What is difficult is to have both of them with high values (Torgo et. al.) LIDTA2020 September, 2020 46 / 127 The F-Measure Combining Precision and Recall into a single measure Useful to have a single measure - e.g. optimization within a search procedure Maximizing one of them is easy at the cost of the other (it is easy to have 100% recall - always predict “P”). What is difficult is to have both of them with high values The F-measure is a statistic that is based on the values of precision and recall and allows establishing a trade-off between the two using a user-defined parameter (β), Fβ = (β2 + 1) · Prec · Rec β2 · Prec + Rec where β controls the relative importance of Prec and Rec. If β = 1 then F is the harmonic mean between Prec and Rec; When β → 0 the weight of Rec decreases. When β → ∞ the weight of Prec decreases. (Torgo et. al.) LIDTA2020 September, 2020 46 / 127 The G-Mean and Adjusted G-Mean Gm = TP TP + FN × TN TN + FP = sensitivity × specificity AGm = Gm+Specificity×Nn 1+Nn sensitivity ≥ 0 0 sensitivity = 0 where Nn is the proportion of majority class examples in the data set. M. Kubat and S. Matwin. “Addressing the curse of imbalanced training sets: one-sided selection.” In Proc. of 14th Int. Conf. on Machine Learning, 1997, Nashville, USA, pp.179-186 R. Batuwita and V. Palade. “A new performance measure for class imbalance learning. Application to bioinformatics problems.” In ICMLA’09, pp.545–550. IEEE, 2009. (Torgo et. al.) LIDTA2020 September, 2020 47 / 127 Metrics for Multiclass Imbalance Problems φ(i) is the relevance of class i. Different ways to obtain φ() depending on the available domain information (Branco, 2017). Recφ = 1 C i=1 φ(i) C i=1 φ(i) · recalli Precφ = 1 C i=1 φ(i) C i=1 φ(i) · precisioni Fφ β = (1+β2)·Precφ·Recφ (β2·Precφ)+Recφ AvFφ β = 1 C i=1 φ(i) C i=1 φ(i)·(1+β2)·precisioni ·recalli (β2·precisioni )+recalli CBAφ = C i=1 φ(i) · mati,i max C j=1 mati,j , C j=1 matj,i P. Branco, L. Torgo, and R. Ribeiro. ”Relevance-based evaluation metrics for multi-class imbalanced domains.” PAKDD. Springer, Cham, pp.698-710 (2017). (Torgo et. al.) LIDTA2020 September, 2020 48 / 127 The Precision/Recall Framework Regression For forecasting rare extreme values, the concepts of Precision and Recall were also adapted to regression (Torgo and Ribeiro, 2009; Branco, 2014), precφ = φ(ˆyi )>tR (1 + U(ˆyi , yi )) φ(ˆyi )>tR (1 + φ(ˆyi )) recφ = φ(yi )>tR (1 + U(ˆyi , yi )) φ(yi )>tR (1 + φ(yi )) L. Torgo and R. P. Ribeiro (2009). “Precision and Recall for Regression”. In: Discovery Science’2009. Springer. P. Branco (2014). “Re-sampling Approaches for Regression Tasks under Imbalanced Domains”. MSc on Computer Science, Univ. Porto. (Torgo et. al.) LIDTA2020 September, 2020 49 / 127 Summary of Scalar Metrics for Imbalanced Domains Adapted from: P. Branco, L. Torgo and R. Ribeiro. “A Survey of Predictive Modeling on Imbalanced Domains”. In: ACM Comput. Surv. 49-2, 1–31 (2016). P. Branco (2018). ”Utility-based Predictive Analytics”. PhD on Computer Science, Univ. Porto. (Torgo et. al.) LIDTA2020 September, 2020 50 / 127 ROC curve and Precision-Recall Curve Classification 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 False Positive Rate TruePositiveRate random classifier A B C AUC−ROC(A) = 0.87 AUC−ROC(B) = 0.83 AUC−ROC(C) = 0.64 Ideal Model 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Recall Precision AB C AUC−PR(A) = 0.97 AUC−PR(B) = 0.80 AUC−PR(C) = 0.76 Ideal Model Taken from: P. Branco (2018). ”Utility-based Predictive Analytics”. PhD on Computer Science, Univ. Porto. (Torgo et. al.) LIDTA2020 September, 2020 51 / 127 ROC curve and Precision-Recall Curve Regression Taken from: R. Ribeiro (2011). “Utility-based Regression”. PhD on Computer Science, Univ. Porto. (Torgo et. al.) LIDTA2020 September, 2020 52 / 127 Summary of Graphical Metrics for Imbalanced Domains Adapted from: P. Branco, L. Torgo and R. Ribeiro. “A Survey of Predictive Modeling on Imbalanced Domains”. In: ACM Comput. Surv. 49-2, 1–31 (2016). P. Branco (2018). ”Utility-based Predictive Analytics”. PhD on Computer Science, Univ. Porto. (Torgo et. al.) LIDTA2020 September, 2020 53 / 127 Imbalanced Domains and Rare Event Detection Unsupervised Methods (Torgo et. al.) LIDTA2020 September, 2020 54 / 127 Unsupervised Methods Unsupervised Statistical-based Proximity-based Clustering-based Hybrid Approaches (Torgo et. al.) LIDTA2020 September, 2020 55 / 127 Unsupervised Methods (Torgo et. al.) LIDTA2020 September, 2020 56 / 127 Statistical-based Methods Parametric Assumption Normal instances occur in high probability regions of a stochastic model. Anomalies occur in the low probability regions of the stochastic model. Gaussian Model Based Regression Model Based Mixture of Parametric Distributions Based (Torgo et. al.) LIDTA2020 September, 2020 57 / 127 Statistical-based Methods Parametric Grubb’s test z = |x−µ| σ Box Plot Rule Regression model based fit a regression model use the residuals to determine the anomaly score (Torgo et. al.) LIDTA2020 September, 2020 58 / 127 Statistical-based Methods Non-parametric Assumption The model structure is not determined a priori but is determined from the given data. Few assumptions regarding the data when compared to parametric techniques. Histogram Based Kernel Function Based (Torgo et. al.) LIDTA2020 September, 2020 59 / 127 Statistical-based Methods Non-parametric Histogram Based build histogram for a new test instance, check if it falls in a bin of the histogram. If it does: normal, otherwise: anomaly. Variant: assign an anomaly score based on the bin frequency (Torgo et. al.) LIDTA2020 September, 2020 60 / 127 Statistical-based Methods Non-parametric Kernel Function Based Non-parametric techniques for probability density estimation Example: parzen windows estimation (Parzen, 1962) Use kernel functions to approximate the actual density. Similar to parametric methods. Difference: the density estimation technique used Parzen, E. (1962) On the estimation of a probability density function and mode. Annals of Mathematical Statistics 33, 1065–1076. (Torgo et. al.) LIDTA2020 September, 2020 61 / 127 Unsupervised Methods (Torgo et. al.) LIDTA2020 September, 2020 62 / 127 Proximity-based Methods Distance-based: Nearest Neighbors (NN) Approach The anomaly score of a case is its distance to the kth nearest neighbor Apply a threshold on the anomaly score to determine is a case is anomalous or not. Examples of applications: land mines detection from satellite ground images, detect anomalies in large synchronous turbine-generators Ramaswamy, S.,Rastogi, R. and Shim., K. ”Efficient algorithms for mining outliers from large data sets.” Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 2000. (Torgo et. al.) LIDTA2020 September, 2020 63 / 127 Proximity-based Methods Distance-based: Nearest Neighbors (NN) Approach (Torgo et. al.) LIDTA2020 September, 2020 64 / 127 Proximity-based Methods Distance-based: Nearest Neighbors (NN) Approach Alternative way for computing the anomaly score: count the number of nearest neighbors that are not more than d distance apart from the case. Can be viewed as a way to obtain an estimate of the global density for each case. Knorr, E. M., and Ng, R. T. ”Algorithms for mining distance based outliers in large datasets.” Proceedings of the international conference on very large data bases. 1998. (Torgo et. al.) LIDTA2020 September, 2020 65 / 127 Proximity-based Methods LOF-based Local Outlier Factor (LOF) (Breunig et al., 2000) Each point has a score that captures the relative degree of isolation of the point from its surrounding neighbourhood. Breunig, M. M., Kriegel, H. P., Ng, R., and Sander, J. (2000). ”LOF: Identifying density-based local outliers.” In Chen, W., Naughton, J. F., and Bernstein, P. A., editors, Proceedings of ACM SIGMOD 2000 International Conference on Management of Data. ACM Press. (Torgo et. al.) LIDTA2020 September, 2020 66 / 127 Proximity-based Methods LOF-based LOF Approach MinPts: number of nearest neighbors used in defining the local neighborhood For each point x compute distance to the kth nearest neighbor (k − dist) Compute reachability distance: reach − distk(x, p) = max{k − dist(p), d(x, p)} Compute local reachability density: lrdMinPts(x) = MinPts p reach−distMinPts (x,p) Compute LOF score: LOFMinPts(x) = 1 MinPts · p lrdMinPts (p) lrdMinPts (x) (Torgo et. al.) LIDTA2020 September, 2020 67 / 127 Proximity-based Methods KNN-based vs LOF-based (Torgo et. al.) LIDTA2020 September, 2020 68 / 127 Proximity-based Methods Isolation Forest Anomalies are few and different. Collection of isolation trees (iTrees) Each iTree isolates every case from the remaining cases for a given sample Anomalies should be more susceptible to isolation, i.e., they exhibit a shorter average path Score(x) = 1 t t i=1 li (x), where li (x) is the path length of observation x in tree i 5 1 6 3 2 1 11 2 1 1 4 2 6 2 2 1 111 1 1 [1] Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (2008). ”Isolation Forest”. 2008 Eighth IEEE International Conference on Data Mining: 413–422. (Torgo et. al.) LIDTA2020 September, 2020 69 / 127 Proximity-based Methods Isolation Forest [1] Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (2008). ”Isolation Forest”. 2008 Eighth IEEE International Conference on Data Mining: 413–422. (Torgo et. al.) LIDTA2020 September, 2020 70 / 127 Unsupervised Methods (Torgo et. al.) LIDTA2020 September, 2020 71 / 127 Clustering-based Methods DBSCAN Idea: find the areas that satisfy a simple minimum density level, and which are separated by areas with lower density. Parameters: MinPts: threshold for the number of neighbors, : radius Objects with more than MinPts neighbors within a radius of (including the query point) are considered to be core points. Ester, Martin, et al. ”A density-based algorithm for discovering clusters in large spatial databases with noise.” KDD. Vol. 96. No. 34. 1996. Schubert, Erich, et al. ”DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.” ACM TODS 42.3 (2017): 1-21. (Torgo et. al.) LIDTA2020 September, 2020 72 / 127 Clustering-based Methods DBSCAN Steps Compute neighbors of each point and identify core points Join neighboring core points into clusters for each non-core point Add to a neighboring core point if possible Otherwise, add to noise (Torgo et. al.) LIDTA2020 September, 2020 73 / 127 Clustering-based Methods CBLOF Cluster-based Local Outlier Factor (He, 2003) Two parameters: α: ratio of the data set that is expected to be normal β: minimum ratio of the size of the large cluster to the small clusters Idea: Anomaly score of a case is equal to the distance to the nearest large cluster multiplied by the size of the cluster the case belong to. He, Zengyou, Xiaofei Xu, and Shengchun Deng. ”Discovering cluster-based local outliers.” (Torgo et. al.) LIDTA2020 September, 2020 74 / 127 Clustering-based Methods CBLOF (Torgo et. al.) LIDTA2020 September, 2020 75 / 127 Unsupervised Methods (Torgo et. al.) LIDTA2020 September, 2020 76 / 127 Hybrid Approaches Feature Bagging for Outlier Detection Feature Bagging for Outlier Detection runs LOF method on multiple projections of the data and combines the results for improved detection qualities in high dimensions. First ensemble learning approach to outlier detection. Lazarevic, A. and Kumar, V., 2005, Feature bagging for outlier detection. In KDD ’05. 2005. (Torgo et. al.) LIDTA2020 September, 2020 77 / 127 Up next ... Semi-supervised Methods Class-based Anomaly Detection Explanation of Rare Events (Torgo et. al.) LIDTA2020 September, 2020 78 / 127 Semi-supervised outlier detection training data has labeled instances only for one class the most common - only normal data available, labeled outliers are missing more robust than unsupervised methods can outperform supervised ones if we are not sure about representativness of labeled outliers Jason Sopheap Tun, Semi-Supervised Outlier Detection Algorithms, U. California 2018 (Torgo et. al.) LIDTA2020 September, 2020 79 / 127 Methods One-class learning: disadvantage: can be sensitive (as One-class SVM) to outliers and thus does not perform very well (see also Aggarwal for details) any unsupervised anomaly detection algorithm can be used learning set contains only normal instances, test set both = (a sort of) novelty detection Evaluation: outliers lie outside the area Novelty detection is more general: can result even in a dense cluster that is far from normal points. (Torgo et. al.) LIDTA2020 September, 2020 80 / 127 Example: sckit-learn Figure 1: Caption(Torgo et. al.) LIDTA2020 September, 2020 81 / 127 Class-based Outlier Detection Class-based outliers Why we need a new concept? Example: e-shop. planning marketing campaign to increase income Which clients to be sent with a new offer? Monitoring two groups of clients Group PLUS : buying products more or less often Group MINUS : browsing list of offers/products more or less often but (almost) have not bought anything so far Which clients – subsets of groups PLUS and MINUS – to be sent with a new offer? (Torgo et. al.) LIDTA2020 September, 2020 83 / 127 Class-based outliers Definition Class-based outliers each example belongs to a class Class-based outliers are those cases that look anomalous when the class labels are taken into account but they do not have to be anomalous when the class labels are ignored. outliers = data point which behaves differently with other data points in the same class may look normal with respect to data points in another class (Torgo et. al.) LIDTA2020 September, 2020 84 / 127 Multi-class outliers Han, Data Mining. Principle and Techniques, 3rd edition learn a model for each normal class if the data point does not fit any of the model, then it is declared an outlier advantage - easy to use disadvantage – some outliers cannot be detected (Torgo et. al.) LIDTA2020 September, 2020 85 / 127 Semantic outliers He et al. 2004 solve the problem cluster and then compute the probability of the class label of the example with respect to other members of the cluster the similarity between the example and other examples in the class introduce COF, a class outlier factor COF = OF w.r.t. own class (+) OF w.r.t. the other classes disadvantage: how to define (+) addition He Z. et al. Mining Class Outliers: Concepts, Algorithms and Applications in CRM. Expert Systems and Applications, ESWA 2004, 27(4), pp. 681-697, 2004. (Torgo et. al.) LIDTA2020 September, 2020 86 / 127 CODB combination of distance-based and density-based approach w.r.t class attribute in RapidMiner T ... instance K ... a number fo nearest neighbors α, β ... parameters COF(T) = SimilarityToTheK-NearestNeighbors ... compare a class of T to classes of the neighbors + α * 1/DistanceFromOtherElementsOfTheClass ... Distance + β * DistanceFromTheNearestNeighbors ... Density Hewahi N.M. and Saad M.K. Class Outliers Mining: Distance-Based Approach. Int. Journal of Intelligent Systems and Technologies, Vol. 2, No. 1, pp 55-68, 2007. (Torgo et. al.) LIDTA2020 September, 2020 87 / 127 RF-OEX Random Forest-based method use proximity matrix for class outlier factor computation COF = sum of three different measures of proximity or outlierness COF = Proximity to the members of the same class + Misclassication - proximity to the members of other classes and + Ambiguity measure – a percentage of ambiguous classification More: https://www.fi.muni.cz/˜popel/685269/ NEZVALOV´A, Leona, Lubom´ır POPEL´INSK´Y, Luis TORGO a Karel VACUL´IK. Class-Based Outlier Detection: Staying Zombies or Awaiting for Resurrection? In Proceedings of IDA 2015. (Torgo et. al.) LIDTA2020 September, 2020 88 / 127 (Torgo et. al.) LIDTA2020 September, 2020 89 / 127 ILP. Rule-based approach. Given E+ positive and E− negative examples and the background knowledge B, learn concept C and dual concept C1 (swap positive and negative examples). C and C1 are pure logic programs. Look for examples that if removed from the learning set change the description (logic program) of C and C1 significantly i.e. difference of coverage is greater then a threshold. = outliers ANGIULLI, Fabrizio; FASSETTI, Fabio. Exploiting domain knowledge to detect outliers. Data Mining and Knowledge Discovery. 2014, vol. 28, no. 2, (Torgo et. al.) LIDTA2020 September, 2020 90 / 127 Case studies Educational Data mining Correct vs incorrect student solutions in logic Czech Parliament 44 most important votings. Deputies that looks anomalous if compared with other members of the same party Small and medium enterprises (growing/non-growing) ... (Torgo et. al.) LIDTA2020 September, 2020 91 / 127 IMDb Star ratings vs. sentiment of a review transform 0..10 stars into positive/negative rating perform 2-class sentiment analysis of a review used RF-OEX, CODB and LOF (LOF for each class separately) (Torgo et. al.) LIDTA2020 September, 2020 92 / 127 IMDb. Example of results positive review, actors horrible Tsui Hark’s visual artistry is at its peek in this movie. Unfortunately the terrible acting by Ekin Cheng and especially Cecilia Cheung (I felt the urge to strangle her while watching this, it’s that bad :) made it difficult to watch at times. This movie is a real breakthrough in the visual department. ... positive review to a realy bad horror that cannot be taken seriously People are seeing it as a typical horror movie that is set out to scare us and prevent us from getting some sleep. Which if it was trying to do then it would deservedly get a 1/10. The general view on this movie is that it has bad acting, a simple script that a 10 year old could produce and that it cant be taken seriously... ... (Torgo et. al.) LIDTA2020 September, 2020 93 / 127 Open challenges two groups A, B, a member of A pretends to be in B Filtering outliers to improve (classifier) accuracy Anomalies in multi-modal data (Torgo et. al.) LIDTA2020 September, 2020 94 / 127 Updated version of this part and the next one can be found here https://www.fi.muni.cz/˜popel/685269/ (Torgo et. al.) LIDTA2020 September, 2020 95 / 127 Explanation of rare events Need for explanation of outliers A user need to understand why an instance is detected as an outlier For many applications, explanation (interpretation, description, outlying property detection, characterization) of outliers is as important as identification Outlier factor (degree) and ranking is only quantitative information Not only for high-dimensional data we need qualitative information Based also on ODD v5.0: Outlier Detection De-constructed ACM SIGKDD 2018 Workshop keynote speeches, namely Making sense of unusual suspects - Finding and Characterizing Outliers (Ira Assent) and Outlier Description and Interpretation (Jian Pei) (Torgo et. al.) LIDTA2020 September, 2020 97 / 127 How to generate explanation? Compare with inlying data as well as confirmed outlying data Find outlier explanatory component / outlying property / outlier context / outlier characteristic Help domain expert in verifying outliers and understanding how the outlier method works (Torgo et. al.) LIDTA2020 September, 2020 98 / 127 What is meaningfull explanation A method for finding of explanation must be helpful for a user, namely easy to understand. E.g. the smallest subset of attributes efficient, scalable Most frequent approaches visual look for a subset of attributes where each outlier has its own explanatory subspace (Torgo et. al.) LIDTA2020 September, 2020 99 / 127 Finding the most important attributes For an object q, find the subspaces where q is most unusual compared to the rest of the data A 3D space {x, y, z} and all its 2D projections. {x, z} is an explanatory subspace (Micenkova 2015) (Torgo et. al.) LIDTA2020 September, 2020 100 / 127 Strongest, weak and trivial outliers Knorr and Ng 1998 Non-trivial outliers P is a non-trivial outlier in space A if P is not an outlier in any subspace of A. Strongest outlier The space A containing one or more outliers is called a strongest outlying space if no outlier exist in any subspace of A. Any P that is an outlier in A is called a strongest outlier. Any non-trivial outlier that is not strongest is called weak outlier. (Torgo et. al.) LIDTA2020 September, 2020 101 / 127 Example: NHL ice hockey players Knorr and Ng 1999 5-D space {A, B, C, D, E} of power-play goals, short-handed goals, game-winning goals, game-tying goals, and game played Lattice representation (Torgo et. al.) LIDTA2020 September, 2020 102 / 127 Explaining outliers by subspace separability (Micenkova and Ng 2013) Cannot derive explanatory subspace just by analyzing vicinity of the point in full space ⇒ need to consider different subspace projections no monotonicity property for outliers wrt. subspaces need for heurstics because of exponential complexity, look for a subspace A where the outlier factor is high and the dimension of A is low separability - instance outlierness is related to its separability from the rest of the data B. Micenkov´a, R. T. Ng, X. H. Dang, and I. Assent. Explaining outliers by subspace separability. In IEEE ICDM 2013 (Torgo et. al.) LIDTA2020 September, 2020 103 / 127 Outlierness as accuracy of classification (Micenkova and Ng 2013) separablity as error at classification. Assume that the data follows a distribution f original data = inlierclass; outlier + artificial points = outlierclass use standard feature selection methods to find explanatory subspaces Measuring outlierness by separability. p1, p2 are points from the distribution f (x) and the normal distributions gp1(x) and gp2(x) were artificially generated. (Torgo et. al.) LIDTA2020 September, 2020 104 / 127 RF-OEX: Analysis of Random Forest two methods: 1. search for frequent branches and 2. reduction of trees NEZVALOV´A, Leona et al. Class-Based Outlier Detection: Staying Zombies or Awaiting for Resurrection? In Proceedings of IDA 2015. (Torgo et. al.) LIDTA2020 September, 2020 105 / 127 RF-OEX Examples of explanation Form: (Condition, certainty factor) Zoo dataset Instance number: 64, Class: mammal eggs=true, 0.51 toothed=false, 0.49 Iris dataset Instance number: 19, Class: Iris-setosa sepallength >= 5.5 && sepalwidth < 4, 0.53 sepallength >= 5.5, 0.47 (Torgo et. al.) LIDTA2020 September, 2020 106 / 127 Recent work Beyond Outlier Detection: LookOut for Pictorial Explanation, ECML PKDD. (Gupta et al. 2018) Explaining anomalies in groups with characterizing subspace rules.Data Mining and Knowledge Discovery (2018) 32 (Macha and Akoglu 2018) Oui! Outlier Interpretation on Multi-dimensional Data viaVisual Analytics Eurographics Conference on Visualization (EuroVis) (Xun Zhao et al. 2019) Sequential Feature Explanation for Anomaly Detection. ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 1, (Siddiqui et al. 2019) Towards explaining anomalies. A deep Taylor decomposition of one-class models. Pattern Recognition 101 (2020) 1071098 (Kauffmann et al. 2020) (Torgo et. al.) LIDTA2020 September, 2020 107 / 127 Open Challenges Imbalanced Domains and Rare Events The future More and more human activities are being monitored through data collection For a large set of critical application domains, stability is a key factor Deviations from normality are undesirable and their timely anticipation may be highly rewarding This is the central playing field of imbalanced domains and rare event detection (Torgo et. al.) LIDTA2020 September, 2020 109 / 127 Some Examples Monitoring critical ecosystems (e.g. Ocean, Marine Protected Areas, etc.) Autonomous vehicles Health data science (e.g. monitoring elderly people, ICU’s) Financial domains etc. (Torgo et. al.) LIDTA2020 September, 2020 110 / 127 Some Critical Aspects of Imbalanced Domains We face Imbalanced Domains when the following two conditions (both!) are true: Not all values of the target variable are equally important The more important values are scarcely represented in the training data How to differentiate importance? How to define rarity? How to make algorithms focus on what is important (being rare)? How to properly evaluate the performance of the algorithms on what it matters to the end user? (Torgo et. al.) LIDTA2020 September, 2020 111 / 127 Some Critical Aspects of Imbalanced Domains (cont.) Established answers exist for some special cases. Problem definition The most established case is Binary Classification with one class value being rare and more important Focus of algorithms The most established methods revolve around resampling Evaluation The most common is to use the Precision+Recall/ F-measure setting (Torgo et. al.) LIDTA2020 September, 2020 112 / 127 Other Problem Settings Multiclass imbalance Regression Time series and data streams Spatiotemporal data streams Ordinal classification Multi-label classification Association rules mining Multi-instance learning Explainability etc. (Torgo et. al.) LIDTA2020 September, 2020 113 / 127 Multiclass imbalance Classification problems with more than two classes, with differentiated importance among classes, with several of the relevant class values being rare How to express which classes are relevant? How to differentiate the different classes (maybe some values are more important than others) ? How to bias the algorithms taking into account the importance information? How to evaluate the performance in these contexts? (Torgo et. al.) LIDTA2020 September, 2020 114 / 127 Multiclass imbalance How to express which classes are relevant? Branco et. al (2017) proposed to use the concept of relevance to specify the importance of each class value. How to derive this relevance information depends on the type of information we can get from the user: Informal - the user just let us know that the rarer the class the more important Intermediate - the user is able to establish a partial order of importance between class values Full - the user is able to provide full quantification of the importance of each class value The authors provide means of estimating the relevance of each class for different user information settings Based on the estimated relevance the authors propose a series of evaluation metrics for the performance of the models P. Branco, L. Torgo, and R. Ribeiro. ”Relevance-based evaluation metrics for multi-class imbalanced domains.” PAKDD. Springer, Cham, pp.698-710 (2017). (Torgo et. al.) LIDTA2020 September, 2020 115 / 127 Multiclass imbalance Open Challenges How to bias the models? Resampling based on relevance? Algorithms directly optimizing the proposed metrics? Other approaches to multiclass? Multiple binary problems (e.g. Fern´andez et al, 2013)? Fern´andez, A., L´opez, V., Galar, M., del Jes´us, M.J., Herrera, F.: Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowledge Based Systems, 42, 97–110 (2013) (Torgo et. al.) LIDTA2020 September, 2020 116 / 127 Imbalance in Time Series Forecasting In time series observations are ordered by time and the goal is to obtain models able to forecast future values of the series Imbalance occurs when some of the observed values are more important, but rare. Time series are usually numeric, so can we re-use the methods of imbalanced regression ? (Torgo et. al.) LIDTA2020 September, 2020 117 / 127 Imbalance in Time Series Forecasting Specifying differentiated importance Moniz et al. (2017) used the concept of relevance by automatically deriving it assuming extreme values of the time series are rare and thus more important (e.g. returns of stock market) Standard resampling algorithms were applied to numeric time series forecasting tasks (under- and over-sampling and SMOTE) Several biased resampling methods were proposed Time bias - preferring to keep important values that are more recent Time and relevance bias - also allowing older values if they are extremely relevant N. Moniz, P. Branco, and L. Torgo. Resampling strategies for imbalanced time series forecasting. International Journal of Data Science and Analytics, 3(3):161–181, 2017. (Torgo et. al.) LIDTA2020 September, 2020 118 / 127 Imbalance in Time Series Forecasting Open Challenges How to address other importance criteria (not only extreme values)? What about time series of symbolic values? Can we also adapt classification approaches? Are there other forms of biasing resampling using other properties of time series (e.g. taking into account seasonality)? (Torgo et. al.) LIDTA2020 September, 2020 119 / 127 Imbalance in Data Streams Data Streams share some of the characteristics of Time Series but are usually too big for normal batch learning and frequently have serious concept drift effects Imbalance ratio may change with time What was rare may become common and vice versa New types of values not seen in the past may appear in the data Full past data access is typically impossible (Torgo et. al.) LIDTA2020 September, 2020 120 / 127 Imbalance in Data Streams (cont.) Several authors describe some of the efforts in this area (e.g. Krawczyk et al., 2017; or Hoens et al,, 2012) A frequent strategy to fight problems raised by the properties of data streams the use of ensembles B. Krawczyk, L. Minku, J. Gama, J. Stefanowski, and M. Wozniak. Ensemble learning for data stream analysis: A survey. Information Fusion, 37:132 – 156, 2017. ISSN 1566-2535. doi: http://dx.doi.org/10.1016/j.inffus.2017.02.004. Hoens, T.R., Polikar, R., Chawla, N.V.: Learning from streaming data with concept drift and imbalance: an overview. Progress AI 1(1), 89–101 (2012) (Torgo et. al.) LIDTA2020 September, 2020 121 / 127 Imbalance in Data Streams With the advances of mobile computing more and more data with both spatial and temporal properties is being collected How to resample a dataset that is a moving target? How to properly evaluate the performance in these settings? (Torgo et. al.) LIDTA2020 September, 2020 122 / 127 Imbalance in Spatiotemporal Datasets In spatiotemporal forecasting you have to cope not only with temporal correlation but also spatial correlation. Are standard resampling strategies applicable? Is there any change on the way we should evaluate the models? Is it worth to think about special purpose learning algorithms? (Torgo et. al.) LIDTA2020 September, 2020 123 / 127 Imbalance in Spatiotemporal Datasets (cont.) Oliveira et al., 2019 are among the first to address the issue of imbalance on spatiotemporal forecasting The authors propose resampling approaches tuned for this type of data They proposed biased resampling that takes into account the temporal and spatial correlation of the data The bias is calculate through temporal and spatial weights temporal - favour more recent observations spatial 1 - favour spatially isolated rare cases spatial 2 - decrease the weights on cases that are faraway from rare cases M. Oliveira, N. Moniz, L. Torgo and V. Santos Costa, ”Biased Resampling Strategies for Imbalanced Spatio-Temporal Forecasting,” 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Washington, DC, USA, 2019, pp. 100-109, doi: 10.1109/DSAA.2019.00024. (Torgo et. al.) LIDTA2020 September, 2020 124 / 127 Imbalance in Spatiotemporal Datasets Open Challenges Cope with situations of real time data (spatiotemporal data streams) New forms of sampling bias Cope with rarity that may be location dependent (Torgo et. al.) LIDTA2020 September, 2020 125 / 127 Explainability with Imbalanced Domains Explainable AI and ML is a hot topic Many critical decisions are being taken based on the outcome of ML models This is even more critical with imbalanced domains Imbalanced domains have to do with rarity and high importance Frequently associated with rare and costly events Frequently used as early detection of these costly events Driving important (and frequently costly) decisions Explaining WHY becomes even more important (Torgo et. al.) LIDTA2020 September, 2020 126 / 127 Learning with Imbalanced Domains and Rare Event Detection Luis Torgo, Stan Matwin, Nathalie Japkowicz, Nuno Moniz, Paula Branco, Rita P. Ribeiro and Lubomir Popelinsky Dalhousie University, Canada INESC TEC, University of Porto, Portugal American University, USA University of Ottawa, Canada Masaryk University, Czech Republic September, 2020 (Torgo et. al.) LIDTA2020 September, 2020 127 / 127