Outlier Detection Zuzana Pekarčíková Outline   What is an Outlier ?   Applications of Outlier Detection   Types of Outliers   Outlier Detection Methods Types   Basic Outlier Detection Methods   High-dimensional Outlier Detection Methods   Class Outlier Detection – Random Forests   Context-based Approach What is an Outlier ?   Definition of Hawkins [Hawkins 1980]: “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” Applications of Outlier Detection   Fraud detection Purchasing behavior of a credit card owner usually changes when the card is stolen   Medicine Unusual symptoms or test results may indicate potential health problems of a patient Whether a particular test result is abnormal may depend on other characteristics of the patients (e.g. gender, age, …)   Detecting measurement errors Data derived from sensors may contain measurement errors Removing such errors can be important in other data mining and data analysis tasks Types of Outliers   Point Anomalies An individual data instance can be considered as anomalous with respect to the rest of data. The simplest type of Outliers. Example: credit card fraud detection Types of Outliers   Contextual Anomalies If a data instance is anomalous in a specific context, but not otherwise. Example: temperature time-series - t1 = t2, however t1 is in winter whereas t2 in summer Types of Outliers   Collective Anomalies A collection of related data instances is anomalous with respect to the entire data set. The individual data instances in a collective anomaly may not be anomalies by themselves, but their occurrence together as a collection is anomalous. Example: human cardiogram Outlier Detection Methods Types   The labels associated with a data instance denote whether that instance is normal or anomalous.   Supervised Methods -  availability of a training data set that has labeled instances for normal as well as anomaly classes -  building a predictive model for normal vs. anomaly classes – problem is transformated to classification problem -  Problems:   anomalous instances are far fewer than normal instances   obtaining acurate labels for the anomaly class is challenging   Semi-supervised Methods -  training data has labeled instances only for the normal class   Unsupervised Methods -  no labels, most widely used -  assumption: normal instances are far more frequent than anomalies in the test data and they make clusters Outlier Detection Methods   Statistical Methods -  normal data objects are generated by a statistical (stochastic) model, and data not following the model are outliers -  Example: statistical distribution: Gaussina Outliers are points that have a low probability to be generated by Gaussian distribution -  Problems: Mean and standard deviation are very sensitive to outliers These values are computed for the complete data set (including potential outliers) -  Advantage: existence of statistical proof why the object is an outlier µDB Outlier Detection Methods   Proximity-Based Methods   An object is an outlier if the proximity of the object to its neighbors significantly deviates from the proximity of most of the other objects to their neighbors in the same data set. -  Distance-based Detection   Radius r   k nearest neighbors -  Density-based Detection Relative density of object counted from density of its neighbors   Clustering-Based Methods Normal data objects belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters. q Outlier Detection Methods   Classification-Based Methods -  Main idea: training a classification model that can distinguish normal data from outliers -  Problem: imbalanced classes -  Solution: using one-class model – classifier describe only the normal class and samples that do not belong to the normal class are regarded as outliers Hight-dimensional Outlier Detection Methods Problems in high-dimensional:   Relative contrast between distances decreases with increasing dimensionality   Data are very sparse, almost all points are outliers   Concept of neighborhood becomes meaningless Solutions:   Use more robust distance functions and find full-dimensional outliers   Find outliers in projections (subspaces) of the original feature space High-dimensional Outlier Detection Methods ABOD – angle-based outlier degree -  Object o is an outlier if most other objects are located in similar directions -  Object o is no outlier if many other objects are located in varying directions oulier o no outlier Class Outlier Detection   ‘semantic outlier’   A semantic outlier is a data point, which behaves differently with other data points in the same class, while looks normal with respect to data points in another class. Class Outlier Detection   Multi-class classification based anomaly detection techniques assume that the training data contains labeled instances belonging to multiple normal classes   Anomaly detection techniques teach a classifier to distinguish between each normal class and the rest of the classes. Class Outlier Detection – Random Forests   Random Forests is an enensemble classification and regression approach.   Ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models.   Random Forests: -  consists of many classification trees -  1/3 of all samples are left out – OOB (out of bag) data – for classification error -  each tree is constructed by a different bootstrap sample from the original data -  all data are run down the tree and proximities are computed for each pair of cases – These proximities are used for outlier detection. -  Outliers are cases whose proximities to all other cases in the data are generally small. -  Used in outliers relative to their class – an outlier in class j is a case whose prosimities to all other class j cases are small. Context-based Approach Is the temperature 28°C outlier? If we are in Brno in summer NO If we are in Brno in winter YES → it dependes on the location and time – CONTEXT Context-based Approach Contextual outlier significantly deviates from model with respect to a specific context of the object. Generally the attributes of the data objects are divided into two groups:   Contextual attributes: Define the object’s context. In the example, the contextual attributes may be date and location.   Behavioral attributes: Define the object’s characteristics, and are used to evaluate whether the object is an outlier in the context to which it belongs. In the example, the behavioral attributes may be the temperature, humidity. Contextual outlier detection methods can be devided into two categories according to whether the contexts can be clearly identified:   Transforming Contextual Outlier Detection to Conventional Outlier Detection The context can be easily identified.   Modeling Normal Behavior with Respect to Contexts The context identification is more difficult Context-based Approach Transforming Contextual Outlier Detection to Conventional Outlier Detection General Idea: Evaluation wheater the object is an outlier is done in two steps: -  identifycation the context of the object using the contextual attributes -  calculation the outlier score for the object in the context using a conventional outlier detection method Example: In customerrelationship management, we can detect outlier customers in the context of customer groups. 3 attributes: -  contextual: age group (25, 25-45, 45-65, and over 65), post code -  behavioral: number of transactions per yer Is customer c outlier? -  locate the context of c using the attributes age group and post code -  compare c with the other customers in the same group, and use a conventional outlier detection method Context-based Approach Modeling Normal Behavior with Respect to Contexts Context is not easy to identify Example: An online store records the sequence of products seached for by each customer. Outlier behavior is when customer suddenly purchased a product that is unrelated to those he/she recently browsed. → contexts cannot be easily specified because it is unclear how many products browsed earlier should be considered as the context, and this number will likely differ for each product General idea: modelation of normal behaviour with respect to contexts With using a training data set a method trains a model that predicts the expected behavior attribute values with respect to the contextual attribute values. Is an object outlier? We apply the model to the contextual attributes of the object. If the behavior attribute values deviate from the values predicted by the model, then the object is a contextual outlier