Outlier detection “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” [Hawkins 1980] Outlier factor = dissimilarity with other instances Two needs for OD 1. Detect THAN Remove & Run again 2. Detect THAN Analyze Applications of Outlier Detection !  Detecting measurement errors Data derived from sensors may contain measurement errors. Removing such errors can be important in other data mining and data analysis tasks !  Fraud detection Purchasing behavior of a credit card owner usually changes when the card is stolen !  Education: detection of unexpected solutions e.g. constructive tasks in logic !  Intrusion detection Attacks to a network, or to a blog !  Plagiarism detection A part of text has been written by somebody else. Local Outlier Factor (LOF) distk(o) . . . k-distance of an object o . . . distance from o to its kth nearest neighbor Nk(o) k-distance neighborhood of o . . . set of k nearest neighbors of o reach.distk(o, p) = max{distk(p), dist(o,p)} . . . reachability-distance of an object o with respect to another object p The local reachability-distance is the inverse of the average reachability-distance of its k-neighborhood. LOF is the average of the ratio between the local reachability-distance of o and those of its k-nearest neighbors. Example: online-shop, planning marketing campaigne Example: online-shop, planning marketing campaigne To which clients you should sent a new offer? Example: online-shop, planning marketing campaigne To which clients you should sent a new offer? monitoring two groups of clients Group PLUS : buying products more or less often Example: online-shop, planning marketing campaigne To which clients you should sent a new offer? monitoring two groups of clients Group PLUS : buying products more or less often Group MINUS : just browsing list of offers/products more or less often but (almost) have not bought anything so far Example: online-shop, planning marketing campaigne To which clients you should sent a new offer? monitoring two groups of clients Group PLUS : buying products more or less often Group MINUS : just browsing list of offers/products more or less often but (almost) have not bought anything so far To which clients you should sent a new offer? Class-based outliers. Why we need a new concept? Example: online-shop, planning marketing campaigne To which clients you should sent a new offer? monitoring two groups of clients Group PLUS : buying products more or less often Group MINUS : just browsing list of offers/products more or less often but (almost) have not bought anything so far To which clients you should sent a new offer? Luboš Popelínský Luboš Popelínský, popel@fi.muni.cz DML & KDLab Faculty of Informatics, FI MU Brno dcc f.sciencias up 20 nov 2019 On class-based outlier detection Thanks to Luis Torgo, Lea Nezvalová, Karel Vaculík and other members of the KDLab Class-based Outliers each example belongs to a class Class-based outliers are those cases that look anomalous when the class labels are taken into account, but they do not have to be anomalous when the class labels are ignored. outliers = data point which behaves differently with other data points in the same class may look normal with respect to data points in another class Class-based Outlier Detection !  sometimes called ‘semantic outlier’ Multi-class outlier detection [Han et al. 2012] Data Mining. Principle and Techniques, 3rd edition learn a model for each normal class if the data point does not fit any of the model, then it is declared an outlier + easy to use – some outliers cannot be detected Class-based outlier factor. How to compute [He et al. 2004] Mining Class Outliers: Concepts, Algorithms and Applications in CRM. Class outlier factor (COF) Semantic outliers; clustering-based COF = OF w.r.t. own class (+) OF w.r.t. the other class/es Pros & Cons Semantic outliers (cont.) x1 has the same rank To fix it . . . Semantic outliers [Hewahi and Saad 2007] use a supervised machine learning algorithm ROBUST-C4.5 [John 1995] C4.5 incorporates a pruning scheme that partially addresses the outlier removal problem. extending the pruning method to fully remove the effect of outliers this results in a smaller tree without decrease of accuracy (average and st.dev.on 21 datasets). CODB [Hewahi and Saad 2007] !  combination of distance-based and density-based approach w.r.t class attribute no need for clustering CODB COF(T) = SimilarityToTheK-NearestNeighbors + α * 1/DistanceFromOtherElementsOfTheClass + β * DistanceFromTheNearestNeighbors COF(T) = K*PCL(T,K) + α * 1/Dev(T) + β *Kdist(T) PCL(T,K) ... the probability of the class label of T w.r.t. the K nearest neighbors Dev(T) ... ... the sum of distance from all other elements from the same class Kdist(T) ... the distance between T and its K nearest neighbors RF-OEX: Class Outlier Detection with Random Forests (Nezvalová et al. IDA 2015) Random Forests [Breiman 2000] is an ensemble classification and regression approach employs 1. bootstraping and 2. random tree learning Class Outlier Detection – Random Forests After each tree is built, all of the data are run down the tree, and proximities are computed for each pair of cases: If two cases occupy the same terminal node, their proximity is increased by one. Proximity matrix Príklad 1 Príklad 2 Príklad 3 Príklad 4 Príklad 5 Príklad 1 0 1 1 2 Príklad 2 0 0 1 1 Príklad 3 1 0 4 3 Príklad 4 1 1 4 3 Príklad 5 2 1 3 3 Class Outlier Factor Outlier factor = sum of three different measures of proximity or outlierness = Proximity to the members of the same class + Misclassication - proximity to the members of other classes and + Ambiguity measure – a percentage of ambiguous classification RF-OEX Detection + explanation Applications E-shop: Clients vs. potential clients ZOO Educational data mining: Students with standard/non-standard study interval Intro to logic: Finding anomalous solutions IMDb Czech Parliament Data pre-processing .. or·ni·tor·rin·co (ornito-+ grego rhúgkos, -eos, focinho) Género de monotremos de corpo alongado e cujo focinho se assemelha a um bico de pato. ptakopysk vtákopysk dziobak ornithorynque schnabeltier Applications E-shop: Clients vs. potential clients ZOO Educational data mining: Students with standard/non-standard study interval Intro to logic: Finding anomalous solutions IMDb Czech Parliament Data pre-processing .. Teaching Logic: Finding student anomalous solutions Task: Build a resolution proof, 400 students, at least 3 tasks to solve Automated evaluation: error detection Two classes CORRECT, INCORRECT If a serious error appeared, the solution is classified as incorrect (ignoring typos) Teaching logic: Finding anomalous solutions (cont.) Search/discover students’ solutions which are unusual frequent pattern mining, frequent subgraphs One attribute for each higher-level generalized pattern true (occurrence of the pattern) and false (non-occurrence of the pattern). Class: occurrence or non-occurrence of the error of resolving on two literals at the same time Novel „solutions“ found, not recognised with the tool used IMDb Movie database: Funny reviews Search/discover reviews that do not correspond to positive or negative star evaluation Large Movie Review Dataset Each review represented as a list of word appearance Only 68 most frequent words in the dataset used Class negative *… **** positive *******…*** IMDb: Ambiguous or funny reviews A positive review with very poor actor ratings IMDb: Ambiguous or funny reviews A positive review of a film about extreme human poverty Applications E-shop: Clients vs. potential clients ZOO Educational data mining: Students with standard/non-standard study interval Intro to logic: Finding anomalous solutions IMDb Czech Parliament Data pre-processing .. Applications Current: Coming back to cleaning data Outlier filtering followed by supervised learning algorithm Can we improve performance by outlier filtering? Thanks for your attention popel@fi.muni.cz www.fi.muni.cz/~popel Literature L. Nezvalová, L. Torgo, K. Vaculík, L. Popelínský [AIMSA 2014] [IDA 2015] Han j. et al. Data Mining. Principles and Techniques. 3rd edition. He Z. et al. Mining Class Outliers: Concepts, Algorithms and Applications in CRM. Expert Systems and Applications, ESWA 2004, 27(4), pp. 681-697, 2004. Hewahi N.M. and Saad M.K. Class Outliers Mining: Distance-Based Approach. International Journal of Intelligent Systems and Technologies, Vol. 2, No. 1, pp 55-68, 2007. John G.H. Robust Decision Trees: Removing Outliers from Databases. Knowledge Discovery and Data Mining - KDD , pp. 174-179, 1995 Weiss G.M. Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter 6 (1), 7-19 ILP Idea Given E+ positive and E- negative examples and the background knowledge B, learn concept C and dual Concept C’ (swap positive and negative examples) Look for examples that if removed from the learning set do not change the description (logic program) of C and C’ significantly i.e. difference of coverage is smaller then a threshold = normal examples Idea Suppose A, a set of normal examples, is a subset of E+ E+ \ A = A’ … abnormal examples Given the k-max, the number of outliers, find the abnormal subset A’ of examples not greater than k-max. Explanation of an outlier: two theories. 1.  – rules that cover some of abnormal examples A^ - examples outside of A’ covered only by clauses that cover an example from A’ 2. - rules induced in absence of A’ and covers some of examples from A^ Literature Angiulli F. and Fasetti F. Outlier detection using Inductive Logic Programming. Proceedings of ICDM 2009. Han J. et al. Data Mining. Principles and Techniques. 3rd edition, 2012 He Z. et al. Mining Class Outliers: Concepts, Algorithms and Applications in CRM. Expert Systems and Applications, ESWA 2004, 27(4), pp. 681-697, 2004. Hewahi N.M. and Saad M.K. Class Outliers Mining: Distance-Based Approach. International Journal of Intelligent Systems and Technologies, Vol. 2, No. 1, pp 55-68, 2007. John G.H. Robust Decision Trees: Removing Outliers from Databases. Knowledge Discovery and Data Mining - KDD , pp. 174-179, 1995 Future/Ideas (Naive) bayes classifier, 2 classes, Sureness around 50% => outlier Similar to supervised methods VOC Pascal data, 2048 features by Resnet att504 <= 0.291603 | att1746 <= 0.653862: aeroplane (95.0/4.0) | att1746 > 0.653862: person (5.0) att504 > 0.291603 | att456 <= 1.082573 | | att268 <= 1.711109 | | | att1195 <= 1.121543: person (148.0/2.0) | | | att1195 > 1.121543 | | | | att1142 <= 0.340023: person (12.0/4.0) | | | | att1142 > 0.340023: aeroplane (2.0) | | att268 > 1.711109 | | | att521 <= 1.855855 | | | | att365 <= 0.007182: aeroplane (5.0) | | | | att365 > 0.007182: person (48.0/14.0) | | | att521 > 1.855855: aeroplane (13.0) | att456 > 1.082573 | | att1928 <= 0.140609: aeroplane (21.0/1.0) | | att1928 > 0.140609: person (3.0/1.0) Similar to supervised methods (cont.) att504 <= 0.291603 | att1746 <= 0.653862: aeroplane (95.0/4.0) | att1746 > 0.653862: person (5.0) att504 > 0.291603 | att456 <= 1.082573 | | att268 <= 1.711109 | | | att1195 <= 1.121543: person (148.0/2.0) | | | att1195 > 1.121543 | | | | att1142 <= 0.340023: person (12.0/4.0) | | | | att1142 > 0.340023: aeroplane (2.0) | | att268 > 1.711109 | | | att521 <= 1.855855 | | | | att365 <= 0.007182: aeroplane (5.0) | | | | att365 > 0.007182: person (48.0/14.0) | | | att521 > 1.855855: aeroplane (13.0) | att456 > 1.082573 | | att1928 <= 0.140609: aeroplane (21.0/1.0) | | att1928 > 0.140609: person (3.0/1.0) Class Outlier Detection – Random Forests !  After each tree is built, all of the data are run down the tree, and proximities are computed for each pair of cases: !  If two cases occupy the same terminal node, their proximity is increased by one. !  At the end of the run, the proximities are normalized by dividing by the number of trees. !  Define the average proximity from case n in class j to the rest of the training data class j as: !  The raw outlier measure for case n is defined as E-shop: Clients vs. potential clients