RF-OEX Outlier factor and anomaly detection Petr Matonoha 485185@mail.muni.cz April 20, 2021 Table of Contents Random forest RF-OEX Experiments P. Matonoha ·RF-OEX ·April 20, 2021 2 / 27 Random forest Random forest Random forest A random forest is a group of decision trees, which make an independent decision, then the random forest makes the decision based on a majority vote. The main advantage of the random forest is that it does not overfit. P. Matonoha ·RF-OEX ·April 20, 2021 3 / 27 Random forest Random forest classifier Random forest classifier A random forest classifier is composed of tree-structured classifiers h(x, Θk), k = 1, 2, . . ., where the Θk are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x P. Matonoha ·RF-OEX ·April 20, 2021 4 / 27 RF-OEX RF-OEX RF-OEX Random Forest-outlier explanation (RF-OEX) is an outlier detection method focused on class-based anomalies. The main idea of RF-OEX is using proximity matrix differently than classical random forest. The main difference is how RF-OEX exploits information about the class label. The outlier factor of instance p is calculated as the sum of three different values. P. Matonoha ·RF-OEX ·April 20, 2021 5 / 27 RF-OEX Proximity matrix Proximity matrix Proximity value between element a and b is the number of trees that classified a as the same class as b divided by the total number of trees. The proximity matrix contains proximity values between all elements. P. Matonoha ·RF-OEX ·April 20, 2021 6 / 27 RF-OEX Outlier factor Outlier factor OF(p) = OF1(p)same−class + OF2(p)misclassification + OF3(p)ambiguity P. Matonoha ·RF-OEX ·April 20, 2021 7 / 27 RF-OEX OF1 OF1 OF1(p)same−class: OF1(p) = 1 cl(p)=cl(q) Prox(p,q), where cl(p) is the class of the point p. The result have to be normalized, because classes are not uniformly distributed. P. Matonoha ·RF-OEX ·April 20, 2021 8 / 27 RF-OEX OF2 OF2 OF2(p)misclassification: We define TopCp (p) as |Cp| (all points with class cl(p)) points, which are the closest to p. To be comparable with OF1 and OF3, we multiply the value with the constant c = maxq∈DBOF1(q) 4 , OF2(p) = c · |{q | cl(q)=cl(p) & q∈Top|Cp|(p)}| |Cp| P. Matonoha ·RF-OEX ·April 20, 2021 9 / 27 RF-OEX OF3 OF3 OF3(p)ambiguity: For increasing importance of outliers, which are distant from other points. OF3(p) = c · |Cp|− q∈Top|Cp|(p) Prox(p,q) |Cp| P. Matonoha ·RF-OEX ·April 20, 2021 10 / 27 Experiments Setting Setting Classifiers: J48, JRip, PART, SMO Removed: 1, 3, 5 % of outliers Datasets: We are expecting improvement in all used datasets, with one exception vehicle dataset. P. Matonoha ·RF-OEX ·April 20, 2021 11 / 27 Experiments Different weights Different weights Testing OF(p) = A · OF1(p) + B · OF2(p) + C · OF3(p) and OF(p) = OF1(p) · OF3(p) P. Matonoha ·RF-OEX ·April 20, 2021 12 / 27 Experiments Different weights dataset 111 110 001 101-multi analcatdata_dmft 19.16 19.25 19.2 19.47 balance-scale 82.11 82.41 82.93 82.16 blood-transfusion-service 77.47 77.42 77.64 77.63 cmc 51.2 51.42 51.58 51.11 dresses-sales 57.9 57.93 59.33 57.9 eucalyptus 55.47 54.84 55.92 54.88 ilpd 68.98 68.78 69.8 68.87 pima-diabetes 75.49 75.51 75.04 75.49 qsar-biodeg 83.7 83.51 83.93 83.98 vehicle 70.69 70.78 71.1 70.73 comparison with 111 0 0 0.4304 0.0024 P. Matonoha ·RF-OEX ·April 20, 2021 13 / 27 Experiments Different weights dataset 111 001 102 103 104 112 analcatdata_dmft 19.44 19.48 19.25 19.36 19.47 19.25 balance-scale 82.23 82.45 82.04 82.44 82.35 81.84 blood-transfusion-service 77.53 77.83 77.75 77.75 77.71 77.63 cmc 51.1 51.09 51.4 51.27 51.52 51.4 dresses-sales 58.75 58.28 58.65 58.55 58.38 58.65 eucalyptus 54.96 56.05 55.05 55.61 55.37 54.92 ilpd 69.21 69.44 69.25 69.54 69.67 69.34 pima-diabetes 75.26 75.69 75.4 75.67 75.51 75.42 qsar-biodeg 83.97 84.03 83.82 83.97 84.06 83.97 vehicle 70.93 70.49 70.44 70.77 70.67 70.44 comparison with 111 0 0.146 -0.032 0.155 0.132 -0.051 P. Matonoha ·RF-OEX ·April 20, 2021 14 / 27 Experiments Ground truth data Ground truth data Testing individual OFi on ground truth data P. Matonoha ·RF-OEX ·April 20, 2021 15 / 27 Experiments Ground truth data Figure: PositivesNegativesAroundPositive Anomaly 1: OF3: 5.62 % OF1 94.37 % P. Matonoha ·RF-OEX ·April 20, 2021 16 / 27 Experiments Ground truth data Figure: Rastr-3x4-0 Anomaly 1: OF3: 9.05 % OF2 4.97 % OF1 85.96 % Anomaly 2: OF3: 4.9 % OF1 95.09 % P. Matonoha ·RF-OEX ·April 20, 2021 17 / 27 Experiments Ground truth data Figure: TwoCircles Anomaly 1: OF3: 7.44 % OF2 2.08 % OF1 90.47 % Anomaly 2: OF3: 8.27 % OF2 3.39 % OF1 88.33 % P. Matonoha ·RF-OEX ·April 20, 2021 18 / 27 Experiments Influence OFi on OF Influence OFi on OF Testing influence of OFi on used datasets. Influence OFi = q∈DB |OFi(q)| |OF1(q)|+|OF2(q)|+|OF3(q)| |DB| P. Matonoha ·RF-OEX ·April 20, 2021 19 / 27 Experiments Influence OFi on OF 5 % elements with the highest outlier factor dataset OF1 % OF2 % OF3 % analcatdata_dmft 73.08 0.08 26.82 balance-scale 51.6 2.65 45.74 blood-transfusion-service 81.85 1.96 16.17 dresses-sales 88.34 0.25 11.39 eucalyptus 65.69 0.38 33.92 ilpd 84.39 0.46 15.13 pima-diabetes 86.93 0.39 12.66 qsar-biodeg 94.1 0.0 5.89 average 78.25 0.77 20.97 P. Matonoha ·RF-OEX ·April 20, 2021 20 / 27 Experiments Influence OFi on OF All elements dataset OF1 % OF2 % OF3 % analcatdata_dmft 56.42 0.0 43.56 balance-scale 50.21 0.32 49.45 blood-transfusion-service 66.5 0.7 32.78 dresses-sales 67.71 0.02 32.25 eucalyptus 42.76 0.08 57.15 ilpd 67.93 0.03 32.02 pima-diabetes 73.37 0.05 26.57 qsar-biodeg 85.9 0.0 14.09 average 63.85 0.15 35.98 P. Matonoha ·RF-OEX ·April 20, 2021 21 / 27 Experiments Influence OFi on OF Results The highest influence of OF2 for 5 % elements with the highest outlier factor: balance-scale 2.65 % The highest influence of OF2 for all elements: blood-transfusion-service 0.7 % P. Matonoha ·RF-OEX ·April 20, 2021 22 / 27 Experiments Linear regression Linear regression Our problem is choosing the best weights for OF. To find them we decided to use linear regression. Influence of OF2 is low in all shown datasets, hence we decided to not use it in linear regression For linear regression were used weights: 101, 100, 001, 102, 201, 103, 301, 104, 401, 203, 302 We used basic model sklearn.linear_model.LinearRegression. We were trying to find coefficients function, such that gain_clfBest = A · OF1 + B · OF3 + C We were using leave one out method, to be able to test results. P. Matonoha ·RF-OEX ·April 20, 2021 23 / 27 Experiments Linear regression Linear regression with all weights without dataset score A B analcatdata_dmft 0.3994 -0.00078 -0.00002 balance-scale 0.47336 -0.00075 -0.00004 blood-transfusion-service 0.47385 -0.00077 -0.00005 dresses-sales 0.4029 -0.0006 0.00004 eucalyptus 0.42369 -0.00049 -0.00003 ilpd 0.54039 -0.00058 0.00015 pima-diabetes 0.47233 -0.00064 -0.00002 qsar-biodeg 0.41695 -0.00071 -0.00002 P. Matonoha ·RF-OEX ·April 20, 2021 24 / 27 Experiments Linear regression Linear regression with six the best performing weights without dataset score A B analcatdata_dmft 0.53131 -0.00112 -0.00009 balance-scale 0.7033 -0.00192 0.00011 blood-transfusion-service 0.78788 -0.00197 0.00007 dresses-sales 0.51715 -0.00055 -0.00007 eucalyptus 0.4756 -0.00104 -0.00001 ilpd 0.77981 -0.00139 0.00024 pima-diabetes 0.76615 -0.00172 0.00011 qsar-biodeg 0.76329 -0.00209 0.00015 P. Matonoha ·RF-OEX ·April 20, 2021 25 / 27 Random forest RF-OEX Experiments Thank you for your attention!