CS340 Machine learning ROC curves Performance measures for binary classifiers False neg rate = false rejection = type II error rate = FN / P = 1-TPR Sensitivity = recall = True pos rate = hit rate = TP / P = 1-FNR False pos rate = false acceptance = = type I error rate = FP / N = 1-spec Specificity = TN / N = 1-FPR precision = positive predictive value (PPV) = TP / P-hat Confusion matrix, contingency table Performance depends on threshold • Declare xn to be a positive if p(y=1|xn)>θ, otherwise declare it to be negative (y=0) • Number of TPs and FPs depends on threshold θ. As we change θ, we get different (TPR, FPR) points. ˆyn = 1 ⇐⇒ p(y = 1|xn) > θ TPR = p(ˆy = 1|y = 1) FPR = p(ˆy = 1|y = 0) Example i yi p(yi = 1|xi) ˆyi(θ = 0) ˆyi(θ = 0.5) ˆyi(θ = 1) 1 1 0.9 1 1 0 2 1 0.8 1 1 0 3 1 0.7 1 1 0 4 1 0.6 1 1 0 5 1 0.5 1 1 0 6 0 0.4 1 0 0 7 0 0.3 1 0 0 8 0 0.2 1 0 0 9 0 0.1 1 0 0 i yi p(yi = 1|xi) ˆyi(θ = 0) ˆyi(θ = 0.5) ˆyi(θ = 1) 1 1 0.9 1 1 0 2 1 0.8 1 1 0 3 1 0.7 1 1 0 4 1 0.6 1 1 0 5 1 0.2 1 0 0 6 0 0.6 1 1 0 7 0 0.3 1 0 0 8 0 0.2 1 0 0 9 0 0.1 1 0 0 Performance measures • EER- Equal error rate/ cross over error rate (false pos rate = false neg rate), smaller is better • AUC - Area under curve, larger is better • Accuracy = (TP+TN)/(P+N) Precision-recall curves • Useful when notion of “negative” (and hence FPR) is not well defined, or too many negatives (rare event detection) • Recall = of those that exist, how many did you find? • Precision = of those that you found, how many correct? • F-score is harmonic mean F = 2 1/P + 1/R = 2PR R + P prec = p(y = 1|ˆy = 1) recall = p(ˆy = 1|y = 1) Word of caution • Consider binary classifiers A, B, C • Clearly A is useless, since it always predicts label 1, regardless of the input. Also, B is slightly better than C (less probability mass wasted on the offdiagonal entries). Yet here are the performance metrics. A . B . C . 1 0 1 0 1 0 1 0.9 0.1 0.8 0 0.78 0 0 0 0 0.1 0.1 0.12 0.1 Metric A B C Accuracy 0.9 0.9 0.88 Precision 0.9 1.0 1.0 Recall 1.0 0.888 0.8667 F-score 0.947 0.941 0.9286 Mutual information is a better measure The MI between estimated and true label is This gives the intuitively correct rankings B>C>A Metric A B C Accuracy 0.9 0.9 0.88 Precision 0.9 1.0 1.0 Recall 1.0 0.888 0.8667 F-score 0.947 0.941 0.9286 Mutual information 0 0.1865 0.1735 I( ˆY , Y ) = 1 ˆy=0 1 y=0 p(ˆy, y) log p(ˆy, y) p(ˆy)p(y)