CS340 Machine learning
ROC curves
Performance measures for binary classifiers
False neg rate = false rejection =
type II error rate = FN / P = 1-TPR
Sensitivity = recall =
True pos rate = hit rate
= TP / P = 1-FNR
False pos rate = false acceptance =
= type I error rate = FP / N = 1-spec
Specificity = TN / N = 1-FPR
precision = positive
predictive value (PPV) = TP / P-hat
Confusion matrix, contingency table
Performance depends on threshold
• Declare xn to be a positive if p(y=1|xn)>θ, otherwise
declare it to be negative (y=0)
• Number of TPs and FPs depends on threshold θ.
As we change θ, we get different (TPR, FPR)
points.
ˆyn = 1 ⇐⇒ p(y = 1|xn) > θ
TPR = p(ˆy = 1|y = 1)
FPR = p(ˆy = 1|y = 0)
Example
i yi p(yi = 1|xi) ˆyi(θ = 0) ˆyi(θ = 0.5) ˆyi(θ = 1)
1 1 0.9 1 1 0
2 1 0.8 1 1 0
3 1 0.7 1 1 0
4 1 0.6 1 1 0
5 1 0.5 1 1 0
6 0 0.4 1 0 0
7 0 0.3 1 0 0
8 0 0.2 1 0 0
9 0 0.1 1 0 0
i yi p(yi = 1|xi) ˆyi(θ = 0) ˆyi(θ = 0.5) ˆyi(θ = 1)
1 1 0.9 1 1 0
2 1 0.8 1 1 0
3 1 0.7 1 1 0
4 1 0.6 1 1 0
5 1 0.2 1 0 0
6 0 0.6 1 1 0
7 0 0.3 1 0 0
8 0 0.2 1 0 0
9 0 0.1 1 0 0
Performance measures
• EER- Equal error rate/ cross over error rate (false
pos rate = false neg rate), smaller is better
• AUC - Area under curve, larger is better
• Accuracy = (TP+TN)/(P+N)
Precision-recall curves
• Useful when notion of “negative” (and hence FPR)
is not well defined, or too many negatives (rare
event detection)
• Recall = of those that exist, how many did you find?
• Precision = of those that you found, how many
correct?
• F-score is harmonic mean F =
2
1/P + 1/R
=
2PR
R + P
prec = p(y = 1|ˆy = 1)
recall = p(ˆy = 1|y = 1)
Word of caution
• Consider binary classifiers A, B, C
• Clearly A is useless, since it always predicts label
1, regardless of the input. Also, B is slightly better
than C (less probability mass wasted on the offdiagonal
entries). Yet here are the performance
metrics.
A . B . C .
1 0 1 0 1 0
1 0.9 0.1 0.8 0 0.78 0
0 0 0 0.1 0.1 0.12 0.1
Metric A B C
Accuracy 0.9 0.9 0.88
Precision 0.9 1.0 1.0
Recall 1.0 0.888 0.8667
F-score 0.947 0.941 0.9286
Mutual information is a better measure
The MI between estimated and true label is
This gives the intuitively correct rankings B>C>A
Metric A B C
Accuracy 0.9 0.9 0.88
Precision 0.9 1.0 1.0
Recall 1.0 0.888 0.8667
F-score 0.947 0.941 0.9286
Mutual information 0 0.1865 0.1735
I( ˆY , Y ) =
1
ˆy=0
1
y=0
p(ˆy, y) log
p(ˆy, y)
p(ˆy)p(y)