Nathalie	  Japkowicz 	   	   	  Mohak	  Shah	  
University	  of	  Ottawa,	  Canada	   	   	  GE	  Global	  Research,	  USA	  
Northern	  Illinois	  University,	  USA	  
nat@site.uottawa.ca	   	   	   	  mohak@mohakshah.com	  
“Evaluation	  is	  the	  key	  to	  
making	  real	  progress	  in	  
data	  mining”,	  	  
[Witten	  &	  Frank,	  2005],	  p.	  143	  
2
Yet….	  
  Evaluation	  gives	  us	  a	  lot	  of,	  sometimes	  contradictory,	  
information.	  How	  do	  we	  make	  sense	  of	  it	  all?	  
  Evaluation	  standards	  diﬀer	  from	  person	  to	  person	  and	  
discipline	  to	  discipline.	  How	  do	  we	  decide	  which	  
standards	  are	  right	  for	  us?	  
  Evaluation	  gives	  us	  support	  for	  a	  theory	  over	  another,	  
but	  rarely,	  if	  ever,	  certainty.	  So	  where	  does	  that	  leave	  
us?	  
3
Example	  I:	  Which	  Classiﬁer	  is	  be8er?	  
There	  are	  almost	  as	  many	  answers	  as	  there	  are	  
performance	  measures!	  (e.g.,	  UCI	  Breast	  Cancer)	  
Algo	   Acc	   RMSE	   TPR	   FPR	   Prec	   Rec	   F	   AUC	   Info	  S	  
NB	   71.7	   .4534	   .44	   .16	   .53	   .44	   .48	   .7	   48.11	  
C4.5	   75.5	   .4324	   .27	   .04	   .74	   .27	   .4	   .59	   34.28	  
3NN	   72.4	   .5101	   .32	   .1	   .56	   .32	   .41	   .63	   43.37	  
Ripp	   71	   .4494	   .37	   .14	   .52	   .37	   .43	   .6	   22.34	  
SVM	   69.6	   .5515	   .33	   .15	   .48	   .33	   .39	   .59	   54.89	  
Bagg	   67.8	   .4518	   .17	   .1	   .4	   .17	   .23	   .63	   11.30	  
Boost	   70.3	   .4329	   .42	   .18	   .5	   .42	   .46	   .7	   34.48	  
RanF	   69.23	   .47	   .33	   .15	   .48	   .33	   .39	   .63	   20.78	  
4
: acceptable contradictions
: questionable contradictions
Example	  I:	  Which	  Classiﬁer	  is	  be8er?	  
Ranking	  the	  results	  
Algo	   Acc	   RMSE	   TPR	   FPR	   Prec	   Rec	   F	   AUC	   Info	  S	  
NB	   3	   5	   1	   7	   3	   1	   1	   1	   2	  
C4.5	   1	   1	   7	   1	   1	   7	   5	   7	   5	  
3NN	   2	   7	   6	   2	   2	   6	   4	   3	   3	  
Ripp	   4	   3	   3	   4	   4	   3	   3	   6	   6	  
SVM	   6	   8	   4	   5	   5	   4	   6	   7	   1	  
Bagg	   8	   4	   8	   2	   8	   8	   8	   3	   8	  
Boost	   5	   2	   2	   8	   7	   2	   2	   1	   4	  
RanF	   7	   6	   4	   5	   5	   4	   7	   3	   7	  
5
: acceptable contradictions
: questionable contradictions
Example	  II:	  What’s	  wrong	  with	  our	  results?	  	  
  	  A	  new	  algorithm	  was	  designed	  based	  on	  data	  that	  had	  
been	  provided	  to	  us	  by	  the	  National	  Institute	  of	  Health	  
(NIH).	  [Wang	  and	  Japkowicz,	  2008,	  2010]	  
  According	  to	  the	  standard	  evaluation	  practices	  in	  machine	  
learning,	  we	  found	  our	  results	  to	  be	  signiﬁcantly	  better	  
than	  the	  state-­‐of-­‐the-­‐art.	  
  The	  conference	  version	  of	  the	  paper	  received	  a	  best	  paper	  
award	  and	  the	  work	  is	  receiving	  a	  bit	  of	  attention	  in	  the	  ML	  
community	  (35+	  citations)	  
  NIH	  would	  not	  consider	  our	  algorithm	  because	  it	  was	  
probably	  not	  signiﬁcantly	  better	  (if	  at	  all)	  than	  others	  as	  
they	  felt	  we	  had	  not	  conducted	  our	  evaluation	  according	  to	  
their	  standards.	  	   6
Example	  III:	  What	  do	  our	  results	  mean?	  	  
  In	  an	  experiment	  comparing	  the	  accuracy	  performance	  of	  
eight	  classiﬁers	  on	  ten	  domains,	  we	  concluded	  that	  One-­‐
Way	  Repeated-­‐Measure	  ANOVA	  rejects	  the	  hypothesis	  
that	  the	  eight	  classiﬁers	  perform	  similarly	  on	  the	  ten	  
domains	  considered	  at	  the	  95%	  signiﬁcance	  level	  
[Japkowicz	  &	  Shah,	  2011],	  p.	  273.	  
  Can	  we	  be	  sure	  of	  our	  results?	  
  What	  if	  we	  are	  in	  the	  5%	  not	  vouched	  for	  by	  the	  statistical	  test?	  
  Was	  the	  ANOVA	  test	  truly	  applicable	  to	  our	  speciﬁc	  data?	  Is	  there	  
any	  way	  to	  verify	  that?	  
  Can	  we	  have	  any	  100%	  guarantee	  on	  our	  results?	  	  
7
Purpose	  of	  the	  tutorial	  
  To	  give	  the	  audience	  an	  appreciation	  for	  the	  complexity	  
and	  uncertainty	  of	  evaluation.	  	  
  To	  discuss	  the	  diﬀerent	  aspects	  of	  evaluation	  and	  to	  
present	  an	  overview	  of	  the	  issues	  that	  come	  up	  in	  their	  
context	  and	  what	  the	  possible	  remedies	  may	  be.	  
  To	  present	  some	  of	  the	  newer	  lesser	  known	  results	  in	  
classiﬁer	  evaluation.	  
  Finally,	  to	  direct	  the	  audience	  to	  some	  available,	  easy-­‐to-­‐
use,	  resources	  that	  can	  help	  improve	  the	  robustness	  of	  
evaluation	  in	  the	  ﬁeld.	  	  
8
Book	  on	  which	  the	  tutorial	  is	  based	  
  Review:	  
	  	  	  	  	  	  	  	  "This	  treasure-­‐trove	  of	  a	  book	  covers	  the	  
important	  topic	  of	  performance	  evaluation	  
of	  machine	  learning	  algorithms	  in	  a	  very	  
comprehensive	  and	  lucid	  fashion.	  As	  
Japkowicz	  and	  Shah	  point	  out,	  performance	  
evaluation	  is	  too	  often	  a	  formulaic	  aﬀair	  in	  
machine	  learning,	  with	  scant	  appreciation	  
of	  the	  appropriateness	  of	  the	  evaluation	  
methods	  used	  or	  the	  interpretation	  of	  the	  
results	  obtained.	  This	  book	  makes	  
signiﬁcant	  steps	  in	  rectifying	  this	  situation	  
by	  providing	  a	  reasoned	  catalogue	  of	  
evaluation	  measures	  and	  methods,	  written	  
speciﬁcally	  for	  a	  machine	  learning	  audience	  
and	  accompanied	  by	  concrete	  machine	  
learning	  examples	  and	  implementations	  in	  
R.	  This	  is	  truly	  a	  book	  to	  be	  savoured	  by	  
machine	  learning	  professionals,	  and	  
required	  reading	  for	  Ph.D	  students."	  	  
Peter	  A.	  Flach,	  University	  of	  Bristol	  	  
9
Evaluating Learning Algorithms:
A Classification Perspective
Nathalie Japkowicz & Mohak Shah
Cambridge University Press, 2011
The	  main	  steps	  of	  evaluaNon	  
10
What	  these	  steps	  depend	  on	  
  These	  steps	  depend	  on	  the	  purpose	  of	  the	  evaluation:	  
  Comparison	  of	  a	  new	  algorithm	  to	  other	  (may	  be	  generic	  or	  
application-­‐speciﬁc)	  classiﬁers	  on	  a	  speciﬁc	  domain	  (e.g.,	  when	  
proposing	  a	  novel	  learning	  algorithm)	  
  Comparison	  of	  a	  new	  generic	  algorithm	  to	  other	  generic	  ones	  on	  a	  
set	  of	  benchmark	  domains	  (e.g.	  to	  demonstrate	  general	  
eﬀectiveness	  of	  the	  new	  approach	  against	  other	  approaches)	  
  Characterization	  of	  generic	  classiﬁers	  on	  benchmarks	  domains	  (e.g.	  
to	  study	  the	  algorithms'	  behavior	  on	  general	  domains	  for	  
subsequent	  use)	  
  Comparison	  of	  multiple	  classiﬁers	  on	  a	  speciﬁc	  domain	  (e.g.	  to	  ﬁnd	  
the	  best	  algorithm	  for	  a	  given	  application	  task)	  
Outline	  of	  the	  tutorial:	  
  Performance	  measures	  	  
  Error	  Estimation/Resampling	  	  	  
  Statistical	  Signiﬁcance	  Testing	  	  	  
  Data	  Set	  Selection	  and	  Evaluation	  Benchmark	  Design	  	  
  Available	  resources	  	  
12
13
Performance	  Measures	  
Outline	  
  Ontology	  	  
  Confusion	  Matrix	  Based	  Measures	  	  
  Graphical	  Measures:	  
  ROC	  Analysis	  and	  AUC,	  Cost	  Curves	  
  Recent	  Developments	  on	  the	  ROC	  Front:	  Smooth	  ROC	  
Curves,	  the	  H	  Measure	  
  Other	  Curves:	  PR	  Curves,	  lift,	  etc..	  
  Probabilistic	  Measures	  
  Others:	  qualitative	  measures,	  combinations	  
frameworks,	  visualization,	  agreement	  statistics	  
14
Overview	  of	  Performance	  Measures	  
15
Confusion	  Matrix-­‐Based	  Performance	  
Measures	  
16
True class 
Hypothesized |
class V
Pos Neg
Yes TP FP
No FN TN
P=TP+FN N=FP+TN
  Multi-­‐Class	  Focus:	  
  Accuracy	  =	  (TP+TN)/(P
+N)	  
  Single-­‐Class	  Focus:	  	  
  Precision	  =	  TP/(TP
+FP)	  
  Recall	  =	  TP/P	  
  Fallout	  =	  	  FP/N	  
  Sensitivity	  =	  TP/(TP
+FN)	  
  Speciﬁcity	  =	  TN/(FP
+TN)	  
A Confusion Matrix
Aliases	  and	  other	  measures	  
  Accuracy	  =	  1	  (or	  100%)	  -­‐	  Error	  rate	  	  	  
 Recall	  =	  	  TPR	  =	  Hit	  rate	  =	  Sensitivity	  
 Fallout	  =	  FPR	  =	  False	  Alarm	  rate	  
 Precision	  =	  Positive	  Predictive	  Value	  (PPV)	  
 Negative	  Predictive	  Value	  (NPV)	  =	  TN/(TN+FN)	  
  Likelihood	  Ratios:	  
  LR+	  =	  Sensitivity/(1-­‐Speciﬁcity)	  
  LR-­‐	  =	  (1-­‐Sensitivity)/Speciﬁcity	  
17
Pairs	  of	  Measures	  and	  Compounded	  
Measures	  
  Precision	  /	  Recall	  
  Sensitivity	  /	  Speciﬁcity	  
  Likelihood	  Ratios	  (LR+	  and	  LR-­‐)	  
  Positive	  /	  Negative	  Predictive	  Values	  
  F-­‐Measure:	  
  Fα	  =	  [(1+	  α)	  (Precision	  x	  Recall)]/	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  α=	  1,	  2,	  0.5	  
	  	  	  	  	  	  	  	  	  	  	  	  [(α	  x	  Precision)	  +	  Recall]	  
  G-­‐Mean:	  	  	  	  	  	  	  	  	  	  	  2-­‐class	  version	  	  	  	  	  	  	  	  	  	  	  	  	  single-­‐class	  version	  
  G-­‐Mean	  =	  Sqrt(TPR	  x	  TNR)	  	  or	  	  Sqrt(TPR	  x	  Precision)	  
18
Skew	  and	  Cost	  consideraNons	  
  Skew-­‐Sensitive	  Assessments:	  e.g.,	  Class	  Ratios	  
  Ratio+	  =	  (TP	  +	  FN)/(FP	  +	  TN)	  
  Ratio-­‐	  =	  (FP	  +	  TN)/(TP	  +	  FN)	  
  TPR	  can	  be	  weighted	  by	  Ratio+	  and	  TNR	  can	  be	  
weighted	  by	  Ratio-­‐	  
  Asymmetric	  Misclassiﬁcation	  Costs:	  	  
  If	  the	  costs	  are	  known,	  
  Weight	  each	  non-­‐diagonal	  entries	  of	  the	  confusion	  matrix	  by	  
the	  appropriate	  misclassiﬁcation	  costs	  
  Compute	  the	  weighted	  version	  of	  any	  previously	  discussed	  
evaluation	  measure.	  	  
19
20
Some	  issues	  with	  performance	  measures	  
True class 
Pos Neg
Yes 200 100
No 300 400
P=500 N=500
True class 
Pos Neg
Yes 400 300
No 100 200
P=500 N=500
"  Both	  classiﬁers	  obtain	  60%	  accuracy	  
"  They	  exhibit	  very	  diﬀerent	  behaviours:	  
"  On	  the	  left:	  weak	  positive	  recognition	  rate/strong	  
negative	  recognition	  rate	  
"  On	  the	  right:	  strong	  positive	  recognition	  rate/weak	  
negative	  recognition	  rate	  
21
Some	  issues	  with	  performance	  measures	  (cont’d)	  
True class 
Pos Neg
Yes 500 5
No 0 0
P=500 N=5
"  The	  classiﬁer	  on	  the	  left	  obtains	  99.01%	  accuracy	  
while	  the	  classiﬁer	  on	  the	  right	  obtains	  89.9%	  
"  Yet,	  the	  classiﬁer	  on	  the	  right	  is	  much	  more	  
sophisticated	  than	  the	  classiﬁer	  on	  the	  left,	  which	  just	  
labels	  everything	  as	  positive	  and	  misses	  all	  the	  negative	  
examples.	  
True class 
Pos Neg
Yes 450 1
No 50 4
P=500 N=5
22
True class 
Pos Neg
Yes 200 100
No 300 400
P=500 N=500
True class 
Pos Neg
Yes 200 100
No 300 0
P=500 N=100
"   Both classifiers obtain the same precision and recall values
of 66.7% and 40% (Note: the data sets are different)
"   They exhibit very different behaviours:
"   Same positive recognition rate
"   Extremely different negative recognition rate: strong on
the left / nil on the right
"   Note: Accuracy has no problem catching this!
Some	  issues	  with	  performance	  measures	  (cont’d)	  
  This	  and	  similar	  analyses	  reveal	  that	  each	  
performance	  measure	  will	  convey	  some	  information	  
and	  hide	  other.	  
  Therefore,	  there	  is	  an	  information	  trade-­‐oﬀ	  carried	  
through	  the	  diﬀerent	  metrics.	  
  A	  practitioner	  has	  to	  choose,	  quite	  carefully,	  the	  
quantity	  s/he	  is	  interested	  in	  monitoring,	  while	  
keeping	  in	  mind	  that	  other	  values	  matter	  as	  well.	  
  Classiﬁers	  may	  rank	  diﬀerently	  depending	  on	  the	  
metrics	  along	  which	  they	  are	  being	  compared.	  
23
Some	  issues	  with	  performance	  measures	  (cont’d)	  
Graphical	  Measures	  
ROC	  Analysis,	  AUC,	  Cost	  Curves	  
  Many	  researchers	  have	  now	  adopted	  the	  AUC	  (the	  area	  
under	  the	  ROC	  Curve)	  to	  evaluate	  their	  results.	  	  
  The	  principal	  advantage	  of	  the	  AUC	  is	  that	  it	  is	  more	  
robust	  than	  Accuracy	  in	  class	  imbalanced	  situations	  where	  
the	  default	  classiﬁer	  that	  always	  outputs	  the	  most	  
probable	  label	  yields	  exceedingly	  good	  estimates	  of	  
performance.	  
  The	  AUC	  takes	  the	  class	  distribution	  into	  consideration	  
and,	  therefore,	  gives	  more	  weight	  to	  correct	  classiﬁcation	  
of	  the	  minority	  class,	  thus	  outputting	  fairer	  results.	  	   24
ROC	  Analysis	  
  ROC	  Analysis	  is	  applicable	  to	  scoring	  rather	  
than	  merely	  deterministic	  classiﬁers	  
  ROC	  graphs	  are	  insensitive	  to	  class	  imbalances	  
(or	  skew)	  since	  they	  consider	  the	  TPR	  and	  FPR	  
independently	  and	  do	  not	  take	  into	  account	  
the	  class	  distribution.	  They,	  therefore,	  give	  very	  
nice	  overall	  comparisons	  of	  two	  systems.	  	  
  However,	  practically	  speaking,	  ROC	  graphs	  
ignore	  the	  skew	  which	  the	  performance	  
measures	  of	  interest	  (pmi)	  usually	  takes	  into	  
consideration.	  Therefore,	  at	  model	  selection	  
time,	  it	  is	  wise	  to	  consider	  isometrics	  for	  pmi	  
which	  are	  lines	  in	  the	  ROC	  space	  along	  which	  
the	  same	  performance	  value	  is	  obtained	  for	  
that	  pmi.	  Diﬀerent	  skew	  ratios	  are	  represented	  
by	  diﬀerent	  isolines,	  making	  the	  selection	  of	  
the	  optimal	  operating	  point	  quite	  easy.	  	  
25
AUC	  
  	  	  
26
Cost	  Curves	  
27
Cost-curves are more practical than ROC
curves because they tell us for what class
probabilities one classifier is preferable
over the other.
ROC Curves only tell
us that sometimes
one classifier is
preferable over the
other
Recent	  Developments	  I:	  Smooth	  ROC	  Curves	  (smROC)	  
[Klement,	  Flach,	  Japkowicz	  &	  Matwin,	  ECML’2011]	  
  ROC	  Analysis	  visualizes	  the	  ranking	  
performance	  of	  classiﬁers	  but	  ignores	  
any	  kind	  of	  scoring	  information.	  
  Scores	  are	  important	  as	  they	  tell	  us	  
whether	  the	  diﬀerence	  between	  two	  
rankings	  is	  large	  or	  small,	  but	  they	  
are	  diﬃcult	  to	  integrate	  because	  
unlike	  probabilities,	  each	  scoring	  
system	  is	  diﬀerent	  from	  one	  classiﬁer	  
to	  	  	  	  	  	  	  	  	  the	  next	  and	  their	  integration	  
is	  not	  obvious.	  
  smROC	  extends	  the	  ROC	  curve	  	  	  	  	  	  	  	  	  	  	  	  	  
to	  include	  normalized	  scores	  as	  
smoothing	  weights	  added	  to	  its	  	  	  	  	  line	  
segments.	  
  smROC	  was	  shown	  to	  capture	  
the	  similarity	  between	  similar	  
classiﬁcation	  models	  better	  than	  
ROC	  and	  that	  it	  can	  capture	  
diﬀerence	  between	  classiﬁcation	  
models	  that	  ROC	  is	  barely	  
sensitive	  to,	  if	  at	  all.	  
28
Recent	  Developments	  II:	  The	  H	  Measure	  	  I	  
  A	  criticism	  of	  the	  AUC	  was	  given	  by	  [Hand,	  2009].	  The	  argument	  
goes	  along	  the	  following	  lines:	  
The	  misclassiﬁcation	  cost	  distributions	  (and	  hence	  the	  skew-­‐
ratio	  distributions)	  used	  by	  the	  AUC	  are	  diﬀerent	  for	  diﬀerent	  
classiﬁers.	  Therefore,	  we	  may	  be	  comparing	  apples	  and	  oranges	  
as	  the	  AUC	  may	  give	  more	  weight	  to	  misclassifying	  a	  point	  by	  
classiﬁer	  A	  than	  it	  does	  by	  classiﬁer	  B	  
  To	  address	  this	  problem,	  [Hand,	  2009]	  proposed	  the	  H-­‐Measure.	  
  In	  essence,	  The	  H-­‐measure	  allows	  the	  user	  to	  select	  a	  cost-­‐weight	  
function	  that	  is	  equal	  for	  all	  the	  classiﬁers	  under	  comparison	  and	  
thus	  allows	  for	  fairer	  comparisons.	  
29
Recent	  Developments	  II:	  	  
The	  H	  Measure	  	  II	  
  [Flach,	  Hernandez-­‐Orallo	  and	  Ferri,	  ICML‘2011]	  argued	  that	  	  
  [Hand‘2009]‘s	  criticism	  holds	  when	  the	  AUC	  is	  interpreted	  as	  a	  
classiﬁcation	  measure	  but	  not	  when	  it	  is	  interpreted	  as	  a	  pure	  
ranking	  measure.	  
  When	  [Hand’2009]’s	  criticism	  holds,	  the	  problem	  occurs	  only	  if	  the	  
threshold	  used	  to	  construct	  the	  ROC	  curve	  are	  assumed	  to	  be	  optimal.	  
  When	  constructing	  the	  ROC	  curve	  using	  all	  the	  points	  in	  the	  data	  set	  
as	  thresholds	  rather	  than	  only	  optimal	  ones,	  then	  the	  AUC	  measure	  
can	  be	  shown	  to	  be	  coherent	  and	  the	  H-­‐Measure	  unnecessary,	  at	  least	  
from	  a	  theoretical	  point	  of	  view.	  
  The	  H-­‐Measure	  is	  a	  linear	  transformation	  of	  the	  value	  that	  the	  area	  
under	  the	  Cost-­‐Curve	  would	  produce.	  
30
Other	  Curves	  
  Other	  research	  communities	  have	  produced	  and	  
used	  graphical	  evaluation	  techniques	  similar	  to	  
ROC	  Curves.	  
  Lift	  charts	  plot	  the	  number	  of	  true	  positives	  versus	  the	  overall	  
number	  of	  examples	  in	  the	  data	  sets	  that	  was	  considered	  for	  the	  
speciﬁc	  true	  positive	  number	  on	  the	  vertical	  axis.	  
  PR	  curves	  plot	  the	  precision	  as	  a	  function	  of	  its	  recall.	  
  DET	  curves	  are	  like	  ROC	  curves,	  but	  plot	  the	  FNR	  rather	  than	  
the	  TPR	  on	  the	  vertical	  axis.	  They	  are	  also	  typically	  log-­‐scaled.	  
  Relative	  Superiority	  Graphs	  are	  more	  akin	  to	  Cost-­‐Curves	  as	  
they	  consider	  costs.	  They	  map	  the	  ratios	  of	  costs	  into	  the	  [0,1]	  
interval.	  
31
	  ProbabilisNc	  Measures	  I:	  RMSE	  
  The	  Root-­‐Mean	  Squared	  Error	  (RMSE)	  is	  usually	  used	  
for	  regression,	  but	  can	  also	  be	  used	  with	  probabilistic	  
classiﬁers.	  The	  formula	  for	  the	  RMSE	  is:	  	  
	   	  RMSE(f)	  =	  sqrt(	  1/m	  Σi=1
m(f(xi)	  –	  yi)2))	  
	  where	  m	  is	  the	  number	  of	  test	  examples,	  f(xi),	  the	  classiﬁer’s	  	  
	  probabilistic	  output	  on	  xi	  and	  yi	  the	  actual	  label.	  
RMSE(f)	  RMSE(f)	  	  
32
ID	   f(xi)	   yi	   (f(xi)	  –	  yi)2	  
1	   .95	   1	   .0025	  
2	   .6	   0	   .36	  
3	   .8	   1	   .04	  
4	   .75	   0	   .5625	  
5	   .9	   1	   .01	  
RMSE(f) = sqrt(1/5 * (.0025+.36+.04+.5625+.01))
= sqrt(0.975/5) = 0.4416
ProbabilisNc	  Measures	  II:	  	  
InformaNon	  Score	  
  Kononenko	  and	  Bratko’s	  Information	  Score	  assumes	  a	  prior	  P(y)	  on	  
the	  labels.	  This	  could	  be	  estimated	  by	  the	  class	  distribution	  of	  the	  
training	  data.	  The	  output	  (posterior	  probability)	  of	  a	  probabilistic	  
classiﬁer	  is	  P(y|f),	  where	  f	  is	  the	  classiﬁer.	  I(a)	  is	  the	  indicator	  
function.	  
  IS(x)	  =	  	  I(P(y|f)	  ≥	  P(y))	  *	  (-­‐log(P(y))+log(P(y|f))	  +	  
	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  +	  I(P(y|f)	  <	  P(y))	  *	  (-­‐	  log(1-­‐P(y))+log(1-­‐P(y|f)))	  
  ISavg=	  1/m	  Σi=1
m(IS(xi))	  
33
x	   P(yi|f)	   yi	   IS(x)	  
1	   .95	   1	   0.66	  
2	   .6	   0	   0	  
3	   .8	   1	   .42	  
4	   .75	   0	   .32	  
5	   .9	   1	   .59	  
P(y=1) = 3/5 = 0.6
P(y=0) = 2/5 = 0.4
IS(x1) = 1 * (-log(.6) + log(.95)) +
0 * (-log(.4) + log).05))
= 0.66
ISavg= 1/5 (0.66+0+.42+.32+.59)
= 0.40
	  Other	  Measures	  I:	  	  
A	  MulN-­‐Criteria	  Measure	  –	  The	  Eﬃciency	  Method	  
  The	  eﬃciency	  method	  is	  a	  framework	  for	  combining	  
various	  measures	  including	  qualitative	  metrics,	  such	  
as	  interestingness.	  It	  considers	  the	  positive	  metrics	  for	  
which	  higher	  values	  are	  desirable	  (e.g,	  accuracy)	  and	  
the	  negative	  metrics	  for	  which	  lower	  values	  are	  
desirable	  (e.g.,	  computational	  time).	  
	  	  	  	  	  	  	  	  	  	  	  	  εS(f)=	  Σiwipmi
+(f)	  /	  Σjwjnmj
-­‐(f)	  	  
pmi
+	  are	  the	  positive	  metrics	  and	  nmj
-­‐	  are	  the	  negative	  metrics.	  The	  wi’s	  
and	  wj’s	  have	  to	  be	  determined	  and	  a	  solution	  is	  proposed	  that	  uses	  
linear	  programming.	  
34
Other	  Measures	  II:	  	  
VisualizaNon-­‐Based	  Measure	  [Japkowicz,	  Sanghi,	  Tischer,	  	  ECML’08,	  ISAIM’08]	  	  
  Classiﬁer	  evaluation	  can	  be	  viewed	  as	  a	  problem	  of	  
analyzing	  high-­‐dimensional	  data.	  
  The	  performance	  measures	  currently	  used	  are	  but	  one	  
class	  of	  projections	  that	  can	  be	  applied	  to	  these	  data.	  
  Why	  not	  apply	  other	  (standard	  or	  not)	  projections	  to	  
the	  data	  with	  various	  kinds	  (standard	  or	  not)	  distance	  
measures?	  
35
36
True
class 
Pos Neg
Yes 82 17
No 12 114
True
class 
Pos Neg
Yes 15 5
No 25 231
Ideal
True
class 
Pos Neg
Yes 99 6
No 1 94
82 17 12 114 15 5 25 231 99 6 1 94
Confusion matrices
for a single classifier
on three domains
37
IllustraNon	  on	  MulNple	  domains:	  
Breast	  Cancer,	  Labor	  and	  Liver	  
Acc. F-
Meas.
AUC
NB BC:
La:
Li:
Avg:
71.7
89.5
55.4	

72.2
.48
.92
.6	

.67
.7
.97
.64	

.77
SMO BC:
La:
Li:
Avg:
69.6
89.5
58.3	

72.46
.39
.92
.014	

.44
.59
.87
.5	

.65
Boost. BC:
La:
Li:
Avg:
70.3
87.7
66.1	

74.7
.46
.91
.534	

.64
.7
.87
.68	

.75
8: SVM (SMO)
9: NB 1: Ideal
Abnormality detection with our
new approach is a lot easier and
accurate than it is, when relying on Accuracy,
F-Measure, or AUC listings on each domain
or their average on all domains.
Also, our new approach
Allows us to mix binary
and multi-class domains.
Averaging does not!
38
Illustration on a Multiclass domain: Anneal
Adaboost:
a b c d e f  classified as
0 0 8 0 0 0 | a
0 0 99 0 0 0 | b
0 0 684 0 0 0 | c
0 0 0 0 0 0 | d
0 0 0 0 67 0 | e
0 0 40 0 0 0 | f
NB:
a b c d e f classified as
7 0 1 0 0 0 | a
0 99 0 0 0 0 | b
3 38 564 0 0 79 | c
0 0 0 0 0 0 | d
0 0 0 0 67 0 | e
0 0 2 0 0 38 | f
8: Adaboost
9: NB 1: Ideal
NB Other
Classifiers
Ada
boost
86.3 97.4 to
99.3
83.6
Accuracy
Accuracy
does not
tell us
whether
NB and
Adaboost
make the
same kind
of errors!
Other	  Measures	  III:	  
AccounNng	  for	  chance	  and	  evaluaNng	  under	  imprecise	  
labeling	  
  Measures	  considered	  so	  far	  assume	  existence	  of	  
ground	  truth	  labels	  (against	  which,	  for	  instance,	  
accuracy	  is	  measured)	  
  In	  the	  absence	  of	  ground	  truth,	  typically	  experts	  label	  
the	  instances	  
39
40
Mul$ple	  Annota$ons	  
41
Inter-expert agreement:
Overall Agreement
of the group
Classifier Agreement
Against the group
2	  Ques$ons	  of	  interest	  
Such	  measurements	  are	  also	  desired	  when	  
  Assessing	  a	  classiﬁer	  in	  a	  skewed	  class	  setting	  
  Assessing	  a	  classiﬁer	  against	  expert	  labeling	  
  There	  are	  reasons	  to	  believe	  that	  bias	  has	  been	  
introduced	  in	  the	  labeling	  process	  (say,	  due	  to	  class	  
imbalance,	  asymmetric	  representativeness	  of	  classes	  
in	  the	  data,…)	  
  …	  
Agreement	  statistic	  aim	  to	  address	  this	  by	  accounting	  
for	  chance	  agreements	  (coincidental	  concordances)	   42
Coincidental	  concordances	  (Natural	  PropensiNes)	  need	  to	  be	  
accounted	  for	  
R6:1891
R1: 1526 R2: 2484
R4: 816 R5: 754
R3: 1671
44
General	  Agreement	  StaNsNc	  
General	  Agreement	  StaNsNc	  
Examples: Cohen’s kappa, Fleiss Kappa, Scott’s pi, ICC…
Agreement measure
Chance Agreement
Maximum Achievable
Agreement
MulNple	  raters	  over	  mulNple	  classes	  
  Various	  generalizations	  to	  estimate	  inter-­‐rater	  agreement	  in	  general	  
multi-­‐class	  multi-­‐rater	  scenario	  
  E.g.,	  Fleiss	  kappa	  (Fleiss	  1971);	  ICC,	  Scott’s	  pi;	  	  (Vanbelle	  and	  Albert,	  2009);	  
  Typically	  relying	  on	  marginalization	  argument	  (ﬁxed	  number	  of	  raters	  assumed	  
to	  be	  sampled	  from	  a	  pool	  of	  raters	  to	  label	  each	  instance)	  	  
  Recently	  generalization	  for	  the	  ﬁxed	  rater	  scenarios	  have	  been	  proposed	  (κS	  
measure	  (Shah,	  2011))	  to	  address	  this	  
  For	  assessing	  agreement	  against	  a	  group	  of	  raters	  
  Typically	  consensus	  argument	  (deterministic	  label	  set	  formed	  via	  majority	  vote	  
for	  evaluation	  and	  the	  classiﬁer	  is	  evaluated	  against	  this	  label	  set)	  or	  
marginalization	  based	  measures	  
  Issues	  with	  these	  argument	  were	  highlighted	  for	  ﬁxed	  rater	  setting	  (Shah,	  2011)	  
  A	  novel	  measure,	  the	  S	  measure,	  has	  been	  proposed	  to	  address	  this	  
46
47
Se	  we	  decided	  on	  a	  performance	  measure.	  
  How	  do	  we	  estimate	  it	  in	  an	  unbiased	  manner?	  
  What	  if	  we	  had	  all	  the	  data?	  
  Re-­‐substitution:	  Overly	  optimistic	  (best	  performance	  
achieved	  with	  complete	  over-­‐ﬁt)	  
  Which	  factors	  aﬀect	  such	  estimation	  choices?	  
  Sample	  Size/Domains	  
  Random	  variations:	  in	  training	  set,	  testing	  set,	  learning	  
algorithms,	  class-­‐noise	  
  …	   48
Hold-­‐out	  approach	  
  Set	  aside	  a	  separate	  test	  set	  T.	  Estimate	  the	  empirical	  risk:	  
  Advantages	  
  independence	  from	  training	  set	  
  Generalization	  behavior	  can	  be	  characterized	  
  Estimates	  can	  be	  obtained	  for	  any	  classiﬁer	  
49
Hold-­‐out	  approach	  
  Conﬁdence	  intervals	  can	  be	  found	  too	  (based	  on	  
binomial	  approximation):	  
50
A	  Tighter	  bound	  
  Based	  on	  binomial	  inversion:	  
  More	  realistic	  than	  asymptotic	  Gaussian	  assumption	  
(μ	  ±	  2σ)	  
51
Binomial	  vs.	  Gaussian	  assumpNons	  
  Bl	  and	  Bu	  shows	  binomial	  lower	  and	  upper	  bounds	  
  CIl	  and	  CIu	  shod	  lower	  and	  upper	  conﬁdence	  intervals	  
based	  on	  Gaussian	  assumption	  
52
Binomial	  vs.	  Gaussian	  assumpNons	  
  Bl	  and	  Bu	  shows	  binomial	  lower	  and	  upper	  bounds	  
  CIl	  and	  CIu	  shod	  lower	  and	  upper	  conﬁdence	  intervals	  
based	  on	  Gaussian	  assumption	  
53
Hold-­‐out	  sample	  size	  requirements	  
  The	  bound:	  
gives	  
54
Hold-­‐out	  sample	  size	  bound	  
  The	  bound:	  
gives	  
This	  sample	  size	  bound	  grows	  very	  quickly	  with	  ε	  and	  δ	  
55
The	  need	  for	  re-­‐sampling	  
  Too	  few	  training	  examples	  -­‐>	  learning	  a	  poor	  classiﬁer	  
  Too	  few	  test	  examples	  -­‐>	  erroneous	  error	  estimates	  
  Hence:	  Resampling	  
  Allows	  for	  accurate	  performance	  estimates	  while	  
allowing	  the	  algorithm	  to	  train	  on	  most	  data	  examples	  
  Allows	  for	  closer	  duplication	  conditions	  
56
What	  implicitly	  guides	  re-­‐sampling:	  
a	  quick	  look	  into	  relaNonship	  with	  Bias-­‐variance	  
behavior	  
  Bias-­‐Variance	  analysis	  helps	  understand	  the	  behavior	  of	  
learning	  algorithms	  
  In	  the	  classiﬁcation	  case:	  	  
  Bias:	  diﬀerence	  between	  the	  most	  frequent	  prediction	  of	  the	  
algorithm	  and	  the	  optimal	  label	  of	  the	  examples	  	  
  Variance:	  Degree	  of	  algorithm’s	  divergence	  from	  its	  most	  probable	  
(average)	  prediction	  in	  response	  to	  variations	  in	  training	  set	  
  Hence,	  bias	  is	  independent	  of	  the	  training	  set	  while	  
variance	  isn’t	  
57
What	  implicitly	  guides	  re-­‐sampling:	  
a	  quick	  look	  into	  relaNonship	  with	  Bias-­‐variance	  
behavior	  
  Having	  too	  few	  examples	  in	  the	  training	  set	  aﬀects	  the	  bias	  
of	  algorithm	  by	  making	  its	  average	  prediction	  unreliable	  
  Having	  too	  few	  examples	  in	  test	  set	  results	  in	  high	  
variance	  
  Hence,	  the	  choice	  of	  resampling	  can	  introduce	  diﬀerent	  
limitations	  on	  the	  bias	  and	  variance	  estimates	  of	  an	  
algorithm	  and	  hence	  on	  the	  reliability	  of	  error	  estimates	  
58
An	  ontology	  of	  error	  esNmaNon	  techniques	  
59
60
Simple	  Resampling:	  
k-­‐fold	  Cross-­‐ValidaNon	  
60
…..
In Cross-Validation, the data set is divided into k folds and at each iteration, a
different fold is reserved for testing while all the others used for training the
classifiers.
: training data subset
: testing data subset
Fold 1:
Fold 2:
Fold k-1:
Fold k:
Simple	  Resampling:	  
Some	  variaNons	  of	  Cross-­‐ValidaNon	  
  Stratiﬁed	  k-­‐fold	  Cross-­‐Validation:	  
  Maintain	  the	  class	  distribution	  from	  the	  original	  dataset	  
when	  forming	  the	  k-­‐fold	  CV	  subsets	  
  useful	  when	  the	  class-­‐distribution	  of	  the	  data	  is	  
imbalanced/skewed.	  	  
  Leave-­‐One-­‐Out	  
  This	  is	  the	  extreme	  case	  of	  k-­‐fold	  CV	  where	  each	  subset	  
is	  of	  size	  1.	  
  Also	  known	  a	  Jackknife	  estimate	  
61
ObservaNons	  
  k-­‐fold	  CV	  is	  arguable	  the	  best	  known	  and	  most	  commonly	  
used	  resampling	  technique	  
  With	  k	  of	  reasonable	  size,	  less	  computer	  intensive	  than	  Leave-­‐
One-­‐Out	  
  Easy	  to	  apply	  
  In	  all	  the	  variants,	  the	  testing	  sets	  are	  independent	  of	  one	  
another,	  as	  required,	  by	  many	  statistical	  testing	  methods	  
  but	  the	  training	  sets	  are	  highly	  overlapping.	  This	  can	  aﬀect	  the	  bias	  
of	  the	  error	  estimates	  (generally	  mitigated	  for	  large	  dataset	  sizes).	  	  
  Results	  in	  averaged	  estimate	  over	  k	  diﬀerent	  classiﬁers	  
62
ObservaNons	  
  Leave-­‐One-­‐Out	  can	  result	  in	  estimates	  with	  high	  variance	  
given	  that	  the	  testing	  folds	  contains	  only	  one	  example	  
  On	  the	  other	  hand,	  the	  classiﬁer	  is	  practically	  unbiased	  since	  each	  
training	  fold	  contains	  almost	  all	  the	  data	  
  Leave-­‐One-­‐Out	  can	  be	  quite	  eﬀective	  for	  moderate	  dataset	  
sizes	  
  For	  small	  datasets,	  the	  estimates	  have	  high	  variance	  while	  for	  large	  
datasets	  application	  becomes	  too	  computer-­‐intensive	  
  Leave-­‐One-­‐Out	  can	  also	  be	  beneﬁcial,	  compared	  to	  k-­‐fold	  
CV,	  when	  
  There	  is	  wide	  dispersion	  of	  data	  distribution	  
  Data	  set	  contains	  extreme	  values	  
63
LimitaNons	  of	  Simple	  Resampling	  
  Do	  not	  estimate	  the	  risk	  of	  a	  classiﬁer	  but	  rather	  expected	  
risk	  of	  algorithm	  over	  the	  size	  of	  partitions	  
  While	  stability	  of	  estimates	  can	  give	  insight	  into	  robustness	  
of	  algorithm,	  classiﬁer	  comparisons	  in	  fact	  compare	  the	  
averaged	  estimates	  (and	  not	  individual	  classiﬁers)	  
  Providing	  rigorous	  guarantees	  is	  quite	  problematic	  
  No	  conﬁdence	  intervals	  can	  be	  established	  (unlike,	  say,	  hold-­‐
out	  method)	  
  Even	  under	  normality	  assumption,	  the	  standard	  deviation	  at	  
best	  conveys	  the	  uncertainty	  in	  estimates	  
64
MulNple	  Resampling:	  
Bootstrapping	  
  Attempts	  to	  answer:	  What	  can	  be	  done	  when	  data	  is	  
too	  small	  for	  application	  of	  k-­‐fold	  CV	  or	  Leave-­‐One-­‐
Out?	  
  Assumes	  that	  the	  available	  sample	  is	  representative	  
  Creates	  a	  large	  number	  of	  new	  sample	  by	  drawing	  
with	  replacement	  
65
MulNple	  Resampling:	  
ε0	  Bootstrapping	  
  Given	  a	  data	  D	  of	  size	  m,	  create	  k	  bootstrap	  samples	  
Bi,	  each	  of	  of	  size	  m	  by	  sampling	  with	  replacement	  
  At	  each	  iteration	  i:	  
  Bi	  represents	  the	  training	  set	  while	  the	  test	  set	  contains	  
single	  copy	  of	  examples	  in	  D	  that	  are	  not	  present	  in	  Bi	  
  Train	  a	  classiﬁer	  on	  Bi	  and	  test	  on	  the	  test	  set	  to	  obtain	  
error	  estimate	  ε0i	  
  Average	  over	  ε0i’s	  to	  obtain	  ε0	  estimate	  
66
MulNple	  Resampling:	  
ε0	  Bootstrapping	  
  In	  every	  iteration	  the	  probability	  of	  an	  example	  not	  
being	  chosen	  after	  m	  samples	  is:	  
  Hence,	  average	  distinct	  examples	  in	  the	  training	  set	  is	  
m(1-­‐0.368)m	  =	  0.632m	  
  The	  ε0	  measure	  can	  hence	  be	  pessimistic	  since	  in	  
each	  run	  the	  classiﬁer	  is	  typically	  trained	  on	  about	  
63.2%	  of	  data	   67
MulNple	  Resampling:	  
e632	  Bootstrapping	  
  e632	  estimate	  corrects	  for	  this	  pessimistic	  bias	  by	  
taking	  into	  account	  the	  optimistic	  bias	  of	  
resubstitution	  error	  over	  the	  remaining	  0.368	  fraction:	  
68
Discussion	  
  Bootstrap	  can	  be	  useful	  when	  datasets	  are	  too	  small	  
  In	  these	  cases,	  the	  estimates	  can	  also	  have	  low	  variance	  owing	  to	  
(artiﬁcially)	  increased	  data	  size	  
  ε0:	  eﬀective	  in	  cases	  of	  very	  high	  true	  error	  rate;	  has	  low	  
variance	  than	  k-­‐fold	  CV	  but	  more	  biased	  
  e632	  (owing	  to	  correction	  it	  makes):	  eﬀective	  over	  small	  
true	  error	  rates	  
  An	  interesting	  observation:	  these	  error	  estimates	  can	  be	  
algorithm-­‐dependent	  
  e.g.,	  Bootstrap	  can	  be	  a	  poor	  estimator	  for	  algorithms	  such	  as	  
nearest	  neighbor	  or	  FOIL	  that	  do	  not	  beneﬁt	  from	  duplicate	  
instances	   69
MulNple	  Resampling:	  
other	  approaches	  
  Randomization	  
  Over	  labels:	  Assess	  the	  dependence	  of	  algorithm	  on	  actual	  label	  
assignment	  (as	  opposed	  to	  obtaining	  similar	  classiﬁer	  on	  chance	  
assignment)	  
  Over	  Samples	  (Permutation):	  Assess	  the	  stability	  of	  estimates	  over	  
diﬀerent	  re-­‐orderings	  of	  the	  data	  
  Multiple	  trials	  of	  simple	  resampling	  
  Potentially	  Higher	  replicability	  and	  more	  stable	  estimates	  
  Multiple	  runs	  (how	  many?):	  5x2	  cv,	  10x10	  cv	  (main	  motivation	  is	  
comparison	  of	  algorithms)	  
70
What	  to	  watch	  out	  when	  selecNng	  error	  
esNmaNon	  method	  
  Domain	  (e.g.,	  for	  data	  distribution,	  dispersion)	  
  Dataset	  Size	  
  Bias-­‐Variance	  dependence	  
  (Expected)	  true	  error	  rate	  
  Type	  of	  learning	  algorithms	  to	  be	  evaluated	  (e.g.	  can	  be	  robust	  to,	  say,	  
permutation	  test;	  or	  do	  not	  beneﬁt	  over	  bootstrapping)	  
  Computational	  Complexity	  
71
The relations between the evaluation methods, whether statistically significantly
different or not, varies with data quality. Therefore, one cannot replace one test
with another, and only evaluation methods appropriate for the context may be
used.
Reich and Barai (1999, Pg 11)
72
StaNsNcal	  Signiﬁcance	  TesNng	  
  Error	  estimation	  techniques	  allows	  us	  to	  obtain	  an	  
estimate	  of	  the	  desired	  performance	  metric(s)	  on	  
diﬀerent	  classiﬁers	  
  A	  diﬀerence	  is	  seen	  between	  the	  metric	  values	  over	  
classiﬁers	  
  Questions	  
  Can	  the	  diﬀerence	  be	  attributed	  to	  real	  characteristics	  
of	  the	  algorithms?	  
  Can	  the	  observed	  diﬀerence(s)	  be	  merely	  coincidental	  
concordances?	  
73
StaNsNcal	  Signiﬁcance	  TesNng	  
  Statistical	  Signiﬁcance	  Testing	  can	  enable	  ascertaining	  
whether	  the	  observed	  results	  are	  statistically	  signiﬁcant	  
(within	  certain	  constraints)	  
  We	  primarily	  focus	  on	  Null	  Hypothesis	  Signiﬁcance	  
Testing	  (NHST)	  typically	  w.r.t.:	  
  Evaluate	  algorithm	  A	  of	  interest	  against	  other	  algorithms	  on	  a	  
speciﬁc	  problem	  
  Evaluate	  generic	  capabilities	  of	  A	  against	  other	  algorithms	  on	  
benchmark	  datasets	  
  Evaluating	  performance	  of	  multiple	  classiﬁers	  on	  benchmark	  
datasets	  or	  a	  speciﬁc	  problem	  of	  interest	  
74
NHST	  
  State	  a	  null	  hypothesis	  
  Usually	  the	  opposite	  of	  what	  we	  wish	  to	  test	  (for	  example,	  classiﬁers	  A	  and	  B	  
perform	  equivalently)	  
  Choose	  a	  suitable	  statistical	  test	  and	  statistic	  that	  will	  be	  used	  to	  
(possibly)	  reject	  the	  null	  hypothesis	  
  Choose	  a	  critical	  region	  for	  the	  statistic	  to	  lie	  in	  that	  is	  extreme	  enough	  
for	  the	  null	  hypothesis	  to	  be	  rejected.	  
  Calculate	  the	  test	  statistic	  from	  the	  data	  
  If	  the	  test	  statistic	  lies	  in	  the	  critical	  region:	  reject	  the	  null	  hypothesis.	  
  If	  not,	  we	  fail	  to	  reject	  the	  null	  hypothesis,	  but	  do	  not	  accept	  it	  either.	  
  Rejecting	  the	  null	  hypothesis	  gives	  us	  some	  conﬁdence	  in	  the	  belief	  that	  
our	  observations	  did	  not	  occur	  merely	  by	  chance.	  	  
75
Choosing	  a	  StaNsNcal	  Test	  
  There	  are	  several	  aspects	  to	  consider	  when	  choosing	  a	  
statistical	  test.	  
  What	  kind	  of	  problem	  is	  being	  handled?	  
  Whether	  we	  have	  enough	  information	  about	  the	  underlying	  
distributions	  of	  the	  classiﬁers’	  results	  to	  apply	  a	  parametric	  test?	  
  …	  
  Statistical	  tests	  considered	  can	  be	  categorized	  by	  the	  task	  
they	  address,	  i.e.,	  comparing	  
  2	  algorithms	  on	  a	  single	  domain	  
  2	  algorithms	  on	  several	  domains	  
  multiple	  algorithms	  on	  multiple	  domains	  	  
76
Issues	  with	  hypothesis	  tesNng	  
  NHST	  never	  constitutes	  a	  proof	  that	  our	  observation	  is	  valid	  
  Some	  researchers	  (Drummond,	  2006;	  Demsar,	  2008)	  have	  pointed	  out	  that	  
NHST:	  
  overvalues	  the	  results;	  and	  
  limits	  the	  search	  for	  novel	  ideas	  (owing	  to	  excessive,	  often	  unwarranted,	  
conﬁdence	  in	  the	  results)	  
  NHST	  outcomes	  are	  prone	  to	  misinterpretation	  
  Rejecting	  null	  hypothesis	  with	  conﬁdence	  p	  does	  not	  signify	  that	  the	  
performance-­‐diﬀerence	  holds	  with	  probability	  1-­‐p	  
  1-­‐p	  in	  fact	  denotes	  the	  likelihood	  of	  evidence	  from	  the	  experiment	  being	  
correct	  if	  the	  diﬀerence	  indeed	  exists	  
  Given	  enough	  data,	  a	  statistically	  signiﬁcant	  diﬀerence	  can	  always	  be	  
shown	  
77
Issues	  with	  hypothesis	  tesNng:	  
But	  NHST	  can	  sNll	  be	  helpful	  
  NHST	  never	  constitutes	  a	  proof	  that	  our	  observation	  is	  valid	  (but	  provides	  
added	  concrete	  support	  to	  observations)	  
  Some	  researchers	  (Drummond,	  2006;	  Demsar,	  2008)	  have	  pointed	  out	  that	  
NHST:	  
  overvalues	  the	  results;	  and	  
  limits	  the	  search	  for	  novel	  ideas	  (owing	  to	  excessive,	  often	  unwarranted,	  
conﬁdence	  in	  the	  results)	  
  NHST	  outcomes	  are	  prone	  to	  misinterpretation	  (to	  which	  a	  better	  
sensitization	  is	  necessary)	  
  Given	  enough	  data,	  a	  statistically	  signiﬁcant	  diﬀerence	  can	  always	  be	  
shown	  
The	  impossibility	  to	  reject	  the	  null	  hypothesis	  while	  using	  a	  reasonable	  eﬀect	  
size	  is	  telling	  
78
An	  overview	  of	  representaNve	  tests	  
79
Parametric	  vs.	  Non-­‐parametric	  tests	  
80
Make assumptions on distribution of
population (i.e., error estimates);
typically more powerful than nonparametric
counterparts
Tests	  covered	  in	  this	  tutorial	  
81
Comparing	  2	  algorithms	  on	  a	  single	  domain:	  
t-­‐test	  
  Arguably,	  one	  of	  the	  most	  widely	  used	  statistical	  tests	  
  Comes	  in	  various	  ﬂavors	  (addressing	  diﬀerent	  experimental	  settings)	  
  In	  our	  case:	  two	  matched	  samples	  t-­‐test	  
  to	  compare	  performances	  of	  two	  classiﬁers	  applied	  to	  the	  same	  dataset	  
with	  matching	  randomizations	  and	  partitions	  
  Measures	  if	  the	  diﬀerence	  between	  two	  means	  (mean	  value	  of	  
performance	  measures,	  e.g.,	  error	  rate)	  is	  meaningful	  
  Null	  hypothesis:	  the	  two	  samples	  (performance	  measures	  of	  two	  
classiﬁers	  over	  the	  dataset)	  come	  from	  the	  same	  population	  
82
Comparing	  2	  algorithms	  on	  a	  single	  domain:	  
t-­‐test	  
  The	  t-­‐statistic:	  
  where	  
83
Average Performance
measure, e.g., error rate
of classifier f1
number of trials
Comparing	  2	  algorithms	  on	  a	  single	  domain:	  
t-­‐test:	  IllustraNon	  
  We	  apply	  C4.5	  and	  Naïve	  Bayes	  (NB)	  on	  the	  Labor	  dataset	  (from	  UCI	  
repository)	  
  10	  runs	  of	  10-­‐fold	  CV	  (maintaining	  same	  training	  and	  testing	  folds	  across	  
algorithms).	  	  
  The	  result	  of	  each	  10-­‐fold	  CV	  is	  considered	  a	  single	  result.	  Hence,	  each	  run	  
of	  10-­‐fold	  CV	  is	  a	  diﬀerent	  trial.	  Consequently	  n=10	  
  We	  have:	  
  Referring	  to	  the	  t-­‐test	  table,	  for	  n-­‐1=9	  degrees	  of	  freedom,	  we	  can	  
reject	  the	  null	  hypothesis	  at	  0.001	  signiﬁcance	  level	  (for	  which	  the	  
observed	  t-­‐value	  should	  be	  greater	  than	  4.781)	  
Eﬀect	  size	  
  t-­‐test	  measured	  the	  eﬀect	  (i.e.,	  the	  diﬀerent	  in	  the	  sample	  means	  is	  
indeed	  statistically	  signiﬁcant)	  but	  not	  the	  size	  of	  this	  eﬀect	  (i.e.,	  the	  
extent	  to	  which	  this	  diﬀerence	  is	  practically	  important)	  
  Statistics	  oﬀers	  many	  methods	  to	  determine	  this	  eﬀect	  size	  including	  
Pearson’s	  correlation	  coeﬃcient,	  Hedges’	  G,	  etc.	  
  In	  the	  case	  of	  t-­‐test,	  we	  use:	  Cohen’s	  d	  statistic	  
where	  
Pooled standard
deviation estimate
Eﬀect	  size	  
  A	  typical	  interpretation	  proposed	  by	  Cohen	  for	  eﬀect	  size	  was	  the	  
following	  discrete	  scale:	  
  A	  value	  of	  about	  0.2	  or	  0.3	  :	  small	  eﬀect,	  but	  probably	  meaningful	  
  ≈0.5:	  medium	  eﬀect	  that	  is	  noticeable	  
  ≥	  0.8:	  large	  eﬀect	  
  In	  the	  case	  of	  Labor	  dataset	  example,	  we	  have:	  
Comparing	  2	  algorithms	  on	  a	  single	  domain:	  
t-­‐test:	  so	  far	  so	  good;	  but	  what	  about	  
AssumpNons	  made	  by	  t-­‐test?	  
Assumptions	  
  The	  Normality	  or	  pseudo-­‐normality	  of	  populations	  
  The	  randomness	  of	  samples	  (representativeness)	  	  
  Equal	  Variances	  of	  the	  population	  
Let’s	  see	  if	  these	  assumptions	  hold	  in	  our	  case	  of	  Labor	  dataset:	  
  Normality:	  we	  may	  assume	  pseudo-­‐normality	  since	  each	  trial	  
eventually	  tests	  the	  algorithms	  on	  each	  of	  57	  instances	  (note	  however,	  
we	  compare	  averaged	  estimates)	  
  Randomness:	  Data	  was	  collected	  in	  1987-­‐1988.	  All	  the	  collective	  
agreements	  were	  collected.	  No	  reason	  to	  assume	  this	  to	  be	  non-­‐
representative	  
Comparing	  2	  algorithms	  on	  a	  single	  domain:	  
t-­‐test:	  so	  far	  so	  good;	  but	  what	  about	  
AssumpNons	  made	  by	  t-­‐test?	  
However,	  
  The	  sample	  variances	  cannot	  be	  
considered	  equal	  
  We	  also	  ignored:	  correlated	  
measurement	  eﬀect	  (due	  to	  
overlapping	  training	  sets)	  resulting	  in	  
higher	  Type	  I	  error.	  
  Were	  we	  warranted	  to	  use	  t-­‐test?	  
Probably	  Not	  (even	  though	  t-­‐test	  is	  
quite	  robust	  over	  some	  assumptions).	  
  Other	  variations	  of	  t-­‐test	  (e.g.,	  
Welch’s	  t-­‐test)	  
  McNemar’s	  test	  (non-­‐parametric)	  
Comparing	  2	  algorithms	  on	  a	  single	  domain:	  
McNemar’s	  test	  
  Non-­‐parametric	  counterpart	  to	  t-­‐test	  
  Computers	  the	  following	  contingency	  matrix	  
  0	  here	  denotes	  misclassiﬁcation	  by	  concerned	  classiﬁer:	  i.e.,	  	  
  c00	  denotes	  the	  number	  of	  instances	  misclassiﬁed	  by	  both	  f1	  and	  f2	  
  C10	  denotes	  the	  number	  of	  instances	  correctly	  classiﬁed	  by	  f1	  but	  not	  by	  f2	  
  …	  
89
Comparing	  2	  algorithms	  on	  a	  single	  domain:	  
McNemar’s	  test	  
  Null	  hypothesis	  
  McNemar’s	  statistic	  
  distributed	  approximately	  as	  	  	  	  	  	  	  	  	  	  	  if	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  is	  greater	  than	  20	  
  To	  reject	  the	  null	  hypothesis:	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  should	  be	  greater	  than	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  
for	  desired	  α	  (typically	  0.05)	   90
Comparing	  2	  algorithms	  on	  a	  single	  domain:	  
McNemar’s	  test	  
  Back	  to	  the	  Labor	  dataset:	  comparing	  C4.5	  and	  NB,	  we	  obtain	  
  Which	  is	  greater	  than	  the	  desired	  value	  of	  3.841	  allowing	  us	  to	  reject	  
the	  null	  hypothesis	  at	  95%	  conﬁdence	  
91
Comparing	  2	  algorithms	  on	  a	  single	  domain:	  
McNemar’s	  test	  
To	  keep	  in	  mind:	  
  Test	  aimed	  for	  matched	  pairs	  (maintain	  same	  training	  and	  test	  folds	  
across	  learning	  algorithms)	  
  Using	  resampling	  instead	  of	  independent	  test	  sets	  (over	  multiple	  runs)	  
can	  introduce	  some	  bias	  
  A	  constraint:	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  (the	  diagonal)	  should	  be	  at	  least	  20	  which	  
was	  not	  true	  in	  this	  case.	  	  
  In	  such	  a	  case,	  sign	  test	  (also	  applicable	  for	  comparing	  2	  algorithms	  on	  
multiple	  domains)	  should	  be	  used	  instead	  
92
Comparing	  2	  algorithms	  on	  a	  mulNple	  domains	  
93
  No	  clear	  parametric	  way	  to	  compare	  two	  classiﬁers	  on	  
multiple	  domains:	  
  t-­‐test	  is	  not	  a	  very	  good	  alternative	  because	  it	  is	  not	  clear	  that	  we	  
have	  commensurability	  of	  the	  performance	  measures	  across	  
domains	  
  The	  normality	  assumption	  is	  diﬃcult	  to	  establish	  
  The	  t-­‐test	  is	  susceptible	  to	  outliers,	  which	  is	  more	  likely	  when	  
many	  diﬀerent	  domains	  are	  considered.	  
  Therefore	  we	  will	  describe	  two	  non-­‐parametric	  alternatives	  
	  	  The	  Sign	  Test	  	  
	  	  Wilcoxon’s	  signed-­‐Rank	  Test	  	  
Comparing	  2	  algorithms	  on	  a	  mulNple	  domains:	  
Sign	  test	  
94
  Can	  be	  used	  to	  compare	  two	  classiﬁers	  on	  a	  single	  domain	  (using	  
the	  results	  at	  each	  fold	  as	  a	  trial)	  
  More	  commonly	  used	  to	  compare	  two	  classiﬁers	  on	  multiple	  
domains	  
  Calculate:	  	  
  nf1:	  the	  number	  of	  times	  that	  f1	  outperforms	  f2	  
  nf2:	  the	  number	  of	  times	  that	  f2	  outperforms	  f1	  	  
  Null	  hypothesis	  holds	  if	  the	  number	  of	  wins	  follow	  a	  binomial	  
distribution.	  
  Practically	  speaking,	  a	  classiﬁer	  should	  perform	  better	  on	  at	  least	  
wα	  (the	  critical	  value)	  datasets	  to	  reject	  null	  hypothesis	  at	  α	  
signiﬁcance	  level	  
Dataset	   NB	   SVM	   Adaboost	   Rand	  Forest	  
Anneal	   96.43	   99.44	   83.63	   99.55	  
Audiology	   73.42	   81.34	   46.46	   79.15	  
Balance	  Scale	   72.30	   91.51	   72.31	   80.97	  
Breast	  Cancer	   71.70	   66.16	   70.28	   69.99	  
Contact	  Lenses	   71.67	   71.67	   71.67	   71.67	  
Pima	  Diabetes	   74.36	   77.08	   74.35	   74.88	  
Glass	   70.63	   62.21	   44.91	   79.87	  
Hepatitis	   83.21	   80.63	   82.54	   84.58	  
Hypothyroid	   98.22	   93.58	   93.21	   99.39	  
Tic-­‐Tac-­‐Toe	   69.62	   99.90	   72.54	   93.94	  
95
•  NB vs SVM: nf1= 4.5, nf2= 5.5 and w0.05 = 8  we cannot reject the null hypothesis (1-tailed)
•  Ada vs RF: nf1=1 and nf2=8.5  we can reject the null hypothesis at level α=0.05 (1-tailed)
and conclude that RF is significantly better than Ada on these data sets at that significance level.
Comparing	  2	  algorithms	  on	  a	  mulNple	  domains:	  
Sign	  test:	  IllustraNon	  
Comparing	  2	  algorithms	  on	  a	  mulNple	  domains:	  
Wilcoxon’s	  Signed-­‐Rank	  test	  
96
  Non-­‐parametric,	  and	  typically	  more	  powerful	  than	  sign	  test	  (since,	  
unlike	  sign	  test,	  it	  quantiﬁes	  the	  diﬀerences	  in	  performances)	  
  Commonly	  used	  to	  compare	  two	  classiﬁers	  on	  multiple	  domains	  
Method	  
  For	  each	  domain,	  calculate	  the	  diﬀerence	  in	  performance	  of	  the	  
two	  classiﬁers	  
  Rank	  the	  absolute	  diﬀerences	  and	  graft	  the	  signs	  in	  front	  of	  these	  
ranks	  
  Calculate	  the	  sum	  of	  positive	  ranks	  Ws1	  and	  sum	  of	  negative	  ranks	  
Ws2	  
  Calculate	  Twilcox	  =	  min	  (Ws1	  ,	  Ws2)	  
  Reject	  null	  hypothesis	  if	  Twilcox	  ≤	  Vα	  (critical	  value)	  
Data	   NB	   SVM	   NB-­‐SVM	   |NB-­‐SVM|	   Ranks	  
1	   .9643	   .9944	   -­‐0.0301	   0.0301	   3	   -­‐3	  
2	   .7342	   .8134	   -­‐0.0792	   0.0792	   6	   -­‐6	  
3	   .7230	   .9151	   -­‐0.1921	   0.1921	   8	   -­‐8	  
4	   .7170	   .6616	   +0.0554	   0.0554	   5	   +5	  
5	   .7167	   .7167	   0	   0	   Remove	   Remove	  
6	   .7436	   .7708	   -­‐0.0272	   0.0272	   2	   -­‐2	  
7	   .7063	   .6221	   +0.0842	   0.0842	   7	   +7	  
8	   .8321	   .8063	   +0.0258	   0.0258	   1	   +1	  
9	   .9822	   .9358	   +0.0464	   0.0464	   4	   +4	  
10	   .6962	   .9990	   -­‐0.3028	   0.3028	   9	   -­‐9	  
97
WS1 = 17 and WS2 = 28  TWilcox = min(17, 28) = 17
For n= 10-1 degrees of freedom and α = 0.05, V = 8 for the 1-sided test.
Since 17 > 8. Hence, we cannot reject the null hypothesis
Comparing	  2	  algorithms	  on	  a	  mulNple	  domains:	  
Wilcoxon’s	  Signed-­‐Rank	  test:	  IllustraNon	  
Comparing	  mulNple	  algorithms	  on	  a	  mulNple	  
domains	  
98
  Two	  alternatives	  
  Parametric:	  (one-­‐way	  repeated	  measure)	  ANOVA	  ;	  
  Non-­‐parametric:	  Friedman’s	  Test.	  
  Such	  multiple-­‐hypothesis	  tests	  are	  called	  Omnibus	  tests.	  	  
  Their	  null	  hypotheses:	  is	  that	  all	  the	  classiﬁers	  perform	  similarly,	  
and	  its	  rejection	  signiﬁes	  that:	  there	  exists	  at	  least	  one	  pair	  of	  
classiﬁers	  with	  signiﬁcantly	  diﬀerent	  performances.	  
  In	  case	  of	  rejection	  of	  this	  null	  hypothesis,	  the	  omnibus	  test	  is	  
followed	  by	  a	  Post-­‐hoc	  test	  to	  identify	  	  the	  signiﬁcantly	  diﬀerent	  pairs	  
of	  classiﬁers.	  
  In	  this	  tutorial,	  we	  will	  discuss	  Friedman’s	  Test	  (omnibus)	  and	  the	  
Nemenyi	  test	  (post-­‐hoc	  test).	  
Friedman’s	  Test	  
99
  Algorithms	  are	  ranked	  on	  each	  domain	  according	  to	  their	  
performances	  (best	  performance	  =	  lowest	  rank)	  
  In	  case	  of	  a	  d-­‐way	  tie	  at	  rank	  r:	  assign	  ((r+1)+(r+2)+…+(r+d))/d	  to	  
each	  classiﬁer	  
  For	  each	  classiﬁer	  j,	  compute	  R.j	  :	  the	  sum	  of	  its	  ranks	  on	  
all	  domains	  
  Calculate	  Friedman’s	  statistic:	  
over	  n	  domains	  and	  k	  classiﬁers	  
IllustraNon	  of	  the	  Friedman	  test	  
Domain	   Classiﬁer	  
fA	  
Classiﬁer	  
fB	  
Classiﬁer	  
fC	  
1	   85.83	   75.86	   84.19	  
2	   85.91	   73.18	   85.90	  
3	   86.12	   69.08	   83.83	  
4	   85.82	   74.05	   85.11	  
5	   86.28	   74.71	   86.38	  
6	   86.42	   65.90	   81.20	  
7	   85.91	   76.25	   86.38	  
8	   86.10	   75.10	   86.75	  
9	   85.95	   70.50	   88.03	  
19	   86.12	   73.95	   87.18	  
Domain	   Classiﬁer	  
fA	  
Classiﬁer	  
fB	  
Classiﬁer	  
fC	  
1	   1	   3	   2	  
2	   1.5	   3	   1.5	  
3	   1	   3	   2	  
4	   1	   3	   2	  
5	   2	   3	   1	  
6	   1	   3	   2	  
7	   2	   3	   1	  
8	   2	   3	   1	  
9	   2	   3	   1	  
10	   2	   3	   1	  
R	  .j	   15.5	   30	   14.5	  
100
Nemenyi	  Test	  
101
  The	  omnibus	  test	  rejected	  the	  null	  hypothesis	  
  Follow	  up	  by	  post-­‐hoc	  test:	  Nemenyi	  test	  to	  identify	  classiﬁer-­‐pairs	  with	  
signiﬁcant	  performance	  diﬀerence	  
  Let	  Rij	  :	  rank	  of	  classiﬁer	  fj	  on	  data	  Sj;	  Compute	  mean	  rank	  of	  fj	  on	  all	  
datasets	  
  Between	  classiﬁer	  fj1	  and	  fj2,	  calculate	  
Nemenyi	  Test:	  IllustraNon	  
102
  Computing	  the	  q	  statistic	  of	  Nemenyi	  test	  for	  diﬀerent	  
classiﬁer	  pairs,	  we	  obtain:	  
  qAB	  =	  -­‐32.22	  
  qBC	  =	  34.44	  
  qAC	  =	  2.22	  
  At	  α	  =	  0.05,	  we	  have	  qα	  =	  2.55.	  
  Hence,	  we	  can	  show	  a	  statistically	  signiﬁcantly	  diﬀerent	  
performance	  between	  classiﬁers	  A	  and	  B,	  and	  between	  
classiﬁers	  B	  and	  C,	  but	  not	  between	  A	  and	  C	  
103
ConsideraNons	  to	  keep	  in	  mind	  while	  
choosing	  an	  appropriate	  	  test	  bed	  
  Wolpert’s	  “No	  Free	  Lunch”	  Theorems:	  if	  one	  algorithm	  tends	  to	  perform	  
better	  than	  another	  on	  a	  given	  class	  of	  problems,	  then	  the	  reverse	  will	  
be	  true	  on	  a	  diﬀerent	  class	  of	  problems.	  
  LaLoudouana	  and	  Tarate	  (2003)	  showed	  that	  even	  mediocre	  learning	  
approaches	  can	  be	  shown	  to	  be	  competitive	  by	  selecting	  the	  test	  
domains	  carefully.	  
The	  purpose	  of	  data	  set	  selection	  should	  not	  be	  to	  demonstrate	  an	  
algorithm’s	  superiority	  to	  another	  in	  all	  cases,	  but	  rather	  to	  identify	  the	  
areas	  of	  strengths	  of	  various	  algorithms	  with	  respect	  to	  domain	  
characteristics	  or	  on	  speciﬁc	  domains	  of	  interest.	  
104
Where	  can	  we	  get	  our	  data	  from?	  
  Repository	  Data:	  Data	  Repositories	  such	  as	  the	  UCI	  
repository	  and	  the	  UCI	  KDD	  Archive	  have	  been	  extremely	  
popular	  in	  Machine	  Learning	  Research.	  
  Artiﬁcial	  Data:	  The	  creation	  of	  artiﬁcial	  data	  sets	  
representing	  the	  particular	  characteristic	  an	  algorithm	  was	  
designed	  to	  deal	  with	  is	  also	  a	  common	  practice	  in	  
Machine	  Learning	  
  Web-­‐Based	  Exchanges:	  Could	  we	  imagine	  a	  multi-­‐
disciplinary	  eﬀort	  conducted	  on	  the	  Web	  where	  
researchers	  in	  need	  of	  data	  analysis	  would	  “lend”	  their	  
data	  to	  machine	  learning	  researchers	  in	  exchange	  for	  an	  
analysis?	  
105
Pros	  and	  Cons	  of	  Repository	  Data	  	  
  Pros:	  
  Very	  easy	  to	  use:	  the	  data	  is	  available,	  already	  processed	  and	  the	  
user	  does	  not	  need	  any	  knowledge	  of	  the	  underlying	  ﬁeld.	  
  The	  data	  is	  not	  artiﬁcial	  since	  it	  was	  provided	  by	  labs	  and	  so	  on.	  So	  
in	  some	  sense,	  the	  research	  is	  conducted	  in	  real-­‐world	  setting	  
(albeit	  a	  limited	  one)	  
  Replication	  and	  comparisons	  are	  facilitated,	  since	  many	  
researchers	  use	  the	  same	  data	  set	  to	  validate	  their	  ideas.	  
  Cons:	  
  The	  use	  of	  data	  repositories	  does	  not	  guarantee	  that	  the	  results	  will	  
generalize	  to	  other	  domains.	  
  The	  data	  sets	  in	  the	  repository	  are	  not	  representative	  of	  the	  data	  
mining	  process	  which	  involves	  many	  steps	  other	  than	  
classiﬁcation.	  
  Community	  experiment/Multiplicity	  eﬀect:	  since	  so	  many	  
experiments	  are	  run	  on	  the	  same	  data	  set,	  by	  chance,	  some	  will	  
yield	  interesting	  (though	  meaningless)	  results	  	  
106
Pros	  and	  Cons	  of	  ArNﬁcial	  Data	  
  Pros:	  
  Data	  sets	  can	  be	  created	  to	  mimic	  the	  traits	  that	  are	  expected	  to	  be	  
present	  in	  real	  data	  (which	  are	  unfortunately	  unavailable)	  
  The	  researcher	  has	  the	  freedom	  to	  explore	  various	  related	  
situations	  that	  may	  not	  have	  been	  readily	  testable	  from	  accessible	  
data.	  
  Since	  new	  data	  can	  be	  generated	  at	  will,	  the	  multiplicity	  eﬀect	  will	  
not	  be	  a	  problem	  in	  this	  setting.	  
  Cons:	  
  Real	  data	  is	  unpredictable	  and	  does	  not	  systematically	  behave	  
according	  to	  a	  well	  deﬁned	  generation	  model.	  	  
  Classiﬁers	  that	  happen	  to	  model	  the	  same	  distribution	  as	  the	  one	  
used	  in	  the	  data	  generation	  process	  have	  an	  unfair	  advantage.	  
107
Pros	  and	  Cons	  of	  Web-­‐Based	  Exchanges	  
  Pros:	  
  Would	  be	  useful	  for	  three	  communities:	  
  Domain	  experts,	  who	  would	  get	  interesting	  analyses	  of	  their	  
problem	  
  Machine	  Learning	  researchers	  who	  would	  be	  getting	  their	  
hand	  on	  interesting	  data,	  and	  thus	  encounter	  new	  problems	  
  Statisticians,	  who	  could	  be	  studying	  their	  techniques	  in	  an	  
applied	  context.	  
  Cons:	  
  It	  has	  not	  been	  organized.	  Is	  it	  truly	  feasible?	  
  How	  would	  the	  quality	  of	  the	  studies	  be	  controlled?	  
108
A	  Technique	  for	  Characterizing	  our	  	  Data	  Sets	  	  	  
[Japkowicz,	  Canadian	  AI’2012]	  
  The	  approach	  can	  be	  characterized	  as	  a	  meta-­‐level	  
transductive	  process	  driven	  by	  classiﬁer	  evaluation:	  
109
Hidden Characteristics
Artiﬁcial	

Domain1	

Artiﬁcial	

Domain2	

Artiﬁcial	

Domain3	

Evaluation Space
Real Domain
of interest	

EvaluaNon	  Space	  Mapping	  
110
Artificial domains
Real-World domains
Classifiers
Hierarchical
Clusterer
Clustering
Results	  
The	  graph	  below	  shows	  the	  hidden	  characteristics	  that	  we	  have	  extracted	  
from	  each	  of	  the	  UCI	  domains	  we	  considered,	  along	  with	  the	  relative	  
strength	  of	  these	  characteristics	  (which	  were	  derived	  from	  the	  distance	  of	  
each	  domain	  mapping	  to	  the	  ideal	  domain	  in	  the	  evaluation	  space	  by	  way	  of	  
the	  hierarchical	  clustering	  obtained	  for	  each	  domain)	  
112
What	  help	  is	  available	  for	  conducNng	  proper	  
evaluaNon?	  	  
  There	  is	  no	  need	  for	  researchers	  to	  program	  all	  the	  code	  necessary	  to	  
conduct	  proper	  evaluation	  nor	  to	  do	  the	  computations	  by	  hand.	  
  Resources	  are	  readily	  available	  for	  every	  steps	  of	  the	  process,	  
including:	  
  Evaluation	  metrics	  
  Statistical	  tests	  
  Re-­‐sampling	  
  And	  of	  course,	  Data	  repositories	  	  
  Pointers	  to	  these	  resources	  and	  explanations	  on	  how	  to	  use	  them	  are	  
discussed	  in	  our	  book:	  <“Evaluating	  Learning	  Algorithms:	  A	  
Classiﬁcation	  Perspective”	  by	  Japkowicz	  and	  Shah,	  Cambridge	  
University	  Press,	  2011>.	  
113
Where	  to	  look	  for	  evaluaNon	  metrics?	  
Actually,	  as	  most	  	  
people	  know,	  Weka	  
is	  a	  great	  source	  for,	  
not	  only,	  classiﬁers,	  
but	  also	  
computation	  of	  the	  
results	  according	  to	  
a	  variety	  of	  
evaluation	  metrics	  	  
114
WEKA	  even	  performs	  ROC	  Analysis	  and	  
draws	  Cost-­‐Curves	  
115
Although better graphical analysis packages are available in R, namely the
ROCR package, which permits the visualization of many versions of ROC and
Cost Curves, and also Lift curves, P-R Curves, and so on
Where	  to	  look	  for	  StaNsNcal	  Tests?	  
116
The R Software
Package contains
implementations
of all the statistical
tests discussed
here and many
more. They are
very simple to run.
Where	  to	  look	  for	  Re-­‐sampling	  methods?	  
  In	  our	  book,	  we	  have	  implemented	  all	  the	  re-­‐sampling	  schemes	  
described	  in	  this	  tutorial.	  The	  actual	  code	  is	  also	  available	  upon	  request.	  
(e-­‐mail	  us!)	  
e0Boot	  =	  function(iter,	  dataSet,	  setSize,	  dimension,	  classiﬁer1,	  classiﬁer2){	  
	  classiﬁer1e0Boot	  <-­‐	  numeric(iter)	  	  
classiﬁer2e0Boot	  <-­‐	  numeric(iter)	  	  
	  	  for(i	  in	  1:iter)	  {	  
	  	  Subsamp	  <-­‐	  	  sample(setSize,	  setSize,	  replace=TRUE)	  
	  	  Basesamp	  <-­‐	  1:setSize	  
	  	  	  	  oneTrain	  <-­‐	  dataSet[Subsamp	  ,1:dimension	  ]	  
	  	  oneTest	  <-­‐	  dataSet[setdiﬀ(Basesamp,Subsamp),	  1:dimension]	  
	  	  classiﬁer1model	  <-­‐	  classiﬁer1(class~.,	  data=oneTrain)	  
	  	  classiﬁer2model	  <-­‐	  classiﬁer2(class~.,	  data=oneTrain)	  
	  	  	  	  	  	  	  	  classiﬁer1eval	  <-­‐	  evaluate_Weka_classiﬁer(classiﬁer1model,	  newdata=oneTest)	  
	  	  	  	  classiﬁer1acc	  <-­‐	  as.numeric(substr(classiﬁer1eval$string,	  70,80))	  
	  	  	  	  classiﬁer2eval	  <-­‐	  evaluate_Weka_classiﬁer(classiﬁer2model,	  
	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  newdata=oneTest)	  
	  	  	  	  	  	  	  	  classiﬁer2acc	  <-­‐	  as.numeric(substr(classiﬁer2eval$string,	  70,80))	  
	  	  classiﬁer1e0Boot[i]=	  classiﬁer1acc	  
	  	  classiﬁer2e0Boot[i]=	  classiﬁer2acc	  
	  	  }	  
return(rbind(classiﬁer1e0Boot,	  classiﬁer2e0Boot))}	  
117
Some	  Concluding	  Remarks	  
118
If	  you	  need	  help,	  advice,	  etc…	  
  Please,	  contact	  us	  at:	  
	   	   	  nat@site.uottawa.ca	  or 	  	  
	   	   	  mohak@mohakshah.com	  
  Thank	  you	  for	  attending	  the	  tutorial!!!	  
119
References	  
  Too	  many	  to	  put	  down	  here,	  but	  pp.	  393-­‐402	  of	  the	  
book.	  
  If	  you	  want	  speciﬁc	  ones,	  come	  and	  see	  us	  at	  the	  end	  
of	  the	  tutorial	  or	  send	  us	  an	  e-­‐mail.	  
120