Text Classification
and Naïve Bayes
The	  Task	  of	  Text	  
Classiﬁca1on	  
Dan	  Jurafsky	  
Is	  this	  spam?	  
Dan	  Jurafsky	  
Who	  wrote	  which	  Federalist	  papers?	  
•  1787-­‐8:	  anonymous	  essays	  try	  to	  convince	  New	  York	  
to	  ra1fy	  U.S	  Cons1tu1on:	  	  Jay,	  Madison,	  Hamilton.	  	  	  
•  Authorship	  of	  12	  of	  the	  leLers	  in	  dispute	  
•  1963:	  solved	  by	  Mosteller	  and	  Wallace	  using	  
Bayesian	  methods	  
James	  Madison	   Alexander	  Hamilton	  
Dan	  Jurafsky	  
Male	  or	  female	  author?	  
1.  By	  1925	  present-­‐day	  Vietnam	  was	  divided	  into	  three	  parts	  
under	  French	  colonial	  rule.	  The	  southern	  region	  embracing	  
Saigon	  and	  the	  Mekong	  delta	  was	  the	  colony	  of	  Cochin-­‐China;	  
the	  central	  area	  with	  its	  imperial	  capital	  at	  Hue	  was	  the	  
protectorate	  of	  Annam…	  
2.  Clara	  never	  failed	  to	  be	  astonished	  by	  the	  extraordinary	  felicity	  
of	  her	  own	  name.	  She	  found	  it	  hard	  to	  trust	  herself	  to	  the	  
mercy	  of	  fate,	  which	  had	  managed	  over	  the	  years	  to	  convert	  
her	  greatest	  shame	  into	  one	  of	  her	  greatest	  assets…	  
S.	  Argamon,	  M.	  Koppel,	  J.	  Fine,	  A.	  R.	  Shimoni,	  2003.	  “Gender,	  Genre,	  and	  Wri1ng	  Style	  in	  Formal	  WriLen	  Texts,”	  Text,	  volume	  23,	  number	  3,	  pp.	  
321–346	  
Dan	  Jurafsky	  
Posi8ve	  or	  nega8ve	  movie	  review?	  
•  unbelievably	  disappoin1ng	  	  
•  Full	  of	  zany	  characters	  and	  richly	  applied	  sa1re,	  and	  some	  
great	  plot	  twists	  
•  	  this	  is	  the	  greatest	  screwball	  comedy	  ever	  ﬁlmed	  
•  	  It	  was	  pathe1c.	  The	  worst	  part	  about	  it	  was	  the	  boxing	  
scenes.	  
5	  
Dan	  Jurafsky	  
What	  is	  the	  subject	  of	  this	  ar8cle?	  
•  Antogonists	  and	  Inhibitors	  
•  Blood	  Supply	  
•  Chemistry	  
•  Drug	  Therapy	  
•  Embryology	  
•  Epidemiology	  
•  …	  
6	  
MeSH	  Subject	  Category	  Hierarchy	  
?	  
MEDLINE Article
	  
Dan	  Jurafsky	  
Text	  Classiﬁca8on	  
•  Assigning	  subject	  categories,	  topics,	  or	  genres	  
•  Spam	  detec1on	  
•  Authorship	  iden1ﬁca1on	  
•  Age/gender	  iden1ﬁca1on	  
•  Language	  Iden1ﬁca1on	  
•  Sen1ment	  analysis	  
•  …	  
Dan	  Jurafsky	  
Text	  Classiﬁca8on:	  deﬁni8on	  
•  Input:	  
•  	  a	  document	  d	  
•  	  a	  ﬁxed	  set	  of	  classes	  	  C	  =	  {c1,	  c2,…,	  cJ}	  
•  Output:	  a	  predicted	  class	  c	  ∈	  C	  
Dan	  Jurafsky	  
Classiﬁca8on	  Methods:	  	  
Hand-­‐coded	  rules	  
•  Rules	  based	  on	  combina1ons	  of	  words	  or	  other	  features	  
•  	  spam:	  black-­‐list-­‐address	  OR	  (“dollars”	  AND“have	  been	  selected”)	  
•  Accuracy	  can	  be	  high	  
•  If	  rules	  carefully	  reﬁned	  by	  expert	  
•  But	  building	  and	  maintaining	  these	  rules	  is	  expensive	  
Dan	  Jurafsky	  
Classiﬁca8on	  Methods:	  
Supervised	  Machine	  Learning	  
•  Input:	  	  
•  a	  document	  d	  
•  	  a	  ﬁxed	  set	  of	  classes	  	  C	  =	  {c1,	  c2,…,	  cJ}	  
•  A	  training	  set	  of	  m	  hand-­‐labeled	  documents	  (d1,c1),....,(dm,cm)	  
•  Output:	  	  
•  a	  learned	  classiﬁer	  γ:d	  à	  c	  
10	  
Dan	  Jurafsky	  
Classiﬁca8on	  Methods:	  
Supervised	  Machine	  Learning	  
•  Any	  kind	  of	  classiﬁer	  
•  Naïve	  Bayes	  
•  Logis1c	  regression	  
•  Support-­‐vector	  machines	  
•  k-­‐Nearest	  Neighbors	  
•  …	  
Text Classification
and Naïve Bayes
The	  Task	  of	  Text	  
Classiﬁca1on	  
Text Classification
and Naïve Bayes
Naïve	  Bayes	  (I)	  
Dan	  Jurafsky	  
Naïve	  Bayes	  Intui8on	  
•  Simple	  (“naïve”)	  classiﬁca1on	  method	  based	  on	  
Bayes	  rule	  
•  Relies	  on	  very	  simple	  representa1on	  of	  document	  
•  Bag	  of	  words	  
Dan	  Jurafsky	  
The	  bag	  of	  words	  representa8on	  
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.!
γ( )=c
Dan	  Jurafsky	  
The	  bag	  of	  words	  representa8on	  
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.!
γ( )=c
Dan	  Jurafsky	  
The	  bag	  of	  words	  representa8on:	  	  
using	  a	  subset	  of	  words	  
x love xxxxxxxxxxxxxxxx sweet
xxxxxxx satirical xxxxxxxxxx
xxxxxxxxxxx great xxxxxxx
xxxxxxxxxxxxxxxxxxx fun xxxx
xxxxxxxxxxxxx whimsical xxxx
romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx recommend xxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xx several xxxxxxxxxxxxxxxxx
xxxxx happy xxxxxxxxx again
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx!
γ( )=c
Dan	  Jurafsky	  
The	  bag	  of	  words	  representa8on	  
γ( )=c
great! 2!
love! 2!
recommend! 1!
laugh! 1!
happy! 1!
...! ...!
Dan	  Jurafsky	  
Planning! GUI!Garbage!
Collection!
Machine
Learning! NLP!
parser!
tag!
training!
translation!
language...!
learning!
training!
algorithm!
shrinkage!
network...!
garbage!
collection!
memory!
optimization!
region...!
Test
document
parser!
language!
label!
translation!
…!
Bag	  of	  words	  for	  document	  classiﬁca8on	  
...!planning!
temporal!
reasoning!
plan!
language...!
?	  
Text Classification
and Naïve Bayes
Naïve	  Bayes	  (I)	  
Text Classification
and Naïve Bayes
Formalizing	  the	  
Naïve	  Bayes	  
Classiﬁer	  
Dan	  Jurafsky	  
Bayes’	  Rule	  Applied	  to	  Documents	  and	  
Classes	  
P(c | d) =
P(d | c)P(c)
P(d)
• For	  a	  document	  d	  and	  a	  class	  c	  
Dan	  Jurafsky	  
Naïve	  Bayes	  Classiﬁer	  (I)	  
cMAP = argmax
c!C
P(c | d)
= argmax
c!C
P(d | c)P(c)
P(d)
= argmax
c!C
P(d | c)P(c)
MAP is “maximum a
posteriori” = most
likely class
Bayes Rule
Dropping the
denominator
Dan	  Jurafsky	  
Naïve	  Bayes	  Classiﬁer	  (II)	  
cMAP = argmax
c!C
P(d | c)P(c)
Document d
represented as
features
x1..xn
= argmax
c!C
P(x1, x2,…, xn | c)P(c)
Dan	  Jurafsky	  
Naïve	  Bayes	  Classiﬁer	  (IV)	  
How often does this
class occur?
cMAP = argmax
c!C
P(x1, x2,…, xn | c)P(c)
O(|X|n•|C|)	  parameters	  
We can just count the
relative frequencies in
a corpus
Could	  only	  be	  es1mated	  if	  a	  
very,	  very	  large	  number	  of	  
training	  examples	  was	  
available.	  
Dan	  Jurafsky	  
Mul8nomial	  Naïve	  Bayes	  Independence	  
Assump8ons	  
P(x1, x2,…, xn | c)
•  Bag	  of	  Words	  assump8on:	  Assume	  posi1on	  doesn’t	  
maLer	  
•  Condi8onal	  Independence:	  Assume	  the	  feature	  
probabili1es	  P(xi|cj)	  are	  independent	  given	  the	  class	  c.
P(x1,…, xn | c) = P(x1 | c)•P(x2 | c)•P(x3 | c)•...•P(xn | c)
Dan	  Jurafsky	  
Mul8nomial	  Naïve	  Bayes	  Classiﬁer	  
cMAP = argmax
c!C
P(x1, x2,…, xn | c)P(c)
cNB = argmax
c!C
P(cj ) P(x | c)
x!X
"
Dan	  Jurafsky	  
Applying	  Mul8nomial	  Naive	  Bayes	  
Classiﬁers	  to	  Text	  Classiﬁca8on	  
cNB = argmax
cj!C
P(cj ) P(xi | cj )
i!positions
"
positions ←	  all	  word	  posi1ons	  in	  test	  document	  	  	  	  	  	  
	   	   	  
Text Classification
and Naïve Bayes
Formalizing	  the	  
Naïve	  Bayes	  
Classiﬁer	  
Text Classification
and Naïve Bayes
Naïve	  Bayes:	  
Learning	  
Dan	  Jurafsky	  
Learning	  the	  Mul8nomial	  Naïve	  Bayes	  Model	  
•  First	  aLempt:	  maximum	  likelihood	  es1mates	  
•  simply	  use	  the	  frequencies	  in	  the	  data	  
Sec.13.3
ˆP(wi | cj ) =
count(wi,cj )
count(w,cj )
w!V
"
ˆP(cj ) =
doccount(C = cj )
Ndoc
Dan	  Jurafsky	  
•  Create	  mega-­‐document	  for	  topic	  j	  by	  concatena1ng	  all	  docs	  in	  
this	  topic	  
•  Use	  frequency	  of	  w	  in	  mega-­‐document	  
Parameter	  es8ma8on	  
frac1on	  of	  1mes	  word	  wi	  appears	  	  
among	  all	  words	  in	  documents	  of	  topic	  cj	  
ˆP(wi | cj ) =
count(wi,cj )
count(w,cj )
w!V
"
Dan	  Jurafsky	  
Problem	  with	  Maximum	  Likelihood	  
•  What	  if	  we	  have	  seen	  no	  training	  documents	  with	  the	  word	  
fantas&c	  	  and	  classiﬁed	  in	  the	  topic	  posi8ve	  (thumbs-­‐up)?	  
	  	  
•  Zero	  probabili1es	  cannot	  be	  condi1oned	  away,	  no	  maLer	  
the	  other	  evidence!	  
ˆP("fantastic" positive) =
count("fantastic", positive)
count(w,positive
w!V
" )
= 0
cMAP = argmaxc
ˆP(c) ˆP(xi | c)
i
!
Sec.13.3
Dan	  Jurafsky	  
Laplace	  (add-­‐1)	  smoothing	  for	  Naïve	  Bayes	  
ˆP(wi | c) =
count(wi,c)+1
count(w,c)+1( )
w!V
"
=
count(wi,c)+1
count(w,c
w!V
" )
#
$
%%
&
'
(( + V
ˆP(wi | c) =
count(wi,c)
count(w,c)( )
w!V
"
Dan	  Jurafsky	  
Mul8nomial	  Naïve	  Bayes:	  Learning	  
•  Calculate	  P(cj)	  terms	  
•  For	  each	  cj	  in	  C	  do	  
	  docsj	  ←	  all	  docs	  with	  	  class	  =cj	  
P(wk | cj )!
nk +!
n +! |Vocabulary |
P(cj )!
| docsj |
| total # documents|
•  Calculate	  P(wk	  |	  cj)	  terms	  
•  Textj	  ←	  single	  doc	  containing	  all	  docsj	  
•  For	  each	  word	  wk	  in	  Vocabulary	  
	  	  	  	  nk	  ←	  #	  of	  occurrences	  of	  wk	  in	  Textj	  
•  From	  training	  corpus,	  extract	  Vocabulary	  
Dan	  Jurafsky	  
Laplace	  (add-­‐1)	  smoothing:	  unknown	  words	  
ˆP(wu | c) =
count(wu,c)+1
count(w,c
w!V
" )
#
$
%%
&
'
(( + V +1
Add	  one	  extra	  word	  to	  the	  vocabulary,	  the	  “unknown	  word”	  wu	  
=
1
count(w,c
w!V
" )
#
$
%%
&
'
(( + V +1
Text Classification
and Naïve Bayes
Naïve	  Bayes:	  
Learning	  
Text Classification
and Naïve Bayes
Naïve	  Bayes:	  
Rela1onship	  to	  
Language	  Modeling	  
Dan	  Jurafsky	  
Genera8ve	  Model	  for	  Mul8nomial	  Naïve	  Bayes	  
39	  
c=China
X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds
Dan	  Jurafsky	  
Naïve	  Bayes	  and	  Language	  Modeling	  
•  Naïve	  bayes	  classiﬁers	  can	  use	  any	  sort	  of	  feature	  
•  URL,	  email	  address,	  dic1onaries,	  network	  features	  
•  But	  if,	  as	  in	  the	  previous	  slides	  
•  We	  use	  only	  word	  features	  	  
•  we	  use	  all	  of	  the	  words	  in	  the	  text	  (not	  a	  subset)	  
•  Then	  	  
•  Naïve	  bayes	  has	  an	  important	  similarity	  to	  language	  
modeling.	  40	  
Dan	  Jurafsky	  
Each	  class	  =	  a	  unigram	  language	  model	  
•  Assigning	  each	  word:	  P(word	  |	  c)	  
•  Assigning	  each	  sentence:	  P(s|c)=Π	  P(word|c)	  
0.1 	  I	  
0.1 	  love	  
0.01 	  this	  
0.05 	  fun	  
0.1 	  ﬁlm	  
…	  
I	   love	   this	   fun	   ﬁlm	  
0.1	   0.1	   .05	   0.01	   0.1	  
Class	  pos	  
P(s	  |	  pos)	  =	  0.0000005	  	  
Sec.13.2.1	  
Dan	  Jurafsky	  
Naïve	  Bayes	  as	  a	  Language	  Model	  
•  Which	  class	  assigns	  the	  higher	  probability	  to	  s?	  
0.1 	  I	  
0.1 	  love	  
0.01 	  this	  
0.05 	  fun	  
0.1 	  ﬁlm	  
Model	  pos	   Model	  neg	  
ﬁlm	  love	   this	   fun	  I	  
0.1	  0.1	   0.01	   0.05	  0.1	  
0.1	  0.001	   0.01	   0.005	  0.2	  
P(s|pos)	  	  >	  	  P(s|neg)	  
0.2 	  I	  
0.001 	  love	  
0.01 	  this	  
0.005 	  fun	  
0.1 	  ﬁlm	  
Sec.13.2.1
Text Classification
and Naïve Bayes
Naïve	  Bayes:	  
Rela1onship	  to	  
Language	  Modeling	  
Text Classification
and Naïve Bayes
Mul1nomial	  Naïve	  
Bayes:	  A	  Worked	  
Example	  
Dan	  Jurafsky	  
Choosing	  a	  class:	  
P(c|d5)	  	  
	  
	  
	  
P(j|d5)	  	  
	  
	  
	  
	  1/4	  *	  (2/9)3	  *	  2/9	  *	  2/9	  	  	  
	  ≈	  0.0001	  
Doc	   Words	   Class	  
Training	   1	   Chinese	  Beijing	  Chinese	   c	  
2	   Chinese	  Chinese	  Shanghai	   c	  
3	   Chinese	  Macao	   c	  
4	   Tokyo	  Japan	  Chinese	   j	  
Test	   5	   Chinese	  Chinese	  Chinese	  Tokyo	  Japan	   ?	  
45	  
Condi8onal	  Probabili8es:	  
P(Chinese|c)	  =	  
P(Tokyo|c)	  	  	  	  =	  
P(Japan|c)	  	  	  	  	  =	  
P(Chinese|j)	  =	  
P(Tokyo|j)	  	  	  	  	  =	  
P(Japan|j)	  	  	  	  	  	  =	  	  
Priors:	  
P(c)=	  	  	  
P(j)=	  	  
3	  
	  4	   1	  
	  4	  
ˆP(w | c) =
count(w,c)+1
count(c)+ |V |
ˆP(c) =
Nc
N
(5+1)	  /	  (8+6)	  =	  6/14	  =	  3/7	  
(0+1)	  /	  (8+6)	  =	  1/14	  
(1+1)	  /	  (3+6)	  =	  2/9	  	  
(0+1)	  /	  (8+6)	  =	  1/14	  
(1+1)	  /	  (3+6)	  =	  2/9	  	  
(1+1)	  /	  (3+6)	  =	  2/9	  	  
	  3/4	  *	  (3/7)3	  *	  1/14	  *	  1/14	  	  
	  ≈	  0.0003	  
∝
∝
Dan	  Jurafsky	  
Naïve	  Bayes	  in	  Spam	  Filtering	  
•  SpamAssassin	  Features:	  
•  Men1ons	  Generic	  Viagra	  
•  Online	  Pharmacy	  
•  Men1ons	  millions	  of	  (dollar)	  ((dollar)	  NN,NNN,NNN.NN)	  
•  Phrase:	  impress	  ...	  girl	  
•  From:	  starts	  with	  many	  numbers	  
•  Subject	  is	  all	  capitals	  
•  HTML	  has	  a	  low	  ra1o	  of	  text	  to	  image	  area	  
•  One	  hundred	  percent	  guaranteed	  
•  Claims	  you	  can	  be	  removed	  from	  the	  list	  
•  'Pres1gious	  Non-­‐Accredited	  Universi1es' 	   	  	  
•  hLp://spamassassin.apache.org/tests_3_3_x.html	  
Dan	  Jurafsky	  
Summary:	  Naive	  Bayes	  is	  Not	  So	  Naive	  
•  Very	  Fast,	  low	  storage	  requirements	  
•  Robust	  to	  Irrelevant	  Features	  
	  Irrelevant	  Features	  cancel	  each	  other	  without	  aﬀec1ng	  results	  
•  Very	  good	  in	  domains	  with	  many	  equally	  important	  features	  
	  Decision	  Trees	  suﬀer	  from	  fragmentaGon	  in	  such	  cases	  –	  especially	  if	  liLle	  data	  
•  Op1mal	  if	  the	  independence	  assump1ons	  hold:	  If	  assumed	  
independence	  is	  correct,	  then	  it	  is	  the	  Bayes	  Op1mal	  Classiﬁer	  for	  problem	  
•  A	  good	  dependable	  baseline	  for	  text	  classiﬁca1on	  
•  But	  we	  will	  see	  other	  classiﬁers	  that	  give	  beWer	  accuracy	  
Text Classification
and Naïve Bayes
Mul1nomial	  Naïve	  
Bayes:	  A	  Worked	  
Example	  
Text Classification
and Naïve Bayes	  
Precision,	  Recall,	  and	  
the	  F	  measure	  
Dan	  Jurafsky	  
The	  2-­‐by-­‐2	  con8ngency	  table	  
correct	   not	  correct	  
selected	   tp	   fp	  
not	  selected	   fn	   tn	  
Dan	  Jurafsky	  
Precision	  and	  recall	  
•  Precision:	  %	  of	  selected	  items	  that	  are	  correct	  
Recall:	  %	  of	  correct	  items	  that	  are	  selected	  
correct	   not	  correct	  
selected	   tp	   fp	  
not	  selected	   fn	   tn	  
Dan	  Jurafsky	  
A	  combined	  measure:	  F	  
•  A	  combined	  measure	  that	  assesses	  the	  P/R	  tradeoﬀ	  is	  F	  measure	  
(weighted	  harmonic	  mean):	  
•  The	  harmonic	  mean	  is	  a	  very	  conserva1ve	  average;	  see	  IIR	  §	  8.3	  
•  People	  usually	  use	  balanced	  F1	  measure	  
•  	  	  i.e.,	  with	  β	  =	  1	  (that	  is,	  α	  =	  ½):	  	  	   	   	  	  	  	  	  	  F	  =	  2PR/(P+R)	  
RP
PR
RP
F
+
+
=
!+
= 2
2
)1(
1
)1(
1
1
"
"
##
Text Classification
and Naïve Bayes	  
Precision,	  Recall,	  and	  
the	  F	  measure	  
Text Classification
and Naïve Bayes
Text	  Classiﬁca1on:	  
Evalua1on	  
Dan	  Jurafsky	  
55	  
More	  Than	  Two	  Classes:	  	  
Sets	  of	  binary	  classiﬁers	  
•  Dealing	  with	  any-­‐of	  or	  mul1value	  classiﬁca1on	  
•  A	  document	  can	  belong	  to	  0,	  1,	  or	  >1	  classes.	  
•  For	  each	  class	  c∈C	

•  Build	  a	  classiﬁer	  γc	  to	  dis1nguish	  c	  from	  all	  other	  classes	  c’	  ∈C	  
•  Given	  test	  doc	  d,	  	  
•  Evaluate	  it	  for	  membership	  in	  each	  class	  using	  each	  γc	  
•  d	  belongs	  to	  any	  class	  for	  which	  γc	  	  returns	  true	  
Sec.14.5
Dan	  Jurafsky	  
56	  
More	  Than	  Two	  Classes:	  	  
Sets	  of	  binary	  classiﬁers	  
•  One-­‐of	  or	  mul1nomial	  classiﬁca1on	  
•  Classes	  are	  mutually	  exclusive:	  	  each	  document	  in	  exactly	  one	  class	  
•  For	  each	  class	  c∈C	

•  Build	  a	  classiﬁer	  γc	  to	  dis1nguish	  c	  from	  all	  other	  classes	  c’	  ∈C	  
•  Given	  test	  doc	  d,	  	  
•  Evaluate	  it	  for	  membership	  in	  each	  class	  using	  each	  γc	  
•  d	  belongs	  to	  the	  one	  class	  with	  maximum	  score	  
Sec.14.5
Dan	  Jurafsky	  
57	  
•  Most	  (over)used	  data	  set,	  21,578	  docs	  (each	  90	  types,	  200	  toknens)	  
•  9603	  training,	  3299	  test	  ar1cles	  (ModApte/Lewis	  split)	  
•  118	  categories	  
•  An	  ar1cle	  can	  be	  in	  more	  than	  one	  category	  
•  Learn	  118	  binary	  category	  dis1nc1ons	  
•  Average	  document	  (with	  at	  least	  one	  category)	  has	  1.24	  classes	  
•  Only	  about	  10	  out	  of	  118	  categories	  are	  large	  
	  
Common categories
(#train, #test)
Evalua8on:	  	  
Classic	  Reuters-­‐21578	  Data	  Set	  	  
•  Earn (2877, 1087)
•  Acquisitions (1650, 179)
•  Money-fx (538, 179)
•  Grain (433, 149)
•  Crude (389, 189)
•  Trade (369,119)
•  Interest (347, 131)
•  Ship (197, 89)
•  Wheat (212, 71)
•  Corn (182, 56)
Sec. 15.2.4
Dan	  Jurafsky	  
58	  
Reuters	  Text	  Categoriza8on	  data	  set	  
(Reuters-­‐21578)	  document	  
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981"
NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow,
March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions
on a number of issues, according to the National Pork Producers Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future
direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to
endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry,
the NPPC added. Reuter
</BODY></TEXT></REUTERS>
Sec. 15.2.4
Dan	  Jurafsky	  
Confusion	  matrix	  c	  
•  For	  each	  pair	  of	  classes	  <c1,c2>	  how	  many	  documents	  from	  c1	  
were	  incorrectly	  assigned	  to	  c2?	  
•  c3,2:	  90	  wheat	  documents	  incorrectly	  assigned	  to	  poultry	  
59	  
Docs	  in	  test	  set	   Assigned	  
UK	  
Assigned	  
poultry	  
Assigned	  
wheat	  
Assigned	  
coﬀee	  
Assigned	  
interest	  
Assigned	  
trade	  
True	  UK	   95	   1	   13	   0	   1	   0	  
True	  poultry	   0	   1	   0	   0	   0	   0	  
True	  wheat	   10	   90	   0	   1	   0	   0	  
True	  coﬀee	   0	   0	   0	   34	   3	   7	  
True	  interest	   -­‐	   1	   2	   13	   26	   5	  
True	  trade	   0	   0	   2	   14	   5	   10	  
Dan	  Jurafsky	  
60	  
Per	  class	  evalua8on	  measures	  
Recall:	  	  
	  	  	  	  Frac1on	  of	  docs	  in	  class	  i	  classiﬁed	  correctly:	  
Precision:	  	  
	  	  	  	  Frac1on	  of	  docs	  assigned	  class	  i	  that	  are	  
actually	  about	  class	  i:	  
	  
Accuracy:	  (1	  -­‐	  error	  rate)	  	  
	  	  	  	  	  	  	  Frac1on	  of	  docs	  classiﬁed	  correctly:	  
cii
i
!
cij
i
!
j
!
cii
cji
j
!
cii
cij
j
!
Sec. 15.2.4
Dan	  Jurafsky	  
61	  
Micro-­‐	  vs.	  Macro-­‐Averaging	  
•  If	  we	  have	  more	  than	  one	  class,	  how	  do	  we	  combine	  
mul1ple	  performance	  measures	  into	  one	  quan1ty?	  
•  Macroaveraging:	  Compute	  performance	  for	  each	  class,	  
then	  average.	  
•  Microaveraging:	  Collect	  decisions	  for	  all	  classes,	  
compute	  con1ngency	  table,	  evaluate.	  
Sec. 15.2.4
Dan	  Jurafsky	  
62	  
Micro-­‐	  vs.	  Macro-­‐Averaging:	  Example	  
Truth:	  
yes	  
Truth:	  
no	  
Classiﬁer:	  yes	   10	   10	  
Classiﬁer:	  no	   10	   970	  
Truth:	  
yes	  
Truth:	  
no	  
Classiﬁer:	  yes	   90	   10	  
Classiﬁer:	  no	   10	   890	  
Truth:	  
yes	  
Truth:	  
no	  
Classiﬁer:	  yes	   100	   20	  
Classiﬁer:	  no	   20	   1860	  
Class	  1	   Class	  2	   Micro	  Ave.	  Table	  
Sec.	  15.2.4	  
•  Macroaveraged	  precision:	  (0.5	  +	  0.9)/2	  =	  0.7	  
•  Microaveraged	  precision:	  100/120	  =	  .83	  
•  Microaveraged	  score	  is	  dominated	  by	  score	  on	  common	  classes	  
Dan	  Jurafsky	  
Development	  Test	  Sets	  and	  Cross-­‐valida8on	  
•  Metric:	  P/R/F1	  	  or	  Accuracy	  
•  Unseen	  test	  set	  
•  avoid	  overﬁƒng	  (‘tuning	  to	  the	  test	  set’)	  
•  more	  conserva1ve	  es1mate	  of	  performance	  
•  Cross-­‐valida1on	  over	  mul1ple	  splits	  
•  Handle	  sampling	  errors	  from	  diﬀerent	  datasets	  
•  Pool	  results	  over	  each	  split	  
•  Compute	  pooled	  dev	  set	  performance	  
Training	  set	   Development	  Test	  Set	   Test	  Set	  
Test	  Set	  
Training	  Set	  
	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  Training	  Set	  Dev	  Test	  
Training	  Set	  
Dev	  Test	  
Dev	  Test	  
Text Classification
and Naïve Bayes
Text	  Classiﬁca1on:	  
Evalua1on	  
Text Classification
and Naïve Bayes
Text	  Classiﬁca1on:	  
Prac1cal	  Issues	  
Dan	  Jurafsky	  
66	  
The Real World
•  Gee,	  I’m	  building	  a	  text	  classiﬁer	  for	  real,	  now!	  
•  What	  should	  I	  do?	  
Sec. 15.3.1
Dan	  Jurafsky	  
67	  
No training data?
Manually written rules
If	  (wheat	  or	  grain)	  and	  not	  (whole	  or	  bread)	  then	  
Categorize	  as	  grain	  
•  Need	  careful	  cra…ing	  	  
•  Human	  tuning	  on	  development	  data	  
•  Time-­‐consuming:	  2	  days	  per	  class	  
Sec. 15.3.1
Dan	  Jurafsky	  
68	  
Very little data?
•  Use	  Naïve	  Bayes	  
•  Naïve	  Bayes	  is	  a	  “high-­‐bias”	  algorithm	  (Ng	  and	  Jordan	  2002	  NIPS)	  
•  Get	  more	  labeled	  data	  	  
•  Find	  clever	  ways	  to	  get	  humans	  to	  label	  data	  for	  you	  
•  Try	  semi-­‐supervised	  training	  methods:	  
•  Bootstrapping,	  EM	  over	  unlabeled	  documents,	  …	  
Sec. 15.3.1
Dan	  Jurafsky	  
69	  
A reasonable amount of data?
•  Perfect	  for	  all	  the	  clever	  classiﬁers	  
•  SVM	  
•  Regularized	  Logis1c	  Regression	  
•  You	  can	  even	  use	  user-­‐interpretable	  decision	  trees	  
•  Users	  like	  to	  hack	  
•  Management	  likes	  quick	  ﬁxes	  
Sec. 15.3.1
Dan	  Jurafsky	  
70	  
A huge amount of data?
•  Can	  achieve	  high	  accuracy!	  
•  At	  a	  cost:	  
•  SVMs	  (train	  1me)	  or	  kNN	  (test	  1me)	  can	  be	  too	  slow	  
•  Regularized	  logis1c	  regression	  can	  be	  somewhat	  beLer	  
•  So	  Naïve	  Bayes	  can	  come	  back	  into	  its	  own	  again!	  
Sec. 15.3.1
Dan	  Jurafsky	  
71	  
Accuracy as a function of data size
•  With	  enough	  data	  
•  Classiﬁer	  may	  not	  maLer	  
Sec. 15.3.1
Brill	  and	  Banko	  on	  spelling	  correc1on	  
Dan	  Jurafsky	  
Real-­‐world	  systems	  generally	  combine:	  
•  Automa1c	  classiﬁca1on	  	  
•  Manual	  review	  of	  uncertain/diﬃcult/"new”	  cases	  
72	  
Dan	  Jurafsky	  
Underﬂow	  Preven8on:	  log	  space	  
•  Mul1plying	  lots	  of	  probabili1es	  can	  result	  in	  ﬂoa1ng-­‐point	  underﬂow.	  
•  Since	  log(xy)	  =	  log(x)	  +	  log(y)	  
•  BeLer	  to	  sum	  logs	  of	  probabili1es	  instead	  of	  mul1plying	  probabili1es.	  
•  Class	  with	  highest	  un-­‐normalized	  log	  probability	  score	  is	  s1ll	  most	  probable.	  
•  Model	  is	  now	  just	  max	  of	  sum	  of	  weights	  
cNB = argmax
cj!C
logP(cj )+ logP(xi | cj )
i!positions
"
Dan	  Jurafsky	  
74	  
How to tweak performance
•  Domain-­‐speciﬁc	  features	  and	  weights:	  very	  important	  in	  real	  
performance	  
•  Some1mes	  need	  to	  collapse	  terms:	  
•  Part	  numbers,	  chemical	  formulas,	  …	  
•  But	  stemming	  generally	  doesn’t	  help	  
•  Upweigh1ng:	  Coun1ng	  a	  word	  as	  if	  it	  occurred	  twice:	  
•  1tle	  words	  (Cohen	  &	  Singer	  1996)	  
•  ﬁrst	  sentence	  of	  each	  paragraph	  (Murata,	  1999)	  
•  In	  sentences	  that	  contain	  1tle	  words	  (Ko	  et	  al,	  2002)	  
Sec. 15.3.2
Text Classification
and Naïve Bayes
Text	  Classiﬁca1on:	  
Prac1cal	  Issues