Text Classification and Naïve Bayes The  Task  of  Text   Classifica1on   Dan  Jurafsky   Is  this  spam?   Dan  Jurafsky   Who  wrote  which  Federalist  papers?   •  1787-­‐8:  anonymous  essays  try  to  convince  New  York   to  ra1fy  U.S  Cons1tu1on:    Jay,  Madison,  Hamilton.       •  Authorship  of  12  of  the  leLers  in  dispute   •  1963:  solved  by  Mosteller  and  Wallace  using   Bayesian  methods   James  Madison   Alexander  Hamilton   Dan  Jurafsky   Male  or  female  author?   1.  By  1925  present-­‐day  Vietnam  was  divided  into  three  parts   under  French  colonial  rule.  The  southern  region  embracing   Saigon  and  the  Mekong  delta  was  the  colony  of  Cochin-­‐China;   the  central  area  with  its  imperial  capital  at  Hue  was  the   protectorate  of  Annam…   2.  Clara  never  failed  to  be  astonished  by  the  extraordinary  felicity   of  her  own  name.  She  found  it  hard  to  trust  herself  to  the   mercy  of  fate,  which  had  managed  over  the  years  to  convert   her  greatest  shame  into  one  of  her  greatest  assets…   S.  Argamon,  M.  Koppel,  J.  Fine,  A.  R.  Shimoni,  2003.  “Gender,  Genre,  and  Wri1ng  Style  in  Formal  WriLen  Texts,”  Text,  volume  23,  number  3,  pp.   321–346   Dan  Jurafsky   Posi8ve  or  nega8ve  movie  review?   •  unbelievably  disappoin1ng     •  Full  of  zany  characters  and  richly  applied  sa1re,  and  some   great  plot  twists   •   this  is  the  greatest  screwball  comedy  ever  filmed   •   It  was  pathe1c.  The  worst  part  about  it  was  the  boxing   scenes.   5   Dan  Jurafsky   What  is  the  subject  of  this  ar8cle?   •  Antogonists  and  Inhibitors   •  Blood  Supply   •  Chemistry   •  Drug  Therapy   •  Embryology   •  Epidemiology   •  …   6   MeSH  Subject  Category  Hierarchy   ?   MEDLINE Article   Dan  Jurafsky   Text  Classifica8on   •  Assigning  subject  categories,  topics,  or  genres   •  Spam  detec1on   •  Authorship  iden1fica1on   •  Age/gender  iden1fica1on   •  Language  Iden1fica1on   •  Sen1ment  analysis   •  …   Dan  Jurafsky   Text  Classifica8on:  defini8on   •  Input:   •   a  document  d   •   a  fixed  set  of  classes    C  =  {c1,  c2,…,  cJ}   •  Output:  a  predicted  class  c  ∈  C   Dan  Jurafsky   Classifica8on  Methods:     Hand-­‐coded  rules   •  Rules  based  on  combina1ons  of  words  or  other  features   •   spam:  black-­‐list-­‐address  OR  (“dollars”  AND“have  been  selected”)   •  Accuracy  can  be  high   •  If  rules  carefully  refined  by  expert   •  But  building  and  maintaining  these  rules  is  expensive   Dan  Jurafsky   Classifica8on  Methods:   Supervised  Machine  Learning   •  Input:     •  a  document  d   •   a  fixed  set  of  classes    C  =  {c1,  c2,…,  cJ}   •  A  training  set  of  m  hand-­‐labeled  documents  (d1,c1),....,(dm,cm)   •  Output:     •  a  learned  classifier  γ:d  à  c   10   Dan  Jurafsky   Classifica8on  Methods:   Supervised  Machine  Learning   •  Any  kind  of  classifier   •  Naïve  Bayes   •  Logis1c  regression   •  Support-­‐vector  machines   •  k-­‐Nearest  Neighbors   •  …   Text Classification and Naïve Bayes The  Task  of  Text   Classifica1on   Text Classification and Naïve Bayes Naïve  Bayes  (I)   Dan  Jurafsky   Naïve  Bayes  Intui8on   •  Simple  (“naïve”)  classifica1on  method  based  on   Bayes  rule   •  Relies  on  very  simple  representa1on  of  document   •  Bag  of  words   Dan  Jurafsky   The  bag  of  words  representa8on   I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.! γ( )=c Dan  Jurafsky   The  bag  of  words  representa8on   I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.! γ( )=c Dan  Jurafsky   The  bag  of  words  representa8on:     using  a  subset  of  words   x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx! γ( )=c Dan  Jurafsky   The  bag  of  words  representa8on   γ( )=c great! 2! love! 2! recommend! 1! laugh! 1! happy! 1! ...! ...! Dan  Jurafsky   Planning! GUI!Garbage! Collection! Machine Learning! NLP! parser! tag! training! translation! language...! learning! training! algorithm! shrinkage! network...! garbage! collection! memory! optimization! region...! Test document parser! language! label! translation! …! Bag  of  words  for  document  classifica8on   ...!planning! temporal! reasoning! plan! language...! ?   Text Classification and Naïve Bayes Naïve  Bayes  (I)   Text Classification and Naïve Bayes Formalizing  the   Naïve  Bayes   Classifier   Dan  Jurafsky   Bayes’  Rule  Applied  to  Documents  and   Classes   P(c | d) = P(d | c)P(c) P(d) • For  a  document  d  and  a  class  c   Dan  Jurafsky   Naïve  Bayes  Classifier  (I)   cMAP = argmax c!C P(c | d) = argmax c!C P(d | c)P(c) P(d) = argmax c!C P(d | c)P(c) MAP is “maximum a posteriori” = most likely class Bayes Rule Dropping the denominator Dan  Jurafsky   Naïve  Bayes  Classifier  (II)   cMAP = argmax c!C P(d | c)P(c) Document d represented as features x1..xn = argmax c!C P(x1, x2,…, xn | c)P(c) Dan  Jurafsky   Naïve  Bayes  Classifier  (IV)   How often does this class occur? cMAP = argmax c!C P(x1, x2,…, xn | c)P(c) O(|X|n•|C|)  parameters   We can just count the relative frequencies in a corpus Could  only  be  es1mated  if  a   very,  very  large  number  of   training  examples  was   available.   Dan  Jurafsky   Mul8nomial  Naïve  Bayes  Independence   Assump8ons   P(x1, x2,…, xn | c) •  Bag  of  Words  assump8on:  Assume  posi1on  doesn’t   maLer   •  Condi8onal  Independence:  Assume  the  feature   probabili1es  P(xi|cj)  are  independent  given  the  class  c. P(x1,…, xn | c) = P(x1 | c)•P(x2 | c)•P(x3 | c)•...•P(xn | c) Dan  Jurafsky   Mul8nomial  Naïve  Bayes  Classifier   cMAP = argmax c!C P(x1, x2,…, xn | c)P(c) cNB = argmax c!C P(cj ) P(x | c) x!X " Dan  Jurafsky   Applying  Mul8nomial  Naive  Bayes   Classifiers  to  Text  Classifica8on   cNB = argmax cj!C P(cj ) P(xi | cj ) i!positions " positions ←  all  word  posi1ons  in  test  document                   Text Classification and Naïve Bayes Formalizing  the   Naïve  Bayes   Classifier   Text Classification and Naïve Bayes Naïve  Bayes:   Learning   Dan  Jurafsky   Learning  the  Mul8nomial  Naïve  Bayes  Model   •  First  aLempt:  maximum  likelihood  es1mates   •  simply  use  the  frequencies  in  the  data   Sec.13.3 ˆP(wi | cj ) = count(wi,cj ) count(w,cj ) w!V " ˆP(cj ) = doccount(C = cj ) Ndoc Dan  Jurafsky   •  Create  mega-­‐document  for  topic  j  by  concatena1ng  all  docs  in   this  topic   •  Use  frequency  of  w  in  mega-­‐document   Parameter  es8ma8on   frac1on  of  1mes  word  wi  appears     among  all  words  in  documents  of  topic  cj   ˆP(wi | cj ) = count(wi,cj ) count(w,cj ) w!V " Dan  Jurafsky   Problem  with  Maximum  Likelihood   •  What  if  we  have  seen  no  training  documents  with  the  word   fantas&c    and  classified  in  the  topic  posi8ve  (thumbs-­‐up)?       •  Zero  probabili1es  cannot  be  condi1oned  away,  no  maLer   the  other  evidence!   ˆP("fantastic" positive) = count("fantastic", positive) count(w,positive w!V " ) = 0 cMAP = argmaxc ˆP(c) ˆP(xi | c) i ! Sec.13.3 Dan  Jurafsky   Laplace  (add-­‐1)  smoothing  for  Naïve  Bayes   ˆP(wi | c) = count(wi,c)+1 count(w,c)+1( ) w!V " = count(wi,c)+1 count(w,c w!V " ) # $ %% & ' (( + V ˆP(wi | c) = count(wi,c) count(w,c)( ) w!V " Dan  Jurafsky   Mul8nomial  Naïve  Bayes:  Learning   •  Calculate  P(cj)  terms   •  For  each  cj  in  C  do    docsj  ←  all  docs  with    class  =cj   P(wk | cj )! nk +! n +! |Vocabulary | P(cj )! | docsj | | total # documents| •  Calculate  P(wk  |  cj)  terms   •  Textj  ←  single  doc  containing  all  docsj   •  For  each  word  wk  in  Vocabulary          nk  ←  #  of  occurrences  of  wk  in  Textj   •  From  training  corpus,  extract  Vocabulary   Dan  Jurafsky   Laplace  (add-­‐1)  smoothing:  unknown  words   ˆP(wu | c) = count(wu,c)+1 count(w,c w!V " ) # $ %% & ' (( + V +1 Add  one  extra  word  to  the  vocabulary,  the  “unknown  word”  wu   = 1 count(w,c w!V " ) # $ %% & ' (( + V +1 Text Classification and Naïve Bayes Naïve  Bayes:   Learning   Text Classification and Naïve Bayes Naïve  Bayes:   Rela1onship  to   Language  Modeling   Dan  Jurafsky   Genera8ve  Model  for  Mul8nomial  Naïve  Bayes   39   c=China X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds Dan  Jurafsky   Naïve  Bayes  and  Language  Modeling   •  Naïve  bayes  classifiers  can  use  any  sort  of  feature   •  URL,  email  address,  dic1onaries,  network  features   •  But  if,  as  in  the  previous  slides   •  We  use  only  word  features     •  we  use  all  of  the  words  in  the  text  (not  a  subset)   •  Then     •  Naïve  bayes  has  an  important  similarity  to  language   modeling.  40   Dan  Jurafsky   Each  class  =  a  unigram  language  model   •  Assigning  each  word:  P(word  |  c)   •  Assigning  each  sentence:  P(s|c)=Π  P(word|c)   0.1  I   0.1  love   0.01  this   0.05  fun   0.1  film   …   I   love   this   fun   film   0.1   0.1   .05   0.01   0.1   Class  pos   P(s  |  pos)  =  0.0000005     Sec.13.2.1   Dan  Jurafsky   Naïve  Bayes  as  a  Language  Model   •  Which  class  assigns  the  higher  probability  to  s?   0.1  I   0.1  love   0.01  this   0.05  fun   0.1  film   Model  pos   Model  neg   film  love   this   fun  I   0.1  0.1   0.01   0.05  0.1   0.1  0.001   0.01   0.005  0.2   P(s|pos)    >    P(s|neg)   0.2  I   0.001  love   0.01  this   0.005  fun   0.1  film   Sec.13.2.1 Text Classification and Naïve Bayes Naïve  Bayes:   Rela1onship  to   Language  Modeling   Text Classification and Naïve Bayes Mul1nomial  Naïve   Bayes:  A  Worked   Example   Dan  Jurafsky   Choosing  a  class:   P(c|d5)           P(j|d5)            1/4  *  (2/9)3  *  2/9  *  2/9        ≈  0.0001   Doc   Words   Class   Training   1   Chinese  Beijing  Chinese   c   2   Chinese  Chinese  Shanghai   c   3   Chinese  Macao   c   4   Tokyo  Japan  Chinese   j   Test   5   Chinese  Chinese  Chinese  Tokyo  Japan   ?   45   Condi8onal  Probabili8es:   P(Chinese|c)  =   P(Tokyo|c)        =   P(Japan|c)          =   P(Chinese|j)  =   P(Tokyo|j)          =   P(Japan|j)            =     Priors:   P(c)=       P(j)=     3    4   1    4   ˆP(w | c) = count(w,c)+1 count(c)+ |V | ˆP(c) = Nc N (5+1)  /  (8+6)  =  6/14  =  3/7   (0+1)  /  (8+6)  =  1/14   (1+1)  /  (3+6)  =  2/9     (0+1)  /  (8+6)  =  1/14   (1+1)  /  (3+6)  =  2/9     (1+1)  /  (3+6)  =  2/9      3/4  *  (3/7)3  *  1/14  *  1/14      ≈  0.0003   ∝ ∝ Dan  Jurafsky   Naïve  Bayes  in  Spam  Filtering   •  SpamAssassin  Features:   •  Men1ons  Generic  Viagra   •  Online  Pharmacy   •  Men1ons  millions  of  (dollar)  ((dollar)  NN,NNN,NNN.NN)   •  Phrase:  impress  ...  girl   •  From:  starts  with  many  numbers   •  Subject  is  all  capitals   •  HTML  has  a  low  ra1o  of  text  to  image  area   •  One  hundred  percent  guaranteed   •  Claims  you  can  be  removed  from  the  list   •  'Pres1gious  Non-­‐Accredited  Universi1es'       •  hLp://spamassassin.apache.org/tests_3_3_x.html   Dan  Jurafsky   Summary:  Naive  Bayes  is  Not  So  Naive   •  Very  Fast,  low  storage  requirements   •  Robust  to  Irrelevant  Features    Irrelevant  Features  cancel  each  other  without  affec1ng  results   •  Very  good  in  domains  with  many  equally  important  features    Decision  Trees  suffer  from  fragmentaGon  in  such  cases  –  especially  if  liLle  data   •  Op1mal  if  the  independence  assump1ons  hold:  If  assumed   independence  is  correct,  then  it  is  the  Bayes  Op1mal  Classifier  for  problem   •  A  good  dependable  baseline  for  text  classifica1on   •  But  we  will  see  other  classifiers  that  give  beWer  accuracy   Text Classification and Naïve Bayes Mul1nomial  Naïve   Bayes:  A  Worked   Example   Text Classification and Naïve Bayes   Precision,  Recall,  and   the  F  measure   Dan  Jurafsky   The  2-­‐by-­‐2  con8ngency  table   correct   not  correct   selected   tp   fp   not  selected   fn   tn   Dan  Jurafsky   Precision  and  recall   •  Precision:  %  of  selected  items  that  are  correct   Recall:  %  of  correct  items  that  are  selected   correct   not  correct   selected   tp   fp   not  selected   fn   tn   Dan  Jurafsky   A  combined  measure:  F   •  A  combined  measure  that  assesses  the  P/R  tradeoff  is  F  measure   (weighted  harmonic  mean):   •  The  harmonic  mean  is  a  very  conserva1ve  average;  see  IIR  §  8.3   •  People  usually  use  balanced  F1  measure   •     i.e.,  with  β  =  1  (that  is,  α  =  ½):                    F  =  2PR/(P+R)   RP PR RP F + + = !+ = 2 2 )1( 1 )1( 1 1 " " ## Text Classification and Naïve Bayes   Precision,  Recall,  and   the  F  measure   Text Classification and Naïve Bayes Text  Classifica1on:   Evalua1on   Dan  Jurafsky   55   More  Than  Two  Classes:     Sets  of  binary  classifiers   •  Dealing  with  any-­‐of  or  mul1value  classifica1on   •  A  document  can  belong  to  0,  1,  or  >1  classes.   •  For  each  class  c∈C •  Build  a  classifier  γc  to  dis1nguish  c  from  all  other  classes  c’  ∈C   •  Given  test  doc  d,     •  Evaluate  it  for  membership  in  each  class  using  each  γc   •  d  belongs  to  any  class  for  which  γc    returns  true   Sec.14.5 Dan  Jurafsky   56   More  Than  Two  Classes:     Sets  of  binary  classifiers   •  One-­‐of  or  mul1nomial  classifica1on   •  Classes  are  mutually  exclusive:    each  document  in  exactly  one  class   •  For  each  class  c∈C •  Build  a  classifier  γc  to  dis1nguish  c  from  all  other  classes  c’  ∈C   •  Given  test  doc  d,     •  Evaluate  it  for  membership  in  each  class  using  each  γc   •  d  belongs  to  the  one  class  with  maximum  score   Sec.14.5 Dan  Jurafsky   57   •  Most  (over)used  data  set,  21,578  docs  (each  90  types,  200  toknens)   •  9603  training,  3299  test  ar1cles  (ModApte/Lewis  split)   •  118  categories   •  An  ar1cle  can  be  in  more  than  one  category   •  Learn  118  binary  category  dis1nc1ons   •  Average  document  (with  at  least  one  category)  has  1.24  classes   •  Only  about  10  out  of  118  categories  are  large     Common categories (#train, #test) Evalua8on:     Classic  Reuters-­‐21578  Data  Set     •  Earn (2877, 1087) •  Acquisitions (1650, 179) •  Money-fx (538, 179) •  Grain (433, 149) •  Crude (389, 189) •  Trade (369,119) •  Interest (347, 131) •  Ship (197, 89) •  Wheat (212, 71) •  Corn (182, 56) Sec. 15.2.4 Dan  Jurafsky   58   Reuters  Text  Categoriza8on  data  set   (Reuters-­‐21578)  document   2-MAR-1987 16:51:43.42 livestockhog AMERICAN PORK CONGRESS KICKS OFF TOMORROW CHICAGO, March 2 - The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter  Sec. 15.2.4 Dan  Jurafsky   Confusion  matrix  c   •  For  each  pair  of  classes    how  many  documents  from  c1   were  incorrectly  assigned  to  c2?   •  c3,2:  90  wheat  documents  incorrectly  assigned  to  poultry   59   Docs  in  test  set   Assigned   UK   Assigned   poultry   Assigned   wheat   Assigned   coffee   Assigned   interest   Assigned   trade   True  UK   95   1   13   0   1   0   True  poultry   0   1   0   0   0   0   True  wheat   10   90   0   1   0   0   True  coffee   0   0   0   34   3   7   True  interest   -­‐   1   2   13   26   5   True  trade   0   0   2   14   5   10   Dan  Jurafsky   60   Per  class  evalua8on  measures   Recall:            Frac1on  of  docs  in  class  i  classified  correctly:   Precision:            Frac1on  of  docs  assigned  class  i  that  are   actually  about  class  i:     Accuracy:  (1  -­‐  error  rate)                  Frac1on  of  docs  classified  correctly:   cii i ! cij i ! j ! cii cji j ! cii cij j ! Sec. 15.2.4 Dan  Jurafsky   61   Micro-­‐  vs.  Macro-­‐Averaging   •  If  we  have  more  than  one  class,  how  do  we  combine   mul1ple  performance  measures  into  one  quan1ty?   •  Macroaveraging:  Compute  performance  for  each  class,   then  average.   •  Microaveraging:  Collect  decisions  for  all  classes,   compute  con1ngency  table,  evaluate.   Sec. 15.2.4 Dan  Jurafsky   62   Micro-­‐  vs.  Macro-­‐Averaging:  Example   Truth:   yes   Truth:   no   Classifier:  yes   10   10   Classifier:  no   10   970   Truth:   yes   Truth:   no   Classifier:  yes   90   10   Classifier:  no   10   890   Truth:   yes   Truth:   no   Classifier:  yes   100   20   Classifier:  no   20   1860   Class  1   Class  2   Micro  Ave.  Table   Sec.  15.2.4   •  Macroaveraged  precision:  (0.5  +  0.9)/2  =  0.7   •  Microaveraged  precision:  100/120  =  .83   •  Microaveraged  score  is  dominated  by  score  on  common  classes   Dan  Jurafsky   Development  Test  Sets  and  Cross-­‐valida8on   •  Metric:  P/R/F1    or  Accuracy   •  Unseen  test  set   •  avoid  overfiƒng  (‘tuning  to  the  test  set’)   •  more  conserva1ve  es1mate  of  performance   •  Cross-­‐valida1on  over  mul1ple  splits   •  Handle  sampling  errors  from  different  datasets   •  Pool  results  over  each  split   •  Compute  pooled  dev  set  performance   Training  set   Development  Test  Set   Test  Set   Test  Set   Training  Set                                                    Training  Set  Dev  Test   Training  Set   Dev  Test   Dev  Test   Text Classification and Naïve Bayes Text  Classifica1on:   Evalua1on   Text Classification and Naïve Bayes Text  Classifica1on:   Prac1cal  Issues   Dan  Jurafsky   66   The Real World •  Gee,  I’m  building  a  text  classifier  for  real,  now!   •  What  should  I  do?   Sec. 15.3.1 Dan  Jurafsky   67   No training data? Manually written rules If  (wheat  or  grain)  and  not  (whole  or  bread)  then   Categorize  as  grain   •  Need  careful  cra…ing     •  Human  tuning  on  development  data   •  Time-­‐consuming:  2  days  per  class   Sec. 15.3.1 Dan  Jurafsky   68   Very little data? •  Use  Naïve  Bayes   •  Naïve  Bayes  is  a  “high-­‐bias”  algorithm  (Ng  and  Jordan  2002  NIPS)   •  Get  more  labeled  data     •  Find  clever  ways  to  get  humans  to  label  data  for  you   •  Try  semi-­‐supervised  training  methods:   •  Bootstrapping,  EM  over  unlabeled  documents,  …   Sec. 15.3.1 Dan  Jurafsky   69   A reasonable amount of data? •  Perfect  for  all  the  clever  classifiers   •  SVM   •  Regularized  Logis1c  Regression   •  You  can  even  use  user-­‐interpretable  decision  trees   •  Users  like  to  hack   •  Management  likes  quick  fixes   Sec. 15.3.1 Dan  Jurafsky   70   A huge amount of data? •  Can  achieve  high  accuracy!   •  At  a  cost:   •  SVMs  (train  1me)  or  kNN  (test  1me)  can  be  too  slow   •  Regularized  logis1c  regression  can  be  somewhat  beLer   •  So  Naïve  Bayes  can  come  back  into  its  own  again!   Sec. 15.3.1 Dan  Jurafsky   71   Accuracy as a function of data size •  With  enough  data   •  Classifier  may  not  maLer   Sec. 15.3.1 Brill  and  Banko  on  spelling  correc1on   Dan  Jurafsky   Real-­‐world  systems  generally  combine:   •  Automa1c  classifica1on     •  Manual  review  of  uncertain/difficult/"new”  cases   72   Dan  Jurafsky   Underflow  Preven8on:  log  space   •  Mul1plying  lots  of  probabili1es  can  result  in  floa1ng-­‐point  underflow.   •  Since  log(xy)  =  log(x)  +  log(y)   •  BeLer  to  sum  logs  of  probabili1es  instead  of  mul1plying  probabili1es.   •  Class  with  highest  un-­‐normalized  log  probability  score  is  s1ll  most  probable.   •  Model  is  now  just  max  of  sum  of  weights   cNB = argmax cj!C logP(cj )+ logP(xi | cj ) i!positions " Dan  Jurafsky   74   How to tweak performance •  Domain-­‐specific  features  and  weights:  very  important  in  real   performance   •  Some1mes  need  to  collapse  terms:   •  Part  numbers,  chemical  formulas,  …   •  But  stemming  generally  doesn’t  help   •  Upweigh1ng:  Coun1ng  a  word  as  if  it  occurred  twice:   •  1tle  words  (Cohen  &  Singer  1996)   •  first  sentence  of  each  paragraph  (Murata,  1999)   •  In  sentences  that  contain  1tle  words  (Ko  et  al,  2002)   Sec. 15.3.2 Text Classification and Naïve Bayes Text  Classifica1on:   Prac1cal  Issues