Text Classification
and Naïve Bayes
The
Task
of
Text
Classifica1on
Dan
Jurafsky
Is
this
spam?
Dan
Jurafsky
Who
wrote
which
Federalist
papers?
• 1787-‐8:
anonymous
essays
try
to
convince
New
York
to
ra1fy
U.S
Cons1tu1on:
Jay,
Madison,
Hamilton.
• Authorship
of
12
of
the
leLers
in
dispute
• 1963:
solved
by
Mosteller
and
Wallace
using
Bayesian
methods
James
Madison
Alexander
Hamilton
Dan
Jurafsky
Male
or
female
author?
1. By
1925
present-‐day
Vietnam
was
divided
into
three
parts
under
French
colonial
rule.
The
southern
region
embracing
Saigon
and
the
Mekong
delta
was
the
colony
of
Cochin-‐China;
the
central
area
with
its
imperial
capital
at
Hue
was
the
protectorate
of
Annam…
2. Clara
never
failed
to
be
astonished
by
the
extraordinary
felicity
of
her
own
name.
She
found
it
hard
to
trust
herself
to
the
mercy
of
fate,
which
had
managed
over
the
years
to
convert
her
greatest
shame
into
one
of
her
greatest
assets…
S.
Argamon,
M.
Koppel,
J.
Fine,
A.
R.
Shimoni,
2003.
“Gender,
Genre,
and
Wri1ng
Style
in
Formal
WriLen
Texts,”
Text,
volume
23,
number
3,
pp.
321–346
Dan
Jurafsky
Posi8ve
or
nega8ve
movie
review?
• unbelievably
disappoin1ng
• Full
of
zany
characters
and
richly
applied
sa1re,
and
some
great
plot
twists
•
this
is
the
greatest
screwball
comedy
ever
filmed
•
It
was
pathe1c.
The
worst
part
about
it
was
the
boxing
scenes.
5
Dan
Jurafsky
What
is
the
subject
of
this
ar8cle?
• Antogonists
and
Inhibitors
• Blood
Supply
• Chemistry
• Drug
Therapy
• Embryology
• Epidemiology
• …
6
MeSH
Subject
Category
Hierarchy
?
MEDLINE Article
Dan
Jurafsky
Text
Classifica8on
• Assigning
subject
categories,
topics,
or
genres
• Spam
detec1on
• Authorship
iden1fica1on
• Age/gender
iden1fica1on
• Language
Iden1fica1on
• Sen1ment
analysis
• …
Dan
Jurafsky
Text
Classifica8on:
defini8on
• Input:
•
a
document
d
•
a
fixed
set
of
classes
C
=
{c1,
c2,…,
cJ}
• Output:
a
predicted
class
c
∈
C
Dan
Jurafsky
Classifica8on
Methods:
Hand-‐coded
rules
• Rules
based
on
combina1ons
of
words
or
other
features
•
spam:
black-‐list-‐address
OR
(“dollars”
AND“have
been
selected”)
• Accuracy
can
be
high
• If
rules
carefully
refined
by
expert
• But
building
and
maintaining
these
rules
is
expensive
Dan
Jurafsky
Classifica8on
Methods:
Supervised
Machine
Learning
• Input:
• a
document
d
•
a
fixed
set
of
classes
C
=
{c1,
c2,…,
cJ}
• A
training
set
of
m
hand-‐labeled
documents
(d1,c1),....,(dm,cm)
• Output:
• a
learned
classifier
γ:d
à
c
10
Dan
Jurafsky
Classifica8on
Methods:
Supervised
Machine
Learning
• Any
kind
of
classifier
• Naïve
Bayes
• Logis1c
regression
• Support-‐vector
machines
• k-‐Nearest
Neighbors
• …
Text Classification
and Naïve Bayes
The
Task
of
Text
Classifica1on
Text Classification
and Naïve Bayes
Naïve
Bayes
(I)
Dan
Jurafsky
Naïve
Bayes
Intui8on
• Simple
(“naïve”)
classifica1on
method
based
on
Bayes
rule
• Relies
on
very
simple
representa1on
of
document
• Bag
of
words
Dan
Jurafsky
The
bag
of
words
representa8on
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.!
γ( )=c
Dan
Jurafsky
The
bag
of
words
representa8on
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.!
γ( )=c
Dan
Jurafsky
The
bag
of
words
representa8on:
using
a
subset
of
words
x love xxxxxxxxxxxxxxxx sweet
xxxxxxx satirical xxxxxxxxxx
xxxxxxxxxxx great xxxxxxx
xxxxxxxxxxxxxxxxxxx fun xxxx
xxxxxxxxxxxxx whimsical xxxx
romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx recommend xxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xx several xxxxxxxxxxxxxxxxx
xxxxx happy xxxxxxxxx again
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx!
γ( )=c
Dan
Jurafsky
The
bag
of
words
representa8on
γ( )=c
great! 2!
love! 2!
recommend! 1!
laugh! 1!
happy! 1!
...! ...!
Dan
Jurafsky
Planning! GUI!Garbage!
Collection!
Machine
Learning! NLP!
parser!
tag!
training!
translation!
language...!
learning!
training!
algorithm!
shrinkage!
network...!
garbage!
collection!
memory!
optimization!
region...!
Test
document
parser!
language!
label!
translation!
…!
Bag
of
words
for
document
classifica8on
...!planning!
temporal!
reasoning!
plan!
language...!
?
Text Classification
and Naïve Bayes
Naïve
Bayes
(I)
Text Classification
and Naïve Bayes
Formalizing
the
Naïve
Bayes
Classifier
Dan
Jurafsky
Bayes’
Rule
Applied
to
Documents
and
Classes
P(c | d) =
P(d | c)P(c)
P(d)
• For
a
document
d
and
a
class
c
Dan
Jurafsky
Naïve
Bayes
Classifier
(I)
cMAP = argmax
c!C
P(c | d)
= argmax
c!C
P(d | c)P(c)
P(d)
= argmax
c!C
P(d | c)P(c)
MAP is “maximum a
posteriori” = most
likely class
Bayes Rule
Dropping the
denominator
Dan
Jurafsky
Naïve
Bayes
Classifier
(II)
cMAP = argmax
c!C
P(d | c)P(c)
Document d
represented as
features
x1..xn
= argmax
c!C
P(x1, x2,…, xn | c)P(c)
Dan
Jurafsky
Naïve
Bayes
Classifier
(IV)
How often does this
class occur?
cMAP = argmax
c!C
P(x1, x2,…, xn | c)P(c)
O(|X|n•|C|)
parameters
We can just count the
relative frequencies in
a corpus
Could
only
be
es1mated
if
a
very,
very
large
number
of
training
examples
was
available.
Dan
Jurafsky
Mul8nomial
Naïve
Bayes
Independence
Assump8ons
P(x1, x2,…, xn | c)
• Bag
of
Words
assump8on:
Assume
posi1on
doesn’t
maLer
• Condi8onal
Independence:
Assume
the
feature
probabili1es
P(xi|cj)
are
independent
given
the
class
c.
P(x1,…, xn | c) = P(x1 | c)•P(x2 | c)•P(x3 | c)•...•P(xn | c)
Dan
Jurafsky
Mul8nomial
Naïve
Bayes
Classifier
cMAP = argmax
c!C
P(x1, x2,…, xn | c)P(c)
cNB = argmax
c!C
P(cj ) P(x | c)
x!X
"
Dan
Jurafsky
Applying
Mul8nomial
Naive
Bayes
Classifiers
to
Text
Classifica8on
cNB = argmax
cj!C
P(cj ) P(xi | cj )
i!positions
"
positions ←
all
word
posi1ons
in
test
document
Text Classification
and Naïve Bayes
Formalizing
the
Naïve
Bayes
Classifier
Text Classification
and Naïve Bayes
Naïve
Bayes:
Learning
Dan
Jurafsky
Learning
the
Mul8nomial
Naïve
Bayes
Model
• First
aLempt:
maximum
likelihood
es1mates
• simply
use
the
frequencies
in
the
data
Sec.13.3
ˆP(wi | cj ) =
count(wi,cj )
count(w,cj )
w!V
"
ˆP(cj ) =
doccount(C = cj )
Ndoc
Dan
Jurafsky
• Create
mega-‐document
for
topic
j
by
concatena1ng
all
docs
in
this
topic
• Use
frequency
of
w
in
mega-‐document
Parameter
es8ma8on
frac1on
of
1mes
word
wi
appears
among
all
words
in
documents
of
topic
cj
ˆP(wi | cj ) =
count(wi,cj )
count(w,cj )
w!V
"
Dan
Jurafsky
Problem
with
Maximum
Likelihood
• What
if
we
have
seen
no
training
documents
with
the
word
fantas&c
and
classified
in
the
topic
posi8ve
(thumbs-‐up)?
• Zero
probabili1es
cannot
be
condi1oned
away,
no
maLer
the
other
evidence!
ˆP("fantastic" positive) =
count("fantastic", positive)
count(w,positive
w!V
" )
= 0
cMAP = argmaxc
ˆP(c) ˆP(xi | c)
i
!
Sec.13.3
Dan
Jurafsky
Laplace
(add-‐1)
smoothing
for
Naïve
Bayes
ˆP(wi | c) =
count(wi,c)+1
count(w,c)+1( )
w!V
"
=
count(wi,c)+1
count(w,c
w!V
" )
#
$
%%
&
'
(( + V
ˆP(wi | c) =
count(wi,c)
count(w,c)( )
w!V
"
Dan
Jurafsky
Mul8nomial
Naïve
Bayes:
Learning
• Calculate
P(cj)
terms
• For
each
cj
in
C
do
docsj
←
all
docs
with
class
=cj
P(wk | cj )!
nk +!
n +! |Vocabulary |
P(cj )!
| docsj |
| total # documents|
• Calculate
P(wk
|
cj)
terms
• Textj
←
single
doc
containing
all
docsj
• For
each
word
wk
in
Vocabulary
nk
←
#
of
occurrences
of
wk
in
Textj
• From
training
corpus,
extract
Vocabulary
Dan
Jurafsky
Laplace
(add-‐1)
smoothing:
unknown
words
ˆP(wu | c) =
count(wu,c)+1
count(w,c
w!V
" )
#
$
%%
&
'
(( + V +1
Add
one
extra
word
to
the
vocabulary,
the
“unknown
word”
wu
=
1
count(w,c
w!V
" )
#
$
%%
&
'
(( + V +1
Text Classification
and Naïve Bayes
Naïve
Bayes:
Learning
Text Classification
and Naïve Bayes
Naïve
Bayes:
Rela1onship
to
Language
Modeling
Dan
Jurafsky
Genera8ve
Model
for
Mul8nomial
Naïve
Bayes
39
c=China
X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds
Dan
Jurafsky
Naïve
Bayes
and
Language
Modeling
• Naïve
bayes
classifiers
can
use
any
sort
of
feature
• URL,
email
address,
dic1onaries,
network
features
• But
if,
as
in
the
previous
slides
• We
use
only
word
features
• we
use
all
of
the
words
in
the
text
(not
a
subset)
• Then
• Naïve
bayes
has
an
important
similarity
to
language
modeling.
40
Dan
Jurafsky
Each
class
=
a
unigram
language
model
• Assigning
each
word:
P(word
|
c)
• Assigning
each
sentence:
P(s|c)=Π
P(word|c)
0.1
I
0.1
love
0.01
this
0.05
fun
0.1
film
…
I
love
this
fun
film
0.1
0.1
.05
0.01
0.1
Class
pos
P(s
|
pos)
=
0.0000005
Sec.13.2.1
Dan
Jurafsky
Naïve
Bayes
as
a
Language
Model
• Which
class
assigns
the
higher
probability
to
s?
0.1
I
0.1
love
0.01
this
0.05
fun
0.1
film
Model
pos
Model
neg
film
love
this
fun
I
0.1
0.1
0.01
0.05
0.1
0.1
0.001
0.01
0.005
0.2
P(s|pos)
>
P(s|neg)
0.2
I
0.001
love
0.01
this
0.005
fun
0.1
film
Sec.13.2.1
Text Classification
and Naïve Bayes
Naïve
Bayes:
Rela1onship
to
Language
Modeling
Text Classification
and Naïve Bayes
Mul1nomial
Naïve
Bayes:
A
Worked
Example
Dan
Jurafsky
Choosing
a
class:
P(c|d5)
P(j|d5)
1/4
*
(2/9)3
*
2/9
*
2/9
≈
0.0001
Doc
Words
Class
Training
1
Chinese
Beijing
Chinese
c
2
Chinese
Chinese
Shanghai
c
3
Chinese
Macao
c
4
Tokyo
Japan
Chinese
j
Test
5
Chinese
Chinese
Chinese
Tokyo
Japan
?
45
Condi8onal
Probabili8es:
P(Chinese|c)
=
P(Tokyo|c)
=
P(Japan|c)
=
P(Chinese|j)
=
P(Tokyo|j)
=
P(Japan|j)
=
Priors:
P(c)=
P(j)=
3
4
1
4
ˆP(w | c) =
count(w,c)+1
count(c)+ |V |
ˆP(c) =
Nc
N
(5+1)
/
(8+6)
=
6/14
=
3/7
(0+1)
/
(8+6)
=
1/14
(1+1)
/
(3+6)
=
2/9
(0+1)
/
(8+6)
=
1/14
(1+1)
/
(3+6)
=
2/9
(1+1)
/
(3+6)
=
2/9
3/4
*
(3/7)3
*
1/14
*
1/14
≈
0.0003
∝
∝
Dan
Jurafsky
Naïve
Bayes
in
Spam
Filtering
• SpamAssassin
Features:
• Men1ons
Generic
Viagra
• Online
Pharmacy
• Men1ons
millions
of
(dollar)
((dollar)
NN,NNN,NNN.NN)
• Phrase:
impress
...
girl
• From:
starts
with
many
numbers
• Subject
is
all
capitals
• HTML
has
a
low
ra1o
of
text
to
image
area
• One
hundred
percent
guaranteed
• Claims
you
can
be
removed
from
the
list
• 'Pres1gious
Non-‐Accredited
Universi1es'
• hLp://spamassassin.apache.org/tests_3_3_x.html
Dan
Jurafsky
Summary:
Naive
Bayes
is
Not
So
Naive
• Very
Fast,
low
storage
requirements
• Robust
to
Irrelevant
Features
Irrelevant
Features
cancel
each
other
without
affec1ng
results
• Very
good
in
domains
with
many
equally
important
features
Decision
Trees
suffer
from
fragmentaGon
in
such
cases
–
especially
if
liLle
data
• Op1mal
if
the
independence
assump1ons
hold:
If
assumed
independence
is
correct,
then
it
is
the
Bayes
Op1mal
Classifier
for
problem
• A
good
dependable
baseline
for
text
classifica1on
• But
we
will
see
other
classifiers
that
give
beWer
accuracy
Text Classification
and Naïve Bayes
Mul1nomial
Naïve
Bayes:
A
Worked
Example
Text Classification
and Naïve Bayes
Precision,
Recall,
and
the
F
measure
Dan
Jurafsky
The
2-‐by-‐2
con8ngency
table
correct
not
correct
selected
tp
fp
not
selected
fn
tn
Dan
Jurafsky
Precision
and
recall
• Precision:
%
of
selected
items
that
are
correct
Recall:
%
of
correct
items
that
are
selected
correct
not
correct
selected
tp
fp
not
selected
fn
tn
Dan
Jurafsky
A
combined
measure:
F
• A
combined
measure
that
assesses
the
P/R
tradeoff
is
F
measure
(weighted
harmonic
mean):
• The
harmonic
mean
is
a
very
conserva1ve
average;
see
IIR
§
8.3
• People
usually
use
balanced
F1
measure
•
i.e.,
with
β
=
1
(that
is,
α
=
½):
F
=
2PR/(P+R)
RP
PR
RP
F
+
+
=
!+
= 2
2
)1(
1
)1(
1
1
"
"
##
Text Classification
and Naïve Bayes
Precision,
Recall,
and
the
F
measure
Text Classification
and Naïve Bayes
Text
Classifica1on:
Evalua1on
Dan
Jurafsky
55
More
Than
Two
Classes:
Sets
of
binary
classifiers
• Dealing
with
any-‐of
or
mul1value
classifica1on
• A
document
can
belong
to
0,
1,
or
>1
classes.
• For
each
class
c∈C
• Build
a
classifier
γc
to
dis1nguish
c
from
all
other
classes
c’
∈C
• Given
test
doc
d,
• Evaluate
it
for
membership
in
each
class
using
each
γc
• d
belongs
to
any
class
for
which
γc
returns
true
Sec.14.5
Dan
Jurafsky
56
More
Than
Two
Classes:
Sets
of
binary
classifiers
• One-‐of
or
mul1nomial
classifica1on
• Classes
are
mutually
exclusive:
each
document
in
exactly
one
class
• For
each
class
c∈C
• Build
a
classifier
γc
to
dis1nguish
c
from
all
other
classes
c’
∈C
• Given
test
doc
d,
• Evaluate
it
for
membership
in
each
class
using
each
γc
• d
belongs
to
the
one
class
with
maximum
score
Sec.14.5
Dan
Jurafsky
57
• Most
(over)used
data
set,
21,578
docs
(each
90
types,
200
toknens)
• 9603
training,
3299
test
ar1cles
(ModApte/Lewis
split)
• 118
categories
• An
ar1cle
can
be
in
more
than
one
category
• Learn
118
binary
category
dis1nc1ons
• Average
document
(with
at
least
one
category)
has
1.24
classes
• Only
about
10
out
of
118
categories
are
large
Common categories
(#train, #test)
Evalua8on:
Classic
Reuters-‐21578
Data
Set
• Earn (2877, 1087)
• Acquisitions (1650, 179)
• Money-fx (538, 179)
• Grain (433, 149)
• Crude (389, 189)
• Trade (369,119)
• Interest (347, 131)
• Ship (197, 89)
• Wheat (212, 71)
• Corn (182, 56)
Sec. 15.2.4
Dan
Jurafsky
58
Reuters
Text
Categoriza8on
data
set
(Reuters-‐21578)
document
2-MAR-1987 16:51:43.42
livestockhog
AMERICAN PORK CONGRESS KICKS OFF TOMORROW
CHICAGO, March 2 - The American Pork Congress kicks off tomorrow,
March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions
on a number of issues, according to the National Pork Producers Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future
direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to
endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry,
the NPPC added. Reuter
Sec. 15.2.4
Dan
Jurafsky
Confusion
matrix
c
• For
each
pair
of
classes
how
many
documents
from
c1
were
incorrectly
assigned
to
c2?
• c3,2:
90
wheat
documents
incorrectly
assigned
to
poultry
59
Docs
in
test
set
Assigned
UK
Assigned
poultry
Assigned
wheat
Assigned
coffee
Assigned
interest
Assigned
trade
True
UK
95
1
13
0
1
0
True
poultry
0
1
0
0
0
0
True
wheat
10
90
0
1
0
0
True
coffee
0
0
0
34
3
7
True
interest
-‐
1
2
13
26
5
True
trade
0
0
2
14
5
10
Dan
Jurafsky
60
Per
class
evalua8on
measures
Recall:
Frac1on
of
docs
in
class
i
classified
correctly:
Precision:
Frac1on
of
docs
assigned
class
i
that
are
actually
about
class
i:
Accuracy:
(1
-‐
error
rate)
Frac1on
of
docs
classified
correctly:
cii
i
!
cij
i
!
j
!
cii
cji
j
!
cii
cij
j
!
Sec. 15.2.4
Dan
Jurafsky
61
Micro-‐
vs.
Macro-‐Averaging
• If
we
have
more
than
one
class,
how
do
we
combine
mul1ple
performance
measures
into
one
quan1ty?
• Macroaveraging:
Compute
performance
for
each
class,
then
average.
• Microaveraging:
Collect
decisions
for
all
classes,
compute
con1ngency
table,
evaluate.
Sec. 15.2.4
Dan
Jurafsky
62
Micro-‐
vs.
Macro-‐Averaging:
Example
Truth:
yes
Truth:
no
Classifier:
yes
10
10
Classifier:
no
10
970
Truth:
yes
Truth:
no
Classifier:
yes
90
10
Classifier:
no
10
890
Truth:
yes
Truth:
no
Classifier:
yes
100
20
Classifier:
no
20
1860
Class
1
Class
2
Micro
Ave.
Table
Sec.
15.2.4
• Macroaveraged
precision:
(0.5
+
0.9)/2
=
0.7
• Microaveraged
precision:
100/120
=
.83
• Microaveraged
score
is
dominated
by
score
on
common
classes
Dan
Jurafsky
Development
Test
Sets
and
Cross-‐valida8on
• Metric:
P/R/F1
or
Accuracy
• Unseen
test
set
• avoid
overfiƒng
(‘tuning
to
the
test
set’)
• more
conserva1ve
es1mate
of
performance
• Cross-‐valida1on
over
mul1ple
splits
• Handle
sampling
errors
from
different
datasets
• Pool
results
over
each
split
• Compute
pooled
dev
set
performance
Training
set
Development
Test
Set
Test
Set
Test
Set
Training
Set
Training
Set
Dev
Test
Training
Set
Dev
Test
Dev
Test
Text Classification
and Naïve Bayes
Text
Classifica1on:
Evalua1on
Text Classification
and Naïve Bayes
Text
Classifica1on:
Prac1cal
Issues
Dan
Jurafsky
66
The Real World
• Gee,
I’m
building
a
text
classifier
for
real,
now!
• What
should
I
do?
Sec. 15.3.1
Dan
Jurafsky
67
No training data?
Manually written rules
If
(wheat
or
grain)
and
not
(whole
or
bread)
then
Categorize
as
grain
• Need
careful
cra…ing
• Human
tuning
on
development
data
• Time-‐consuming:
2
days
per
class
Sec. 15.3.1
Dan
Jurafsky
68
Very little data?
• Use
Naïve
Bayes
• Naïve
Bayes
is
a
“high-‐bias”
algorithm
(Ng
and
Jordan
2002
NIPS)
• Get
more
labeled
data
• Find
clever
ways
to
get
humans
to
label
data
for
you
• Try
semi-‐supervised
training
methods:
• Bootstrapping,
EM
over
unlabeled
documents,
…
Sec. 15.3.1
Dan
Jurafsky
69
A reasonable amount of data?
• Perfect
for
all
the
clever
classifiers
• SVM
• Regularized
Logis1c
Regression
• You
can
even
use
user-‐interpretable
decision
trees
• Users
like
to
hack
• Management
likes
quick
fixes
Sec. 15.3.1
Dan
Jurafsky
70
A huge amount of data?
• Can
achieve
high
accuracy!
• At
a
cost:
• SVMs
(train
1me)
or
kNN
(test
1me)
can
be
too
slow
• Regularized
logis1c
regression
can
be
somewhat
beLer
• So
Naïve
Bayes
can
come
back
into
its
own
again!
Sec. 15.3.1
Dan
Jurafsky
71
Accuracy as a function of data size
• With
enough
data
• Classifier
may
not
maLer
Sec. 15.3.1
Brill
and
Banko
on
spelling
correc1on
Dan
Jurafsky
Real-‐world
systems
generally
combine:
• Automa1c
classifica1on
• Manual
review
of
uncertain/difficult/"new”
cases
72
Dan
Jurafsky
Underflow
Preven8on:
log
space
• Mul1plying
lots
of
probabili1es
can
result
in
floa1ng-‐point
underflow.
• Since
log(xy)
=
log(x)
+
log(y)
• BeLer
to
sum
logs
of
probabili1es
instead
of
mul1plying
probabili1es.
• Class
with
highest
un-‐normalized
log
probability
score
is
s1ll
most
probable.
• Model
is
now
just
max
of
sum
of
weights
cNB = argmax
cj!C
logP(cj )+ logP(xi | cj )
i!positions
"
Dan
Jurafsky
74
How to tweak performance
• Domain-‐specific
features
and
weights:
very
important
in
real
performance
• Some1mes
need
to
collapse
terms:
• Part
numbers,
chemical
formulas,
…
• But
stemming
generally
doesn’t
help
• Upweigh1ng:
Coun1ng
a
word
as
if
it
occurred
twice:
• 1tle
words
(Cohen
&
Singer
1996)
• first
sentence
of
each
paragraph
(Murata,
1999)
• In
sentences
that
contain
1tle
words
(Ko
et
al,
2002)
Sec. 15.3.2
Text Classification
and Naïve Bayes
Text
Classifica1on:
Prac1cal
Issues