Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
PV211: Introduction to Information Retrieval
https://www.fi.muni.cz/~sojka/PV211
IIR 14: Vector Space Classiﬁcation
Handout version
Petr Sojka, Hinrich Schütze et al.
Faculty of Informatics, Masaryk University, Brno
Center for Information and Language Processing, University of Munich
2023-04-05
(compiled on 2023-04-05 11:55:58)
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 1 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Overview
1 Intro vector space classiﬁcation
2 Rocchio
3 kNN
4 Linear classiﬁers
5 > two classes
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 2 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Take-away today
Vector space classiﬁcation: Basic idea of doing text
classiﬁcation for documents that are represented as vectors
Rocchio classiﬁer: Rocchio relevance feedback idea applied to
text classiﬁcation
k nearest neighbor classiﬁcation
Linear classiﬁers
More than two classes
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 3 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Roadmap for today
Naive Bayes is simple and a good baseline.
Use it if you want to get a text classiﬁer up and running in a
hurry.
But other classiﬁcation methods are more accurate.
Perhaps the simplest well performing alternative: kNN
kNN is a vector space classiﬁer.
Plan for rest of today
1 intro vector space classiﬁcation
2 very simple vector space classiﬁcation: Rocchio
3 kNN
4 general properties of classiﬁers
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 5 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Recall vector space representation
Each document is a vector, one component for each term.
Terms are axes.
High dimensionality: 100,000s of dimensions
Normalize vectors (documents) to unit length
How can we do classiﬁcation in this space?
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 6 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Vector space classiﬁcation
As before, the training set is a set of documents, each labeled
with its class.
In vector space classiﬁcation, this set corresponds to a labeled
set of points or vectors in the vector space.
Premise 1: Documents in the same class form a contiguous
region.
Premise 2: Documents from diﬀerent classes don’t overlap.
We deﬁne lines, surfaces, hyper-surfaces to divide regions.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 7 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Classes in the vector space
xx
x
x
⋄
⋄
⋄⋄
⋄
⋄
China
Kenya
UK
⋆
Should the document ⋆ be assigned to China, UK or Kenya?
Find separators between the classes
Based on these separators: ⋆ should be assigned to China
How do we ﬁnd separators that do a good job at classifying new
documents like ⋆? – Main topic of today
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 8 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Aside: 2D/3D graphs can be misleading
d true
dprojected
x1
x2 x3 x4
x5
x′
1 x′
2 x′
3 x′
4 x′
5
x′
1 x′
2 x′
3
x′
4 x′
5
Left: A projection of the 2D semicircle to 1D. For the points x1, x2, x3, x4, x5 at x
coordinates −0.9, −0.2, 0, 0.2, 0.9 the distance |x2x3| ≈ 0.201 only diﬀers by 0.5%
from |x′
2x′
3| = 0.2; but |x1x3|/|x′
1x′
3| = dtrue/dprojected ≈ 1.06/0.9 ≈ 1.18 is an
example of a large distortion (18%) when projecting a large area. Right: The
corresponding projection of the 3D hemisphere to 2D.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 9 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Relevance feedback
In relevance feedback, the user marks documents as
relevant/non-relevant.
Relevant/non-relevant can be viewed as classes or categories.
For each document, the user decides which of these two
classes is correct.
The IR system then uses these class assignments to build a
better query (“model”) of the information need . . .
. . . and returns better documents.
Relevance feedback is a form of text classiﬁcation.
The notion of text classiﬁcation (TC) is very general and has
many applications within and beyond information retrieval.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 11 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Using Rocchio for vector space classiﬁcation
The principal diﬀerence between relevance feedback and text
classiﬁcation:
The training set is given as part of the input in text
classiﬁcation.
It is interactively created in relevance feedback.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 12 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Rocchio classiﬁcation: Basic idea
Compute a centroid for each class
The centroid is the average of all documents in the class.
Assign each test document to the class of its closest centroid.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 13 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Recall deﬁnition of centroid
µ(c) =
1
|Dc| d∈Dc
v(d)
where Dc is the set of all documents that belong to class c and
v(d) is the vector space representation of d.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 14 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Rocchio algorithm
TrainRocchio(C, D)
1 for each cj ∈ C
2 do Dj ← {d : d, cj ∈ D}
3 µj ← 1
|Dj | d∈Dj
v(d)
4 return {µ1, . . . , µJ }
ApplyRocchio({µ1, . . . , µJ}, d)
1 return arg minj |µj − v(d)|
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 15 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Rocchio illustrated: a1 = a2, b1 = b2, c1 = c2
xx
x
x
⋄
⋄
⋄
⋄
⋄
⋄
China
Kenya
UK
⋆ a1
a2
b1
b2
c1
c2
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 16 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Rocchio properties
Rocchio forms a simple representation for each class: the
centroid
We can interpret the centroid as the prototype of the class.
Classiﬁcation is based on similarity to / distance from
centroid/prototype.
Does not guarantee that classiﬁcations are consistent with the
training data!
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 17 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Time complexity of Rocchio
mode time complexity
training Θ(|D|Lave + |C||V |) ≈ Θ(|D|Lave)
testing Θ(La + |C|Ma) ≈ Θ(|C|Ma)
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 18 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Rocchio vs. Naive Bayes
In many cases, Rocchio performs worse than Naive Bayes.
One reason: Rocchio does not handle nonconvex, multimodal
classes correctly.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 19 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Rocchio cannot handle nonconvex, multimodal classes
a
a
a
a
a
a
a a
a
a
a
a
a
a
a a
aa
a
a
a
a
a
a
a
a
a
a
a
a a
aa
aa
a
a
aa
a
b
b
b
b
bb bb b
b
b
bb
b
b
b
b
b
b
X XA
B
o
Exercise: Why is Rocchio not
expected to do well for the
classiﬁcation task a vs. b here?
A is centroid of the a’s,
B is centroid of the b’s.
The point o is closer to
A than to B.
But o is a better ﬁt for
the b class.
A is a multimodal class
with two prototypes.
But in Rocchio we only
have one prototype.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 20 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
kNN classiﬁcation
kNN classiﬁcation is another vector space classiﬁcation
method.
It also is very simple and easy to implement.
kNN is more accurate (in most cases) than Naive Bayes and
Rocchio.
If you need to get a pretty accurate classiﬁer up and running
in a short time . . .
. . . and you don’t care about eﬃciency that much . . .
. . . use kNN.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 22 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
kNN classiﬁcation
kNN = k nearest neighbors
kNN classiﬁcation rule for k = 1 (1NN): Assign each test
document to the class of its nearest neighbor in the training
set.
1NN is not very robust – one document can be mislabeled or
atypical.
kNN classiﬁcation rule for k > 1 (kNN): Assign each test
document to the majority class of its k nearest neighbors in
the training set.
Rationale of kNN: contiguity hypothesis
We expect a test document d to have the same label as the
training documents located in the local region surrounding d.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 23 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Probabilistic kNN
Probabilistic version of kNN: P(c|d) = fraction of k neighbors
of d that are in c
kNN classiﬁcation rule for probabilistic kNN: Assign d to class
c with highest P(c|d)
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 24 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
kNN is based on Voronoi tessellation
x
x
x
x
x
x
x
x
x x
x
⋄
⋄
⋄
⋄
⋄
⋄
⋄
⋄⋄
⋄ ⋄
⋆
1NN, 3NN
classiﬁcation
decision for
star?
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 25 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
kNN algorithm
Train-kNN(C, D)
1 D′ ← Preprocess(D)
2 k ← Select-k(C, D′)
3 return D′, k
Apply-kNN(D′, k, d)
1 Sk ← ComputeNearestNeighbors(D′, k, d)
2 for each cj ∈ C(D′)
3 do pj ← |Sk ∩ cj|/k
4 return arg maxj pj
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 26 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Exercise
⋆
x
x
x
x
x
x
x
x
x
x
o
o
o
o
o
How is star classiﬁed by:
(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio?
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 27 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Time complexity of kNN
kNN with preprocessing of training set
training Θ(|D|Lave)
testing Θ(La + |D|MaveMa) = Θ(|D|MaveMa)
kNN test time proportional to the size of the training set!
The larger the training set, the longer it takes to classify a
test document.
kNN is ineﬃcient for very large training sets.
Question: Can we divide up the training set into regions, so
that we only have to search in one region to do kNN
classiﬁcation for a given test document? (which perhaps
would give us better than linear time complexity)
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 28 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Curse of dimensionality
Our intuitions about space are based on the 3D world we live
in.
Intuition 1: some things are close by, some things are distant.
Intuition 2: we can carve up space into areas such that: within
an area things are close, distances between areas are large.
These two intuitions don’t necessarily hold for high
dimensions.
In particular: for a set of k uniformly distributed points, let
dmin be the smallest distance between any two points and
dmax be the largest distance between any two points.
Then
lim
d→∞
dmax − dmin
dmin
= 0
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 29 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Curse of dimensionality: Simulation
Simulate
lim
d→∞
dmax − dmin
dmin
= 0
Pick a dimensionality d
Generate 10 random points in the d-dimensional hypercube
(uniform distribution)
Compute all 45 distances
Compute dmax−dmin
dmin
We see that intuition 1 (some things are close, others are
distant) is not true for high dimensions.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 30 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Intuition 2: Space can be carved up
Intuition 2: we can carve up space into areas such that: within
an area things are close, distances between areas are large.
If this is true, then we have a simple and eﬃcient algorithm
for kNN.
To ﬁnd the k closest neighbors of data point
< x1, x2, . . . , xd > do the following.
Using binary search ﬁnd all data points whose ﬁrst dimension
is in [x1 − ǫ, x1 + ǫ]. This is O(log n) where n is the number of
data points.
Do this for each dimension, then intersect the d subsets.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 31 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Intuition 2: Space can be carved up
Size of data set n = 100
Again, assume uniform distribution in hypercube
Set ǫ = 0.05: we will look in an interval of length 0.1 for
neighbors on each dimension.
What is the probability that the nearest neighbor of a new
data point x is in this neighborhood in d = 1 dimension?
for d = 1: 1 − (1 − 0.1)100 ≈ 0.99997
In d = 2 dimensions?
for d = 2: 1 − (1 − 0.12)100 ≈ 0.63
In d = 3 dimensions?
for d = 3: 1 − (1 − 0.13)100 ≈ 0.095
In d = 4 dimensions?
for d = 4: 1 − (1 − 0.14)100 ≈ 0.0095
In d = 5 dimensions?
for d = 5: 1 − (1 − 0.15)100 ≈ 0.0009995
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 32 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Intuition 2: Space can be carved up
In d = 5 dimensions?
for d = 5: 1 − (1 − 0.15)100 ≈ 0.0009995
In other words: with enough dimensions, there is only one
“local” region that will contain the nearest neighbor with high
certainty: the entire search space.
We cannot carve up high-dimensional space into neat
neighborhoods . . .
. . . unless the “true” dimensionality is much lower than d.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 33 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
kNN: Discussion
No training necessary
But linear preprocessing of documents is as expensive as
training Naive Bayes.
We always preprocess the training set, so in reality training
time of kNN is linear.
kNN is very accurate if training set is large.
Optimality result: asymptotically zero error if Bayes rate is
zero.
But kNN can be very inaccurate if training set is small.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 34 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Linear classiﬁers
Deﬁnition:
A linear classiﬁer computes a linear combination or weighted
sum i wi xi of the feature values.
Classiﬁcation decision: i wi xi > θ?
. . . where θ (the threshold) is a parameter.
(First, we only consider binary classiﬁers.)
Geometrically, this corresponds to a line (2D), a plane (3D) or
a hyperplane (higher dimensionalities), the separator.
We ﬁnd this separator based on training set.
Methods for ﬁnding separator: Perceptron, Rocchio, Naive
Bayes – as we will explain on the next slides
Assumption: The classes are linearly separable.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 36 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
A linear classiﬁer in 1D
A linear classiﬁer in 1D is
a point described by the
equation w1d1 = θ
The point at θ/w1
Points (d1) with w1d1 ≥ θ
are in the class c.
Points (d1) with w1d1 < θ
are in the complement
class c.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 37 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
A linear classiﬁer in 2D
A linear classiﬁer in 2D is
a line described by the
equation w1d1 + w2d2 = θ
Example for a 2D linear
classiﬁer
Points (d1 d2) with
w1d1 + w2d2 ≥ θ are in
the class c.
Points (d1 d2) with
w1d1 + w2d2 < θ are in
the complement class c.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 38 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
A linear classiﬁer in 3D
A linear classiﬁer in 3D is
a plane described by the
equation
w1d1 + w2d2 + w3d3 = θ
Example for a 3D linear
classiﬁer
Points (d1 d2 d3) with
w1d1 + w2d2 + w3d3 ≥ θ
are in the class c.
Points (d1 d2 d3) with
w1d1 + w2d2 + w3d3 < θ
are in the complement
class c.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 39 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Rocchio as a linear classiﬁer
Rocchio is a linear classiﬁer deﬁned by:
M
i=1
wi di = wd = θ
where w is the normal vector µ(c1) − µ(c2) and
θ = 0.5 ∗ (|µ(c1)|2 − |µ(c2)|2).
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 40 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Naive Bayes as a linear classiﬁer
Multinomial Naive Bayes is a linear classiﬁer (in log space) deﬁned
by:
M
i=1
wi di = θ
where wi = log[ˆP(ti |c)/ˆP(ti |¯c)], di = number of occurrences of ti
in d, and θ = − log[ˆP(c)/ˆP(¯c)]. Here, the index i, 1 ≤ i ≤ M,
refers to terms of the vocabulary (not to positions in d as k did in
our original deﬁnition of Naive Bayes)
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 41 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
kNN is not a linear classiﬁer
x
x
x x
x
x x
x
x x
x
⋄
⋄
⋄
⋄
⋄
⋄
⋄
⋄⋄
⋄ ⋄
⋆
Classiﬁcation decision
based on majority of
k nearest neighbors.
The decision
boundaries between
classes are piecewise
linear . . .
. . . but they are in
general not linear
classiﬁers that can be
described as
M
i=1 wi di = θ.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 42 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Example of a linear two-class classiﬁer
ti wi d1i d2i ti wi d1i d2i
prime 0.70 0 1 dlrs -0.71 1 1
rate 0.67 1 0 world -0.35 1 0
interest 0.63 0 0 sees -0.33 0 0
rates 0.60 0 0 year -0.25 0 0
discount 0.46 1 0 group -0.24 0 0
bundesbank 0.43 0 0 dlr -0.24 0 0
This is for the class interest in Reuters-21578.
For simplicity: assume a simple 0/1 vector representation
d1: “rate discount dlrs world”
d2: “prime dlrs”
θ = 0
Exercise: Which class is d1 assigned to? Which class is d2 assigned to?
We assign document d1 “rate discount dlrs world” to interest since
wT d1 = 0.67 · 1 + 0.46 · 1 + (−0.71) · 1 + (−0.35) · 1 = 0.07 > 0 = θ.
We assign d2 “prime dlrs” to the complement class (not in interest) since
wT d2 = −0.01 ≤ θ.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 43 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Which hyperplane?
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 44 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Learning algorithms for vector space classiﬁcation
In terms of actual computation, there are two types of
learning algorithms.
(i) Simple learning algorithms that estimate the parameters of
the classiﬁer directly from the training data, often in one
linear pass.
Naive Bayes, Rocchio, kNN are all examples of this.
(ii) Iterative algorithms
Support vector machines
Perceptron (example available as PDF on website:
http://cislmu.org)
The best performing learning algorithms usually require
iterative learning.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 45 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Perceptron update rule
Randomly initialize linear separator w
Do until convergence:
Pick data point x
If sign(wT
x) is correct class (1 or -1): do nothing
Otherwise: w = w − sign(wT
x)x
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 46 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Perceptron
w
x
S
NO
YES
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 47 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Perceptron
w
x
x
S
NO
YES
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 48 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Perceptron
w
x
x
w + x
S S′
NO
YES
NOYES
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 49 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Perceptron
x
w + x
S′
NOYES
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 50 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Which hyperplane?
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 51 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Which hyperplane?
For linearly separable training sets: there are inﬁnitely many
separating hyperplanes.
They all separate the training set perfectly . . .
. . . but they behave diﬀerently on test data.
Error rates on new data are low for some, high for others.
How do we ﬁnd a low-error separator?
Perceptron: generally bad; Naive Bayes, Rocchio: ok; linear
SVM: good
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 52 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Linear classiﬁers: Discussion
Many common text classiﬁers are linear classiﬁers: Naive
Bayes, Rocchio, logistic regression, linear support vector
machines, etc.
Each method has a diﬀerent way of selecting the separating
hyperplane
Huge diﬀerences in performance on test documents
Can we get better performance with more powerful nonlinear
classiﬁers?
Not in general: A given amount of training data may suﬃce
for estimating a linear boundary, but not for estimating a
more complex nonlinear boundary.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 53 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
A nonlinear problem
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Linear classiﬁer like Rocchio does badly on this task.
kNN will do well (assuming enough training data)
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 54 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Which classiﬁer do I use for a given TC problem?
Is there a learning method that is optimal for all text
classiﬁcation problems?
No, because there is a trade-oﬀ between bias and variance.
Factors to take into account:
How much training data is available?
How simple/complex is the problem? (linear vs. nonlinear
decision boundary)
How noisy is the problem?
How stable is the problem over time?
For an unstable problem, it’s better to use a simple and robust
classiﬁer.
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 55 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
How to combine hyperplanes for > 2 classes?
?
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 57 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
One-of problems
One-of or multiclass classiﬁcation
Classes are mutually exclusive.
Each document belongs to exactly one class.
Example: language of a document (assumption: no document
contains multiple languages)
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 58 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
One-of classiﬁcation with linear classiﬁers
Combine two-class linear classiﬁers as follows for one-of
classiﬁcation:
Run each classiﬁer separately
Rank classiﬁers (e.g., according to score)
Pick the class with the highest score
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 59 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Any-of problems
Any-of or multilabel classiﬁcation
A document can be a member of 0, 1, or many classes.
A decision on one class leaves decisions open on all other
classes.
A type of “independence” (but not statistical independence)
Example: topic classiﬁcation
Usually: make decisions on the region, on the subject area, on
the industry and so on “independently”
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 60 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Any-of classiﬁcation with linear classiﬁers
Combine two-class linear classiﬁers as follows for any-of
classiﬁcation:
Simply run each two-class classiﬁer separately on the test
document and assign document accordingly
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 61 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Take-away today
Vector space classiﬁcation: Basic idea of doing text
classiﬁcation for documents that are represented as vectors
Rocchio classiﬁer: Rocchio relevance feedback idea applied to
text classiﬁcation
k nearest neighbor classiﬁcation
Linear classiﬁers
More than two classes
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 62 / 63
Intro vector space classiﬁcation Rocchio kNN Linear classiﬁers > two classes
Resources
Chapter 13 of IIR (feature selection)
Chapter 14 of IIR
Resources at http://cislmu.org
Perceptron example
General overview of text classiﬁcation: Sebastiani (2002)
Text classiﬁcation chapter on decision tress and perceptrons:
Manning & Schütze (1999)
One of the best machine learning textbooks: Hastie, Tibshirani
& Friedman (2003)
Sojka, IIR Group: PV211: Vector Space Classiﬁcation 63 / 63