Vector space classification (Chapter 14)
Algorithm 1 (Rocchio classification)
1: function Train-Rocchio(C, D)
2: for all 𝑐 𝑗 ∈ C do
3: 𝐷 𝑗 ← {𝑑 : ⟨𝑑, 𝑐 𝑗⟩ ∈ D}
4: ⃗𝜇 𝑗 ← 1
|𝐷 𝑗 |
∑︀
𝑑∈𝐷 𝑗
⃗𝑣(𝑑)
5: end for
6: return { ⃗𝜇1, . . . , ⃗𝜇 𝐽 }
7: end function
8:
9: function Apply-Rocchio({ ⃗𝜇1, . . . , ⃗𝜇 𝐽 }, 𝑑)
10: return arg min 𝑗 | ⃗𝜇 𝑗 − ⃗𝑣(𝑑)|
11: end function
Algorithm 2 (𝑘 nearest neighbor classification)
1: function Train-kNN(C, D)
2: D′
← Preprocess(D)
3: 𝑘 ← Select-k(C, D′
)
4: return D′
, 𝑘
5: end function
6:
7: function Apply-kNN(C, D′
, 𝑘, 𝑑)
8: 𝑆 𝑘 ← ComputeNearestNeighbors(D′
, 𝑘, 𝑑)
9: for all 𝑐 𝑗 ∈ C do
10: 𝑝 𝑗 ← |𝑆 𝑘 ∩ 𝑐 𝑗|/𝑘
11: end for
12: return arg max 𝑗 𝑝 𝑗
13: end function
Exercise 14/1
What is the contiguity hypothesis?
Exercise 14/2
Discuss the main idea behind the Rocchio classification. How is Rocchio classification
different to our linear classifier from exercises 13/3 and 13/4 in the previous seminar?
Exercise 14/3
Discuss the main idea behind the 𝑘 Nearest Neighbor (𝑘NN) classification. How large 𝑘
(how many neighbors) should we use?
Exercise 14/4
Build Rocchio and 1NN classifiers for the training set {([1, 1], 1), ([2, 0], 1), ([2, 3], 2)} and
classify the document 𝑞 = [1, 2]. Do the classifiers agree?
1
2