Vector space classification (Chapter 14) Algorithm 1 (Rocchio classification) 1: function Train-Rocchio(C, D) 2: for all 𝑐 𝑗 ∈ C do 3: 𝐷 𝑗 ← {𝑑 : ⟨𝑑, 𝑐 𝑗⟩ ∈ D} 4: ⃗𝜇 𝑗 ← 1 |𝐷 𝑗 | ∑︀ 𝑑∈𝐷 𝑗 ⃗𝑣(𝑑) 5: end for 6: return { ⃗𝜇1, . . . , ⃗𝜇 𝐽 } 7: end function 8: 9: function Apply-Rocchio({ ⃗𝜇1, . . . , ⃗𝜇 𝐽 }, 𝑑) 10: return arg min 𝑗 | ⃗𝜇 𝑗 − ⃗𝑣(𝑑)| 11: end function Algorithm 2 (𝑘 nearest neighbor classification) 1: function Train-kNN(C, D) 2: D′ ← Preprocess(D) 3: 𝑘 ← Select-k(C, D′ ) 4: return D′ , 𝑘 5: end function 6: 7: function Apply-kNN(C, D′ , 𝑘, 𝑑) 8: 𝑆 𝑘 ← ComputeNearestNeighbors(D′ , 𝑘, 𝑑) 9: for all 𝑐 𝑗 ∈ C do 10: 𝑝 𝑗 ← |𝑆 𝑘 ∩ 𝑐 𝑗|/𝑘 11: end for 12: return arg max 𝑗 𝑝 𝑗 13: end function Exercise 14/1 What is the contiguity hypothesis? Exercise 14/2 Discuss the main idea behind the Rocchio classification. How is Rocchio classification different to our linear classifier from exercises 13/3 and 13/4 in the previous seminar? Exercise 14/3 Discuss the main idea behind the 𝑘 Nearest Neighbor (𝑘NN) classification. How large 𝑘 (how many neighbors) should we use? Exercise 14/4 Build Rocchio and 1NN classifiers for the training set {([1, 1], 1), ([2, 0], 1), ([2, 3], 2)} and classify the document 𝑞 = [1, 2]. Do the classifiers agree? 1 2