Intro vector space classification Rocchio kNN Linear classifiers > two classes PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/PV211 IIR 14: Vector Space Classification Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2024-04-25 (compiled on 2024-06-10 08:51:10) Sojka, IIR Group: PV211: Vector Space Classification 1 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Overview 1 Intro vector space classification 2 Rocchio 3 kNN 4 Linear classifiers 5 > two classes Sojka, IIR Group: PV211: Vector Space Classification 2 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Take-away today Vector space classification: Basic idea of doing text classification for documents that are represented as vectors Rocchio classifier: Rocchio relevance feedback idea applied to text classification k nearest neighbor classification Linear classifiers More than two classes Sojka, IIR Group: PV211: Vector Space Classification 3 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Roadmap for today Naive Bayes is simple and a good baseline. Use it if you want to get a text classifier up and running in a hurry. But other classification methods are more accurate. Perhaps the simplest well performing alternative: kNN kNN is a vector space classifier. Plan for rest of today 1 intro vector space classification 2 very simple vector space classification: Rocchio 3 kNN 4 general properties of classifiers Sojka, IIR Group: PV211: Vector Space Classification 5 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Recall vector space representation Each document is a vector, one component for each term. Terms are axes. High dimensionality: 100,000s of dimensions Normalize vectors (documents) to unit length How can we do classification in this space? Sojka, IIR Group: PV211: Vector Space Classification 6 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Vector space classification As before, the training set is a set of documents, each labeled with its class. In vector space classification, this set corresponds to a labeled set of points or vectors in the vector space. Premise 1: Documents in the same class form a contiguous region. Premise 2: Documents from different classes don’t overlap. We define lines, surfaces, hyper-surfaces to divide regions. Sojka, IIR Group: PV211: Vector Space Classification 7 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Classes in the vector space xx x x ⋄ ⋄ ⋄⋄ ⋄ ⋄ China Kenya UK ⋆ Should the document ⋆ be assigned to China, UK or Kenya? Find separators between the classes Based on these separators: ⋆ should be assigned to China How do we find separators that do a good job at classifying new documents like ⋆? – Main topic of today Sojka, IIR Group: PV211: Vector Space Classification 8 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Aside: 2D/3D graphs can be misleading d true dprojected x1 x2 x3 x4 x5 x′ 1 x′ 2 x′ 3 x′ 4 x′ 5 x′ 1 x′ 2 x′ 3 x′ 4 x′ 5 Left: A projection of the 2D semicircle to 1D. For the points x1, x2, x3, x4, x5 at x coordinates −0.9, −0.2, 0, 0.2, 0.9 the distance |x2x3| ≈ 0.201 only differs by 0.5% from |x′ 2x′ 3| = 0.2; but |x1x3|/|x′ 1x′ 3| = dtrue/dprojected ≈ 1.06/0.9 ≈ 1.18 is an example of a large distortion (18%) when projecting a large area. Right: The corresponding projection of the 3D hemisphere to 2D. Sojka, IIR Group: PV211: Vector Space Classification 9 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Relevance feedback In relevance feedback, the user marks documents as relevant/non-relevant. Relevant/non-relevant can be viewed as classes or categories. For each document, the user decides which of these two classes is correct. The IR system then uses these class assignments to build a better query (“model”) of the information need . . . . . . and returns better documents. Relevance feedback is a form of text classification. The notion of text classification (TC) is very general and has many applications within and beyond information retrieval. Sojka, IIR Group: PV211: Vector Space Classification 11 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Using Rocchio for vector space classification The principal difference between relevance feedback and text classification: The training set is given as part of the input in text classification. It is interactively created in relevance feedback. Sojka, IIR Group: PV211: Vector Space Classification 12 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Rocchio classification: Basic idea Compute a centroid for each class The centroid is the average of all documents in the class. Assign each test document to the class of its closest centroid. Sojka, IIR Group: PV211: Vector Space Classification 13 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Recall definition of centroid µ(c) = 1 |Dc| d∈Dc v(d) where Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d. Sojka, IIR Group: PV211: Vector Space Classification 14 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Rocchio algorithm TrainRocchio(C, D) 1 for each cj ∈ C 2 do Dj ← {d : d, cj ∈ D} 3 µj ← 1 |Dj | d∈Dj v(d) 4 return {µ1, . . . , µJ } ApplyRocchio({µ1, . . . , µJ}, d) 1 return arg minj |µj − v(d)| Sojka, IIR Group: PV211: Vector Space Classification 15 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Rocchio illustrated: a1 = a2, b1 = b2, c1 = c2 xx x x ⋄ ⋄ ⋄ ⋄ ⋄ ⋄ China Kenya UK ⋆ a1 a2 b1 b2 c1 c2 Sojka, IIR Group: PV211: Vector Space Classification 16 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Rocchio properties Rocchio forms a simple representation for each class: the centroid We can interpret the centroid as the prototype of the class. Classification is based on similarity to / distance from centroid/prototype. Does not guarantee that classifications are consistent with the training data! Sojka, IIR Group: PV211: Vector Space Classification 17 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Time complexity of Rocchio mode time complexity training Θ(|D|Lave + |C||V |) ≈ Θ(|D|Lave) testing Θ(La + |C|Ma) ≈ Θ(|C|Ma) Sojka, IIR Group: PV211: Vector Space Classification 18 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Rocchio vs. Naive Bayes In many cases, Rocchio performs worse than Naive Bayes. One reason: Rocchio does not handle nonconvex, multimodal classes correctly. Sojka, IIR Group: PV211: Vector Space Classification 19 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Rocchio cannot handle nonconvex, multimodal classes a a a a a a a a a a a a a a a a aa a a a a a a a a a a a a a aa aa a a aa a b b b b bb bb b b b bb b b b b b b X XA B o Exercise: Why is Rocchio not expected to do well for the classification task a vs. b here? A is centroid of the a’s, B is centroid of the b’s. The point o is closer to A than to B. But o is a better fit for the b class. A is a multimodal class with two prototypes. But in Rocchio we only have one prototype. Sojka, IIR Group: PV211: Vector Space Classification 20 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes kNN classification kNN classification is another vector space classification method. It also is very simple and easy to implement. kNN is more accurate (in most cases) than Naive Bayes and Rocchio. If you need to get a pretty accurate classifier up and running in a short time . . . . . . and you don’t care about efficiency that much . . . . . . use kNN. Sojka, IIR Group: PV211: Vector Space Classification 22 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes kNN classification kNN = k nearest neighbors kNN classification rule for k = 1 (1NN): Assign each test document to the class of its nearest neighbor in the training set. 1NN is not very robust – one document can be mislabeled or atypical. kNN classification rule for k > 1 (kNN): Assign each test document to the majority class of its k nearest neighbors in the training set. Rationale of kNN: contiguity hypothesis We expect a test document d to have the same label as the training documents located in the local region surrounding d. Sojka, IIR Group: PV211: Vector Space Classification 23 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Probabilistic kNN Probabilistic version of kNN: P(c|d) = fraction of k neighbors of d that are in c kNN classification rule for probabilistic kNN: Assign d to class c with highest P(c|d) Sojka, IIR Group: PV211: Vector Space Classification 24 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes kNN is based on Voronoi tessellation x x x x x x x x x x x ⋄ ⋄ ⋄ ⋄ ⋄ ⋄ ⋄ ⋄⋄ ⋄ ⋄ ⋆ 1NN, 3NN classification decision for star? Sojka, IIR Group: PV211: Vector Space Classification 25 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes kNN algorithm Train-kNN(C, D) 1 D′ ← Preprocess(D) 2 k ← Select-k(C, D′) 3 return D′, k Apply-kNN(D′, k, d) 1 Sk ← ComputeNearestNeighbors(D′, k, d) 2 for each cj ∈ C(D′) 3 do pj ← |Sk ∩ cj|/k 4 return arg maxj pj Sojka, IIR Group: PV211: Vector Space Classification 26 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Exercise ⋆ x x x x x x x x x x o o o o o How is star classified by: (i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio? Sojka, IIR Group: PV211: Vector Space Classification 27 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Time complexity of kNN kNN with preprocessing of training set training Θ(|D|Lave) testing Θ(La + |D|MaveMa) = Θ(|D|MaveMa) kNN test time proportional to the size of the training set! The larger the training set, the longer it takes to classify a test document. kNN is inefficient for very large training sets. Question: Can we divide up the training set into regions, so that we only have to search in one region to do kNN classification for a given test document? (which perhaps would give us better than linear time complexity) Sojka, IIR Group: PV211: Vector Space Classification 28 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Curse of dimensionality Our intuitions about space are based on the 3D world we live in. Intuition 1: some things are close by, some things are distant. Intuition 2: we can carve up space into areas such that: within an area things are close, distances between areas are large. These two intuitions don’t necessarily hold for high dimensions. In particular: for a set of k uniformly distributed points, let dmin be the smallest distance between any two points and dmax be the largest distance between any two points. Then lim d→∞ dmax − dmin dmin = 0 Sojka, IIR Group: PV211: Vector Space Classification 29 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Curse of dimensionality: Simulation Simulate lim d→∞ dmax − dmin dmin = 0 Pick a dimensionality d Generate 10 random points in the d-dimensional hypercube (uniform distribution) Compute all 45 distances Compute dmax−dmin dmin We see that intuition 1 (some things are close, others are distant) is not true for high dimensions. Sojka, IIR Group: PV211: Vector Space Classification 30 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Intuition 2: Space can be carved up Intuition 2: we can carve up space into areas such that: within an area things are close, distances between areas are large. If this is true, then we have a simple and efficient algorithm for kNN. To find the k closest neighbors of data point < x1, x2, . . . , xd > do the following. Using binary search find all data points whose first dimension is in [x1 − ǫ, x1 + ǫ]. This is O(log n) where n is the number of data points. Do this for each dimension, then intersect the d subsets. Sojka, IIR Group: PV211: Vector Space Classification 31 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Intuition 2: Space can be carved up Size of data set n = 100 Again, assume uniform distribution in hypercube Set ǫ = 0.05: we will look in an interval of length 0.1 for neighbors on each dimension. What is the probability that the nearest neighbor of a new data point x is in this neighborhood in d = 1 dimension? for d = 1: 1 − (1 − 0.1)100 ≈ 0.99997 In d = 2 dimensions? for d = 2: 1 − (1 − 0.12)100 ≈ 0.63 In d = 3 dimensions? for d = 3: 1 − (1 − 0.13)100 ≈ 0.095 In d = 4 dimensions? for d = 4: 1 − (1 − 0.14)100 ≈ 0.0095 In d = 5 dimensions? for d = 5: 1 − (1 − 0.15)100 ≈ 0.0009995 Sojka, IIR Group: PV211: Vector Space Classification 32 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Intuition 2: Space can be carved up In d = 5 dimensions? for d = 5: 1 − (1 − 0.15)100 ≈ 0.0009995 In other words: with enough dimensions, there is only one “local” region that will contain the nearest neighbor with high certainty: the entire search space. We cannot carve up high-dimensional space into neat neighborhoods . . . . . . unless the “true” dimensionality is much lower than d. Sojka, IIR Group: PV211: Vector Space Classification 33 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes kNN: Discussion No training necessary But linear preprocessing of documents is as expensive as training Naive Bayes. We always preprocess the training set, so in reality training time of kNN is linear. kNN is very accurate if training set is large. Optimality result: asymptotically zero error if Bayes rate is zero. But kNN can be very inaccurate if training set is small. Sojka, IIR Group: PV211: Vector Space Classification 34 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Linear classifiers Definition: A linear classifier computes a linear combination or weighted sum i wi xi of the feature values. Classification decision: i wi xi > θ? . . . where θ (the threshold) is a parameter. (First, we only consider binary classifiers.) Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane (higher dimensionalities), the separator. We find this separator based on training set. Methods for finding separator: Perceptron, Rocchio, Naive Bayes – as we will explain on the next slides Assumption: The classes are linearly separable. Sojka, IIR Group: PV211: Vector Space Classification 36 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes A linear classifier in 1D A linear classifier in 1D is a point described by the equation w1d1 = θ The point at θ/w1 Points (d1) with w1d1 ≥ θ are in the class c. Points (d1) with w1d1 < θ are in the complement class c. Sojka, IIR Group: PV211: Vector Space Classification 37 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes A linear classifier in 2D A linear classifier in 2D is a line described by the equation w1d1 + w2d2 = θ Example for a 2D linear classifier Points (d1 d2) with w1d1 + w2d2 ≥ θ are in the class c. Points (d1 d2) with w1d1 + w2d2 < θ are in the complement class c. Sojka, IIR Group: PV211: Vector Space Classification 38 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes A linear classifier in 3D A linear classifier in 3D is a plane described by the equation w1d1 + w2d2 + w3d3 = θ Example for a 3D linear classifier Points (d1 d2 d3) with w1d1 + w2d2 + w3d3 ≥ θ are in the class c. Points (d1 d2 d3) with w1d1 + w2d2 + w3d3 < θ are in the complement class c. Sojka, IIR Group: PV211: Vector Space Classification 39 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Rocchio as a linear classifier Rocchio is a linear classifier defined by: M i=1 wi di = wd = θ where w is the normal vector µ(c1) − µ(c2) and θ = 0.5 ∗ (|µ(c1)|2 − |µ(c2)|2). Sojka, IIR Group: PV211: Vector Space Classification 40 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Naive Bayes as a linear classifier Multinomial Naive Bayes is a linear classifier (in log space) defined by: M i=1 wi di = θ where wi = log[ˆP(ti |c)/ˆP(ti |¯c)], di = number of occurrences of ti in d, and θ = − log[ˆP(c)/ˆP(¯c)]. Here, the index i, 1 ≤ i ≤ M, refers to terms of the vocabulary (not to positions in d as k did in our original definition of Naive Bayes) Sojka, IIR Group: PV211: Vector Space Classification 41 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes kNN is not a linear classifier x x x x x x x x x x x ⋄ ⋄ ⋄ ⋄ ⋄ ⋄ ⋄ ⋄⋄ ⋄ ⋄ ⋆ Classification decision based on majority of k nearest neighbors. The decision boundaries between classes are piecewise linear . . . . . . but they are in general not linear classifiers that can be described as M i=1 wi di = θ. Sojka, IIR Group: PV211: Vector Space Classification 42 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Example of a linear two-class classifier ti wi d1i d2i ti wi d1i d2i prime 0.70 0 1 dlrs -0.71 1 1 rate 0.67 1 0 world -0.35 1 0 interest 0.63 0 0 sees -0.33 0 0 rates 0.60 0 0 year -0.25 0 0 discount 0.46 1 0 group -0.24 0 0 bundesbank 0.43 0 0 dlr -0.24 0 0 This is for the class interest in Reuters-21578. For simplicity: assume a simple 0/1 vector representation d1: “rate discount dlrs world” d2: “prime dlrs” θ = 0 Exercise: Which class is d1 assigned to? Which class is d2 assigned to? We assign document d1 “rate discount dlrs world” to interest since wT d1 = 0.67 · 1 + 0.46 · 1 + (−0.71) · 1 + (−0.35) · 1 = 0.07 > 0 = θ. We assign d2 “prime dlrs” to the complement class (not in interest) since wT d2 = −0.01 ≤ θ. Sojka, IIR Group: PV211: Vector Space Classification 43 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Which hyperplane? Sojka, IIR Group: PV211: Vector Space Classification 44 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Learning algorithms for vector space classification In terms of actual computation, there are two types of learning algorithms. (i) Simple learning algorithms that estimate the parameters of the classifier directly from the training data, often in one linear pass. Naive Bayes, Rocchio, kNN are all examples of this. (ii) Iterative algorithms Support vector machines Perceptron (example available as PDF on website: http://cislmu.org) The best performing learning algorithms usually require iterative learning. Sojka, IIR Group: PV211: Vector Space Classification 45 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Perceptron update rule Randomly initialize linear separator w Do until convergence: Pick data point x If sign(wT x) is correct class (1 or -1): do nothing Otherwise: w = w − sign(wT x)x Sojka, IIR Group: PV211: Vector Space Classification 46 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Perceptron w x S NO YES Sojka, IIR Group: PV211: Vector Space Classification 47 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Perceptron w x x S NO YES Sojka, IIR Group: PV211: Vector Space Classification 48 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Perceptron w x x w + x S S′ NO YES NOYES Sojka, IIR Group: PV211: Vector Space Classification 49 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Perceptron x w + x S′ NOYES Sojka, IIR Group: PV211: Vector Space Classification 50 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Which hyperplane? Sojka, IIR Group: PV211: Vector Space Classification 51 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Which hyperplane? For linearly separable training sets: there are infinitely many separating hyperplanes. They all separate the training set perfectly . . . . . . but they behave differently on test data. Error rates on new data are low for some, high for others. How do we find a low-error separator? Perceptron: generally bad; Naive Bayes, Rocchio: ok; linear SVM: good Sojka, IIR Group: PV211: Vector Space Classification 52 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Linear classifiers: Discussion Many common text classifiers are linear classifiers: Naive Bayes, Rocchio, logistic regression, linear support vector machines, etc. Each method has a different way of selecting the separating hyperplane Huge differences in performance on test documents Can we get better performance with more powerful nonlinear classifiers? Not in general: A given amount of training data may suffice for estimating a linear boundary, but not for estimating a more complex nonlinear boundary. Sojka, IIR Group: PV211: Vector Space Classification 53 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes A nonlinear problem 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Linear classifier like Rocchio does badly on this task. kNN will do well (assuming enough training data) Sojka, IIR Group: PV211: Vector Space Classification 54 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Which classifier do I use for a given TC problem? Is there a learning method that is optimal for all text classification problems? No, because there is a trade-off between bias and variance. Factors to take into account: How much training data is available? How simple/complex is the problem? (linear vs. nonlinear decision boundary) How noisy is the problem? How stable is the problem over time? For an unstable problem, it’s better to use a simple and robust classifier. Sojka, IIR Group: PV211: Vector Space Classification 55 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes How to combine hyperplanes for > 2 classes? ? Sojka, IIR Group: PV211: Vector Space Classification 57 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes One-of problems One-of or multiclass classification Classes are mutually exclusive. Each document belongs to exactly one class. Example: language of a document (assumption: no document contains multiple languages) Sojka, IIR Group: PV211: Vector Space Classification 58 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes One-of classification with linear classifiers Combine two-class linear classifiers as follows for one-of classification: Run each classifier separately Rank classifiers (e.g., according to score) Pick the class with the highest score Sojka, IIR Group: PV211: Vector Space Classification 59 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Any-of problems Any-of or multilabel classification A document can be a member of 0, 1, or many classes. A decision on one class leaves decisions open on all other classes. A type of “independence” (but not statistical independence) Example: topic classification Usually: make decisions on the region, on the subject area, on the industry and so on “independently” Sojka, IIR Group: PV211: Vector Space Classification 60 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Any-of classification with linear classifiers Combine two-class linear classifiers as follows for any-of classification: Simply run each two-class classifier separately on the test document and assign document accordingly Sojka, IIR Group: PV211: Vector Space Classification 61 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Take-away today Vector space classification: Basic idea of doing text classification for documents that are represented as vectors Rocchio classifier: Rocchio relevance feedback idea applied to text classification k nearest neighbor classification Linear classifiers More than two classes Sojka, IIR Group: PV211: Vector Space Classification 62 / 63 Intro vector space classification Rocchio kNN Linear classifiers > two classes Resources Chapter 13 of IIR (feature selection) Chapter 14 of IIR Resources at http://cislmu.org Perceptron example General overview of text classification: Sebastiani (2002) Text classification chapter on decision trees and perceptrons: Manning & Schütze (1999) One of the best machine learning textbooks: Hastie, Tibshirani & Friedman (2003) Sojka, IIR Group: PV211: Vector Space Classification 63 / 63