Online edition (c) 2009 Cambridge UP DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 349 16 Flat clustering Clustering algorithms group a set of documents into subsets or clusters. TheCLUSTER algorithms’ goal is to create clusters that are coherent internally, but clearly different from each other. In other words, documents within a cluster should be as similar as possible; and documents in one cluster should be as dissimilar as possible from documents in other clusters. 0.0 0.5 1.0 1.5 2.0 0.00.51.01.52.02.5 ◮ Figure 16.1 An example of a data set with a clear cluster structure. Clustering is the most common form of unsupervised learning. No super-UNSUPERVISED LEARNING vision means that there is no human expert who has assigned documents to classes. In clustering, it is the distribution and makeup of the data that will determine cluster membership. A simple example is Figure 16.1. It is visually clear that there are three distinct clusters of points. This chapter and Chapter 17 introduce algorithms that find such clusters in an unsupervised fashion. The difference between clustering and classification may not seem great at first. After all, in both cases we have a partition of a set of documents into groups. But as we will see the two problems are fundamentally different. Classification is a form of supervised learning (Chapter 13, page 256): our goal is to replicate a categorical distinction that a human supervisor im- Online edition (c) 2009 Cambridge UP 350 16 Flat clustering poses on the data. In unsupervised learning, of which clustering is the most important example, we have no such teacher to guide us. The key input to a clustering algorithm is the distance measure. In Figure 16.1, the distance measure is distance in the 2D plane. This measure suggests three different clusters in the figure. In document clustering, the distance measure is often also Euclidean distance. Different distance measures give rise to different clusterings. Thus, the distance measure is an important means by which we can influence the outcome of clustering. Flat clustering creates a flat set of clusters without any explicit structure thatFLAT CLUSTERING would relate clusters to each other. Hierarchical clustering creates a hierarchy of clusters and will be covered in Chapter 17. Chapter 17 also addresses the difficult problem of labeling clusters automatically. A second important distinction can be made between hard and soft clustering algorithms. Hard clustering computes a hard assignment – each documentHARD CLUSTERING is a member of exactly one cluster. The assignment of soft clustering algo-SOFT CLUSTERING rithms is soft – a document’s assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters. Latent semantic indexing, a form of dimensionality reduction, is a soft clustering algorithm (Chapter 18, page 417). This chapter motivates the use of clustering in information retrieval by introducing a number of applications (Section 16.1), defines the problem we are trying to solve in clustering (Section 16.2) and discusses measures for evaluating cluster quality (Section 16.3). It then describes two flat clustering algorithms, K-means (Section 16.4), a hard clustering algorithm, and the Expectation-Maximization (or EM) algorithm (Section 16.5), a soft clustering algorithm. K-means is perhaps the most widely used flat clustering algorithm due to its simplicity and efficiency. The EM algorithm is a generalization of K-means and can be applied to a large variety of document representations and distributions. 16.1 Clustering in information retrieval The cluster hypothesis states the fundamental assumption we make when us-CLUSTER HYPOTHESIS ing clustering in information retrieval. Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs. The hypothesis states that if there is a document from a cluster that is relevant to a search request, then it is likely that other documents from the same cluster are also relevant. This is because clustering puts together documents that share many terms. The cluster hypothesis essentially is the contiguity Online edition (c) 2009 Cambridge UP 16.1 Clustering in information retrieval 351 Application What is Benefit Example clustered? Search result clustering search results more effective information presentation to user Figure 16.2 Scatter-Gather (subsets of) collection alternative user interface: “search without typing” Figure 16.3 Collection clustering collection effective information presentation for exploratory browsing McKeown et al. (2002), http://news.google.com Language modeling collection increased precision and/or recall Liu and Croft (2004) Cluster-based retrieval collection higher efficiency: faster search Salton (1971a) ◮ Table 16.1 Some applications of clustering in information retrieval. hypothesis in Chapter 14 (page 289). In both cases, we posit that similar documents behave similarly with respect to relevance. Table 16.1 shows some of the main applications of clustering in information retrieval. They differ in the set of documents that they cluster – search results, collection or subsets of the collection – and the aspect of an information retrieval system they try to improve – user experience, user interface, effectiveness or efficiency of the search system. But they are all based on the basic assumption stated by the cluster hypothesis. The first application mentioned in Table 16.1 is search result clustering whereSEARCH RESULT CLUSTERING by search results we mean the documents that were returned in response to a query. The default presentation of search results in information retrieval is a simple list. Users scan the list from top to bottom until they have found the information they are looking for. Instead, search result clustering clusters the search results, so that similar documents appear together. It is often easier to scan a few coherent groups than many individual documents. This is particularly useful if a search term has different word senses. The example in Figure 16.2 is jaguar. Three frequent senses on the web refer to the car, the animal and an Apple operating system. The Clustered Results panel returned by the Vivísimo search engine (http://vivisimo.com) can be a more effective user interface for understanding what is in the search results than a simple list of documents. A better user interface is also the goal of Scatter-Gather, the second ap-SCATTER-GATHER plication in Table 16.1. Scatter-Gather clusters the whole collection to get groups of documents that the user can select or gather. The selected groups are merged and the resulting set is again clustered. This process is repeated until a cluster of interest is found. An example is shown in Figure 16.3. Online edition (c) 2009 Cambridge UP 352 16 Flat clustering ◮ Figure 16.2 Clustering of search results to improve recall. None of the top hits cover the animal sense of jaguar, but users can easily access it by clicking on the cat cluster in the Clustered Results panel on the left (third arrow from the top). Automatically generated clusters like those in Figure 16.3 are not as neatly organized as a manually constructed hierarchical tree like the Open Directory at http://dmoz.org. Also, finding descriptive labels for clusters automatically is a difficult problem (Section 17.7, page 396). But cluster-based navigation is an interesting alternative to keyword searching, the standard information retrieval paradigm. This is especially true in scenarios where users prefer browsing over searching because they are unsure about which search terms to use. As an alternative to the user-mediated iterative clustering in Scatter-Gather, we can also compute a static hierarchical clustering of a collection that is not influenced by user interactions (“Collection clustering” in Table 16.1). Google News and its precursor, the Columbia NewsBlaster system, are examples of this approach. In the case of news, we need to frequently recompute the clustering to make sure that users can access the latest breaking stories. Clustering is well suited for access to a collection of news stories since news reading is not really search, but rather a process of selecting a subset of stories about recent events. Online edition (c) 2009 Cambridge UP 16.1 Clustering in information retrieval 353 ◮ Figure 16.3 An example of a user session in Scatter-Gather. A collection of New York Times news stories is clustered (“scattered”) into eight clusters (top row). The user manually gathers three of these into a smaller collection International Stories and performs another scattering operation. This process repeats until a small cluster with relevant documents is found (e.g., Trinidad). The fourth application of clustering exploits the cluster hypothesis directly for improving search results, based on a clustering of the entire collection. We use a standard inverted index to identify an initial set of documents that match the query, but we then add other documents from the same clusters even if they have low similarity to the query. For example, if the query is car and several car documents are taken from a cluster of automobile documents, then we can add documents from this cluster that use terms other than car (automobile, vehicle etc). This can increase recall since a group of documents with high mutual similarity is often relevant as a whole. More recently this idea has been used for language modeling. Equation (12.10), page 245, showed that to avoid sparse data problems in the language modeling approach to IR, the model of document d can be interpolated with a Online edition (c) 2009 Cambridge UP 354 16 Flat clustering collection model. But the collection contains many documents with terms untypical of d. By replacing the collection model with a model derived from d’s cluster, we get more accurate estimates of the occurrence probabilities of terms in d. Clustering can also speed up search. As we saw in Section 6.3.2 (page 123) search in the vector space model amounts to finding the nearest neighbors to the query. The inverted index supports fast nearest-neighbor search for the standard IR setting. However, sometimes we may not be able to use an inverted index efficiently, e.g., in latent semantic indexing (Chapter 18). In such cases, we could compute the similarity of the query to every document, but this is slow. The cluster hypothesis offers an alternative: Find the clusters that are closest to the query and only consider documents from these clusters. Within this much smaller set, we can compute similarities exhaustively and rank documents in the usual way. Since there are many fewer clusters than documents, finding the closest cluster is fast; and since the documents matching a query are all similar to each other, they tend to be in the same clusters. While this algorithm is inexact, the expected decrease in search quality is small. This is essentially the application of clustering that was covered in Section 7.1.6 (page 141). ? Exercise 16.1 Define two documents as similar if they have at least two proper names like Clinton or Sarkozy in common. Give an example of an information need and two documents, for which the cluster hypothesis does not hold for this notion of similarity. Exercise 16.2 Make up a simple one-dimensional example (i.e. points on a line) with two clusters where the inexactness of cluster-based retrieval shows up. In your example, retrieving clusters close to the query should do worse than direct nearest neighbor search. 16.2 Problem statement We can define the goal in hard flat clustering as follows. Given (i) a set of documents D = {d1, . . . , dN}, (ii) a desired number of clusters K, and (iii) an objective function that evaluates the quality of a clustering, we want toOBJECTIVE FUNCTION compute an assignment γ : D → {1, . . . , K} that minimizes (or, in other cases, maximizes) the objective function. In most cases, we also demand that γ is surjective, i.e., that none of the K clusters is empty. The objective function is often defined in terms of similarity or distance between documents. Below, we will see that the objective in K-means clustering is to minimize the average distance between documents and their centroids or, equivalently, to maximize the similarity between documents and their centroids. The discussion of similarity measures and distance metrics Online edition (c) 2009 Cambridge UP 16.2 Problem statement 355 in Chapter 14 (page 291) also applies to this chapter. As in Chapter 14, we use both similarity and distance to talk about relatedness between documents. For documents, the type of similarity we want is usually topic similarity or high values on the same dimensions in the vector space model. For example, documents about China have high values on dimensions like Chinese, Beijing, and Mao whereas documents about the UK tend to have high values for London, Britain and Queen. We approximate topic similarity with cosine similarity or Euclidean distance in vector space (Chapter 6). If we intend to capture similarity of a type other than topic, for example, similarity of language, then a different representation may be appropriate. When computing topic similarity, stop words can be safely ignored, but they are important cues for separating clusters of English (in which the occurs frequently and la infrequently) and French documents (in which the occurs infrequently and la frequently). A note on terminology. An alternative definition of hard clustering is that a document can be a full member of more than one cluster. Partitional clus-PARTITIONAL CLUSTERING tering always refers to a clustering where each document belongs to exactly one cluster. (But in a partitional hierarchical clustering (Chapter 17) all members of a cluster are of course also members of its parent.) On the definition of hard clustering that permits multiple membership, the difference between soft clustering and hard clustering is that membership values in hard clustering are either 0 or 1, whereas they can take on any non-negative value in soft clustering. Some researchers distinguish between exhaustive clusterings that assignEXHAUSTIVE each document to a cluster and non-exhaustive clusterings, in which some documents will be assigned to no cluster. Non-exhaustive clusterings in which each document is a member of either no cluster or one cluster are called exclusive. We define clustering to be exhaustive in this book.EXCLUSIVE 16.2.1 Cardinality – the number of clusters A difficult issue in clustering is determining the number of clusters or cardi-CARDINALITY nality of a clustering, which we denote by K. Often K is nothing more than a good guess based on experience or domain knowledge. But for K-means, we will also introduce a heuristic method for choosing K and an attempt to incorporate the selection of K into the objective function. Sometimes the application puts constraints on the range of K. For example, the Scatter-Gather interface in Figure 16.3 could not display more than about K = 10 clusters per layer because of the size and resolution of computer monitors in the early 1990s. Since our goal is to optimize an objective function, clustering is essentially Online edition (c) 2009 Cambridge UP 356 16 Flat clustering a search problem. The brute force solution would be to enumerate all possible clusterings and pick the best. However, there are exponentially many partitions, so this approach is not feasible.1 For this reason, most flat clustering algorithms refine an initial partitioning iteratively. If the search starts at an unfavorable initial point, we may miss the global optimum. Finding a good starting point is therefore another important problem we have to solve in flat clustering. 16.3 Evaluation of clustering Typical objective functions in clustering formalize the goal of attaining high intra-cluster similarity (documents within a cluster are similar) and low intercluster similarity (documents from different clusters are dissimilar). This is an internal criterion for the quality of a clustering. But good scores on anINTERNAL CRITERION OF QUALITY internal criterion do not necessarily translate into good effectiveness in an application. An alternative to internal criteria is direct evaluation in the application of interest. For search result clustering, we may want to measure the time it takes users to find an answer with different clustering algorithms. This is the most direct evaluation, but it is expensive, especially if large user studies are necessary. As a surrogate for user judgments, we can use a set of classes in an evaluation benchmark or gold standard (see Section 8.5, page 164, and Section 13.6, page 279). The gold standard is ideally produced by human judges with a good level of inter-judge agreement (see Chapter 8, page 152). We can then compute an external criterion that evaluates how well the clustering matchesEXTERNAL CRITERION OF QUALITY the gold standard classes. For example, we may want to say that the optimal clustering of the search results for jaguar in Figure 16.2 consists of three classes corresponding to the three senses car, animal, and operating system. In this type of evaluation, we only use the partition provided by the gold standard, not the class labels. This section introduces four external criteria of clustering quality. Purity is a simple and transparent evaluation measure. Normalized mutual information can be information-theoretically interpreted. The Rand index penalizes both false positive and false negative decisions during clustering. The F measure in addition supports differential weighting of these two types of errors. To compute purity, each cluster is assigned to the class which is most fre-PURITY quent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by N. 1. An upper bound on the number of clusterings is KN/K!. The exact number of different partitions of N documents into K clusters is the Stirling number of the second kind. See http://mathworld.wolfram.com/StirlingNumberoftheSecondKind.html or Comtet (1974). Online edition (c) 2009 Cambridge UP 16.3 Evaluation of clustering 357 x o x x x x o x o o ⋄ o x ⋄ ⋄ ⋄ x cluster 1 cluster 2 cluster 3 ◮ Figure 16.4 Purity as an external evaluation criterion for cluster quality. Majority class and number of members of the majority class for the three clusters are: x, 5 (cluster 1); o, 4 (cluster 2); and ⋄, 3 (cluster 3). Purity is (1/17) × (5 + 4 + 3) ≈ 0.71. purity NMI RI F5 lower bound 0.0 0.0 0.0 0.0 maximum 1 1 1 1 value for Figure 16.4 0.71 0.36 0.68 0.46 ◮ Table 16.2 The four external evaluation measures applied to the clustering in Figure 16.4. Formally: purity(Ω, C) = 1 N ∑ k max j |ωk ∩ cj|(16.1) where Ω = {ω1, ω2, . . . , ωK} is the set of clusters and C = {c1, c2, . . . , cJ} is the set of classes. We interpret ωk as the set of documents in ωk and cj as the set of documents in cj in Equation (16.1). We present an example of how to compute purity in Figure 16.4.2 Bad clusterings have purity values close to 0, a perfect clustering has a purity of 1. Purity is compared with the other three measures discussed in this chapter in Table 16.2. High purity is easy to achieve when the number of clusters is large – in particular, purity is 1 if each document gets its own cluster. Thus, we cannot use purity to trade off the quality of the clustering against the number of clusters. A measure that allows us to make this tradeoff is normalized mutual infor-NORMALIZED MUTUAL INFORMATION 2. Recall our note of caution from Figure 14.2 (page 291) when looking at this and other 2D figures in this and the following chapter: these illustrations can be misleading because 2D projections of length-normalized vectors distort similarities and distances between points. Online edition (c) 2009 Cambridge UP 358 16 Flat clustering mation or NMI: NMI(Ω, C) = I(Ω; C) [H(Ω) + H(C)]/2 (16.2) I is mutual information (cf. Chapter 13, page 272): I(Ω; C) = ∑ k ∑ j P(ωk ∩ cj) log P(ωk ∩ cj) P(ωk)P(cj) (16.3) = ∑ k ∑ j |ωk ∩ cj| N log N|ωk ∩ cj| |ωk||cj| (16.4) where P(ωk), P(cj), and P(ωk ∩ cj) are the probabilities of a document being in cluster ωk, class cj, and in the intersection of ωk and cj, respectively. Equation (16.4) is equivalent to Equation (16.3) for maximum likelihood estimates of the probabilities (i.e., the estimate of each probability is the corresponding relative frequency). H is entropy as defined in Chapter 5 (page 99): H(Ω) = − ∑ k P(ωk) log P(ωk)(16.5) = − ∑ k |ωk| N log |ωk| N (16.6) where, again, the second equation is based on maximum likelihood estimates of the probabilities. I(Ω; C) in Equation (16.3) measures the amount of information by which our knowledge about the classes increases when we are told what the clusters are. The minimum of I(Ω; C) is 0 if the clustering is random with respect to class membership. In that case, knowing that a document is in a particular cluster does not give us any new information about what its class might be. Maximum mutual information is reached for a clustering Ωexact that perfectly recreates the classes – but also if clusters in Ωexact are further subdivided into smaller clusters (Exercise 16.7). In particular, a clustering with K = N onedocument clusters has maximum MI. So MI has the same problem as purity: it does not penalize large cardinalities and thus does not formalize our bias that, other things being equal, fewer clusters are better. The normalization by the denominator [H(Ω) + H(C)]/2 in Equation (16.2) fixes this problem since entropy tends to increase with the number of clusters. For example, H(Ω) reaches its maximum log N for K = N, which ensures that NMI is low for K = N. Because NMI is normalized, we can use it to compare clusterings with different numbers of clusters. The particular form of the denominator is chosen because [H(Ω) + H(C)]/2 is a tight upper bound on I(Ω; C) (Exercise 16.8). Thus, NMI is always a number between 0 and 1. Online edition (c) 2009 Cambridge UP 16.3 Evaluation of clustering 359 An alternative to this information-theoretic interpretation of clustering is to view it as a series of decisions, one for each of the N(N − 1)/2 pairs of documents in the collection. We want to assign two documents to the same cluster if and only if they are similar. A true positive (TP) decision assigns two similar documents to the same cluster, a true negative (TN) decision assigns two dissimilar documents to different clusters. There are two types of errors we can commit. A false positive (FP) decision assigns two dissimilar documents to the same cluster. A false negative (FN) decision assigns two similar documents to different clusters. The Rand index (RI) measuresRAND INDEX RI the percentage of decisions that are correct. That is, it is simply accuracy (Section 8.3, page 155). RI = TP + TN TP + FP + FN + TN As an example, we compute RI for Figure 16.4. We first compute TP + FP. The three clusters contain 6, 6, and 5 points, respectively, so the total number of “positives” or pairs of documents that are in the same cluster is: TP + FP = 6 2 + 6 2 + 5 2 = 40 Of these, the x pairs in cluster 1, the o pairs in cluster 2, the ⋄ pairs in cluster 3, and the x pair in cluster 3 are true positives: TP = 5 2 + 4 2 + 3 2 + 2 2 = 20 Thus, FP = 40 − 20 = 20. FN and TN are computed similarly, resulting in the following contingency table: Same cluster Different clusters Same class TP = 20 FN = 24 Different classes FP = 20 TN = 72 RI is then (20 + 72)/(20 + 20 + 24 + 72) ≈ 0.68. The Rand index gives equal weight to false positives and false negatives. Separating similar documents is sometimes worse than putting pairs of dissimilar documents in the same cluster. We can use the F measure (Section 8.3,F MEASURE page 154) to penalize false negatives more strongly than false positives by selecting a value β > 1, thus giving more weight to recall. P = TP TP + FP R = TP TP + FN Fβ = (β2 + 1)PR β2P + R Online edition (c) 2009 Cambridge UP 360 16 Flat clustering Based on the numbers in the contingency table, P = 20/40 = 0.5 and R = 20/44 ≈ 0.455. This gives us F1 ≈ 0.48 for β = 1 and F5 ≈ 0.456 for β = 5. In information retrieval, evaluating clustering with F has the advantage that the measure is already familiar to the research community. ? Exercise 16.3 Replace every point d in Figure 16.4 with two identical copies of d in the same class. (i) Is it less difficult, equally difficult or more difficult to cluster this set of 34 points as opposed to the 17 points in Figure 16.4? (ii) Compute purity, NMI, RI, and F5 for the clustering with 34 points. Which measures increase and which stay the same after doubling the number of points? (iii) Given your assessment in (i) and the results in (ii), which measures are best suited to compare the quality of the two clusterings? 16.4 K-means K-means is the most important flat clustering algorithm. Its objective is to minimize the average squared Euclidean distance (Chapter 6, page 131) of documents from their cluster centers where a cluster center is defined as the mean or centroid µ of the documents in a cluster ω:CENTROID µ(ω) = 1 |ω| ∑ x∈ω x The definition assumes that documents are represented as length-normalized vectors in a real-valued space in the familiar way. We used centroids for Rocchio classification in Chapter 14 (page 292). They play a similar role here. The ideal cluster in K-means is a sphere with the centroid as its center of gravity. Ideally, the clusters should not overlap. Our desiderata for classes in Rocchio classification were the same. The difference is that we have no labeled training set in clustering for which we know which documents should be in the same cluster. A measure of how well the centroids represent the members of their clusters is the residual sum of squares or RSS, the squared distance of each vectorRESIDUAL SUM OF SQUARES from its centroid summed over all vectors: RSSk = ∑ x∈ωk |x − µ(ωk)|2 RSS = K ∑ k=1 RSSk(16.7) RSS is the objective function in K-means and our goal is to minimize it. Since N is fixed, minimizing RSS is equivalent to minimizing the average squared distance, a measure of how well centroids represent their documents. Online edition (c) 2009 Cambridge UP 16.4 K-means 361 K-MEANS({x1, . . . , xN}, K) 1 (s1, s2, . . . , sK) ← SELECTRANDOMSEEDS({x1, . . . , xN}, K) 2 for k ← 1 to K 3 do µk ← sk 4 while stopping criterion has not been met 5 do for k ← 1 to K 6 do ωk ← {} 7 for n ← 1 to N 8 do j ← arg minj′ |µj′ − xn| 9 ωj ← ωj ∪ {xn} (reassignment of vectors) 10 for k ← 1 to K 11 do µk ← 1 |ωk| ∑x∈ωk x (recomputation of centroids) 12 return {µ1, . . . , µK} ◮ Figure 16.5 The K-means algorithm. For most IR applications, the vectors xn ∈ RM should be length-normalized. Alternative methods of seed selection and initialization are discussed on page 364. The first step of K-means is to select as initial cluster centers K randomly selected documents, the seeds. The algorithm then moves the cluster centersSEED around in space in order to minimize RSS. As shown in Figure 16.5, this is done iteratively by repeating two steps until a stopping criterion is met: reassigning documents to the cluster with the closest centroid; and recomputing each centroid based on the current members of its cluster. Figure 16.6 shows snapshots from nine iterations of the K-means algorithm for a set of points. The “centroid” column of Table 17.2 (page 397) shows examples of centroids. We can apply one of the following termination conditions. • A fixed number of iterations I has been completed. This condition limits the runtime of the clustering algorithm, but in some cases the quality of the clustering will be poor because of an insufficient number of iterations. • Assignment of documents to clusters (the partitioning function γ) does not change between iterations. Except for cases with a bad local minimum, this produces a good clustering, but runtimes may be unacceptably long. • Centroids µk do not change between iterations. This is equivalent to γ not changing (Exercise 16.5). • Terminate when RSS falls below a threshold. This criterion ensures that the clustering is of a desired quality after termination. In practice, we Online edition (c) 2009 Cambridge UP 362 16 Flat clustering 0 1 2 3 4 5 6 0 1 2 3 4 × × selection of seeds 0 1 2 3 4 5 6 0 1 2 3 4 × × assignment of documents (iter. 1) 0 1 2 3 4 5 6 0 1 2 3 4 + + + + + + + + + + + o o + o+ + ++ + ++ + o + + o + + + + o o + o + + o+ o × ×× × recomputation/movement of µ’s (iter. 1) 0 1 2 3 4 5 6 0 1 2 3 4 + + + + + + + + + + + + + + o+ + ++ + o+ o o o o + o + o + o + o o o + o+ o × × µ’s after convergence (iter. 9) 0 1 2 3 4 5 6 0 1 2 3 4 . . . . . . . . . . . . . . .. . .. . .. . . . . . . . . . . . . . . . .. . movement of µ’s in 9 iterations ◮ Figure 16.6 A K-means example for K = 2 in R2. The position of the two centroids (µ’s shown as X’s in the top four panels) converges after nine iterations. Online edition (c) 2009 Cambridge UP 16.4 K-means 363 need to combine it with a bound on the number of iterations to guarantee termination. • Terminate when the decrease in RSS falls below a threshold θ. For small θ, this indicates that we are close to convergence. Again, we need to combine it with a bound on the number of iterations to prevent very long runtimes. We now show that K-means converges by proving that RSS monotonically decreases in each iteration. We will use decrease in the meaning decrease or does not change in this section. First, RSS decreases in the reassignment step since each vector is assigned to the closest centroid, so the distance it contributes to RSS decreases. Second, it decreases in the recomputation step because the new centroid is the vector v for which RSSk reaches its minimum. RSSk(v) = ∑ x∈ωk |v − x|2 = ∑ x∈ωk M ∑ m=1 (vm − xm)2 (16.8) ∂RSSk(v) ∂vm = ∑ x∈ωk 2(vm − xm)(16.9) where xm and vm are the mth components of their respective vectors. Setting the partial derivative to zero, we get: vm = 1 |ωk| ∑ x∈ωk xm(16.10) which is the componentwise definition of the centroid. Thus, we minimize RSSk when the old centroid is replaced with the new centroid. RSS, the sum of the RSSk, must then also decrease during recomputation. Since there is only a finite set of possible clusterings, a monotonically decreasing algorithm will eventually arrive at a (local) minimum. Take care, however, to break ties consistently, e.g., by assigning a document to the cluster with the lowest index if there are several equidistant centroids. Otherwise, the algorithm can cycle forever in a loop of clusterings that have the same cost. While this proves the convergence of K-means, there is unfortunately no guarantee that a global minimum in the objective function will be reached. This is a particular problem if a document set contains many outliers, doc-OUTLIER uments that are far from any other documents and therefore do not fit well into any cluster. Frequently, if an outlier is chosen as an initial seed, then no other vector is assigned to it during subsequent iterations. Thus, we end up with a singleton cluster (a cluster with only one document) even though thereSINGLETON CLUSTER is probably a clustering with lower RSS. Figure 16.7 shows an example of a suboptimal clustering resulting from a bad choice of initial seeds. Online edition (c) 2009 Cambridge UP 364 16 Flat clustering 0 1 2 3 4 0 1 2 3 × × × × × × d1 d2 d3 d4 d5 d6 ◮ Figure 16.7 The outcome of clustering in K-means depends on the initial seeds. For seeds d2 and d5, K-means converges to {{d1, d2, d3}, {d4, d5, d6}}, a suboptimal clustering. For seeds d2 and d3, it converges to {{d1, d2, d4, d5}, {d3, d6}}, the global optimum for K = 2. Another type of suboptimal clustering that frequently occurs is one with empty clusters (Exercise 16.11). Effective heuristics for seed selection include (i) excluding outliers from the seed set; (ii) trying out multiple starting points and choosing the clustering with lowest cost; and (iii) obtaining seeds from another method such as hierarchical clustering. Since deterministic hierarchical clustering methods are more predictable than K-means, a hierarchical clustering of a small random sample of size iK (e.g., for i = 5 or i = 10) often provides good seeds (see the description of the Buckshot algorithm, Chapter 17, page 399). Other initialization methods compute seeds that are not selected from the vectors to be clustered. A robust method that works well for a large variety of document distributions is to select i (e.g., i = 10) random vectors for each cluster and use their centroid as the seed for this cluster. See Section 16.6 for more sophisticated initializations. What is the time complexity of K-means? Most of the time is spent on computing vector distances. One such operation costs Θ(M). The reassignment step computes KN distances, so its overall complexity is Θ(KNM). In the recomputation step, each vector gets added to a centroid once, so the complexity of this step is Θ(NM). For a fixed number of iterations I, the overall complexity is therefore Θ(IKNM). Thus, K-means is linear in all relevant factors: iterations, number of clusters, number of vectors and dimensionality of the space. This means that K-means is more efficient than the hierarchical algorithms in Chapter 17. We had to fix the number of iterations I, which can be tricky in practice. But in most cases, K-means quickly reaches either complete convergence or a clustering that is close to convergence. In the latter case, a few documents would switch membership if further iterations were computed, but this has a small effect on the overall quality of the clustering. Online edition (c) 2009 Cambridge UP 16.4 K-means 365 There is one subtlety in the preceding argument. Even a linear algorithm can be quite slow if one of the arguments of Θ(. . .) is large, and M usually is large. High dimensionality is not a problem for computing the distance between two documents. Their vectors are sparse, so that only a small fraction of the theoretically possible M componentwise differences need to be computed. Centroids, however, are dense since they pool all terms that occur in any of the documents of their clusters. As a result, distance computations are time consuming in a naive implementation of K-means. However, there are simple and effective heuristics for making centroid-document similarities as fast to compute as document-document similarities. Truncating centroids to the most significant k terms (e.g., k = 1000) hardly decreases cluster quality while achieving a significant speedup of the reassignment step (see references in Section 16.6). The same efficiency problem is addressed by K-medoids, a variant of K-K-MEDOIDS means that computes medoids instead of centroids as cluster centers. We define the medoid of a cluster as the document vector that is closest to theMEDOID centroid. Since medoids are sparse document vectors, distance computations are fast. £ 16.4.1 Cluster cardinality in K-means We stated in Section 16.2 that the number of clusters K is an input to most flat clustering algorithms. What do we do if we cannot come up with a plausible guess for K? A naive approach would be to select the optimal value of K according to the objective function, namely the value of K that minimizes RSS. Defining RSSmin(K) as the minimal RSS of all clusterings with K clusters, we observe that RSSmin(K) is a monotonically decreasing function in K (Exercise 16.13), which reaches its minimum 0 for K = N where N is the number of documents. We would end up with each document being in its own cluster. Clearly, this is not an optimal clustering. A heuristic method that gets around this problem is to estimate RSSmin(K) as follows. We first perform i (e.g., i = 10) clusterings with K clusters (each with a different initialization) and compute the RSS of each. Then we take the minimum of the i RSS values. We denote this minimum by RSSmin(K). Now we can inspect the values RSSmin(K) as K increases and find the “knee” in the curve – the point where successive decreases in RSSmin become noticeably smaller. There are two such points in Figure 16.8, one at K = 4, where the gradient flattens slightly, and a clearer flattening at K = 9. This is typical: there is seldom a single best number of clusters. We still need to employ an external constraint to choose from a number of possible values of K (4 and 9 in this case). Online edition (c) 2009 Cambridge UP 366 16 Flat clustering 2 4 6 8 10 17501800185019001950 number of clusters residualsumofsquares ◮ Figure 16.8 Estimated minimal residual sum of squares as a function of the number of clusters in K-means. In this clustering of 1203 Reuters-RCV1 documents, there are two points where the RSSmin curve flattens: at 4 clusters and at 9 clusters. The documents were selected from the categories China, Germany, Russia and Sports, so the K = 4 clustering is closest to the Reuters classification. A second type of criterion for cluster cardinality imposes a penalty for each new cluster – where conceptually we start with a single cluster containing all documents and then search for the optimal number of clusters K by successively incrementing K by one. To determine the cluster cardinality in this way, we create a generalized objective function that combines two elements: distortion, a measure of how much documents deviate from the prototype ofDISTORTION their clusters (e.g., RSS for K-means); and a measure of model complexity. WeMODEL COMPLEXITY interpret a clustering here as a model of the data. Model complexity in clustering is usually the number of clusters or a function thereof. For K-means, we then get this selection criterion for K: K = arg min K [RSSmin(K) + λK](16.11) where λ is a weighting factor. A large value of λ favors solutions with few clusters. For λ = 0, there is no penalty for more clusters and K = N is the best solution. Online edition (c) 2009 Cambridge UP 16.4 K-means 367 The obvious difficulty with Equation (16.11) is that we need to determine λ. Unless this is easier than determining K directly, then we are back to square one. In some cases, we can choose values of λ that have worked well for similar data sets in the past. For example, if we periodically cluster news stories from a newswire, there is likely to be a fixed value of λ that gives us the right K in each successive clustering. In this application, we would not be able to determine K based on past experience since K changes. A theoretical justification for Equation (16.11) is the Akaike Information Cri-AKAIKE INFORMATION CRITERION terion or AIC, an information-theoretic measure that trades off distortion against model complexity. The general form of AIC is: AIC: K = arg min K [−2L(K) + 2q(K)](16.12) where −L(K), the negative maximum log-likelihood of the data for K clusters, is a measure of distortion and q(K), the number of parameters of a model with K clusters, is a measure of model complexity. We will not attempt to derive the AIC here, but it is easy to understand intuitively. The first property of a good model of the data is that each data point is modeled well by the model. This is the goal of low distortion. But models should also be small (i.e., have low model complexity) since a model that merely describes the data (and therefore has zero distortion) is worthless. AIC provides a theoretical justification for one particular way of weighting these two factors, distortion and model complexity, when selecting a model. For K-means, the AIC can be stated as follows: AIC: K = arg min K [RSSmin(K) + 2MK](16.13) Equation (16.13) is a special case of Equation (16.11) for λ = 2M. To derive Equation (16.13) from Equation (16.12) observe that q(K) = KM in K-means since each element of the K centroids is a parameter that can be varied independently; and that L(K) = −(1/2)RSSmin(K) (modulo a constant) if we view the model underlying K-means as a Gaussian mixture with hard assignment, uniform cluster priors and identical spherical covariance matrices (see Exercise 16.19). The derivation of AIC is based on a number of assumptions, e.g., that the data are independent and identically distributed. These assumptions are only approximately true for data sets in information retrieval. As a consequence, the AIC can rarely be applied without modification in text clustering. In Figure 16.8, the dimensionality of the vector space is M ≈ 50,000. Thus, 2MK > 50,000 dominates the smaller RSS-based term (RSSmin(1) < 5000, not shown in the figure) and the minimum of the expression is reached for K = 1. But as we know, K = 4 (corresponding to the four classes China, Online edition (c) 2009 Cambridge UP 368 16 Flat clustering Germany, Russia and Sports) is a better choice than K = 1. In practice, Equation (16.11) is often more useful than Equation (16.13) – with the caveat that we need to come up with an estimate for λ. ? Exercise 16.4 Why are documents that do not use the same term for the concept car likely to end up in the same cluster in K-means clustering? Exercise 16.5 Two of the possible termination conditions for K-means were (1) assignment does not change, (2) centroids do not change (page 361). Do these two conditions imply each other? £ 16.5 Model-based clustering In this section, we describe a generalization of K-means, the EM algorithm. It can be applied to a larger variety of document representations and distributions than K-means. In K-means, we attempt to find centroids that are good representatives. We can view the set of K centroids as a model that generates the data. Generating a document in this model consists of first picking a centroid at random and then adding some noise. If the noise is normally distributed, this procedure will result in clusters of spherical shape. Model-based clustering assumes thatMODEL-BASED CLUSTERING the data were generated by a model and tries to recover the original model from the data. The model that we recover from the data then defines clusters and an assignment of documents to clusters. A commonly used criterion for estimating the model parameters is maximum likelihood. In K-means, the quantity exp(−RSS) is proportional to the likelihood that a particular model (i.e., a set of centroids) generated the data. For K-means, maximum likelihood and minimal RSS are equivalent criteria. We denote the model parameters by Θ. In K-means, Θ = {µ1, . . . , µK}. More generally, the maximum likelihood criterion is to select the parameters Θ that maximize the log-likelihood of generating the data D: Θ = arg max Θ L(D|Θ) = arg max Θ log N ∏ n=1 P(dn|Θ) = arg max Θ N ∑ n=1 log P(dn|Θ) L(D|Θ) is the objective function that measures the goodness of the clustering. Given two clusterings with the same number of clusters, we prefer the one with higher L(D|Θ). This is the same approach we took in Chapter 12 (page 237) for language modeling and in Section 13.1 (page 265) for text classification. In text classification, we chose the class that maximizes the likelihood of generating a particular document. Here, we choose the clustering Θ that maximizes the Online edition (c) 2009 Cambridge UP 16.5 Model-based clustering 369 likelihood of generating a given set of documents. Once we have Θ, we can compute an assignment probability P(d|ωk; Θ) for each document-cluster pair. This set of assignment probabilities defines a soft clustering. An example of a soft assignment is that a document about Chinese cars may have a fractional membership of 0.5 in each of the two clusters China and automobiles, reflecting the fact that both topics are pertinent. A hard clustering like K-means cannot model this simultaneous relevance to two topics. Model-based clustering provides a framework for incorporating our knowledge about a domain. K-means and the hierarchical algorithms in Chapter 17 make fairly rigid assumptions about the data. For example, clusters in K-means are assumed to be spheres. Model-based clustering offers more flexibility. The clustering model can be adapted to what we know about the underlying distribution of the data, be it Bernoulli (as in the example in Table 16.3), Gaussian with non-spherical variance (another model that is important in document clustering) or a member of a different family. A commonly used algorithm for model-based clustering is the Expectation-EXPECTATION- MAXIMIZATION ALGORITHM Maximization algorithm or EM algorithm. EM clustering is an iterative algorithm that maximizes L(D|Θ). EM can be applied to many different types of probabilistic modeling. We will work with a mixture of multivariate Bernoulli distributions here, the distribution we know from Section 11.3 (page 222) and Section 13.3 (page 263): P(d|ωk; Θ) = ∏ tm∈d qmk ∏ tm /∈d (1 − qmk)(16.14) where Θ = {Θ1, . . . , ΘK}, Θk = (αk, q1k, . . . , qMk), and qmk = P(Um = 1|ωk) are the parameters of the model.3 P(Um = 1|ωk) is the probability that a document from cluster ωk contains term tm. The probability αk is the prior of cluster ωk: the probability that a document d is in ωk if we have no information about d. The mixture model then is: P(d|Θ) = K ∑ k=1 αk ∏ tm∈d qmk ∏ tm /∈d (1 − qmk)(16.15) In this model, we generate a document by first picking a cluster k with probability αk and then generating the terms of the document according to the parameters qmk. Recall that the document representation of the multivariate Bernoulli is a vector of M Boolean values (and not a real-valued vector). 3. Um is the random variable we defined in Section 13.3 (page 266) for the Bernoulli Naive Bayes model. It takes the values 1 (term tm is present in the document) and 0 (term tm is absent in the document). Online edition (c) 2009 Cambridge UP 370 16 Flat clustering How do we use EM to infer the parameters of the clustering from the data? That is, how do we choose parameters Θ that maximize L(D|Θ)? EM is similar to K-means in that it alternates between an expectation step, correspondingEXPECTATION STEP to reassignment, and a maximization step, corresponding to recomputation ofMAXIMIZATION STEP the parameters of the model. The parameters of K-means are the centroids, the parameters of the instance of EM in this section are the αk and qmk. The maximization step recomputes the conditional parameters qmk and the priors αk as follows: Maximization step: qmk = ∑N n=1 rnk I(tm ∈ dn) ∑N n=1 rnk αk = ∑N n=1 rnk N (16.16) where I(tm ∈ dn) = 1 if tm ∈ dn and 0 otherwise and rnk is the soft assignment of document dn to cluster k as computed in the preceding iteration. (We’ll address the issue of initialization in a moment.) These are the maximum likelihood estimates for the parameters of the multivariate Bernoulli from Table 13.3 (page 268) except that documents are assigned fractionally to clusters here. These maximum likelihood estimates maximize the likelihood of the data given the model. The expectation step computes the soft assignment of documents to clusters given the current parameters qmk and αk: Expectation step : rnk = αk(∏tm∈dn qmk)(∏tm /∈dn (1 − qmk)) ∑K k=1 αk(∏tm∈dn qmk)(∏tm /∈dn (1 − qmk)) (16.17) This expectation step applies Equations (16.14) and (16.15) to computing the likelihood that ωk generated document dn. It is the classification procedure for the multivariate Bernoulli in Table 13.3. Thus, the expectation step is nothing else but Bernoulli Naive Bayes classification (including normalization, i.e. dividing by the denominator, to get a probability distribution over clusters). We clustered a set of 11 documents into two clusters using EM in Table 16.3. After convergence in iteration 25, the first 5 documents are assigned to cluster 1 (ri,1 = 1.00) and the last 6 to cluster 2 (ri,1 = 0.00). Somewhat atypically, the final assignment is a hard assignment here. EM usually converges to a soft assignment. In iteration 25, the prior α1 for cluster 1 is 5/11 ≈ 0.45 because 5 of the 11 documents are in cluster 1. Some terms are quickly associated with one cluster because the initial assignment can “spread” to them unambiguously. For example, membership in cluster 2 spreads from document 7 to document 8 in the first iteration because they share sugar (r8,1 = 0 in iteration 1). For parameters of terms occurring in ambiguous contexts, convergence takes longer. Seed documents 6 and 7 Online edition (c) 2009 Cambridge UP 16.5 Model-based clustering 371 (a) docID document text docID document text 1 hot chocolate cocoa beans 7 sweet sugar 2 cocoa ghana africa 8 sugar cane brazil 3 beans harvest ghana 9 sweet sugar beet 4 cocoa butter 10 sweet cake icing 5 butter truffles 11 cake black forest 6 sweet chocolate (b) Parameter Iteration of clustering 0 1 2 3 4 5 15 25 α1 0.50 0.45 0.53 0.57 0.58 0.54 0.45 r1,1 1.00 1.00 1.00 1.00 1.00 1.00 1.00 r2,1 0.50 0.79 0.99 1.00 1.00 1.00 1.00 r3,1 0.50 0.84 1.00 1.00 1.00 1.00 1.00 r4,1 0.50 0.75 0.94 1.00 1.00 1.00 1.00 r5,1 0.50 0.52 0.66 0.91 1.00 1.00 1.00 r6,1 1.00 1.00 1.00 1.00 1.00 1.00 0.83 0.00 r7,1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 r8,1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 r9,1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 r10,1 0.50 0.40 0.14 0.01 0.00 0.00 0.00 r11,1 0.50 0.57 0.58 0.41 0.07 0.00 0.00 qafrica,1 0.000 0.100 0.134 0.158 0.158 0.169 0.200 qafrica,2 0.000 0.083 0.042 0.001 0.000 0.000 0.000 qbrazil,1 0.000 0.000 0.000 0.000 0.000 0.000 0.000 qbrazil,2 0.000 0.167 0.195 0.213 0.214 0.196 0.167 qcocoa,1 0.000 0.400 0.432 0.465 0.474 0.508 0.600 qcocoa,2 0.000 0.167 0.090 0.014 0.001 0.000 0.000 qsugar,1 0.000 0.000 0.000 0.000 0.000 0.000 0.000 qsugar,2 1.000 0.500 0.585 0.640 0.642 0.589 0.500 qsweet,1 1.000 0.300 0.238 0.180 0.159 0.153 0.000 qsweet,2 1.000 0.417 0.507 0.610 0.640 0.608 0.667 ◮ Table 16.3 The EM clustering algorithm. The table shows a set of documents (a) and parameter values for selected iterations during EM clustering (b). Parameters shown are prior α1, soft assignment scores rn,1 (both omitted for cluster 2), and lexical parameters qm,k for a few terms. The authors initially assigned document 6 to cluster 1 and document 7 to cluster 2 (iteration 0). EM converges after 25 iterations. For smoothing, the rnk in Equation (16.16) were replaced with rnk + ǫ where ǫ = 0.0001. Online edition (c) 2009 Cambridge UP 372 16 Flat clustering both contain sweet. As a result, it takes 25 iterations for the term to be unambiguously associated with cluster 2. (qsweet,1 = 0 in iteration 25.) Finding good seeds is even more critical for EM than for K-means. EM is prone to get stuck in local optima if the seeds are not chosen well. This is a general problem that also occurs in other applications of EM.4 Therefore, as with K-means, the initial assignment of documents to clusters is often computed by a different algorithm. For example, a hard K-means clustering may provide the initial assignment, which EM can then “soften up.” ? Exercise 16.6 We saw above that the time complexity of K-means is Θ(IKNM). What is the time complexity of EM? 16.6 References and further reading Berkhin (2006b) gives a general up-to-date survey of clustering methods with special attention to scalability. The classic reference for clustering in pattern recognition, covering both K-means and EM, is (Duda et al. 2000). Rasmussen (1992) introduces clustering from an information retrieval perspective. Anderberg (1973) provides a general introduction to clustering for applications. In addition to Euclidean distance and cosine similarity, KullbackLeibler divergence is often used in clustering as a measure of how (dis)similar documents and clusters are (Xu and Croft 1999, Muresan and Harper 2004, Kurland and Lee 2004). The cluster hypothesis is due to Jardine and van Rijsbergen (1971) who state it as follows: Associations between documents convey information about the relevance of documents to requests. Salton (1971a; 1975), Croft (1978), Voorhees (1985a), Can and Ozkarahan (1990), Cacheda et al. (2003), Can et al. (2004), Singitham et al. (2004) and Altingövde et al. (2008) investigate the efficiency and effectiveness of cluster-based retrieval. While some of these studies show improvements in effectiveness, efficiency or both, there is no consensus that cluster-based retrieval works well consistently across scenarios. Clusterbased language modeling was pioneered by Liu and Croft (2004). There is good evidence that clustering of search results improves user experience and search result quality (Hearst and Pedersen 1996, Zamir and Etzioni 1999, Tombros et al. 2002, Käki 2005, Toda and Kataoka 2005), although not as much as search result structuring based on carefully edited category hierarchies (Hearst 2006). The Scatter-Gather interface for browsing collections was presented by Cutting et al. (1992). A theoretical framework for an- 4. For example, this problem is common when EM is used to estimate parameters of hidden Markov models, probabilistic grammars, and machine translation models in natural language processing (Manning and Schütze 1999). Online edition (c) 2009 Cambridge UP 16.6 References and further reading 373 alyzing the properties of Scatter/Gather and other information seeking user interfaces is presented by Pirolli (2007). Schütze and Silverstein (1997) evaluate LSI (Chapter 18) and truncated representations of centroids for efficient K-means clustering. The Columbia NewsBlaster system (McKeown et al. 2002), a forerunner to the now much more famous and refined Google News (http://news.google.com), used hierarchical clustering (Chapter 17) to give two levels of news topic granularity. See Hatzivassiloglou et al. (2000) for details, and Chen and Lin (2000) and Radev et al. (2001) for related systems. Other applications of clustering in information retrieval are duplicate detection (Yang and Callan (2006), Section 19.6, page 438), novelty detection (see references in Section 17.9, page 399) and metadata discovery on the semantic web (Alonso et al. 2006). The discussion of external evaluation measures is partially based on Strehl (2002). Dom (2002) proposes a measure Q0 that is better motivated theoretically than NMI. Q0 is the number of bits needed to transmit class memberships assuming cluster memberships are known. The Rand index is due to Rand (1971). Hubert and Arabie (1985) propose an adjusted Rand index thatADJUSTED RAND INDEX ranges between −1 and 1 and is 0 if there is only chance agreement between clusters and classes (similar to κ in Chapter 8, page 165). Basu et al. (2004) argue that the three evaluation measures NMI, Rand index and F measure give very similar results. Stein et al. (2003) propose expected edge density as an internal measure and give evidence that it is a good predictor of the quality of a clustering. Kleinberg (2002) and Meil˘a (2005) present axiomatic frameworks for comparing clusterings. Authors that are often credited with the invention of the K-means algorithm include Lloyd (1982) (first distributed in 1957), Ball (1965), MacQueen (1967), and Hartigan and Wong (1979). Arthur and Vassilvitskii (2006) investigate the worst-case complexity of K-means. Bradley and Fayyad (1998), Pelleg and Moore (1999) and Davidson and Satyanarayana (2003) investigate the convergence properties of K-means empirically and how it depends on initial seed selection. Dhillon and Modha (2001) compare K-means clusters with SVD-based clusters (Chapter 18). The K-medoid algorithm was presented by Kaufman and Rousseeuw (1990). The EM algorithm was originally introduced by Dempster et al. (1977). An in-depth treatment of EM is (McLachlan and Krishnan 1996). See Section 18.5 (page 417) for publications on latent analysis, which can also be viewed as soft clustering. AIC is due to Akaike (1974) (see also Burnham and Anderson (2002)). An alternative to AIC is BIC, which can be motivated as a Bayesian model selection procedure (Schwarz 1978). Fraley and Raftery (1998) show how to choose an optimal number of clusters based on BIC. An application of BIC to K-means is (Pelleg and Moore 2000). Hamerly and Elkan (2003) propose an alternative to BIC that performs better in their experiments. Another influential Bayesian approach for determining the number of clusters (simultane- Online edition (c) 2009 Cambridge UP 374 16 Flat clustering ously with cluster assignment) is described by Cheeseman and Stutz (1996). Two methods for determining cardinality without external criteria are presented by Tibshirani et al. (2001). We only have space here for classical completely unsupervised clustering. An important current topic of research is how to use prior knowledge to guide clustering (e.g., Ji and Xu (2006)) and how to incorporate interactive feedback during clustering (e.g., Huang and Mitchell (2006)). Fayyad et al. (1998) propose an initialization for EM clustering. For algorithms that can cluster very large data sets in one scan through the data see Bradley et al. (1998). The applications in Table 16.1 all cluster documents. Other information retrieval applications cluster words (e.g., Crouch 1988), contexts of words (e.g., Schütze and Pedersen 1995) or words and documents simultaneously (e.g., Tishby and Slonim 2000, Dhillon 2001, Zha et al. 2001). Simultaneous clustering of words and documents is an example of co-clustering or biclustering.CO-CLUSTERING 16.7 Exercises ? Exercise 16.7 Let Ω be a clustering that exactly reproduces a class structure C and Ω′ a clustering that further subdivides some clusters in Ω. Show that I(Ω; C) = I(Ω′; C). Exercise 16.8 Show that I(Ω; C) ≤ [H(Ω) + H(C)]/2. Exercise 16.9 Mutual information is symmetric in the sense that its value does not change if the roles of clusters and classes are switched: I(Ω; C) = I(C; Ω). Which of the other three evaluation measures are symmetric in this sense? Exercise 16.10 Compute RSS for the two clusterings in Figure 16.7. Exercise 16.11 (i) Give an example of a set of points and three initial centroids (which need not be members of the set of points) for which 3-means converges to a clustering with an empty cluster. (ii) Can a clustering with an empty cluster be the global optimum with respect to RSS? Exercise 16.12 Download Reuters-21578. Discard documents that do not occur in one of the 10 classes acquisitions, corn, crude, earn, grain, interest, money-fx, ship, trade, and wheat. Discard documents that occur in two of these 10 classes. (i) Compute a K-means clustering of this subset into 10 clusters. There are a number of software packages that implement K-means, such as WEKA (Witten and Frank 2005) and R (R Development Core Team 2005). (ii) Compute purity, normalized mutual information, F1 and RI for Online edition (c) 2009 Cambridge UP 16.7 Exercises 375 the clustering with respect to the 10 classes. (iii) Compile a confusion matrix (Table 14.5, page 308) for the 10 classes and 10 clusters. Identify classes that give rise to false positives and false negatives. Exercise 16.13 Prove that RSSmin(K) is monotonically decreasing in K. Exercise 16.14 There is a soft version of K-means that computes the fractional membership of a document in a cluster as a monotonically decreasing function of the distance ∆ from its centroid, e.g., as e−∆. Modify reassignment and recomputation steps of hard K-means for this soft version. Exercise 16.15 In the last iteration in Table 16.3, document 6 is in cluster 2 even though it was the initial seed for cluster 1. Why does the document change membership? Exercise 16.16 The values of the parameters qmk in iteration 25 in Table 16.3 are rounded. What are the exact values that EM will converge to? Exercise 16.17 Perform a K-means clustering for the documents in Table 16.3. After how many iterations does K-means converge? Compare the result with the EM clustering in Table 16.3 and discuss the differences. Exercise 16.18 [⋆ ⋆ ⋆] Modify the expectation and maximization steps of EM for a Gaussian mixture. The maximization step computes the maximum likelihood parameter estimates αk, µk, and Σk for each of the clusters. The expectation step computes for each vector a soft assignment to clusters (Gaussians) based on their current parameters. Write down the equations for Gaussian mixtures corresponding to Equations (16.16) and (16.17). Exercise 16.19 [⋆ ⋆ ⋆] Show that K-means can be viewed as the limiting case of EM for Gaussian mixtures if variance is very small and all covariances are 0. Exercise 16.20 [⋆ ⋆ ⋆] The within-point scatter of a clustering is defined as ∑k 1 2 ∑xi∈ωk ∑xj∈ωk |xi − xj|2. ShowWITHIN-POINT SCATTER that minimizing RSS and minimizing within-point scatter are equivalent. Exercise 16.21 [⋆ ⋆ ⋆] Derive an AIC criterion for the multivariate Bernoulli mixture model from Equation (16.12).