2424 Unsupervised learning. Clustering 2525 Clustering •  Partition unlabeled examples into disjoint subsets of clusters, such that: –  Examples within a cluster are very similar –  Examples in different clusters are very different •  Discover new categories in an unsupervised manner (no sample category labels provided). 2626 . Clustering Example . . . . . . . .. . . ... . . . . . . . . .. . . ... . . 2727 Hierarchical Clustering •  Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. •  Recursive application of a standard clustering algorithm can produce a hierarchical clustering. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate 2828 Aglommerative vs. Divisive Clustering •  Aglommerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. •  Divisive (partitional, top-down) separate all examples immediately into clusters. 2929 Direct Clustering Method •  Direct clustering methods require a specification of the number of clusters, k, desired. •  A clustering evaluation function assigns a real-value quality measure to a clustering. •  The number of clusters can be determined automatically by explicitly generating clusterings for multiple values of k and choosing the best result according to a clustering evaluation function. 3030 Hierarchical Agglomerative Clustering (HAC) •  Assumes a similarity function for determining the similarity of two instances. •  Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. •  The history of merging forms a binary tree or hierarchy. 3131 HAC Algorithm Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci ∪ cj Hierarchical Clustering •  Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA) • 32 Dendrogram. Shows How Clusters are Merged Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster 33 3434 Cluster Similarity •  Assume a similarity function that determines the similarity of two instances: sim(x,y). –  Euclidean /Mahalanobis, Hamming, Cosine similarity, Pearson r etc. •  How to compute similarity of two clusters each possibly containing multiple instances? –  Single Link: Similarity of two most similar members. –  Complete Link: Similarity of two least similar members. –  Group Average: Average similarity between members. Distance between Clusters •  Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq) •  Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq) •  Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq) •  Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj) •  Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj) –  Medoid: a chosen, centrally located object in the cluster 35 3636 Single Link Agglomerative Clustering •  Use maximum similarity of pairs: •  Can result in “straggly” (long and thin) clusters due to chaining effect. –  Appropriate in some domains, such as clustering islands. 3737 Single Link Example 3838 Complete Link Agglomerative Clustering •  Use minimum similarity of pairs: •  Makes more “tight,” spherical clusters that are typically preferable. 3939 Complete Link Example 4040 Computational Complexity •  In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). •  In each of the subsequent n-2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. •  In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time. 4141 Computing Cluster Similarity •  After merging ci and cj, the similarity of the resulting cluster to any other cluster, ck, can be computed by: –  Single Link: –  Complete Link: 4242 Non-Hierarchical Clustering 4343 Non-Hierarchical Clustering •  Typically must provide the number of desired clusters, k. •  Randomly choose k instances as seeds, one per cluster. •  Form initial clusters based on these seeds. •  Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. •  Stop when clustering converges or after a fixed number of iterations. 4444 K-Means •  Assumes instances are real-valued vectors. •  Clusters based on centroids, center of gravity, or mean of points in a cluster, c: •  Reassignment of instances to clusters is based on distance to the current cluster centroids. 4545 Distance Metrics •  Euclidian distance (L2 norm): •  L1 norm: •  Cosine Similarity (transform to a distance by subtracting from 1): 4646 K-Means Algorithm Let d be the distance measure between instances. Select k random instances {s1, s2,… sk} as seeds. Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cj such that d(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = µ(cj) 4747 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x xx Compute centroids Reassign clusters Converged! 4848 Time Complexity •  Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors. •  Reassigning clusters: O(kn) distance computations, or O(knm). •  Computing centroids: Each instance vector gets added once to some centroid: O(nm). •  Assume these two steps are each done once for I iterations: O(Iknm). •  Linear in all relevant factors, assuming a fixed number of iterations, more efficient than O(n2) HAC. 4949 K-Means Objective •  The objective of k-means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid. •  Finding the global optimum is NP-hard. •  The k-means algorithm is guaranteed to converge a local optimum. 5050 Seed Choice •  Results can vary based on random seed selection. •  Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. •  Select good seeds using a heuristic or the results of another method. 5151 Buckshot Algorithm •  Combines HAC and K-Means clustering. •  First randomly take a sample of instances of size √n •  Run group-average HAC on this sample, which takes only O(n) time. •  Use the results of HAC as initial seeds for K-means. •  Overall algorithm is O(n) and avoids problems of bad seed selection. 5252 Soft Clustering •  Clustering typically assumes that each instance is given a “hard” assignment to exactly one cluster. •  Does not allow uncertainty in class membership or for an instance to belong to more than one cluster. •  Soft clustering gives probabilities that an instance belongs to each of a set of clusters. •  Each instance is assigned a probability distribution across a set of discovered categories (probabilities of all categories must sum to 1). 5353 Expectation Maximumization (EM) •  Probabilistic method for soft clustering. •  Direct method that assumes k clusters:{c1, c2,… ck} •  Soft version of k-means. •  Assumes a probabilistic model of categories that allows computing P(ci | E) for each category, ci, for a given example, E. •  For text, typically assume a naïve-Bayes category model. –  Parameters θ = {P(ci), P(wj | ci): i∈{1,…k}, j ∈{1,…,|V|}} 5454 EM Algorithm •  Iterative method for learning probabilistic categorization model from unsupervised data. •  Initially assume random assignment of examples to categories. •  Learn an initial probabilistic model by estimating model parameters θ from this randomly labeled data. •  Iterate following two steps until convergence: –  Expectation (E-step): Compute P(ci | E) for each example given the current model, and probabilistically re-label the examples based on these posterior probability estimates. –  Maximization (M-step): Re-estimate the model parameters, θ, from the probabilistically re-labeled data. 5555 EM Unlabeled Examples + + - + + - -+ Assign random probabilistic labels to unlabeled data Initialize: 5656 EM Prob. Learner + + - + + - -+ Give soft-labeled training data to a probabilistic learner Initialize: 5757 EM Prob. Learner Prob. Classifier + + - + + - -+ Produce a probabilistic classifier Initialize: 5858 EM Prob. Learner Prob. Classifier Relabel unlabled data using the trained classifier + + - + + - -+ E Step: 5959 EM Prob. Learner + + - + + - -+ Prob. Classifier Continue EM iterations until probabilistic labels on unlabeled data converge. Retrain classifier on relabeled data M step: 6060 Learning from Probabilistically Labeled Data •  Instead of training data labeled with “hard” category labels, training data is labeled with “soft” probabilistic category labels. •  When estimating model parameters θ from training data, weight counts by the corresponding probability of the given category label. •  For example, if P(c1 | E) = 0.8 and P(c2 | E) = 0.2, each word wj in E contributes only 0.8 towards the counts n1 and n1j, and 0.2 towards the counts n2 and n2j . 6161 Naïve Bayes EM Randomly assign examples probabilistic category labels. Use standard naïve-Bayes training to learn a probabilistic model with parameters θ from the labeled data. Until convergence or until maximum number of iterations reached: E-Step: Use the naïve Bayes model θ to compute P(ci | E) for each category and example, and re-label each example using these probability values as soft category labels. M-Step: Use standard naïve-Bayes training to re-estimate the parameters θ using these new probabilistic category labels. Assessing Clustering Tendency •  Assess if non-random structure exists in the data by measuring the probability that the data is generated by a uniform data distribution •  Test spatial randomness by statistic test: Hopkins Static –  Given a dataset D regarded as a sample of a random variable o, determine how far away o is from being uniformly distributed in the data space –  Sample n points, p1, …, pn, uniformly from D. For each pi, find its nearest neighbor in D: xi = min{dist (pi, v)} where v in D –  Sample n points, q1, …, qn, uniformly from D. For each qi, find its nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and v ≠ qi –  Calculate the Hopkins Statistic: –  If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and H is close to 0.5. If D is highly skewed, H is close to 0 62 Measuring Clustering Quality •  Two methods: extrinsic vs. intrinsic •  Extrinsic: supervised, i.e., the ground truth is available –  Compare a clustering against the ground truth using certain clustering quality measure •  Intrinsic: unsupervised, i.e., the ground truth is unavailable –  Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are –  Ex. Silhouette coefficient 63 Measuring Clustering Quality: Extrinsic Methods •  Clustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg. •  Q is good if it satisfies the following 4 essential criteria –  Cluster homogeneity: the purer, the better –  Cluster completeness: should assign objects belong to the same category in the ground truth to the same cluster –  Rag bag: putting a heterogeneous object into a pure cluster should be penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other” category) –  Small cluster preservation: splitting a small category into pieces is more harmful than splitting a large category into pieces 64 Measuring Clustering Quality: Extrinsic Methods •  Clustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg. •  Q is good if it satisfies the following 4 essential criteria –  Cluster homogeneity: the purer, the better –  Cluster completeness: should assign objects belong to the same category in the ground truth to the same cluster –  Rag bag: putting a heterogeneous object into a pure cluster should be penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other” category) –  Small cluster preservation: splitting a small category into pieces is more harmful than splitting a large category into pieces 65 Silhouette Coefficient •  considering both the intra- and inter-cluster distances. •  For a point xi, the average of the distances to all points in the same cluster is calculated. This value is set to ai. •  Then for each cluster that does not contain xi, the average distance of xi to all the data points in each cluster is computed. This value is set to bi. •  Using ai and bi the silhouette coefficient of a point is estimated. The average of all the silhouettes in the dataset is called the average silhouettes width for all the points in the dataset. 66 Silhouette Coefficient •  considering both the intra- and inter-cluster distances. •  For a point xi, the average of the distances to all points in the same cluster is calculated. This value is set to ai. •  Then for each cluster that does not contain xi, the average distance of xi to all the data points in each cluster is computed. This value is set to bi. •  Using ai and bi the silhouette coefficient of a point is estimated. The average of all the silhouettes in the dataset is called the average silhouettes width for all the points in the dataset. 67 Silhouette Coefficient •  To evaluate the quality of a clustering one can compute the average silhouette coefficient of all points. 68 6969 Conclusions •  Unsupervised learning induces categories from unlabeled data. •  Agglomerative vs. Divisive. Hard vs. soft •  There are a variety of approaches, including: –  HAC –  k-means –  EM