2424
Unsupervised learning. Clustering
2525
Clustering
•  Partition unlabeled examples into disjoint
subsets of clusters, such that:
–  Examples within a cluster are very similar
–  Examples in different clusters are very different
•  Discover new categories in an unsupervised
manner (no sample category labels provided).
2626
.
Clustering Example
.
.
.
.
. .
. ..
.
.
...
.
.
.
.
.
. .
. ..
.
.
...
.
.
2727
Hierarchical Clustering
•  Build a tree-based hierarchical taxonomy
(dendrogram) from a set of unlabeled examples.
•  Recursive application of a standard clustering
algorithm can produce a hierarchical clustering.
animal
vertebrate
fish reptile amphib. mammal worm insect crustacean
invertebrate
2828
Aglommerative vs. Divisive Clustering
•  Aglommerative (bottom-up) methods start
with each example in its own cluster and
iteratively combine them to form larger and
larger clusters.
•  Divisive (partitional, top-down) separate all
examples immediately into clusters.
2929
Direct Clustering Method
•  Direct clustering methods require a
specification of the number of clusters, k,
desired.
•  A clustering evaluation function assigns a
real-value quality measure to a clustering.
•  The number of clusters can be determined
automatically by explicitly generating
clusterings for multiple values of k and
choosing the best result according to a
clustering evaluation function.
3030
Hierarchical Agglomerative Clustering
(HAC)
•  Assumes a similarity function for determining
the similarity of two instances.
•  Starts with all instances in a separate cluster
and then repeatedly joins the two clusters that
are most similar until there is only one cluster.
•  The history of merging forms a binary tree or
hierarchy.
3131
HAC Algorithm
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single cluster ci ∪ cj
Hierarchical Clustering
•  Use distance matrix as clustering criteria. This method does
not require the number of clusters k as an input, but needs a
termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
• 32
Dendrogram. Shows How Clusters are Merged
Decompose data objects into a several levels of nested partitioning
(tree of clusters), called a dendrogram
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected component
forms a cluster
33
3434
Cluster Similarity
•  Assume a similarity function that determines the
similarity of two instances: sim(x,y).
–  Euclidean /Mahalanobis, Hamming, Cosine similarity,
Pearson r etc.
•  How to compute similarity of two clusters each
possibly containing multiple instances?
–  Single Link: Similarity of two most similar members.
–  Complete Link: Similarity of two least similar members.
–  Group Average: Average similarity between members.
Distance between Clusters
•  Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
•  Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
•  Average: avg distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
•  Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)
•  Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) =
dist(Mi, Mj)
–  Medoid: a chosen, centrally located object in the cluster
35
3636
Single Link Agglomerative Clustering
•  Use maximum similarity of pairs:
•  Can result in “straggly” (long and thin)
clusters due to chaining effect.
–  Appropriate in some domains, such as
clustering islands.
3737
Single Link Example
3838
Complete Link Agglomerative Clustering
•  Use minimum similarity of pairs:
•  Makes more “tight,” spherical clusters that
are typically preferable.
3939
Complete Link Example
4040
Computational Complexity
•  In the first iteration, all HAC methods need
to compute similarity of all pairs of n
individual instances which is O(n2).
•  In each of the subsequent n-2 merging
iterations, it must compute the distance
between the most recently created cluster
and all other existing clusters.
•  In order to maintain an overall O(n2)
performance, computing similarity to each
other cluster must be done in constant time.
4141
Computing Cluster Similarity
•  After merging ci and cj, the similarity of the
resulting cluster to any other cluster, ck, can
be computed by:
–  Single Link:
–  Complete Link:
4242
Non-Hierarchical Clustering
4343
Non-Hierarchical Clustering
•  Typically must provide the number of desired
clusters, k.
•  Randomly choose k instances as seeds, one per
cluster.
•  Form initial clusters based on these seeds.
•  Iterate, repeatedly reallocating instances to
different clusters to improve the overall clustering.
•  Stop when clustering converges or after a fixed
number of iterations.
4444
K-Means
•  Assumes instances are real-valued vectors.
•  Clusters based on centroids, center of
gravity, or mean of points in a cluster, c:
•  Reassignment of instances to clusters is
based on distance to the current cluster
centroids.
4545
Distance Metrics
•  Euclidian distance (L2 norm):
•  L1 norm:
•  Cosine Similarity (transform to a distance
by subtracting from 1):
4646
K-Means Algorithm
Let d be the distance measure between instances.
Select k random instances {s1, s2,… sk} as seeds.
Until clustering converges or other stopping criterion:
For each instance xi:
Assign xi to the cluster cj such that d(xi, sj) is minimal.
(Update the seeds to the centroid of each cluster)
For each cluster cj
sj = µ(cj)
4747
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
x
x
Reasssign clusters
x
x xx Compute centroids
Reassign clusters
Converged!
4848
Time Complexity
•  Assume computing distance between two instances is
O(m) where m is the dimensionality of the vectors.
•  Reassigning clusters: O(kn) distance computations,
or O(knm).
•  Computing centroids: Each instance vector gets
added once to some centroid: O(nm).
•  Assume these two steps are each done once for I
iterations: O(Iknm).
•  Linear in all relevant factors, assuming a fixed
number of iterations, more efficient than O(n2) HAC.
4949
K-Means Objective
•  The objective of k-means is to minimize the
total sum of the squared distance of every
point to its corresponding cluster centroid.
•  Finding the global optimum is NP-hard.
•  The k-means algorithm is guaranteed to
converge a local optimum.
5050
Seed Choice
•  Results can vary based on random seed
selection.
•  Some seeds can result in poor convergence
rate, or convergence to sub-optimal
clusterings.
•  Select good seeds using a heuristic or the
results of another method.
5151
Buckshot Algorithm
•  Combines HAC and K-Means clustering.
•  First randomly take a sample of instances of
size √n
•  Run group-average HAC on this sample,
which takes only O(n) time.
•  Use the results of HAC as initial seeds for
K-means.
•  Overall algorithm is O(n) and avoids
problems of bad seed selection.
5252
Soft Clustering
•  Clustering typically assumes that each instance is
given a “hard” assignment to exactly one cluster.
•  Does not allow uncertainty in class membership or
for an instance to belong to more than one cluster.
•  Soft clustering gives probabilities that an instance
belongs to each of a set of clusters.
•  Each instance is assigned a probability distribution
across a set of discovered categories (probabilities
of all categories must sum to 1).
5353
Expectation Maximumization (EM)
•  Probabilistic method for soft clustering.
•  Direct method that assumes k clusters:{c1, c2,… ck}
•  Soft version of k-means.
•  Assumes a probabilistic model of categories that
allows computing P(ci | E) for each category, ci, for a
given example, E.
•  For text, typically assume a naïve-Bayes category
model.
–  Parameters θ = {P(ci), P(wj | ci): i∈{1,…k}, j ∈{1,…,|V|}}
5454
EM Algorithm
•  Iterative method for learning probabilistic
categorization model from unsupervised data.
•  Initially assume random assignment of examples to
categories.
•  Learn an initial probabilistic model by estimating
model parameters θ from this randomly labeled data.
•  Iterate following two steps until convergence:
–  Expectation (E-step): Compute P(ci | E) for each example
given the current model, and probabilistically re-label the
examples based on these posterior probability estimates.
–  Maximization (M-step): Re-estimate the model
parameters, θ, from the probabilistically re-labeled data.
5555
EM
Unlabeled Examples
+ +
-
+ +
-
-+
Assign random probabilistic labels to unlabeled data
Initialize:
5656
EM
Prob.
Learner
+ +
-
+ +
-
-+
Give soft-labeled training data to a probabilistic learner
Initialize:
5757
EM
Prob.
Learner
Prob.
Classifier
+ +
-
+ +
-
-+
Produce a probabilistic classifier
Initialize:
5858
EM
Prob.
Learner
Prob.
Classifier
Relabel unlabled data using the trained classifier
+ +
-
+ +
-
-+
E Step:
5959
EM
Prob.
Learner
+ +
-
+ +
-
-+
Prob.
Classifier
Continue EM iterations until probabilistic labels
on unlabeled data converge.
Retrain classifier on relabeled data
M step:
6060
Learning from Probabilistically Labeled Data
•  Instead of training data labeled with “hard”
category labels, training data is labeled with “soft”
probabilistic category labels.
•  When estimating model parameters θ from training
data, weight counts by the corresponding
probability of the given category label.
•  For example, if P(c1 | E) = 0.8 and P(c2 | E) = 0.2,
each word wj in E contributes only 0.8 towards the
counts n1 and n1j, and 0.2 towards the counts n2 and
n2j .
6161
Naïve Bayes EM
Randomly assign examples probabilistic category labels.
Use standard naïve-Bayes training to learn a probabilistic model
with parameters θ from the labeled data.
Until convergence or until maximum number of iterations reached:
E-Step: Use the naïve Bayes model θ to compute P(ci | E) for
each category and example, and re-label each example
using these probability values as soft category labels.
M-Step: Use standard naïve-Bayes training to re-estimate the
parameters θ using these new probabilistic category labels.
Assessing Clustering Tendency
•  Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
•  Test spatial randomness by statistic test: Hopkins Static
–  Given a dataset D regarded as a sample of a random variable o,
determine how far away o is from being uniformly distributed in
the data space
–  Sample n points, p1, …, pn, uniformly from D. For each pi, find its
nearest neighbor in D: xi = min{dist (pi, v)} where v in D
–  Sample n points, q1, …, qn, uniformly from D. For each qi, find its
nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D
and v ≠ qi
–  Calculate the Hopkins Statistic:
–  If D is uniformly distributed, ∑ xi and ∑ yi will be close to each
other and H is close to 0.5. If D is highly skewed, H is close to 0
62
Measuring Clustering Quality
•  Two methods: extrinsic vs. intrinsic
•  Extrinsic: supervised, i.e., the ground truth is available
–  Compare a clustering against the ground truth using certain
clustering quality measure
•  Intrinsic: unsupervised, i.e., the ground truth is unavailable
–  Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the clusters
are
–  Ex. Silhouette coefficient
63
Measuring Clustering Quality: Extrinsic Methods
•  Clustering quality measure: Q(C, Cg), for a clustering C
given the ground truth Cg.
•  Q is good if it satisfies the following 4 essential criteria
–  Cluster homogeneity: the purer, the better
–  Cluster completeness: should assign objects belong to
the same category in the ground truth to the same
cluster
–  Rag bag: putting a heterogeneous object into a pure
cluster should be penalized more than putting it into a
rag bag (i.e., “miscellaneous” or “other” category)
–  Small cluster preservation: splitting a small category
into pieces is more harmful than splitting a large
category into pieces
64
Measuring Clustering Quality: Extrinsic Methods
•  Clustering quality measure: Q(C, Cg), for a clustering C
given the ground truth Cg.
•  Q is good if it satisfies the following 4 essential criteria
–  Cluster homogeneity: the purer, the better
–  Cluster completeness: should assign objects belong to
the same category in the ground truth to the same
cluster
–  Rag bag: putting a heterogeneous object into a pure
cluster should be penalized more than putting it into a
rag bag (i.e., “miscellaneous” or “other” category)
–  Small cluster preservation: splitting a small category
into pieces is more harmful than splitting a large
category into pieces
65
Silhouette Coefficient
•  considering both the intra- and inter-cluster distances.
•  For a point xi, the average of the distances to all points in
the same cluster is calculated. This value is set to ai.
•  Then for each cluster that does not contain xi, the average
distance of xi to all the data points in each cluster is
computed. This value is set to bi.
•  Using ai and bi the silhouette coefficient of a point is
estimated. The average of all the silhouettes in the dataset
is called the average silhouettes width for all the points in
the dataset.
66
Silhouette Coefficient
•  considering both the intra- and inter-cluster distances.
•  For a point xi, the average of the distances to all points in
the same cluster is calculated. This value is set to ai.
•  Then for each cluster that does not contain xi, the average
distance of xi to all the data points in each cluster is
computed. This value is set to bi.
•  Using ai and bi the silhouette coefficient of a point is
estimated. The average of all the silhouettes in the dataset
is called the average silhouettes width for all the points in
the dataset.
67
Silhouette Coefficient
•  To evaluate the quality of a clustering one can compute the
average silhouette coefficient of all points.
68
6969
Conclusions
•  Unsupervised learning induces categories
from unlabeled data.
•  Agglomerative vs. Divisive. Hard vs. soft
•  There are a variety of approaches, including:
–  HAC
–  k-means
–  EM