Flat clustering Hierarchical clustering Further topics PA196: Pattern Recognition 10. Clustering Dr. Vlad Popovici popovici@iba.muni.cz Institute of Biostatistics and Analyses Masaryk University, Brno Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Introduction clustering: a collection of methods for grouping data based on some measure of similarity technique for data exploration tries to identify some groupings of data: a partition of the given data set into some subsets that are (more or less) homogeneous in some sense different methods may lead to completely different partitions can be used as a starting point in a supervised analysis: the cluster labels may guide classifier design Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Why clustering? Example: Eisen et al, PNAS 1998: "... statistical algorithms to arrange genes according to similarity in pattern of gene expression." Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Why clustering? Example: Adamic, Adar, 2005: the social network - email communication within a corporation Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Why clustering? Example: Mignotte , PRL 2011: image segmentation Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Why clustering? Example: Li et al., IEEE Trans Inf Theory, 2004 Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Two classes of algorithms flat clustering the data space is partitioned into a number of subsets (the most "meaningful"); requires the number of partitions to be specified hierarchical clustering: build a nested hierarchy of partitions Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Outline 1 Flat clustering Mixture densities K−means clustering Spectral clustering 2 Hierarchical clustering Linkage algorithms 3 Further topics Other methods Cluster quality Cluster validation Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Outline 1 Flat clustering Mixture densities K−means clustering Spectral clustering 2 Hierarchical clustering Linkage algorithms 3 Further topics Other methods Cluster quality Cluster validation Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Assume a generative model: the observed data D = {xi|i = 1, . . . , n} originates from G groups each group has a prior P(gk ), k = 1, . . . , G the class-conditional probabilities are of some known parametric form p(x|gk , θk ) the labels of the points x are unknown as are the parameters θk Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering PDF: p(x|{θ1, . . . , θG}) = G i=1 p(x|gi, θi)P(gi) this is a mixture density p(·) are the mixture components and P(·) are the mixing parameters the goal is to estimate θ1, . . . , θG Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Maximum likelihood estimates Let T = {θ1, . . . , θG}. The likelihood of the observed data is p(D|T) = n i=1 p(xi|T) The maximum likelihood estimate ˆT is ˆT = arg max T p(D|T) Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering The maximum likelihood estimate is usually obtained through an iterative process: expectation maximization. Repeat until convergence: 1 Expectation step (E-step) compute the expected log-likelihood under current estimates ˆT(t) 2 Maximization step (M-step) find ˆT(t+1) that maximizes the expectation Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Gaussian Mixture Models depending on how constraint is the model, different forms can be fitted usually, the constraints are on the form of the covariance matrices: spherical, diagonal, full Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Example (from scikit-learn): Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Outline 1 Flat clustering Mixture densities K−means clustering Spectral clustering 2 Hierarchical clustering Linkage algorithms 3 Further topics Other methods Cluster quality Cluster validation Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering given are n data points x1, . . . , xn ∈ Rd and the number K of clusters in the Gaussian mixture model, assume the covariance matrices to be identity matrices → the Mahalanobis distance becomes Euclidean distance goal: find the K cluster centres (based on Euclidean distance) Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Overview of the algorithm: 1 choose randomly K cluster centres 2 assign all the points to the cluster defined by the closest centre 3 re-compute the cluster centres 4 repeat 2-3 until no more changes (or some other convergence criterion) Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering [DHS -Fig 10.3] Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Outline 1 Flat clustering Mixture densities K−means clustering Spectral clustering 2 Hierarchical clustering Linkage algorithms 3 Further topics Other methods Cluster quality Cluster validation Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Some definitions Graph: A (simple, unoriented) graph G is a pair of sets G = (V, E), where V is a vertex set and E is an edge set. An edge (i, j) ∈ E is an unordered pair of vertices i, j ∈ V. Oriented graph: A oriented graph G is a pair of sets G = (V, E), where V is a vertex set and E is an arc set. An arc (i, j) ∈ E is an ordered pair of vertices i, j ∈ V, with i called tail and j called head. Figure : A simple graph. Figure : An oriented graph. Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering A path of length r from i to j is a sequence of r + 1 distinct and adjacent vertices starting with i and ending with j. A connected graph is a graph where any two vertices may be linked by a path. A connected component is an induced subgraph maximal that is connected. The degree di of a vertex i is the number of incident edges to i. The in–degree d (i) i of a vertex i is the number of arcs ending with i. The out–degree d (o) i of a vertex i is the number of arcs starting with i. Figure : A graph with 2 connected components. Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Graph isomorphism Two graphs G1 and G2 are isomorphic if there exists a bijection ϕ : VG1 → VG2 such that (i, j) ∈ EG1 iff (ϕ(i), ϕ(j)) ∈ EG2 . Adjacency and incidence matrices The adjacency matrix A(G) = [aij] of a graph G is a |V| × |V| 01−matrix, with aij = I(i,j)∈E. The incidence matrix B(G) = [bij] of a graph G is a |V| × |E| matrix with bik = −1 and bjk = 1, where k = (i, j) ∈ E. Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Properties: for a graph G: BBt = diag(d1, ..., d|V|) − A let G1 and G2 be two isomorph graphs; then det(xI − A(G1)) = det(xI − A(G2)), so, they have the same spectrum. the number of paths of length r from i to j is given by (Ar )ij. Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Laplacian of a graph The Laplacian of the graph G is the matrix L(G) = BBt , where B is the incidence matrix of G. Properties: Lij =    di if i = j −1 if i j and (i, j) ∈ E 0 otherwise L is symmetric (by construction) positive semi–definite L1|V| = 0 so the smallest eigenvalue is λ1 = 0 and the corresponding eigenvector is 1|V|. the number of connected components of G equals the number of null eigenvalues. Warning: there are various other variants of L’s definition (e.g. admittance matrix, Kirchhoff matrix). Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Generalizations: A is replaced by a weight/similarity matrix W (still symmetric) the degree of a vertex i becomes di = |V| j=1 wij the Laplacian becomes (see its properties): L = D − W, where D = diag(d1, ..., d|V|) the normalized Laplacian is defined as D−1 2 (D − W)D− 1 2 . Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Overview of spectral clustering Spectral clustering: partitioning the similarity graph under some constraints: Figure : Where to cut?? balance the size of the clusters? minimize the number of edges removed? . . . Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Let A, B be the two clusters/subgraphs and let s(A, B) = i∈A j∈B wij. Objective functions: ratio cut: J(A, B) = s(A,B) |A| + s(A,B) |B| , i.e. minimize similarity between A and B normalized cut: J(A, B) = s(A,B) i∈A di + s(A,B) i∈B di min–max–cut: J(A, B) = s(A,B) s(A,A) + s(A,B) s(B,B) , i.e. minimize the similarity between A and B and maximize the similarity within A and B. All these lead to finding the smallest eigenvectors/eigenvalues of L = D − W. Figure : Adjacency matrix and the 2nd eigenvector Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Mixture densities K−means clustering Spectral clustering Spectral clustering - main algorithm input: a similarity matrix S and the number of clusters k compute the Laplacian L compute the eigenvectors vi of L end order them in increasing order of the corresponding eigenvalues build V ∈ Rn×k with eigenvectors as columns the rows of V are the new data points zi ∈ Rk to be clustered use k−means to cluster zi Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Linkage algorithms Outline 1 Flat clustering Mixture densities K−means clustering Spectral clustering 2 Hierarchical clustering Linkage algorithms 3 Further topics Other methods Cluster quality Cluster validation Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Linkage algorithms Hierarchical clustering A nested set of partitions and the corresponding dendrogram: 1 2 3 4 5 6 7 8 9 10 11 5 9 10 7 8 6 11 12 3 4 the height of the descendent segments is proportional with the distance/similarity between the partitions the ordering is irrelevant (i.e. you can swap left and right sub-trees without altering the meaning) it can be seen as a density estimation method Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Linkage algorithms There are two main approaches to construct a hierarchical clustering: agglomerative: initially, each point is alone in a cluster; the closest two clusters/groups are merged to form a new cluster divisive starts with all points in a single cluster, which is iteratively split Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Linkage algorithms Outline 1 Flat clustering Mixture densities K−means clustering Spectral clustering 2 Hierarchical clustering Linkage algorithms 3 Further topics Other methods Cluster quality Cluster validation Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Linkage algorithms Linkage algorithms bottom-up/agglomerative strategy start with each point in its own cluster merge the two closest clusters continue until there is only one cluster Johnson: Hierarchical clustering schemes. Psychometrika, 2:241-254, 1967 many applications, including phylogenetic trees the key is to define a distance over clusters space Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Linkage algorithms single linkage: δ(C1, C2) = min x∈C1,z∈C2 d(x, bz) average linkage: δ(C1, C2) = 1 |C1||C2| x∈C1,z∈C2 d(x, z) complete linkage: δ(C1, C2) = max x∈C1,z∈C2 d(x, bz) other variations... Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Linkage algorithms Comments: single linkage tends to produce non-balanced trees, with long "chains" complete linkage leads to more compact clusters single linkage: minimum spanning tree (cut the longest branch in MST and get the first two clusters) sensitive to outliers does not need a pre-specified number of clusters - but often you have to cut the dendrogram and to decide how many clusters are in data the clustering tree is not unique Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation Outline 1 Flat clustering Mixture densities K−means clustering Spectral clustering 2 Hierarchical clustering Linkage algorithms 3 Further topics Other methods Cluster quality Cluster validation Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation Outline 1 Flat clustering Mixture densities K−means clustering Spectral clustering 2 Hierarchical clustering Linkage algorithms 3 Further topics Other methods Cluster quality Cluster validation Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation Other methods: information-theoretic methods idea: construct a clustering the preserves most of the "information" in data hence, you need a cost function (distortion function) density estimates based on frequencies (counts) no notion of similarity example: Information Bottleneck Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation Other methods: iterative refinement Karypis, Kumar: multilevel partitioning of graphs, SIAM J Sci Comp 1999 start with a large graph merge nodes in "super-nodes" (coarsen the graph) cluster the coarse graph uncoarsen again: refine the clustering Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation Other methods: ensemble methods produce a series of clusterings based on perturbed versions of the original data (resampling, different parameters, etc) use the ensemble of clusters to decide for the final clustering see Strehl, Ghosh: cluster ensembles, JMLR 2002 Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation Other methods support vector clustering subspace clustering co-clustering (bi-clustering): obtain a clustering of rows and columns of the data matrix etc. etc. Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation Outline 1 Flat clustering Mixture densities K−means clustering Spectral clustering 2 Hierarchical clustering Linkage algorithms 3 Further topics Other methods Cluster quality Cluster validation Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation Cluster quality some cluster methods (e.g. k-means) define an objective function as the goal of the clustering there are quality functions that are algorithm independent (many!) two quantities are usually accounted for: within-cluster similarity (cluster homogeneity): should be high between-cluster similarity: should be low while mathematically attractive, they usually lead to an NP-hard problem the choice of quality function is rather ad-hoc Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation Cluster stability and optimal k General idea: start with a data set D = {xi} and some clustering algorithm A for various number of clusters (e.g. k = 2, 3, . . . , K: draw subsamples from D use A to cluster them into k clusters compare the resulting clusters by using a distance between clusterings and compute a stability index describing how variable/stable the clustering distances are choose k that gives the best stability Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation Distances between clusterings of the same data: let f(1) and f(2) be two clusterings and define Nij as the number of pairs (x, z) for which f(1)(x, z) = i and f(2)(x, z) = j, for i, j ∈ {0, 1}. Similarity/distance functions (some of many): Rand index: (N00 + N11)/[n(n − 1)] Jaccard index: N11/(N11 + N01 + N10) Hamming distance: (N01 + N10)/[n(n − 1)] information theoretic distance: Entropy(f(1)) + Entropy(f(2)) Mutual information(f(1), f(2)) Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation What if the clusterings are not defined on the same data? far from being trivial idea "extend" clustering f(1) to D2 and f(2) to D1 some clustering algorithms are easily extended: k−means just requires assignment of new data to the defined clusters, etc. other clustering algorithms are not so flexible (e.g. spectral clustering) use some classifiers for extension... what about classification errors? what if the 2 datasets are not exactly "aligned"? Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation How many clusters? most flat clustering algorithms require k the hierarchical clustering usually is cut at some height to yield the final number of clusters e.g. image segmentation k =? one way, use cluster quality to choose best k or use the stability approach or use "gap statistic" - see Tibshirani et al, J Royal Statist Soc 2001 Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation Outline 1 Flat clustering Mixture densities K−means clustering Spectral clustering 2 Hierarchical clustering Linkage algorithms 3 Further topics Other methods Cluster quality Cluster validation Vlad PA196: Pattern Recognition Flat clustering Hierarchical clustering Further topics Other methods Cluster quality Cluster validation Cluster validation given a clustering, how confident are we that it represents "real" groups of points if a new data set is available, would we obtain the same clusters? and as many as before? complication: in real applications, data may not always come from the same conditions (e.g. change of measurement device, etc) maybe the original data set does not capture all the clusters that could arise in data no clear method for validation, but the ideas are the same as for cluster stability and quality Vlad PA196: Pattern Recognition