Week 8 : Clustering and dimension reduction
Introduction to Bioinformatics (LF:DSIB01)


Adobe Systems
Clustering (cluster analysis)
2
•Very broad definition – group of data objects
•Unsupervised classification
•Data mining
•What data you need for clustering?
‒Data points with features where you can measure distance
•Some methods need Triangular inequality
‒Distance matrix
‒Important for sequence comparison!
‒
•
•

Adobe Systems
Clustering types
3
•Centroid models
‒K-mean clustering
‒Mean-Shift Clustering
•Distribution models
‒Expectation–Maximization (EM) Clustering
•Density models
‒DBSPAN
•Connectivity models
‒Hierarchical clustering
•Graph-based models
‒Clique, Spanning tree, Markov Cluster Algorithm (MCL)
•…

Adobe Systems
Clustering
4
•Hard vs. Fuzzy clustering
‒Fuzzy = each object belong to each cluster to some degree (probability)
‒Overlapping clustering =~ discreet Fuzzy clustering
‒More detailed analysis, but hard to interpret – good not as a final step

Adobe Systems
K-mean clustering
5
•most well-known clustering algorithm
1.Select a number of clusters to use and randomly initialize centers
2.k clusters are created by associating every observation with the nearest mean
3.Centroid of each of the k clusters becomes the new mean
4.Repeat steps 2 and 3 until stopping rule

Adobe Systems
K-mean clustering
6
A close up of text on a white background Description automatically generated

Adobe Systems
K-mean clustering
7
•Advantage
‒simple to implement
‒very fast
•linear complexity O(n)
•Disadvantage
‒Must select number of clases
‒Random start point
‒Identify ‘spherical clusters’

Adobe Systems
K-mean clustering pseudocode
8

Adobe Systems
Mean-Shift Clustering
9
•Steps
1.circular sliding window centered at a point C (randomly selected)
2.every iteration, the sliding window is shifted towards regions of higher density by shifting the
center point to the mean of the points within the window
3.until there is no direction at which a shift
4.process of steps 1 to 3 is done with many sliding windows until all points lie within a window

Adobe Systems
Mean-Shift Clustering
10
A picture containing rain, large, people, holding Description automatically generated

Adobe Systems
Mean-Shift Clustering
11
A picture containing star Description automatically generated

Adobe Systems
Mean-Shift Clustering
12
•Advantage
‒Not-necessary to select number of clusters
‒Fast computation
•Disadvantage
‒Must select kernel range r
‒Identify ‘spherical clusters’
•

Adobe Systems
Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
13
•Similar to K-means clustering
‒each cluster is not defined only by its mean but also variance
‒Gaussian normal curve in each dimension
•

Adobe Systems
Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
14
•KM
1.select a number of clusters to use and randomly initialize centers
2.k clusters are created by associating every observation with the nearest mean
3.centroid of each of the k clusters becomes the new mean
4.repeat steps 2 and 3 until stop rule
•EM with GMM
1.select a number of clusters to use and randomly initialize Gaussian distribution parameters
2.Using Gaussian distributions for each cluster, compute the probability that each data point
belongs to a particular cluster
3.compute a new set of parameters of Gaussian distributions to maximize the probabilities of data
points within the clusters
•Weighted sum of individual
4.repeat steps 2 and 3 until convergence

Adobe Systems
Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
15
A close up of a map Description automatically generated

Adobe Systems
Density-Based Spatial Clustering of Applications with Noise
16
1.DBSCAN begins with an arbitrary starting data point that has not been visited. The neighborhood
of this point is extracted using a distance epsilon ε (All points which are within the ε distance
are neighborhood points).
2.If there are a sufficient number of points (according to minPoints) within this neighborhood then
the clustering process starts and the current data point becomes the first point in the new
cluster. Otherwise, the point will be labeled as noise (later this noisy point might become the
part of the cluster). In both cases that point is marked as “visited”.
3.For this first point in the new cluster, the points within its ε distance neighborhood also
become part of the same cluster. This procedure of making all points in the ε neighborhood belong
to the same cluster is then repeated for all of the new points that have been just added to the
cluster group.
4.This process of steps 2 and 3 is repeated until all points in the cluster are determined i.e all
points within the ε neighborhood of the cluster have been visited and labeled.
5.Once we’re done with the current cluster, a new unvisited point is retrieved and processed,
leading to the discovery of a further cluster or noise. This process repeats until all points are
marked as visited. Since at the end of this all points have been visited, each point will have been
marked as either belonging to a cluster or being noise.
•

Adobe Systems
Density-Based Spatial Clustering of Applications with Noise
17
A close up of a logo Description automatically generated

Adobe Systems
Density-Based Spatial Clustering of Applications with Noise
18
•Advantage
‒Not-necessary to select number of clusters
‒Identifies outliers as noise
‒Work on distance matrix
‒
•Disadvantage
‒Doesn’t perform as well as others when the clusters are of varying density
‒Drawback also occurs with very high-dimensional data

Adobe Systems
Hierarchical Clustering
19
•top-down or bottom-up
‒Agglomerative or divisive

Adobe Systems
Agglomerative Hierarchical Clustering
20
•Steps
1.Select smallest distance from distance matrix (selects two data points)
2.Combine selected two clusters into one new data point. Recompute distances to all other data
points.
3.Repeated 1 and 2 until we end up with one cluster or decide to stop.

Adobe Systems
Agglomerative Hierarchical Clustering
21
A screenshot of a cell phone Description automatically generated

Adobe Systems
Agglomerative Hierarchical Clustering
22
•Different agglomeration function:
‒Max = Complete linkage
‒Min = Single linkage
‒Mean = Average linkage
‒Ward linkage
•Ward's minimum variance criterion minimizes the total within-cluster variance.
‒
‒
‒
‒

Adobe Systems
Agglomerative Hierarchical Clustering
23
•Advantage
‒Get all ‘number of clusters’
‒Uncover underlying hierarchy
‒Work on distance matrix
‒Agglomeration method can modulate results
•Disadvantage
‒High computational complexity
•complexity O(n3)
‒Still simple cluster shapes
•

Adobe Systems
Clustering overview
24
A close up of a logo Description automatically generated

Adobe Systems
Dimension reduction
25
•Why?

Adobe Systems
Dimension reduction
26
•Feature selection
•Feature projection
•

Adobe Systems
Feature selection
27
•Subset selection method
‒Exhaustive
‒Greedy forward selection
‒Genetic algorithm
•Evaluation metric type
‒Wrapper methods
•Build full model and test
‒Filter methods
•mutual information
•Pearson product-moment correlation coefficient
‒Embedded methods

Adobe Systems
Feature projection
28
•Principal Component Analysis (PCA)
•Non-negative matrix factorization
•Independent Component Analysis (ICA)
•t-SNE

Adobe Systems
PCA
29
•Highlight variance
•Mainly used to visualize data better

Adobe Systems
PCA
30

Adobe Systems
PCA
31

Adobe Systems
PCA
32

Adobe Systems
PCA
33
•Warnings:
‒Check variance explained
‒Check normality of your data-points
•Relatively higher values can ‘capture’ the whole PCA

Adobe Systems Adobe Systems Adobe Systems Adobe Systems Adobe Systems Adobe Systems
34
www.ceitec.eu
CEITEC
@CEITEC_Brno
Thank you for your attention!
60 minutes lunch break.

>