Cluster analysis
Petr Ocelík
MVPd002 Quantitative Research in International and European Politics
Plan for today
• Intuition
• Cluster analysis step-by-step
• Exercise
Cluster analysis
• Data reduction technique
• Cluster: a grouping of similar objects
• Basic idea: identifying groups of mutually similar objects
based on particular variable(s)
• Unsupervised technique
wikimedia commons
Ansari & Oskrochi 2006
Doise et al. 1999
Radanliev et al. 2020
Pipis 2020
Cluster analysis: process
1. Sampling and data collection
2. Similarity measures
3. Clustering methods
4. Cluster solution interpretation
5. Cluster solution diagnostics
1. Sampling and data collection
• What is a target population?
• What is the level of analysis?
• What is the unit of observation?
• What set of variables are we interested in?
• Practical considerations
2. Similarity measures
• (Dis)similarity of objects is quantified by using similarity measures
• Choice of similarity measure needs to consider:
1. level of measurement: categorical vs continuous
2. data dimensionality: number of variables (vars)
3. scale sensitivity: small vs large data, vars scales
• We distinguish between association-based and distance-based
similarity measures (not exhaustive)
2. Similarity measures: representations
• The degree of (dis)similarity can be captured numerically or
graphically.
• Dendrogram
• Heatmap
• Cluster profile
Dendrogram
Pai 2021
Heatmap
UC Business 2024
Cluster profile
2.1 Pearson’s coefficient
• Pearson’s r measures the existence, strength and direction of the linear
relationship between two variables
• Not suitable for heterogeneous data and/or nonlinear relationships
between variables
measurement level number of values range coefficient
continuous-continuous many-many <-1, 1> Pearson’s r
Soukup et al. 2022
http://guessthecorrelation.com/
• r = covariance / combined total variance
2.2 Euclidean distance
• Do you recall this?
wikimedia commons
2.2 Euclidean distance
• Euclidean distance (d) is a generalization of the Pythagorean theorem
• d(p, q), of two objects (p, q) equals the length of the straight line
between them
wikimedia commons
Koschade 2007
2.2 Euclidean distance
2.3 Jaccard’s coefficient
• J = the size of the intersection (A ∩ B) by the
size of the union (A + B = A ∪ B) of the samples
• J = A ∩ B / (A ∪ B)
• Does not account for observations missing in
both samples (∉ A ∪ B)
sample B
present absent
sample A present A ∩ B A – B
absent B – A ∉ A ∪ B
wikimedia commons
measurement level number of values range coefficient
categorical-categorical binary <0, 1> Jaccard’s
2.4 Cosine distance
Karabiber 2024
doc_A doc_B
geopolitics 4 0
climate change 0 1
Brexit 0 12
Euro 3 9
sovereignty 5 2
• CS =
4∗0 + 0∗1+ 0 ∗12 + 3∗9 + 5∗2
(16 + 0 + 0 + 9 + 2) ∗ (0 + 1 + 144 + 81 + 4)
= 0.345
• CD = 1 – 0.345 = 0.655
2.4 Cosine distance (CD)
2. Similarity measures: summary
• Cosine and Euclidean distances appropriate for higherdimensional
data
• Cosine distance used for text-based data
• Euclidean distances and Jaccard coefficient used for
network data
• Pearson coefficient appropriate for continuous data and
linear relationships
• There are many more measures of similarity
Cluster analysis: process
1. Sampling
2. Similarity measure
3. Clustering method
4. Cluster solution interpretation
5. Cluster solution diagnosis
3. Clustering methods
• After the (dis)similarities between objects are calculated, we need
to select a particular clustering method that partitions (clusters)
the data according specific rules
• There are several clustering approaches, k-means clustering and
hierarchical clustering belong to the most common
3.1 k-means clustering
• k-means clustering partitions the data into k clusters based on the mean distances
in each cluster
1. The number of clusters k needs to be pre-selected
2. The algorithm starts randomly assign cluster centers (centroids)
3. Each object is assigned to the nearest centroid based on a particular similarity
measure
4. Within the clusters, the centroids are updated based on the mean similarity of
the objects classified to the respective cluster
5. The steps 3-4 repeat until the centroids no longer change → solution
2-cluster solution
Jordan 2012
Determining k
James et al. 2013
3.1 Hierarchical clustering
• Hierarchical clustering can be performed in agglomerative and divisive sequence
• The number of clusters is not pre-selected
• The linkage method (how are clusters merged/split) needs to be defined
1. Treat each objects as a separate cluster (agglomerative sequence)
2. The average distances between the clusters are calculated (average linkage)
3. The clusters with the lowest average distance are merged
4. The steps 2-3 are repeated until there is only a single cluster
5. The process is represented by dendrogram
6. Considering substantive insights, the k-cluster solution is identified
NYU 2011
NYU 2011
Cluster analysis: process
1. Sampling
2. Similarity measure
3. Clustering method
4. Cluster solution interpretation
5. Cluster solution diagnosis
4. Cluster solution interpretation
• Do the clusters have prima facie validity (eyeballing test)?
• Substantive and theoretical insights are vital – to what extent the
solution aligns with our expectations?
• Size of the clusters – do clusters markedly differ in their size?
• Outliers – are there any?
• Do clusters reduce our data well? → cluster solution diagnostics
5. Cluster solution diagnostics
• Important to assess the quality of our solution
• Are the clusters internally cohesive?
• Are the clusters well separated?
• What is the optimal number of clusters?
5.1 Within/between sum of squares
• The within-cluster sum of squares (WCSS) calculate the sum
of squared distances between each object and its cluster
centroid (k-means clustering approach)
• The between-cluster sum of squares (BCSS) measure the sum
of squared distance between centroid of each cluster and the
overall centroid of all objects
• The WCSS capture the cohesiveness of clusters, while BCSS
measure the separation between clusters
5.1 WCSS and BCSS
SSRI 2020
Elbow graph
Kulma 2017
5.2 Silhouette score
• The Silhouette score ranges <-1, 1>; where high positive values indicate
a good fit of the object within its own cluster, zero indicates a borderline
position, and negative values indicate that the object is misclassified
• s =
𝑏 − 𝑎
max(𝑏 − 𝑎)
; a = average within-cluster d, b = average between-cluster d
• The Silhouette is calculated for each object and then average is taken to
evaluate the cluster solution
• The Silhouette score can be applied both to the k-means and
hierarchical clustering
• Rule of thumb: s > 0.5 good solution; s < 0.25 bad solution
Exercise
Exercise
• Open Jamovi and install snowCluster extension
• Load into Jamovi file “state.csv”
• Check variables relig_prot, urban, and relig_high in the codebook
• Describe the variables
• Apply (1) k-means clustering and (2) hierarchical clustering
• Interpret the results