Unsupervised Learning
Clustering Algorithms
• Groups objects that are similar
• Typically organized by modeling approaches
• Two classes
• Hard clustering
• Fuzzy clustering
• Examples:
• Connectivity based clustering
• K-means
• Distribution based clustering
• Density based clustering
Connectivity-Based Clustering
• Objects are more related to the nearby objects rather then those fare away
• Similarity measure – Euclidian distance or anything else
https://en.wikipedia.org/wiki/Cluster_analysis
Centroid-Based Clustering
• Represented by center vector
• Center does not have to be necessary one of the data points
https://en.wikipedia.org/wiki/Cluster_analysis
Distribution-Based Clustering
• Clusters defined as object belongings to the same distribution
• Convenient for artificial datasets, but suffer from overfitting in practice
https://en.wikipedia.org/wiki/Cluster_analysis
Density-Based Clustering
• Cluster defined as areas with higher density (require density drops)
• Objects in sparse areas considered to be noise
https://en.wikipedia.org/wiki/Cluster_analysis
Intrinsic Dimensions
1 2 3
X … … …
Y … … …
Z … … …
1 2 3
X … … …
Y … … …
Z … … …
1 2 3
X … … …
Y … … …
Z … … …
Dimensionality Reduction Algorithms
• Reducing number of random variables to a set a principal variables
• Finds structures in data to reduce dimensionality unsupervised
• Lower dimensional variables often visualized for labeling and further supervised learning
• Examples:
• Principal component analysis (PCA)
• Linear discriminant analysis (LDA)
• t-distributed stochastic neighbor embedding (t-SNE)
• Uniform manifold approximation and projection (UMAP)
K-means
K-means Introduction
• Centroid-based clustering
• Assumes Euclidean space/distance
• Advantage
• Suitable for large datasets
• Can be applied to non-well separated clusters
• Disadvantage
• Requires to select the number of clusters k
K-means Algorithm
• Input:
• K (number of clusters)
• Data set {𝑥1, 𝑥2 … 𝑥 𝑚}
• Algorithm
1. Select randomly k centroids
2. Assign cluster indices to each point based on the distance to centroids
3. Update centroid locations
4. Repeat 2-3 until convergence (i.e., no change)
Selecting k Value
• Try different values and look for the average distance to centroid as k increases
• Alternatively use silhouette
Selecting Starting Points
• Naïve Approach
• Select points randomly
• Possible problems when selecting points in same place
• Approach 1: Sampling
• Cluster a smaller subset of data using different clustering algorithm
• Pick representatives from each cluster
• Approach 2: Dispersed Set
• Select first point randomly
• Next points select such they have a largest possible distance from already selected points
Complexity
• In each round we examine each input points once
• O(kn) for n points and k clusters
• The problem is the number of rounds to converge
Image Processing
Short Turing Test
Machine Human
Short Turing Test
MachineHuman
Short Turing Test
MachineHuman
Short Turing Test
Machine Human
Image Recognition
• This lecture focuses on a single example of image recognition
• Humans mostly focus on local outstanding features and contours
• Need a technique to detect those local characteristics
Source: Artwork by Matt Small Source: Arts with Miss Griffin; Types of Contours
Image Recognition
• Goal: a program that recognizes classes (circles and rectangles) in an image, learned
through a labeled training set
• TODO’s
• Transfer the images to a basis suitable for edge detection and local features
• Wavelet decomposition
• Find the features associated with different classes
• Principal components
• Design a statistical decision mechanism for determination of new objects
• Linear discrimination analysis
Decomposition Revisited
• Some well-known decompositions
• SVD, PCA
• ...
• There are many more decompositions out there
• Principle
• Find a suitable basis
• Find coefficients to represent the data
• Wavelet decomposition is yet another decomposition where the basis consists of
wavelets
What is a Wavelet?
• A wavelet is an oscillation function, with an amplitude that begins at zero,
increases and ends at zero.
• Wavelets can be combined to create other more complex functions.
Source: Mathworks
Why Wavelets?
• Wavelets are spatially localized
• Perfect for non periodic functions/signals
• Pyramid representation
Wavelet Decomposition
• Wavelets are ideal way to represent multi-scale information
• Very efficient in detecting and highlighting of edges
• Image data is often represented in wavelets for machine learning and data analysis
• Wavelets are able to detect local changes in the data - they can «march along the data“
Source: Wikipedia Wavelet_transform
Wavelet Decomposition
• Every wavelet can be described through a mother wavelet function 𝜓 and a mother scaling
function 𝜙.
• The simplest wavelet is the Haar wavelet
• Was developed by Alfred Haar in 1909
• The simplest and most widely adopted wavelet basis and looks like this:
Haar Wavelet
• Let’s start with an example
• The signal below can be expressed as a combination of an average and difference
function:
• 5 ⋅ 𝜙(𝑡)
• 4 ⋅ 𝜓(𝑡)
9
1
0
0.5 1
= +
5
9
1
0
0.5 1
9
1
0
0.5 1
4
-4
mother wavelet function
mother scaling function
Haar Wavelet
• Now consider a more complex signal
• For a 1-D discrete signal with length 2 𝑁 we can remove the average & difference of two
neighboring values to obtain 2 𝑁−1scale coefficient and 2 𝑁−1 detail coefficients.
Haar Wavelet
• Let's look at the values:
• Y = [1, 9, 8, 7, 3, 4, 5, 6]
• The averages of the neighboring values and the differences:
• cA = [ 5.0, 7.5 , 3.5, 5.5] and cD = [ -4.0, 0.5 , -0.5, -0.5]
• Vector [cA,cD] is a single level wavelet decomposition
• Single level means that the decomposition step was performed once
Haar Wavelet
• Let's look at the values:
• Y = [1, 9, 8, 7, 3, 4, 5, 6]
• The averages of the neighboring values and the differences:
• cA1 = [ 5.0, 7.5 , 3.5, 5.5] and cD1 = [ -4.0, 0.5 , -0.5, -0.5]
• Repeat with new averages:
• cA2 = [6.25, 4.5] and cD2 = [ -1.25, -1 ]
• Oone last time:
• cA3 = 5.375 and cD3 = 0.875
• The 3-level wavelet transform of Y is now [cA3,cD3,cD2,cD1]
Orthonormal Wavelet Basis
• Need orthonormal basis for representation
• Orthonormal if:
• Means via: (a+b)/sqrt(2)
• Differences via: (a-b)/sqrt(2)
Odd Length Signals
• Two strategies
• Preferable: copy the last value
• Alternative: remove last data point
Haar Wavelet
• What about 2D?
• In 2D the principle stays the same, for single level decomposition:
• Average over cells with four elements
• Compute the horizontal differences
• Compute the vertical differences
• Compute the diagonal differences
Source: Wavelets for computer graphics: A Primer
Haar Wavelet
• Multi-level decomposition one has to choose the order of the decomposition
• Iterate over averages
• Remeber all computed diffeerences
Haar Wavelet
Source: https://chengtsolin.wordpress.com
Back to Image Recognition
• Achieved: new representation of the labeled images
• Local changes encoded
• TODO: create a decision mechanism based on edges
• Circle vs. rectangle
• Create new data which encodes the edges
• Only small fraction of PC needed to describe the data sufficiently
• Remember SVD
• Next step find a number of principal components which are associated with the
objects
Underfitting / Overfitting
https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/
SVD for Image Recognition
• Assume we have the edge data of n cube images and m sphere images
(ed_cu,ed_sp). We want to characterize the images based on k features.
• Perform SVD on the stacked data:
• [U,S,V] = svd([ed_cu,ed_sp])
• Lets take a closer look at the decomposed matrices.
SVD for Image Recognition
• [U,S,V] = svd([ed_cu,ed_sp])= svd(A)
• S: impact of single values
• objects=S*V’ is a new basis
• Size of objects is dependent on the number of samples only
A U S V‘
Linear Discrimination Analysis
• Having detected a number of principal components of a class we can now set up a
statistical decision mechanism to identify objects in new images.
• One possible way to do so is to use linear discrimination analysis (LDA)
• LDA aims to reduce the dimensionality while preserving as much of the class
discriminatory information as possible
LDA Illustration
• Assume we have a set of high dimensional samples 𝑥 with two classes 𝜔1 and 𝜔2:
spheres and cubes.
• We seek to obtain a scalar 𝑦 by projecting the samples 𝑥 onto a line, 𝑦 = 𝑤 𝑇 𝑥.
• Of all the possible lines we want to select the one that maximizes the separability of
the two groups.
Bad projection Good projection
LDA
• In order to find a good projection we need to define a measure of separation.
• One possibility is to compute the mean vector 𝜇𝑖 of each class 𝜔𝑖 and use the distance
between the projected means as our objective function:
ሚ𝑆 𝐵 = ෤𝜇1 − ෤𝜇2 = 𝑤 𝑇
𝜇1 − 𝜇2 𝜇1 − 𝜇2
𝑇
𝑤 = 𝑤 𝑇
𝑆 𝐵 𝑤
• But considering just he mean is not enough:
Overlap
High distance
𝜇1
𝜇2
LDA
• Fisher suggested maximizing the difference between the means, normalized by a measure of
the within-class differences.
• For each class define the scatter, an equivalent of the variance as
𝑆 𝑊 = ෍
𝑗=1
2
෍
𝑥
(𝑥 − 𝜇 𝐽) 𝑥 − 𝜇 𝑗 ; ሚ𝑆 𝑊 = 𝑤 𝑇
𝑆 𝑊 𝑤
• The Fisher linear discriminant is defined as the linear projection 𝑤 that maximizes the criterion
function
𝐽(𝑤) = ෥𝜇1−෥𝜇2
2
ǁ𝑠1
2+ ǁ𝑠2
2
• Such a projection mimizes the distance
within the class and maximizes the distance
between classes
𝜇1
𝜇2
LDA
• Solving the generalized eigenvalue problem 𝑆 𝑊
−1
𝑆 𝐵 𝑤 = 𝐽 𝑤 = 𝜆𝑤 gives us the solution:
𝑤∗
= arg 𝑚𝑎𝑥
𝑤 𝑇 𝑆 𝐵 𝑤
𝑤 𝑇 𝑆 𝑊 𝑤
= 𝑆 𝑊
−1
𝜇1 − 𝜇2
• This solution is known as Fisher’s linear discriminant, even though this is not a
discriminant but a specific choice of the projection direction of the data down to one
dimension.
• This projection can now be used to distinguish between the two groups.
• One simple method: 𝑤∗ 𝑥 > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ⇒ cube, everything else is a sphere.
• More sophisticated methods can be used for the classification