Unsupervised Learning Clustering Algorithms • Groups objects that are similar • Typically organized by modeling approaches • Two classes • Hard clustering • Fuzzy clustering • Examples: • Connectivity based clustering • K-means • Distribution based clustering • Density based clustering Connectivity-Based Clustering • Objects are more related to the nearby objects rather then those fare away • Similarity measure – Euclidian distance or anything else https://en.wikipedia.org/wiki/Cluster_analysis Centroid-Based Clustering • Represented by center vector • Center does not have to be necessary one of the data points https://en.wikipedia.org/wiki/Cluster_analysis Distribution-Based Clustering • Clusters defined as object belongings to the same distribution • Convenient for artificial datasets, but suffer from overfitting in practice https://en.wikipedia.org/wiki/Cluster_analysis Density-Based Clustering • Cluster defined as areas with higher density (require density drops) • Objects in sparse areas considered to be noise https://en.wikipedia.org/wiki/Cluster_analysis Intrinsic Dimensions 1 2 3 X … … … Y … … … Z … … … 1 2 3 X … … … Y … … … Z … … … 1 2 3 X … … … Y … … … Z … … … Dimensionality Reduction Algorithms • Reducing number of random variables to a set a principal variables • Finds structures in data to reduce dimensionality unsupervised • Lower dimensional variables often visualized for labeling and further supervised learning • Examples: • Principal component analysis (PCA) • Linear discriminant analysis (LDA) • t-distributed stochastic neighbor embedding (t-SNE) • Uniform manifold approximation and projection (UMAP) K-means K-means Introduction • Centroid-based clustering • Assumes Euclidean space/distance • Advantage • Suitable for large datasets • Can be applied to non-well separated clusters • Disadvantage • Requires to select the number of clusters k K-means Algorithm • Input: • K (number of clusters) • Data set {𝑥1, 𝑥2 … 𝑥 𝑚} • Algorithm 1. Select randomly k centroids 2. Assign cluster indices to each point based on the distance to centroids 3. Update centroid locations 4. Repeat 2-3 until convergence (i.e., no change) Selecting k Value • Try different values and look for the average distance to centroid as k increases • Alternatively use silhouette Selecting Starting Points • Naïve Approach • Select points randomly • Possible problems when selecting points in same place • Approach 1: Sampling • Cluster a smaller subset of data using different clustering algorithm • Pick representatives from each cluster • Approach 2: Dispersed Set • Select first point randomly • Next points select such they have a largest possible distance from already selected points Complexity • In each round we examine each input points once • O(kn) for n points and k clusters • The problem is the number of rounds to converge Image Processing Short Turing Test Machine Human Short Turing Test MachineHuman Short Turing Test MachineHuman Short Turing Test Machine Human Image Recognition • This lecture focuses on a single example of image recognition • Humans mostly focus on local outstanding features and contours • Need a technique to detect those local characteristics Source: Artwork by Matt Small Source: Arts with Miss Griffin; Types of Contours Image Recognition • Goal: a program that recognizes classes (circles and rectangles) in an image, learned through a labeled training set • TODO’s • Transfer the images to a basis suitable for edge detection and local features • Wavelet decomposition • Find the features associated with different classes • Principal components • Design a statistical decision mechanism for determination of new objects • Linear discrimination analysis Decomposition Revisited • Some well-known decompositions • SVD, PCA • ... • There are many more decompositions out there • Principle • Find a suitable basis • Find coefficients to represent the data • Wavelet decomposition is yet another decomposition where the basis consists of wavelets What is a Wavelet? • A wavelet is an oscillation function, with an amplitude that begins at zero, increases and ends at zero. • Wavelets can be combined to create other more complex functions. Source: Mathworks Why Wavelets? • Wavelets are spatially localized • Perfect for non periodic functions/signals • Pyramid representation Wavelet Decomposition • Wavelets are ideal way to represent multi-scale information • Very efficient in detecting and highlighting of edges • Image data is often represented in wavelets for machine learning and data analysis • Wavelets are able to detect local changes in the data - they can «march along the data“ Source: Wikipedia Wavelet_transform Wavelet Decomposition • Every wavelet can be described through a mother wavelet function 𝜓 and a mother scaling function 𝜙. • The simplest wavelet is the Haar wavelet • Was developed by Alfred Haar in 1909 • The simplest and most widely adopted wavelet basis and looks like this: Haar Wavelet • Let’s start with an example • The signal below can be expressed as a combination of an average and difference function: • 5 ⋅ 𝜙(𝑡) • 4 ⋅ 𝜓(𝑡) 9 1 0 0.5 1 = + 5 9 1 0 0.5 1 9 1 0 0.5 1 4 -4 mother wavelet function mother scaling function Haar Wavelet • Now consider a more complex signal • For a 1-D discrete signal with length 2 𝑁 we can remove the average & difference of two neighboring values to obtain 2 𝑁−1scale coefficient and 2 𝑁−1 detail coefficients. Haar Wavelet • Let's look at the values: • Y = [1, 9, 8, 7, 3, 4, 5, 6] • The averages of the neighboring values and the differences: • cA = [ 5.0, 7.5 , 3.5, 5.5] and cD = [ -4.0, 0.5 , -0.5, -0.5] • Vector [cA,cD] is a single level wavelet decomposition • Single level means that the decomposition step was performed once Haar Wavelet • Let's look at the values: • Y = [1, 9, 8, 7, 3, 4, 5, 6] • The averages of the neighboring values and the differences: • cA1 = [ 5.0, 7.5 , 3.5, 5.5] and cD1 = [ -4.0, 0.5 , -0.5, -0.5] • Repeat with new averages: • cA2 = [6.25, 4.5] and cD2 = [ -1.25, -1 ] • Oone last time: • cA3 = 5.375 and cD3 = 0.875 • The 3-level wavelet transform of Y is now [cA3,cD3,cD2,cD1] Orthonormal Wavelet Basis • Need orthonormal basis for representation • Orthonormal if: • Means via: (a+b)/sqrt(2) • Differences via: (a-b)/sqrt(2) Odd Length Signals • Two strategies • Preferable: copy the last value • Alternative: remove last data point Haar Wavelet • What about 2D? • In 2D the principle stays the same, for single level decomposition: • Average over cells with four elements • Compute the horizontal differences • Compute the vertical differences • Compute the diagonal differences Source: Wavelets for computer graphics: A Primer Haar Wavelet • Multi-level decomposition one has to choose the order of the decomposition • Iterate over averages • Remeber all computed diffeerences Haar Wavelet Source: https://chengtsolin.wordpress.com Back to Image Recognition • Achieved: new representation of the labeled images • Local changes encoded • TODO: create a decision mechanism based on edges • Circle vs. rectangle • Create new data which encodes the edges • Only small fraction of PC needed to describe the data sufficiently • Remember SVD • Next step find a number of principal components which are associated with the objects Underfitting / Overfitting https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/ SVD for Image Recognition • Assume we have the edge data of n cube images and m sphere images (ed_cu,ed_sp). We want to characterize the images based on k features. • Perform SVD on the stacked data: • [U,S,V] = svd([ed_cu,ed_sp]) • Lets take a closer look at the decomposed matrices. SVD for Image Recognition • [U,S,V] = svd([ed_cu,ed_sp])= svd(A) • S: impact of single values • objects=S*V’ is a new basis • Size of objects is dependent on the number of samples only A U S V‘ Linear Discrimination Analysis • Having detected a number of principal components of a class we can now set up a statistical decision mechanism to identify objects in new images. • One possible way to do so is to use linear discrimination analysis (LDA) • LDA aims to reduce the dimensionality while preserving as much of the class discriminatory information as possible LDA Illustration • Assume we have a set of high dimensional samples 𝑥 with two classes 𝜔1 and 𝜔2: spheres and cubes. • We seek to obtain a scalar 𝑦 by projecting the samples 𝑥 onto a line, 𝑦 = 𝑤 𝑇 𝑥. • Of all the possible lines we want to select the one that maximizes the separability of the two groups. Bad projection Good projection LDA • In order to find a good projection we need to define a measure of separation. • One possibility is to compute the mean vector 𝜇𝑖 of each class 𝜔𝑖 and use the distance between the projected means as our objective function: ሚ𝑆 𝐵 = ෤𝜇1 − ෤𝜇2 = 𝑤 𝑇 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇 𝑤 = 𝑤 𝑇 𝑆 𝐵 𝑤 • But considering just he mean is not enough: Overlap High distance 𝜇1 𝜇2 LDA • Fisher suggested maximizing the difference between the means, normalized by a measure of the within-class differences. • For each class define the scatter, an equivalent of the variance as 𝑆 𝑊 = ෍ 𝑗=1 2 ෍ 𝑥 (𝑥 − 𝜇 𝐽) 𝑥 − 𝜇 𝑗 ; ሚ𝑆 𝑊 = 𝑤 𝑇 𝑆 𝑊 𝑤 • The Fisher linear discriminant is defined as the linear projection 𝑤 that maximizes the criterion function 𝐽(𝑤) = ෥𝜇1−෥𝜇2 2 ǁ𝑠1 2+ ǁ𝑠2 2 • Such a projection mimizes the distance within the class and maximizes the distance between classes 𝜇1 𝜇2 LDA • Solving the generalized eigenvalue problem 𝑆 𝑊 −1 𝑆 𝐵 𝑤 = 𝐽 𝑤 = 𝜆𝑤 gives us the solution: 𝑤∗ = arg 𝑚𝑎𝑥 𝑤 𝑇 𝑆 𝐵 𝑤 𝑤 𝑇 𝑆 𝑊 𝑤 = 𝑆 𝑊 −1 𝜇1 − 𝜇2 • This solution is known as Fisher’s linear discriminant, even though this is not a discriminant but a specific choice of the projection direction of the data down to one dimension. • This projection can now be used to distinguish between the two groups. • One simple method: 𝑤∗ 𝑥 > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ⇒ cube, everything else is a sphere. • More sophisticated methods can be used for the classification