Mahalanobis distance Principal component analysis Data sphering Bi7740: Scientific computing Matrix computation - Exercises Vlad Popovici popovici@iba.muni.cz Institute of Biostatistics and Analyses Masaryk University, Brno Vlad Bi7740: Scientific computing Mahalanobis distance Principal component analysis Data sphering Before starting Let x ∈ Rm , x = [x1, . . . , xm] µ ∈ Rm be the mean (or sample average, depending on the context) vector X =   x1 ... xn   and Xµ =   x1 − µ ... xn − µ   Σ = E[(x − µ)t (x − µ)] the covariance matrix ˆΣ = 1 n−1 Xt µXµ the sample-based unbiased estimator of the covariance load data: load 'artificial_data.mat' and load 'genexpr_data.mat' Vlad Bi7740: Scientific computing Mahalanobis distance Principal component analysis Data sphering Outline 1 Mahalanobis distance 2 Principal component analysis 3 Data sphering Vlad Bi7740: Scientific computing Mahalanobis distance Principal component analysis Data sphering Mahalanobis distance Let N(µ, Σ) be a m-dimensional Gaussian (normal) distribution with mean µ and covariance Σ. Then, Mahalanobis distance d2 (x, N(µ, Σ)) = (x − µ)Σ−1 (x − µ)t Vlad Bi7740: Scientific computing Mahalanobis distance Principal component analysis Data sphering Exercise Write the Matlab functions for computing the Mahalanobis distance from each of the observations (rows) in X under a Gaussian distribution: mhdist0(X) for N(0, I) (Euclidean distance to 0!) mhdist1(X, mu, sigma) for N(µ, Σ) with µ and Σ provided by the user mhdist(X, Z) for N(ˆµ, ˆΣ) with ˆµ and ˆΣ estimated from Z Vlad Bi7740: Scientific computing Mahalanobis distance Principal component analysis Data sphering Outline 1 Mahalanobis distance 2 Principal component analysis 3 Data sphering Vlad Bi7740: Scientific computing Mahalanobis distance Principal component analysis Data sphering PCA Z = X ∗ W an orthogonal transformation such that the new axes align with the directions of lagest variation the resulting variables are linearly uncorrelated one interpretation: finds the axes (linear combinations of original variables) that minimize the squared-error of approximating the original data (linear regression) other perspective: spectral analysis of the covariance matrix W: has the columns the eigenvectors of the covariance matrix of vectors in X Vlad Bi7740: Scientific computing Mahalanobis distance Principal component analysis Data sphering Exercise Write Matlab functions prcomp_naive(X, npc): computes the principal vectors and corresponding coefficients by eigendecomposition of the covariance matrix what happens when trying to find all the eigenvectors for m > n? What if m >> n? remember that SVD decomposition of X is mathematically equivalent to eigendecomposition of Xt X. Implement prcomp(X) to find all principal vectors and corresponding coefficients, by using the SVD decomposition. Vlad Bi7740: Scientific computing Mahalanobis distance Principal component analysis Data sphering Outline 1 Mahalanobis distance 2 Principal component analysis 3 Data sphering Vlad Bi7740: Scientific computing Mahalanobis distance Principal component analysis Data sphering Data sphering PCA can be used to uncorrelated variables by proper scaling of the result, the resulting variables have identity covariance matrix: Z = X VΛ− 1 2 where Var has the eigenvectors of covariance matrix as columns and Λ is the diagonal matrix of corresponding eigenvalues. it is also called data whitening because the spectrum of eigenvalues of the transformed (distribution) uniform Vlad Bi7740: Scientific computing Mahalanobis distance Principal component analysis Data sphering Exercise Implement in Matlab the data sphering (DO NOT USE inv()): [Z, W, IW] = sphere(X) transforms the data X −→ Z, and returns the transformation matrix W and its inverse IW try it! [Z,W] = sphere2(X) for matrices with m >> n: use the following math: XXt u = λu ⇔ (Xt X)(Xt u) = λXt u so the eigenvectors vi of the covariance matrix of interest (of the form Xt X) can be obtained from the eigenvectors ui of XXt by vi = X ui followed by proper scaling (for normalization) Vlad Bi7740: Scientific computing