Mahalanobis distance
Principal component analysis
Data sphering
Bi7740: Scientiﬁc computing
Matrix computation - Exercises
Vlad Popovici
popovici@iba.muni.cz
Institute of Biostatistics and Analyses
Masaryk University, Brno
Vlad Bi7740: Scientiﬁc computing
Mahalanobis distance
Principal component analysis
Data sphering
Before starting
Let
x ∈ Rm
, x = [x1, . . . , xm]
µ ∈ Rm
be the mean (or sample average, depending on the
context) vector
X =


x1
...
xn


and Xµ =


x1 − µ
...
xn − µ


Σ = E[(x − µ)t
(x − µ)] the covariance matrix
ˆΣ = 1
n−1 Xt
µXµ the sample-based unbiased estimator of the
covariance
load data: load 'artificial_data.mat' and
load 'genexpr_data.mat'
Vlad Bi7740: Scientiﬁc computing
Mahalanobis distance
Principal component analysis
Data sphering
Outline
1 Mahalanobis distance
2 Principal component analysis
3 Data sphering
Vlad Bi7740: Scientiﬁc computing
Mahalanobis distance
Principal component analysis
Data sphering
Mahalanobis distance
Let N(µ, Σ) be a m-dimensional Gaussian (normal) distribution
with mean µ and covariance Σ. Then,
Mahalanobis distance
d2
(x, N(µ, Σ)) = (x − µ)Σ−1
(x − µ)t
Vlad Bi7740: Scientiﬁc computing
Mahalanobis distance
Principal component analysis
Data sphering
Exercise
Write the Matlab functions for computing the Mahalanobis distance
from each of the observations (rows) in X under a Gaussian
distribution:
mhdist0(X) for N(0, I) (Euclidean distance to 0!)
mhdist1(X, mu, sigma) for N(µ, Σ) with µ and Σ provided
by the user
mhdist(X, Z) for N(ˆµ, ˆΣ) with ˆµ and ˆΣ estimated from Z
Vlad Bi7740: Scientiﬁc computing
Mahalanobis distance
Principal component analysis
Data sphering
Outline
1 Mahalanobis distance
2 Principal component analysis
3 Data sphering
Vlad Bi7740: Scientiﬁc computing
Mahalanobis distance
Principal component analysis
Data sphering
PCA
Z = X ∗ W
an orthogonal transformation such that the new axes align
with the directions of lagest variation
the resulting variables are linearly uncorrelated
one interpretation: ﬁnds the axes (linear combinations of
original variables) that minimize the squared-error of
approximating the original data (linear regression)
other perspective: spectral analysis of the covariance matrix
W: has the columns the eigenvectors of the covariance matrix
of vectors in X
Vlad Bi7740: Scientiﬁc computing
Mahalanobis distance
Principal component analysis
Data sphering
Exercise
Write Matlab functions
prcomp_naive(X, npc): computes the principal vectors
and corresponding coefﬁcients by eigendecomposition of the
covariance matrix
what happens when trying to ﬁnd all the eigenvectors for
m > n? What if m >> n?
remember that SVD decomposition of X is mathematically
equivalent to eigendecomposition of Xt
X. Implement
prcomp(X) to ﬁnd all principal vectors and corresponding
coefﬁcients, by using the SVD decomposition.
Vlad Bi7740: Scientiﬁc computing
Mahalanobis distance
Principal component analysis
Data sphering
Outline
1 Mahalanobis distance
2 Principal component analysis
3 Data sphering
Vlad Bi7740: Scientiﬁc computing
Mahalanobis distance
Principal component analysis
Data sphering
Data sphering
PCA can be used to uncorrelated variables
by proper scaling of the result, the resulting variables have
identity covariance matrix:
Z = X VΛ− 1
2
where Var has the eigenvectors of covariance matrix as
columns and Λ is the diagonal matrix of corresponding
eigenvalues.
it is also called data whitening because the spectrum of
eigenvalues of the transformed (distribution) uniform
Vlad Bi7740: Scientiﬁc computing
Mahalanobis distance
Principal component analysis
Data sphering
Exercise
Implement in Matlab the data sphering (DO NOT USE inv()):
[Z, W, IW] = sphere(X) transforms the data X −→ Z,
and returns the transformation matrix W and its inverse IW
try it!
[Z,W] = sphere2(X) for matrices with m >> n: use the
following math:
XXt
u = λu ⇔ (Xt
X)(Xt
u) = λXt
u
so the eigenvectors vi of the covariance matrix of interest (of
the form Xt
X) can be obtained from the eigenvectors ui of XXt
by vi = X ui followed by proper scaling (for normalization)
Vlad Bi7740: Scientiﬁc computing