In [2]:
%matplotlib inline

In [3]:
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np

# Gaussian Mixture Models (GMMs)
See the documentation at http://scikit-learn.org/stable/modules/mixture.html#gmm section 2.1.1.

In [4]:
from sklearn import datasets
from sklearn.externals.six.moves import xrange
from sklearn.mixture import GaussianMixture

For a full example with plots, see http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html

Please note that while in this example the GMMs are used for classification, the basic usage is for clustering (or density estimation). The .fit() method does not need the labels...

Here we just look at the main steps for fitting a model:

In [5]:
#-get the data
iris = datasets.load_iris()
X = iris.data
y = iris.target

#-how many features are there?

In [6]:
#-fit a model (example):
n_classes = 3
clst = GaussianMixture(n_components=n_classes, covariance_type='diag', init_params='kmeans', max_iter=20)
clst.fit(X)

GaussianMixture(covariance_type='diag', init_params='kmeans', max_iter=20,
 means_init=None, n_components=3, n_init=1, precisions_init=None,
 random_state=None, reg_covar=1e-06, tol=0.001, verbose=0,
 verbose_interval=10, warm_start=False, weights_init=None)

In [7]:
#-inspect the fitted parameters:
print(clst.means_) # is the matrix of component centers, one per row
print(clst.weights_) # these are the mixing coefficients

[[ 6.81209583 3.07212929 5.72666022 2.10669076]
 [ 5.006 3.418 1.464 0.244 ]
 [ 5.92793543 2.75046463 4.40762592 1.41444826]]
[ 0.25188836 0.33333333 0.41477831]


In [8]:
#-you can assign the data points to a cluster:
y_pred = clst.predict(X)

In [9]:
#-and "compare" with the true labels:
sum(y != y_pred)*1.0 / y.size

0.90666666666666662

Why there is such a mismatch between cluster labels and class labels? Inspect the two and try for find a way to put in correspondence the class labels with the cluster labels.

# K-means clustering

See the documentation at
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans

Note that there is a parameter, n_jobs, which allows you running the
code on several CPUs. Great speed-up for large data sets.

The basic steps are as before:

In [10]:
from sklearn.cluster import KMeans

In [11]:
#-fit a model (example):
n_classes = 3
clst = KMeans(n_clusters=n_classes)
clst.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
 n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
 random_state=None, tol=0.0001, verbose=0)

In [12]:
#-inspect the fitted parameters:
clst.cluster_centers_ # is the matrix of component centers, one per row

array([[ 5.006 , 3.418 , 1.464 , 0.244 ],
 [ 5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
 [ 6.85 , 3.07368421, 5.74210526, 2.07105263]])

In [13]:
#-you can assign the data points to a cluster:
y_pred = clst.predict(X)

In [15]:
#-and compare with the true labels:
sum(y != y_pred)*1.0 / y.size ## Why does it not work, as you would expect??

0.10666666666666667

** NOTE: ** For large data sets there is a "batch" version of KMeans, which converges much faster. Look at:
 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html#sklearn.cluster.MiniBatchKMeans

And see the discussion:
http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html

## Application: color quantization
An important application of KMeans is in image processing. For example in re-quantizing the color levels. For this, execute the example at http://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html

### TODO: (if time allows)
Use a different clustering method - explore the options from

- mean shift: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MeanShift.html#sklearn.cluster.MeanShift
- spectral clustering: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering
- hierarchical clustering (Ward): http://scikit-learn.org/0.16/modules/generated/sklearn.cluster.Ward.html