data science

  • k-means clustering

    k-means is a kind of clustering algorithms, which belong to the family of unsupervised machine learning models. It aims at finding $k$ groups of similar data (clusters) in an unlabeled multidimensional dataset.

    The k-means minimization problem

    Let $(x1,..., xn)$ be a set of $n$ observations with $xi \in \mathbb{R}^{d}$, for $1 \leq i \leq n$. The aim of the k-means algorithms is to find a disjoint partition $S={S1,..., Sk }$ of the $n$ observations into $k \leq n$ clusters, minimizing $D$ the within-cluster distance to center: $$ D(S) = \sum{i=1}^k \sum{x \in Si} \| x - \mui \|^2 $$ where $\mui$ is the $i$-th cluster center (i.e. the arithmetic mean of the cluster observations): $\mui = \frac{1}{|Si|} \sum{xj \in Si} xj$, for $1 \leq i \leq n$.

    Unfortunately, finding the exact solution of this problem is very tough (NP-hard) and a local minimum is generally sought using a heuristic.

    The algorithm

    Here is a simple description of the algorithm taken from the book "Data Science from Scratch" by Joel Grus (O'Reilly):

    1. Start with a set of k-means, which are $k$ points in $d$-dimensional space.
    2. Assign each point to the mean to which it is closest.
    3. If no point’s assignment has changed, stop and keep the clusters.
    4. If some point’s assignment has changed, recompute the means and return to step 2.

    This algorithm is an iterative refinement procedure. In his book "Python Data Science Handbook" (O'Reilly), Jake VanderPlas refers to this algorithm as kind of Expectation–Maximization (E–M). Since step 1 is the algorithm initialization and step 3 the stopping criteria, we can see that the algorithm consists in only two alternating steps:

    step 2. is the Expectation:

    "updating our expectation of which cluster each point belongs to".

    step 4. is the Maximization:

    "maximizing some fitness function that defines the location of the cluster centers".

    This is described with more details in the following link.

    An interesting geometrical interpretation is that step 2 corresponds to partitioning the observations according to the Voronoi diagram generated by the centers computed previously (either on step 1 or 4). This is also why the standard k-means algorithm is also called Lloyd's algorithm, which is a Voronoi iteration method for finding evenly spaced sets of points in subsets of Euclidean spaces.

    Voronoi diagram

    Let us have a look at the Voronoi diagram generated by the $k$ means.

    As in Jake VanderPlas' book, we generate some fake observation data using scikit-learn 2-dimensional blobs, in order to easily plot them.

    ```python %matplotlib inline import matplotlib.pyplot as plt from sklearn.datasets.samplesgenerator import makeblobs

    k = 20 n = 1000 X, _ = makeblobs(nsamples=n, centers=k, clusterstd=0.70, randomstate=0) plt.scatter(X[:, 0], X[:, 1], s=5); ```