Given a set of datapoints, we often want to know how many clusters the datapoints form. The gap statistic and the prediction strength are two practical algorithms for choosing the number of clusters.
Gap Statistic
The gap statistic algorithm works as follows:
For each i from 1 up to some maximum number of clusters,
Run a k-means algorithm on the original dataset to find i clusters, and sum the distance of all points from their cluster mean. Call this sum the dispersion.
Generate a set of reference datasets (of the same size as the original). One simple way of generating a reference dataset is to sample uniformly from the original dataset’s bounding rectangle; a more sophisticated approach is take into account the original dataset’s shape by sampling, say, from a rectangle formed from the original dataset’s principal components.
Calculate the dispersion of each of these reference datasets, and take their mean.
Define the ith gap by: log(mean dispersion of reference datasets) - log(dispersion of original dataset).
Once we’ve calculated all the gaps (we can add confidence intervals as well; see the original paper for the formula), we can select the number of clusters to be the one that gives the maximum gap. (Sidenote: I view the gap statistic as a very statistical-minded algorithm, since it compares the original dataset against a set of reference “control” datasets.)
For example, here I’ve generated three Gaussian clusters:
And running the gap statistic algorithm, we see that it correctly detects the number of clusters to be three:
For a sample R implementation of the gap statistic, see the Github repository here.
Prediction Strength
Another cluster-counting algorithm is the prediction strength algorithm. In contrast to the gap statistic (which, as mentioned above, I find very statistically), I see prediction strength as taking a more machine learning viewpoint, since it’s formulated as a supervised learning problem validated against a test set.
To calculate prediction strength, for each i from 1 up to some maximum number of clusters:
Divide the dataset into two groups, a training set and a test set.
Run a k-means algorithm on each set to find i clusters.
For each test cluster, count the proportion of pairs of points in that cluster that would remain in the same cluster, if each were assigned to its closest training cluster mean.
The minimum over these proportions is the prediction strength for i clusters.
Once we’ve calculated the prediction strength for each number of clusters, we select the number of clusters to be the maximum i such that the prediction strength for i is greater than some threshold. (The paper suggests 0.8 - 0.9 as a good threshold, and I’ve seen 0.8 work well in practice.)
Here’s the prediction strength algorithm run on the same example above:
Again, check out a sample R implementation of the prediction strength here.
In practice, I tend to prefer using the gap statistic algorithm, since it’s a little easier to code and it doesn’t require selecting an arbitrary threshold like the prediction strength does. I’ve also found that it gives slightly better results (though the original prediction strength paper has the opposite finding).
Appendix
I ended up giving a brief description of two very common clustering algorithms, k-means and Gaussian mixture models in the comments, so I figured I might as well bring them up here.
k-means algorithm
Suppose we have a set of datapoints that we want to cluster. We want to learn two things:
- A description of the clusters themselves (so that if new points come in, we can assign them to a cluster).
- Which clusters our current points fall into.
We start by initializing k cluster centers (e.g., by randomly choosing k points among our datapoints). Then we repeatedly
- Step A: Assign each datapoint to the nearest cluster center.
- Step B: Update all the cluster centers: for each cluster i, take the mean over all points currently in the cluster, and update cluster center i to be this mean.
- (Repeat steps A and B above until the cluster assignments stop changing.)
And that’s pretty much it for k-means.
k-means from an EM point of View
To ease the transition into Gaussian mixture models, let’s also describe the k-means algorithm using EM language.
Note that if we knew for certain either 1) the exact cluster centers or 2) the cluster each point belonged to, we could trivially solve k-means, since
- If we knew the exact cluster centers, all we’d have to do is assign each point to its nearest cluster center, and we’d be done.
- If we knew which cluster each point belonged to, we could pick the cluster center by simply taking the mean over all points in that cluster.
The problem is that we know neither of these, and so we alternate between making educated guesses of each one:
- In A step above, we pretend that we know the cluster centers, and based off this pretense, we guess which cluster each point belongs to. (This is also known as the E step in the EM algorithm.)
- In the B step above, we do the reverse: we pretend that we know which cluster each point belongs to, and then try to guess the cluster centers. (This is also known as the M step in EM.)
Our guesses keep getting better and better, and eventually we’ll converge.
Gaussian Mixture Models
k-means has a hard notion of clustering: point X either belongs to cluster C or it doesn’t. But sometimes we want a soft notion instead: point X belongs to cluster C with probability p (according to a Gaussian kernel). This is where Gaussian mixture modeling comes in.
To run a GMM, we start by initializing $k$ Gaussians (say, by randomly choosing $k$ points to be the centers of the Gaussians and by setting the variance of each Gaussians to be the overall variance), and then we repeatedly:
- E Step: Pretend we know the parameters of each Gaussian cluster, and assign each datapoint to Gaussian cluster i with appropriate probability.
- M Step: Pretend we know the probabilities that each point belongs to a given cluster. Using these probabilities, update the means and variances of each Gaussian cluster: the new mean for cluster i is the weighted mean over all points (where the weight of each point X is the probability that X belongs to cluster i), and similarly for the new variance.
This is exactly like k-means in the EM formulation, except we replace the binary clustering formula with Gaussian kernels.