Efficient k-means evaluation with silhouette score in sklearn

Asked 15/5, 2014 at 19:41 Answered 3/5 at 9:0

I am running k-means clustering on a dataset with around 1 million items and around 100 attributes. I applied clustering for various k, and I want to evaluate the different groupings with the silhouette score from sklearn library. Running it with no sampling seems unfeasible and takes a prohibitively long time. So, I assume I need to use sampling, i.e.:

metrics.silhouette_score(feature_matrix, cluster_labels, metric='euclidean',sample_size=???)

I don't have a good sense of what an appropriate sampling approach is. Is there a rule of thumb for ideal sample size to use given the size of my matrix? And is it better to take the largest sample size that my analysis machine can handle? Or should I take the average of smaller samples?

My preliminary test with sample_size=10000 has produced some unintuitive results.

I am open for alternative and more scalable evaluation metrics.

Editing to visualize the issue:

The plot shows the silhouette score against the number of clusters. enter image description here

In my opinion, increasing sample size is reducing the noise is a normal behavior. But given that I have 1 million, a heterogenous vector, 2 or 3 clusters is the "best" number of clusters seems unintuitive. In other words, I would expect to find a monotonic decreases in silhouette score as I increase the number of clusters.

Cyrilcyrill answered 15/5, 2014 at 19:41 Comment(7)

Define unintuitive results, and try rerunning that test multiple times with different sample sizes. – Plutocracy 15/5, 2014 at 20:14

Running code to generate a clarifying plot. Will edit and post asap. – Cyrilcyrill 15/5, 2014 at 20:25

Those silhouette scores are pretty low. Data with strong cluster structure will give you silhouette scores above 0.7 or so. Have you tried using the Gap Statistic to estimate the proper number of clusters? Another possibility is that some of the 100 features are adding noise and are hiding clusters. You might try PCA to get rid of some of the noise. – Mainmast 28/8, 2014 at 19:12

I've also encountered similar problem. When I increased the number of cluster, the silhouette score computed by sklearn.metrics.silhouette_score decreased monotonically, and I don't figure out why this happened – Shontashoo 25/1, 2015 at 13:19

@AnnabellChan did you ever figure out what was going with sklearn.metrics.silhouette_score? I have the same problem of monotonically decreasing values with larger k. – Periphrastic 10/2, 2015 at 22:19

@Periphrastic not yet, but I read a paper discussing the main internal validation measures, see Understanding the Internal Clustering Validation Measures and replaced silhouette score with SDbw, which was demonstrated to be the most robust index in this paper – Shontashoo 11/2, 2015 at 15:1

All things being equal, the silhouette score will decrease if you increase the number of clusters, or increase the number of features used as anchors for the model. Another thing to keep in mind is, just like correlation scores, from a real life application standpoint, suggesting that 0.7 and above are the best scores, is not realistic. – Tennyson 13/9, 2018 at 21:5

Other metrics

Elbow method: Compute the % variance explained for each K, and choose the K where the plot starts to level off. (a good description is here https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). Obviously if you have k == number of data points, you can explain 100% of the variance. The question is where do the improvements in variance explained start to level off.
Information theory: If you can calculate a likelihood for a given K, then you can use the AIC, AICc, or BIC (or any other information-theoretic approach). E.g. for the AICc, it just balances the increase in likelihood as you increase K with the increase in the number of parameters you need. In practice all you do is choose the K that minimises the AICc.
You may be able to get a feel for a roughly appropriate K by running alternative methods that give you back an estimate of the number of clusters, like DBSCAN. Though I haven't seen this approach used to estimate K, and it is likely inadvisable to rely on it like this. However, if DBSCAN also gave you a small number of clusters here, then there's likely something about your data that you might not be appreciating (i.e. not as many clusters are you're expecting).

How much to sample

It looks like you've answered this from your plot: no matter what your sampling you get the same pattern in silhouette score. So that patterns seems very robust to sampling assumptions.

Reposeful answered 2/5, 2017 at 2:46 Comment(0)

kmeans converge to local minima. Starting positions plays a crucial role in optimal number of clusters. It would be a good idea often to reduce the noise and dimensions using PCA or any other dimension reduction techniques to proceed with kmeans.

Just to add for the sake of completeness. It might be a good idea to get optimal number of clusters by "partition around medoids". It is equivalent to using silhouette method.

Reason for the weird observations could be different starting points for different sized samples.

Having said all the above, it is important to evaluate clusterability of the dataset in hand. Tractable means is by Worst Pair ratio as discussed here Clusterability.

Spud answered 30/5, 2017 at 14:34 Comment(0)

Since there is no widely-accepted best approach to determine the optimal number of clusters, all evaluation techniques, including Silhouette Score, Gap Statistic, etc. fundamentally rely on some form of heuristic/trial&error argument. So to me, the best approach is to try out multiple techniques and to NOT develop over-confidence in any.

In your case, the ideal and most accurate score should be calculated on the entire data set. However, if you need to use partial samples to speed up the computation, you should use largest possible sample size your machine can handle. The rationale is the same as getting as many data points as possible out of the population of interest.

One more thig is that the sklearn implementation of Silhouette Score uses random (non-stratified) sampling. You can repeat the calculation multiple time using the same sample size (say sample_size=50000) to get a sensing on whether the sample size is large enough to produce consistent results.

Virescence answered 12/2, 2020 at 3:32 Comment(0)

Stumbled over the same unintuitive monotonic decreases in silhouette score as I increase the number of clusters.

However, taking a closer look on the equation to calculate the intra-cluster distance a(i) you can see that this distance is only calculated for other data points j in the same cluster as i. So with an increasing number of clusters, and thus an increasing number of clusters with only one sample in them, these will default to 0 thus reducing the silhouette score.

The equation is from Wikipedia: https://en.wikipedia.org/wiki/Silhouette_(clustering)

Clincher answered 3/5 at 9:0 Comment(0)

Recommended topics

Hot tags