What method do you use for selecting the optimum number of clusters in k-means and EM?

Asked 22/2, 2010 at 17:53 Answered 21/10, 2010 at 1:35

r cluster-analysis data-mining expectation-maximization

Many algorithms for clustering are available. A popular algorithm is the K-means where, based on a given number of clusters, the algorithm iterates to find best clusters for the objects.

What method do you use to determine the number of clusters in the data in k-means clustering?

Does any package available in R contain the V-fold cross-validation method for determining the right number of clusters?

Another well used approach is Expectation Maximization (EM) algorithm which assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters.

Is this algorithm implemented in R?

If it is, does it have the option to automatically select the optimum number of clusters by cross validation?

Do you prefer some other clustering method instead?

Westwardly answered 22/2, 2010 at 17:53 Comment(5)

I intentionally left out hierarchical clustering because hclust is a rather memory hungry method, not suitable for large datasets in which i'm actually mostly interested. – Westwardly 22/2, 2010 at 20:21

Please define what you mean by "optimal" – Atchley 22/2, 2010 at 21:1

Great question @Svante, I've been thinking a lot on that one. I even intended to write a package with several algorithms for optimal number of clusters (hclust methods only). @hadley, I've acquainted with: C-H index (Calinsky & Harabasz), C-index, Goodman-Kruskal gamma coef. and there's a way to "pick an optimal cluster solution" by utilizing F-test. Here's a ref: Miligan, G.W. & Cooper, M.C. (1985). An Examination of Procedures for Determining the Number of Clusters in a Data Set, Psychometrika, 50, 159-179 Though I assume that you prefer "graph-based" decision on optimal solution... – Claypoole 22/2, 2010 at 21:56

@hadley, in the sense of maximizing some score function having as arguments perhaps the between class distance and the within class distance. See, for example the method described in paragraph Optimal Number of Clusters here: sandro.saitta.googlepages.com/… – Westwardly 23/2, 2010 at 7:56

This may also come in handy: stats.stackexchange.com/questions/723/… – Lubeck 21/10, 2010 at 7:9

For large "sparse" datasets i would seriously recommend "Affinity propagation" method. It has superior performance compared to k means and it is deterministic in nature.

http://www.psi.toronto.edu/affinitypropagation/ It was published in journal "Science".

However the choice of optimal clustering algorithm depends on the data set under consideration. K Means is a text book method and it is very likely that some one has developed a better algorithm more suitable for your type of dataset/

This is a good tutorial by Prof. Andrew Moore (CMU, Google) on K Means and Hierarchical Clustering. http://www.autonlab.org/tutorials/kmeans.html

Concierge answered 28/2, 2010 at 22:54 Comment(0)

Last week I coded up such an estimate-the-number-of-clusters algorithm for a K-Means clustering program. I used the method outlined in:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.9687&rep=rep1&type=pdf

My biggest implementation problem was that I had to find a suitable Cluster Validation Index (ie error metric) that would work. Now it is a matter of processing speed, but the results currently look reasonable.

Samathasamau answered 21/10, 2010 at 1:35 Comment(0)

Recommended topics

Hot tags