Hierarchical Clustering: Determine optimal number of cluster and statistically describe Clusters

Asked 6/11, 2012 at 10:51 Answered 23/4, 2019 at 9:16

I could use some advice on methods in R to determine the optimal number of clusters and later on describe the clusters with different statistical criteria. I’m new to R with basic knowledge about the statistical foundations of cluster analysis.

Methods to determine the number of clusters: In the literature one common method to do so is the so called "Elbow-criterion" which compares the Sum of Squared Differences (SSD) for different cluster solutions. Therefore the SSD is plotted against the numbers of Cluster in the analysis and an optimal number of clusters is determined by identifying the “elbow” in the plot (e.g. here: https://en.wikipedia.org/wiki/File:DataClustering_ElbowCriterion.JPG) This method is a first approach to get a subjective impression. Therefore I’d like to implement it in R. The information on the internet on this is sparse. There is one good example here: http://www.mattpeeples.net/kmeans.html where the author also did an interesting iterative approach to see if the elbow is somehow stable after several repetitions of the clustering process (nevertheless it is for partitioning cluster methods not for hierarchical). Other methods in Literature comprise the so called “stopping rules”. MILLIGAN & COOPER compared 30 of these stopping rules in their paper “An examination of procedures for determining the number of clusters in a data set” (available here: http://link.springer.com/article/10.1007%2FBF02294245) finding that the Stopping Rule from Calinski and Harabasz provided the best results in a Monte Carlo evaluation. Information on implementing this in R is even sparser. So if anyone has ever implemented this or another Stopping rule (or other method) some advice would be very helpful.
Statistically describe the clusters:For describing the clusters I thought of using the mean and some sort of Variance Criterion. My data is on agricultural land-use and shows the production numbers of different crops per Municipality. My aim is to find similar patterns of land-use in my dataset.

I produced a script for a subset of objects to do a first test-run. It looks like this (explanations on the steps within the script, sources below).

    #Clusteranalysis agriculture

    #Load data
    agriculture <-read.table ("C:\\Users\\etc...", header=T,sep=";")
    attach(agriculture)

    #Define Dataframe to work with
    df<-data.frame(agriculture)

    #Define a Subset of objects to first test the script
    a<-df[1,]
    b<-df[2,]
    c<-df[3,]
    d<-df[4,]
    e<-df[5,]
    f<-df[6,]
    g<-df[7,]
    h<-df[8,]
    i<-df[9,]
    j<-df[10,]
    k<-df[11,]
    #Bind the objects
    aTOk<-rbind(a,b,c,d,e,f,g,h,i,j,k)

    #Calculate euclidian distances including only the columns 4 to 24
    dist.euklid<-dist(aTOk[,4:24],method="euclidean",diag=TRUE,upper=FALSE, p=2)
    print(dist.euklid)

    #Cluster with Ward
    cluster.ward<-hclust(dist.euklid,method="ward")

    #Plot the dendogramm. define Labels with labels=df$Geocode didn't work
    plot(cluster.ward, hang = -0.01, cex = 0.7)

    #here are missing methods to determine the optimal number of clusters

    #Calculate different solutions with different number of clusters
    n.cluster<-sapply(2:5, function(n.cluster)table(cutree(cluster.ward,n.cluster)))
    n.cluster

    #Show the objects within clusters for the three cluster solution
    three.cluster<-cutree(cluster.ward,3)
    sapply(unique(three.cluster), function(g)aTOk$Geocode[three.cluster==g])

    #Calculate some statistics to describe the clusters
    three.cluster.median<-aggregate(aTOk[,4:24],list(three.cluster),median)
    three.cluster.median
    three.cluster.min<-aggregate(aTOk[,4:24],list(three.cluster),min)
    three.cluster.min
    three.cluster.max<-aggregate(aTOk[,4:24],list(three.cluster),max)
    three.cluster.max
    #Summary statistics for one variable
    three.cluster.summary<-aggregate(aTOk[,4],list(three.cluster),summary)
    three.cluster.summary

    detach(agriculture)

Sources:

Grania answered 6/11, 2012 at 10:51 Comment(5)

You may want to take a look at Numerical Ecology with R, by Borcard, Gillet, and Legendre, which has a good chapter on cluster analysis: springer.com/statistics/life+sciences,+medicine+%26+health/book/… – Matzo 6/11, 2012 at 10:59

I just odered a copy of the book fom our library and will have a look on it. Thank you for the advice!... i must admit that I find it quite strange that there are a lot of manuals on how to perform a Cluster Analysis and only few on how to actually deal with the results :/ – Grania 6/11, 2012 at 16:18

In my opinion that's because there are many more people who know how to perform CA than understand the results! If you like the book you may also want to check out Legendre and Legendre Numerical Ecology, which is not R-specific but is very general and authoritative. – Matzo 6/11, 2012 at 17:59

@Drew Steen thanks once more for the advice on the literature. I just recieved the book and it comes with quite a lot of interesting methods to deepen the cluster analysis and plot the the dissimilartities. As I get trough it i'll try to post a solution. – Grania 14/11, 2012 at 15:29

I've answered a Q elsewhere that addresses part 1 of the above: https://mcmap.net/q/80153/-cluster-analysis-in-r-determine-the-optimal-number-of-clusters – Mozellamozelle 18/3, 2013 at 8:26

This is a very late answer and probably not useful for the asker anymore - but maybe for others. Check out the package NbClust. It contains 26 indices that give you a recommended number of clusters (and you can also choose your type of clustering). You can run it in such a way that you get the results for all the indices and then you can basically go with the number of clusters recommended by most indices. And yes, I think the basic statistics are the best way to describe clusters.

Annabel answered 24/12, 2013 at 2:37 Comment(0)

The elbow criterion as your links indicated is for k-means. Also the cluster mean is obviously related to k-means, and is not appropriate for linkage clustering (in particular not for single-linkage, see single-link-effect).

Your question title however mentions hierarchical clustering, and so does your code?

Note that the elbow criterion does not choose the optimal number of clusters. It chooses the optimal number of k-means clusters. If you use a different clustering method, it may need a different number of clusters.

There is no such thing as the objectively best clustering. Thus, there also is no objectively best number of clusters. There is a rule of thumb for k-means that chooses a (maybe best) tradeoff between number of clusters and minimizing the target function (because increasing the number of clusters always can improve the target function); but that is mostly to counter a deficit of k-means. It is by no means objective.

Cluster analysis in itself is not an objective task. A clustering may be mathematically good, but useless. A clustering may score much worse mathematically, but it may provide you insight to your data that cannot be measured mathematically.

Eichler answered 6/11, 2012 at 20:54 Comment(5)

Thank you for your answer. I think it highlights some important points in cluster analysis. I agree with you completely that there is no such thing like an objectively best clustering. Clustering methods are to a good degree subjective and in fact I wasn't searching for an objective method to interpret the results of the cluster method. I was/am searching for a robust method to determine the best number of cluster in hierarchical clustering in R that represents best my data structure. I think this is a tricky point in cluster analysis because as you mentioned there are always a bunch... – Grania 7/11, 2012 at 8:49

...of possible solutions. So besides the empirical interpretation some statistic indicators can be used to determine a good number of clusters based on homogeneity inside the clusters and heterogeneity between them. The Elbow criterion based on SSD is not necessarily linked to the k-means algorithm. Ward- Clustering is also based on minimizing the SSD within Clusters (with the difference that this task is executed in a hierarchical way). Therefore the elbow in SSD can indicate a good number of homogenous clusters where the SSD is still low inside clusters and high between them. – Grania 7/11, 2012 at 8:50

For hierarchical clustering, the common approach is to look at the dendrogram. Just fixing the target number of clusters doesn't give you the option of cutting at different depth. A visual check helps a lot here. – Eichler 7/11, 2012 at 9:27

If I understood it right than looking at the dendogramm and plotting SSDs against the number of clusters is quite the same isn't it? the problem with locking at my dendogramm is, that I have so many objects that my dendogramm is too growded to see anything. Maybe you have an advice how to plot it with higer resolution (i'm quite new to R so i get stuck in this basic stuff)? Maybe it would be interesting to plot both, the dendogramm and the SSDs against number of clusters... – Grania 7/11, 2012 at 9:58

Only look at the top part of the dendrogram. The point is, you want to see if there is a clear threshold. If the dendrogram doesn't have big steps at the top, it's not significant. SSD can't capture this, because it tests one particular horizontal cut, not whether there is a good reason to choose this cut. – Eichler 7/11, 2012 at 13:23

Annabel answered 24/12, 2013 at 2:37 Comment(0)

You can also try the R-NN Curves method. http://rguha.net/writing/pres/rnn.pdf

Autotrophic answered 11/4, 2013 at 21:57 Comment(2)

Thanks for your advice and the link! have you ever done this in R? – Grania 19/4, 2013 at 17:34

When working on segmentation I was handling time series of 48 points, so the R-NN curves method didn't fit my needs cause decreasing the dimensionality removed the differences I was trying to highlight ... But I could probably still help you. I must have, somewhere, a document much more detailed (including some scripts) than the simple one I've posted. I'll be back as soon as I find it. – Autotrophic 19/4, 2013 at 18:38

K means Clustering is highly sensitive to the scale of data e.g. for a person's age and salary, if not normalized, K means would consider salary more important variable for clustering rather than age, which you do not want. So before applying the Clustering Algorithm, it is always a good practice to normalize the scale of data, bring them to the same level and then apply the CA.

Diecious answered 23/4, 2019 at 9:16 Comment(0)

Recommended topics

Hot tags