Clustering list for hclust function
Asked Answered
J

2

30

Using plot(hclust(dist(x))) method, I was able to draw a cluster tree map. It works. Yet I would like to get a list of all clusters, not a tree diagram, because I have huge amount of data (like 150K nodes) and the plot gets messy.

In other words, lets say if a b c is a cluster and if d e f g is a cluster then I would like to get something like this:

1 a,b,c
2 d,e,f,g

Please note that this is not exactly what I want to get as an "output". It is just an example. I just would like to be able to get a list of clusters instead of a tree plot It could be vector, matrix or just simple numbers that show which groups elements belong to.

How is this possible?

Jeopardous answered 29/6, 2011 at 9:5 Comment(1)
This may help. #28378613Walkway
S
50

I will use the dataset available in R to demonstrate how to cut a tree into desired number of pieces. Result is a table.

Construct a hclust object.

hc <- hclust(dist(USArrests), "ave")
#plot(hc)

You can now cut the tree into as many branches as you want. For my next trick, I will split the tree into two groups. You set the number of cuts with the k parameter. See ?cutree and the use of paramter h which may be more useful to you (see cutree(hc, k = 2) == cutree(hc, h = 110)).

cutree(hc, k = 2)
       Alabama         Alaska        Arizona       Arkansas     California 
             1              1              1              2              1 
      Colorado    Connecticut       Delaware        Florida        Georgia 
             2              2              1              1              2 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
             2              2              1              2              2 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
             2              2              1              2              1 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
             2              1              2              1              2 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
             2              2              1              2              2 
    New Mexico       New York North Carolina   North Dakota           Ohio 
             1              1              1              2              2 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
             2              2              2              2              1 
  South Dakota      Tennessee          Texas           Utah        Vermont 
             2              2              2              2              2 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
             2              2              2              2              2
Squabble answered 29/6, 2011 at 9:36 Comment(3)
excellent! thank you. This makes me think how one can possibly approximate a good value for parameter "k" so that the number of clusters in the data is what it should be instead of what I want it to be? In other words, what if if I dont know how many cuts I need to make because I dont know how many clusters there are in the data. That is indeed what I am trying to find out that is to say the number of clusters and the elements within each cluster. Sorry if I was not clear earlier.Jeopardous
@dave, is it possible for you to know at which height you want to cut the tree? If yes, use the parameter h (see ?cutree). The function will return the appropriate number of groups (and allegiance of leaves).Goosestep
I see, maybe this is what I can do, hclust objects have components such as merge matrix, heights etc. lets say if a is a hclust object, we can access possible heights using a$height.So maybe selecting the max height from that matrix, I can possibly find out the number of possible clusters. That is what I was able to find thru my reading.Jeopardous
D
19

lets say,

y<-dist(x)
clust<-hclust(y)
groups<-cutree(clust, k=3)
x<-cbind(x,groups)

now you will get for each record, the cluster group. You can subset the dataset as well:

x1<- subset(x, groups==1)
x2<- subset(x, groups==2)
x3<- subset(x, groups==3)
Deucalion answered 16/9, 2013 at 11:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.