hclust() in R on large datasets
Asked Answered
R

1

5

I am trying implement hierarchical clustering in R : hclust() ; this requires a distance matrix created by dist() but my dataset has around a million rows, and even EC2 instances run out of RAM. Is there a workaround?

Richierichlad answered 6/12, 2016 at 6:20 Comment(3)
The best approach here is probably going to be to make subsets of your data which are closest to each other (maybe using another clustering algorithm like KNN) and then make hierarchical clusters of those subsets, and then finally assign each cluster a location in the hierarchy. This post discusses this basic idea. The implementation in the post is in python, but most of it is just discussing ideas, not code.Aftermath
Also, have a look at the Rclusterpp package for a hierarchical clustering algorythm that is more efficient.Aftermath
Possible duplicate of Large distance matrix in clusteringAftermath
N
7

One possible solution for this is to sample your data, cluster the smaller sample, then treat the clustered sample as training data for k Nearest Neighbors and "classify" the rest of the data. Here is a quick example with 1.1M rows. I use a sample of 5000 points. The original data is not well-separated, but with only 1/220 of the data, the sample is separated. Since your question referred to hclust, I used that. But you could use other clustering algorithms like dbscan or mean shift.

## Generate data
set.seed(2017)
x = c(rnorm(250000, 0,0.9), rnorm(350000, 4,1), rnorm(500000, -5,1.1))
y = c(rnorm(250000, 0,0.9), rnorm(350000, 5.5,1), rnorm(500000,  5,1.1))
XY = data.frame(x,y)
Sample5K = sample(length(x), 5000)     ## Downsample

## Cluster the sample
DM5K = dist(XY[Sample5K,])
HC5K = hclust(DM5K, method="single")
Groups = cutree(HC5K, 8)
Groups[Groups>4] = 4
plot(XY[Sample5K,], pch=20, col=rainbow(4, alpha=c(0.2,0.2,0.2,1))[Groups])

Clustered Sample

Now just assign all other points to the nearest cluster.

Core = which(Groups<4)
library(class)
knnClust = knn(XY[Sample5K[Core], ], XY, Groups[Core])
plot(XY, pch=20, col=rainbow(3, alpha=0.1)[knnClust])

Full Data Clustered

A few quick notes.

  1. Because I created the data, I knew to choose three clusters. With a real problem, you would have to do the work of figuring out an appropriate number of clusters.
  2. Sampling 1/220 could completely miss any small clusters. In the small sample, they would just look like noise.
Navvy answered 1/1, 2017 at 1:6 Comment(3)
This clusters the data, it doesn't do hierarchical clustering. Normal clustering just divides things into a set number of groups, hierarchical clustering makes a "family tree" for all the data, assigning each individual data point a specific place in the tree.Aftermath
@Aftermath - yes. I agree. It is not hierarchical clustering on the full data.Navvy
The end results isn't hierarchical clustering at all (or at best it is a hierarchy of 3), you threw out all of the relationships between the groups when you cut the tree and just used the top 3 clusters for your classifier. You could get a sort of semi hierarchy if you kept all of you 5000 groups from hclust and assigned the rest of the data to each of the 5000 branches. You could then make a real hierarchy (though with some potential errors) if you then ran hclust on each of the groups and then hooked them back up into the tree.Aftermath

© 2022 - 2024 — McMap. All rights reserved.