hclust() in R on large datasets - McMap

About

hclust() in R on large datasets

Asked 6/12, 2016 at 6:20 Answered 1/1, 2017 at 1:6

R

1

5

I am trying implement hierarchical clustering in R : hclust() ; this requires a distance matrix created by dist() but my dataset has around a million rows, and even EC2 instances run out of RAM. Is there a workaround?

Richierichlad answered 6/12, 2016 at 6:20 Comment(3)

The best approach here is probably going to be to make subsets of your data which are closest to each other (maybe using another clustering algorithm like KNN) and then make hierarchical clusters of those subsets, and then finally assign each cluster a location in the hierarchy. This post discusses this basic idea. The implementation in the post is in python, but most of it is just discussing ideas, not code. – Aftermath 1/1, 2017 at 16:51

Also, have a look at the Rclusterpp package for a hierarchical clustering algorythm that is more efficient. – Aftermath 1/1, 2017 at 17:10

Possible duplicate of Large distance matrix in clustering – Aftermath 1/1, 2017 at 18:4

N

7

One possible solution for this is to sample your data, cluster the smaller sample, then treat the clustered sample as training data for k Nearest Neighbors and "classify" the rest of the data. Here is a quick example with 1.1M rows. I use a sample of 5000 points. The original data is not well-separated, but with only 1/220 of the data, the sample is separated. Since your question referred to hclust, I used that. But you could use other clustering algorithms like dbscan or mean shift.

## Generate data
set.seed(2017)
x = c(rnorm(250000, 0,0.9), rnorm(350000, 4,1), rnorm(500000, -5,1.1))
y = c(rnorm(250000, 0,0.9), rnorm(350000, 5.5,1), rnorm(500000,  5,1.1))
XY = data.frame(x,y)
Sample5K = sample(length(x), 5000)     ## Downsample

## Cluster the sample
DM5K = dist(XY[Sample5K,])
HC5K = hclust(DM5K, method="single")
Groups = cutree(HC5K, 8)
Groups[Groups>4] = 4
plot(XY[Sample5K,], pch=20, col=rainbow(4, alpha=c(0.2,0.2,0.2,1))[Groups])

Now just assign all other points to the nearest cluster.

Core = which(Groups<4)
library(class)
knnClust = knn(XY[Sample5K[Core], ], XY, Groups[Core])
plot(XY, pch=20, col=rainbow(3, alpha=0.1)[knnClust])

A few quick notes.

Because I created the data, I knew to choose three clusters. With a real problem, you would have to do the work of figuring out an appropriate number of clusters.
Sampling 1/220 could completely miss any small clusters. In the small sample, they would just look like noise.

Navvy answered 1/1, 2017 at 1:6 Comment(3)

This clusters the data, it doesn't do hierarchical clustering. Normal clustering just divides things into a set number of groups, hierarchical clustering makes a "family tree" for all the data, assigning each individual data point a specific place in the tree. – Aftermath 1/1, 2017 at 16:41

@Aftermath - yes. I agree. It is not hierarchical clustering on the full data. – Navvy 1/1, 2017 at 16:53

The end results isn't hierarchical clustering at all (or at best it is a hierarchy of 3), you threw out all of the relationships between the groups when you cut the tree and just used the top 3 clusters for your classifier. You could get a sort of semi hierarchy if you kept all of you 5000 groups from hclust and assigned the rest of the data to each of the 5000 branches. You could then make a real hierarchy (though with some potential errors) if you then ran hclust on each of the groups and then hooked them back up into the tree. – Aftermath 1/1, 2017 at 17:5

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.