dist() function in R: vector size limitation

Asked 17/10, 2013 at 20:6 Answered 15/1, 2019 at 9:50

I was trying to draw a hierarchical clustering of some samples (40 of them) over some features(genes) and I have a big table with 500k rows and 41 columns (1st one is name) and when I tried

d<-dist(as.matrix(file),method="euclidean")

I got this error

Error: cannot allocate vector of size 1101.1 Gb

How can I get around of this limitation? I googled it and came across to the ff package in R but I don't quite understand whether that could solve my issue.

Thanks!

Whitby answered 17/10, 2013 at 20:6 Comment(6)

You are going to need to explore other methods. Perhaps deleting this question and posing a question on CrossValidated.com – Basaltware 17/10, 2013 at 20:15

Perhaps someone will correct my arithmetic here, but I believe that you'd be talking about memory on the order of petabytes in order to just hold one copy of the distance matrix in memory. – Obscurantism 17/10, 2013 at 20:24

You are creating a diagonal matrix of size 500k x 500k. Are you aware of what you are doing? Why not use another clustering technique (e.g. from package stream?) – Lyso 17/10, 2013 at 20:28

So I have 40 samples, each sample were measured at each gene so I want to cluster the samples based on the measurement and hope to see that similar samples (like two biological replicates, or with similar biological properties) would cluster together over the list of genes. I'm not familiar with stream, what is the advantage of clustering technique from stream compared to the hclust in R? – Whitby 17/10, 2013 at 20:48

you are clustering the genes, not the samples in your example. depending on what you want to cluster and the amount of RAM, other techniques can be more suited. – Lyso 17/10, 2013 at 21:4

@jwijffels yes... the answer below also pointed that out.. i just tried to transpose my data and now it works... I really need to understand the basics of R I guess – Whitby 17/10, 2013 at 21:10

Generally speaking hierarchical clustering is not the best approach for dealing with very large datasets.

In your case however there is a different problem. If you want to cluster samples structure of your data is wrong. Observations should be represented as the rows, and gene expression (or whatever kind of data you have) as the columns.

Lets assume you have data like this:

data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40))

What you want to do is:

 # Create transposed data matrix
 data.matrix.t <- t(as.matrix(data))

 # Create distance matrix
 dists <- dist(data.matrix.t)

 # Clustering
 hcl <- hclust(dists)

 # Plot
 plot(hcl)

NOTE

You should remember that euclidean distances can be rather misleading when you work with high-dimensional data.

Cartierbresson answered 17/10, 2013 at 20:28 Comment(3)

oh, the dist function in R says" computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix." so a whole row would be all the measurement for one sample.. thanks! – Whitby 17/10, 2013 at 20:51

Yeah, so many numbers are reduced to one dimension I guess there could be artifacts.. – Whitby 17/10, 2013 at 21:11

manhattan could be a better metric in this case, but most of the time you should consider selecting some subset of interesting parameters and/or dimensionality reduction. – Cartierbresson 17/10, 2013 at 21:19

When dealing with large data sets, R is not the best choice.

The majority of methods in R seems to be implemented by computing a full distance matrix, which inherently needs O(n^2) memory and runtime. Matrix based implementations don't scale well to large data , unless the matrix is sparse (which a distance matrix per definition isn't).

I don't know if you realized that 1101.1 Gb is 1 Terabyte. I don't think you have that much RAM, and you probably won't have the time to wait for computing this matrix either.

For example ELKI is much more powerful for clustering, as you can enable index structures to accelerate many algorithms. This saves both memory (usually down to linear memory usage; for storing the cluster assignments) and runtime (usually down to O(n log n), one O(log n) operation per object).

But of course, it also varies from algorithm to algorithm. K-means for example, which needs point-to-mean distances only, does not need (and cannot use) an O(n^2) distance matrix.

So in the end: I don't think the memory limit of R is your actual problem. The method you want to use doesn't scale.

Sweeney answered 18/10, 2013 at 7:44 Comment(0)

I just experience a related issue but with less rows (around 100 thousands for 16 columns).
RAM size is the limiting factor.
To limitate the need in memory space I used 2 different functions from 2 different packages. from parallelDist the function parDist() allow you to obtain the distances quite fast. it uses RAM of course during the process but it seems that the resulting dist object is taking less memory (no idea why).
Then I used the hclust() function but from the package fastcluster. fastcluster is actually not so fast on such an amount of data but it seems that it uses less memory than the default hclust().
Hope this will be useful for anybody who find this topic.

Ettie answered 15/1, 2019 at 9:50 Comment(0)

Recommended topics

Hot tags