I'm trying to create dist
objects from large distance matrices. I'm running out of memory using stats::as.dist
. For example, I have around 128 Gb available on current machine but as.dist
runs out of memory when processing a 73,000 x 73,000 matrix (approx 42Gb). Given that the final dist
object should be less than half the size of the matrix (i.e. it is the lower triangle, stored as a vector) it seems to me that it should be possible to do this calculation in a more memory-efficient way - provided we are careful not to create large intermediate objects and just copy the relevant elements of the input directly to the output.
Looking at the code for getS3method('as.dist', 'default')
, I see that it does the calculation using ans <- m[row(m) > col(m)]
which requires the creation of row
and col
matrices of the same dimensionality as the input.
I thought I might be able to improve on this using the algorithm from here to generate the indexes of the lower triangle. Here is my attempt using this method.
as.dist.new = function(dm, diag = FALSE, upper = FALSE) {
n = dim(dm)[1]
stopifnot(is.matrix(dm))
stopifnot(dim(dm)[2] == n)
k = 1:((n^2 - n)/2)
j <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k - 1))) / 2)
i <- j + k - (2 * n - j) * (j - 1) / 2
idx = cbind(i,j)
remove(i,j,k)
gc()
d = dm[idx]
class(d) <- "dist"
attr(d, "Size") <- n
attr(d, "call") <- match.call()
attr(d, "Diag") <- diag
attr(d, "Upper") <- upper
d
}
This works fine on smaller matrices. Here's a simple example:
N = 10
dm = matrix(runif(N*N), N, N)
diag(dm) = 0
x = as.dist(dm)
y = as.dist.new(dm)
However, if we create a larger distance matrix it runs into the same memory issues as as.dist
.
E.g. this version crashes on my system:
N = 73000
dm = matrix(runif(N*N), N, N)
gc()
diag(dm) = 0
gc()
as.dist.new(dm)
Does anyone have a suggestion how to perform this operation more efficiently? R or Rcpp solutions welcome. NB looking at this answer to a related problem (generating the full distance matrix from 2-column location data) it seems that it may be possible to do this using RcppArmadillo
, but I've got no experience of using that.