Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices
Asked Answered
T

2

3

I wonder if anyone could have a look at the following code and minimal example and suggest improvements - in particular regarding efficiency of the code when working with really large data sets.

The function takes a data.frame and splits it by a grouping variable (factor) and then calculates the distance matrix for all the rows in each group.

I do not need to keep the distance matrices - only some statistics ie the mean, the histogram .., then they can be discarded.

I don't know much about memory allocation and the like and am wondering what would be the best way to do this, since I will be working with 10.000 - 100.000 of cases per group. Any thoughts will be greatly appreciated!

Also, what would be the least painful way of including bigmemory or some other large data handling package into the function as is in case I run into serious memory issues?

FactorDistances <- function(df) {
  # df is the data frame where the first column is the grouping variable. 
  # find names and number of groups in df (in the example there are three:(2,3,4)
  factor.names <- unique(df[1])
  n.factors <-length(unique(df$factor))
  # split df by factor into list - each subset dataframe is one list element
  df.l<-list()
  for (f in 1:n.factors) {df.l[[f]]<-df[which(df$factor==factor.names[f,]),]}
  # use lapply to go through list and calculate distance matrix for each group
  # this results in a new list where each element is a distance matrix
  distances <- lapply (df.l, function(x) dist(x[,2:length(x)], method="minkowski", p=2))  
  # again use lapply to get the mean distance for each group
  means <- lapply (distances,  mean)  
  rm(distances)
  gc()
  return(means)
}

df <- data.frame(cbind(factor=rep(2:4,2:4), rnorm(9), rnorm(9)))
FactorDistances(df)
# The result are three average euclidean distances between all pairs in each group
# If a group has only one member, the value is NaN

Edit: I edited the title to reflect the chunking issue I posted as an answer..

Thermoluminescent answered 19/11, 2012 at 15:15 Comment(2)
Looking through the code, I developed the suspicion it might not be doing what you wnated to accomplish. However the lack of any comments within the code prevents us from understanding what you think each line will be constructing.Deutzia
Sorry, I added comments now (and cleared out some clutter) - hope it is clearer now!Thermoluminescent
T
5

I've come up with a chunking solution for those extra large matrices that dist() can't handle, which I'm posting here in case anyone else finds it helpful (or finds fault with it, please!). It is significantly slower than dist(), but that is kind of irrelevant, since it should only ever be used when dist() throws an error - usually one of the following:

"Error in double(N * (N - 1)/2) : vector size specified is too large" 
"Error: cannot allocate vector of size 6.0 Gb"
"Error: negative length vectors are not allowed"

The function calculates the mean distance for the matrix, but you can change that to anything else, but in case you want to actually save the matrix I believe some sort of filebacked bigmemory matrix is in order.. Kudos to link for the idea and Ari for his help!

FunDistanceMatrixChunking <- function (df, blockSize=100){
  n <- nrow(df)
  blocks <- n %/% blockSize
  if((n %% blockSize) > 0)blocks <- blocks + 1
  chunk.means <- matrix(NA, nrow=blocks*(blocks+1)/2, ncol= 2)
  dex <- 1:blockSize
  chunk <- 0
  for(i in 1:blocks){    
    p <- dex + (i-1)*blockSize
    lex <- (blockSize+1):(2*blockSize)
    lex <- lex[p<= n]
    p <- p[p<= n]
    for(j in 1:blocks){
      q <- dex +(j-1)*blockSize
      q <- q[q<=n]     
      if (i == j) {       
        chunk <- chunk+1
        x <- dist(df[p,])
        chunk.means[chunk,] <- c(length(x), mean(x))}
      if ( i > j) {
        chunk <- chunk+1
        x <- as.matrix(dist(df[c(q,p),]))[lex,dex] 
        chunk.means[chunk,] <- c(length(x), mean(x))}
    }
  }
  mean <- weighted.mean(chunk.means[,2], chunk.means[,1])
  return(mean)
}
df <- cbind(var1=rnorm(1000), var2=rnorm(1000))
mean(dist(df))
FunDistanceMatrixChunking(df, blockSize=100)

Not sure whether I should have posted this as an edit, instead of an answer.. It does solve my problem, although I didn't really specify it this way..

Thermoluminescent answered 22/11, 2012 at 0:43 Comment(1)
Posting it as an answer was the right call. Thanks for bringing your solution back to the community.Confidante
C
2

A few thoughts:

  • unique(df[1]) probably works (by ignoring the data.frame property of your list), but makes me nervous and is hard to read. unique(df[,1]) would be better.
  • for (f in 1:n.factors) {df.l[[f]]<-df[which(df$factor==factor.names[f,]),]} can be done with split.
  • If you're worried about memory, definitely don't store the entire distance matrix for every level, then calculate your summary statistics for every factor level! Change your lapply to something like: lapply (df.l, function(x) mean(dist(x[,2:length(x)], method="minkowski", p=2))).

If you need more than one summary statistic, calculate both and return a list:

lapply (df.l, function(x) {
   dmat <- dist(x[,2:length(x)], method="minkowski", p=2)
   list( mean=mean(dmat), median=median(dmat) )
})

See if that gets you anywhere. If not, you may have to go more specialized (avoiding lapply, storing your data.frames as matrices instead, etc.)

Confidante answered 19/11, 2012 at 21:13 Comment(4)
(1.) split, of course! (2.) and your last bit of code is exactly what I was looking for, since I am calculating more than just the mean, but couldn't figure out how to get them out! (3.) while the code is now working fine for reasonable sizes, it did end with a "could not allocate vector of size 2.9 Gb", so I need to find another solution. What did you mean by "avoiding lapply?" I don't think the data.frames are an issue memory wise, just the distance matrices..Thermoluminescent
You may be ok with lapply, depending on whether garbage collection runs between rounds or not (It likely does...R is pretty good about it). So don't worry about that. For your 2.9Gb problem, I'd figure out which group triggers the error and the run that distance matrix calculation by itself. If that still produces the error, then you know where to focus. Or just get an Amazon EC2 cluster with lots of memory and run it on that. 2.9Gb isn't that big.Confidante
Thanks! The error gets triggered by a 40Kby40K distance matrix, but in addition to the groups above, I also intend to do all of the pairs (200K), hopefully without resorting to sampling. Before I try EC2 I'm gonna try doing it in blocks. I need the means, but also the histograms, but these I can "sum up" if I make sure the breaks are the same. I've found this link which looks helpful, although I'm not sure I can get a histogram out of a filebacked.big.matrix, so I'll start with just the block option..Thermoluminescent
Doing it in blocks is a great idea. Even if you can calculate it all as one big file-backed matrix, it's not necessarily better...disks are slow.Confidante

© 2022 - 2024 — McMap. All rights reserved.