I wonder if anyone could have a look at the following code and minimal example and suggest improvements - in particular regarding efficiency of the code when working with really large data sets.
The function takes a data.frame and splits it by a grouping variable (factor) and then calculates the distance matrix for all the rows in each group.
I do not need to keep the distance matrices - only some statistics ie the mean, the histogram .., then they can be discarded.
I don't know much about memory allocation and the like and am wondering what would be the best way to do this, since I will be working with 10.000 - 100.000 of cases per group. Any thoughts will be greatly appreciated!
Also, what would be the least painful way of including bigmemory or some other large data handling package into the function as is in case I run into serious memory issues?
FactorDistances <- function(df) {
# df is the data frame where the first column is the grouping variable.
# find names and number of groups in df (in the example there are three:(2,3,4)
factor.names <- unique(df[1])
n.factors <-length(unique(df$factor))
# split df by factor into list - each subset dataframe is one list element
df.l<-list()
for (f in 1:n.factors) {df.l[[f]]<-df[which(df$factor==factor.names[f,]),]}
# use lapply to go through list and calculate distance matrix for each group
# this results in a new list where each element is a distance matrix
distances <- lapply (df.l, function(x) dist(x[,2:length(x)], method="minkowski", p=2))
# again use lapply to get the mean distance for each group
means <- lapply (distances, mean)
rm(distances)
gc()
return(means)
}
df <- data.frame(cbind(factor=rep(2:4,2:4), rnorm(9), rnorm(9)))
FactorDistances(df)
# The result are three average euclidean distances between all pairs in each group
# If a group has only one member, the value is NaN
Edit: I edited the title to reflect the chunking issue I posted as an answer..