Efficient way to read and write data into files over a loop using R

I am trying to read and write data into files at each time step.

To do this, I am using the package h5 to store large datasets but I find that my code using the functions of this package is running slowly. I am working with very large datasets. So, I have memory limit issues. Here is a reproducible example:

library(ff)
library(h5)
set.seed(12345)
for(t in 1:3650){

  print(t)

  ## Initialize the matrix to fill
  mat_to_fill <- ff(-999, dim=c(7200000, 48), dimnames=list(NULL, paste0("P", as.character(seq(1, 48, 1)))), vmode="double", overwrite = T) 
  ## print(mat_to_fill)
  ## summary(mat_to_fill[,])

  ## Create the output file
  f_t <- h5file(paste0("file",t,".h5"))

  ## Retrieve the matrix at t - 1 if t > 1
  if(t > 1){
    f_t_1 <- h5file(paste0("file", t - 1,".h5"))
    mat_t_1 <- f_t_1["testmat"][] ## *********** ##
    ## f_t_1["testmat"][]

  } else {

    mat_t_1 <- 0

  }

  ## Fill the matrix
  mat_to_fill[,] <- matrix(data = sample(1:100, 7200000*48, replace = TRUE), nrow = 7200000, ncol = 48) + mat_t_1
  ## mat_to_fill[1:3,]

  ## Write data
  system.time(f_t["testmat"] <- mat_to_fill[,]) ## *********** ##
  ## f_t["testmat"][]
  h5close(f_t)

}

Is there an efficient way to speed up my code (see symbols ## *********** ##) ? Any advice would be much appreciated.

EDIT

I have tried to create a data frame from the function createDataFrame of the package "SparkR" but I have this error message:

Error in writeBin(batch, con, endian = "big") : 
  long vectors not supported yet: connections.c:4418

I have also tested other functions to write huge data in file:

test <- mat_to_fill[,]

library(data.table)
system.time(fwrite(test, file = "Test.csv", row.names=FALSE))
   user  system elapsed                                                                                                              
  33.74    2.10   13.06 

system.time(save(test, file = "Test.RData"))
 user  system elapsed 
 223.49    0.67  224.75 

system.time(saveRDS(test, "Test.Rds"))
 user  system elapsed 
 197.42    0.98  199.01 

library(feather)
test <- data.frame(mat_to_fill[,])
system.time(write_feather(test, "Test.feather")) 
   user  system elapsed 
   0.99    1.22   10.00

If possible, I would like to reduce the elapsed time to <= 1 sec.

SUPPLEMENTARY INFORMATION

I am building an agent-based model with R but I have memory issues because I work with large 3D arrays. In the 3D arrays, the first dimension corresponds to the time (each array has 3650 rows), the second dimension defines the properties of individuals or landscape cells (each array has 48 columns) and the third dimension represents each individual (in total, there are 720000 individuals) or landscape cell (in total, there are 90000 cells). In total, I have 8 3D arrays. Currently, the 3D arrays are defined at initialization so that data are stored in the array at each time step (1 day) using several functions. However, to fill one 3D array at t from the model, I need to only keep data at t – 1 and t – tf – 1, where tf is a duration parameter that is fixed (e.g., tf = 320 days). However, I don’t know how to manage these 3D arrays in the ABM at each time step. My first solution to avoid memory issues was thus to save data that are contained in the 3D array for each individual or cell at each time step (thus 2D array) and to retrieve data (thus read data from files) at t – 1 and t – tf – 1.

You matrix is 7200000 * 48 and with a 4 byte integer you'll get 7200000 * 48 * 4 bytes or ~1.3Gb. With the HDD r/w operation speed of 120Mb/s you are lucky to get 10 seconds if you have an average HDD. With a good SDD you should be able to get 2-3Gb/s and therefore about 0.5 second using fwrite or write_feather you tried. I assume you don't have SDD as it is not mentioned. You have 32Gb of memory which seems to be enough for 8 datasets of that size, so chances are you are using the memory to copy this data around. You can try to optimize your memory usage instead of writing it to the hard drive or to work with a portion of the dataset at a time, although both approaches are probably presenting implementation challenges. The problem of splitting the data and merging results is frequent distributed computing which requires splitting datasets and then merging results from multiple workers. Using database is always slower than plain disc operations, unless it is in-memory database which is stated to be not fitting into memory, unless you have some very specific sparse data that could be easily compressed/extracted.

Recommended topics

Hot tags