Efficient way to read and write data into files over a loop using R
Asked Answered
M

2

5

I am trying to read and write data into files at each time step.

To do this, I am using the package h5 to store large datasets but I find that my code using the functions of this package is running slowly. I am working with very large datasets. So, I have memory limit issues. Here is a reproducible example:

library(ff)
library(h5)
set.seed(12345)
for(t in 1:3650){

  print(t)

  ## Initialize the matrix to fill
  mat_to_fill <- ff(-999, dim=c(7200000, 48), dimnames=list(NULL, paste0("P", as.character(seq(1, 48, 1)))), vmode="double", overwrite = T) 
  ## print(mat_to_fill)
  ## summary(mat_to_fill[,])

  ## Create the output file
  f_t <- h5file(paste0("file",t,".h5"))

  ## Retrieve the matrix at t - 1 if t > 1
  if(t > 1){
    f_t_1 <- h5file(paste0("file", t - 1,".h5"))
    mat_t_1 <- f_t_1["testmat"][] ## *********** ##
    ## f_t_1["testmat"][]

  } else {

    mat_t_1 <- 0

  }

  ## Fill the matrix
  mat_to_fill[,] <- matrix(data = sample(1:100, 7200000*48, replace = TRUE), nrow = 7200000, ncol = 48) + mat_t_1
  ## mat_to_fill[1:3,]

  ## Write data
  system.time(f_t["testmat"] <- mat_to_fill[,]) ## *********** ##
  ## f_t["testmat"][]
  h5close(f_t)

}

Is there an efficient way to speed up my code (see symbols ## *********** ##) ? Any advice would be much appreciated.

EDIT

I have tried to create a data frame from the function createDataFrame of the package "SparkR" but I have this error message:

Error in writeBin(batch, con, endian = "big") : 
  long vectors not supported yet: connections.c:4418

I have also tested other functions to write huge data in file:

test <- mat_to_fill[,]

library(data.table)
system.time(fwrite(test, file = "Test.csv", row.names=FALSE))
   user  system elapsed                                                                                                              
  33.74    2.10   13.06 

system.time(save(test, file = "Test.RData"))
 user  system elapsed 
 223.49    0.67  224.75 

system.time(saveRDS(test, "Test.Rds"))
 user  system elapsed 
 197.42    0.98  199.01 

library(feather)
test <- data.frame(mat_to_fill[,])
system.time(write_feather(test, "Test.feather")) 
   user  system elapsed 
   0.99    1.22   10.00 

If possible, I would like to reduce the elapsed time to <= 1 sec.

SUPPLEMENTARY INFORMATION

I am building an agent-based model with R but I have memory issues because I work with large 3D arrays. In the 3D arrays, the first dimension corresponds to the time (each array has 3650 rows), the second dimension defines the properties of individuals or landscape cells (each array has 48 columns) and the third dimension represents each individual (in total, there are 720000 individuals) or landscape cell (in total, there are 90000 cells). In total, I have 8 3D arrays. Currently, the 3D arrays are defined at initialization so that data are stored in the array at each time step (1 day) using several functions. However, to fill one 3D array at t from the model, I need to only keep data at t – 1 and t – tf – 1, where tf is a duration parameter that is fixed (e.g., tf = 320 days). However, I don’t know how to manage these 3D arrays in the ABM at each time step. My first solution to avoid memory issues was thus to save data that are contained in the 3D array for each individual or cell at each time step (thus 2D array) and to retrieve data (thus read data from files) at t – 1 and t – tf – 1.

Mocambique answered 9/8, 2019 at 23:48 Comment(10)
How large is your datasets? How much available RAM do you have? A definitive solution would be to use sparkR. Hadoop, on the other hand, would be a bit of overkill. It really depends on the size of the data-set and your memory-restrictions.Schoolman
Thank you very much for your answer. My 8 datasets are 7200000 x 48 matrices. I have 32GB RAM.Mocambique
What do you need to achieve with the dataset? Add each one the the previous one? On which OS are you running?Nicolette
I need to save each data frame at t to achieve all these data at the next time step. I have memory problems in R. Thus, I save data on the hard disk rather than in R at each time step. I am running on Windows 7.Mocambique
Just for curiosity, have you tried load your data with data.table::fread() function?Comeuppance
Thank you very much for your answer. Yes, I have tried the functions fread and fwrite. But, 10s for each time step t, it's too long.Mocambique
I mean, writing to disk is slow---that's why it's preferable to work in memory. Rather than looking for a faster way to write to disk, working in an environment with enough memory to handle 2 time steps would be much faster, letting you build timestep t with timestep t-1 still in memory. Unless you need to log the intermediate times anyway...Conative
Have you tried the vroom package? Or storing the data in a database?Habitual
@Mocambique Can you elaborate what exactly you are trying to do? I really don't get it from your example code. Reading and writing are by definition slow. There are, however, numerous ways how to improve execution speed of various operations. E.g., if you have memory issues, but enough processing resources, parellization could make sense. But to assess that and give you concrete guidance some more information on your goals and what types of data sources you are dealing with (where does it come from?) would be needed.Mask
@Mask Thank you very much for your comment. I have added some details in the section "supplementary information".Mocambique
L
5

You matrix is 7200000 * 48 and with a 4 byte integer you'll get 7200000 * 48 * 4 bytes or ~1.3Gb. With the HDD r/w operation speed of 120Mb/s you are lucky to get 10 seconds if you have an average HDD. With a good SDD you should be able to get 2-3Gb/s and therefore about 0.5 second using fwrite or write_feather you tried. I assume you don't have SDD as it is not mentioned. You have 32Gb of memory which seems to be enough for 8 datasets of that size, so chances are you are using the memory to copy this data around. You can try to optimize your memory usage instead of writing it to the hard drive or to work with a portion of the dataset at a time, although both approaches are probably presenting implementation challenges. The problem of splitting the data and merging results is frequent distributed computing which requires splitting datasets and then merging results from multiple workers. Using database is always slower than plain disc operations, unless it is in-memory database which is stated to be not fitting into memory, unless you have some very specific sparse data that could be easily compressed/extracted.

Lind answered 20/8, 2019 at 6:56 Comment(0)
C
4

You can try using-

library(fst)
write.fst(x, path, compress = 50, uniform_encoding = TRUE)

You can find more detailed comparison here - https://www.fstpackage.org/

Note: You can use compress parameter to make it more efficient.

Canton answered 20/8, 2019 at 19:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.