I am trying to read and write data into files at each time step.
To do this, I am using the package h5
to store large datasets but I find that my code using the functions of this package is running slowly. I am working with very large datasets. So, I have memory limit issues. Here is a reproducible example:
library(ff)
library(h5)
set.seed(12345)
for(t in 1:3650){
print(t)
## Initialize the matrix to fill
mat_to_fill <- ff(-999, dim=c(7200000, 48), dimnames=list(NULL, paste0("P", as.character(seq(1, 48, 1)))), vmode="double", overwrite = T)
## print(mat_to_fill)
## summary(mat_to_fill[,])
## Create the output file
f_t <- h5file(paste0("file",t,".h5"))
## Retrieve the matrix at t - 1 if t > 1
if(t > 1){
f_t_1 <- h5file(paste0("file", t - 1,".h5"))
mat_t_1 <- f_t_1["testmat"][] ## *********** ##
## f_t_1["testmat"][]
} else {
mat_t_1 <- 0
}
## Fill the matrix
mat_to_fill[,] <- matrix(data = sample(1:100, 7200000*48, replace = TRUE), nrow = 7200000, ncol = 48) + mat_t_1
## mat_to_fill[1:3,]
## Write data
system.time(f_t["testmat"] <- mat_to_fill[,]) ## *********** ##
## f_t["testmat"][]
h5close(f_t)
}
Is there an efficient way to speed up my code (see symbols ## *********** ##) ? Any advice would be much appreciated.
EDIT
I have tried to create a data frame from the function createDataFrame
of the package "SparkR
" but I have this error message:
Error in writeBin(batch, con, endian = "big") :
long vectors not supported yet: connections.c:4418
I have also tested other functions to write huge data in file:
test <- mat_to_fill[,]
library(data.table)
system.time(fwrite(test, file = "Test.csv", row.names=FALSE))
user system elapsed
33.74 2.10 13.06
system.time(save(test, file = "Test.RData"))
user system elapsed
223.49 0.67 224.75
system.time(saveRDS(test, "Test.Rds"))
user system elapsed
197.42 0.98 199.01
library(feather)
test <- data.frame(mat_to_fill[,])
system.time(write_feather(test, "Test.feather"))
user system elapsed
0.99 1.22 10.00
If possible, I would like to reduce the elapsed time to <= 1 sec.
SUPPLEMENTARY INFORMATION
I am building an agent-based model with R but I have memory issues because I work with large 3D arrays. In the 3D arrays, the first dimension corresponds to the time (each array has 3650 rows), the second dimension defines the properties of individuals or landscape cells (each array has 48 columns) and the third dimension represents each individual (in total, there are 720000 individuals) or landscape cell (in total, there are 90000 cells). In total, I have 8 3D arrays. Currently, the 3D arrays are defined at initialization so that data are stored in the array at each time step (1 day) using several functions. However, to fill one 3D array at t from the model, I need to only keep data at t – 1 and t – tf – 1, where tf is a duration parameter that is fixed (e.g., tf = 320 days). However, I don’t know how to manage these 3D arrays in the ABM at each time step. My first solution to avoid memory issues was thus to save data that are contained in the 3D array for each individual or cell at each time step (thus 2D array) and to retrieve data (thus read data from files) at t – 1 and t – tf – 1.
fread
andfwrite
. But, 10s for each time stept
, it's too long. – Mocambiquet
with timestept-1
still in memory. Unless you need to log the intermediate times anyway... – Conativevroom
package? Or storing the data in a database? – Habitual