how to avoid filling the RAM when doing multiprocessing in R (future)?

Asked 30/7, 2019 at 21:34 Answered 15/7, 2022 at 1:58

I am using furrr which is built on top of future.

I have a very simple question. I have a list of files, say list('/mydata/file1.csv.gz', '/mydata/file1.csv.gz') and I am processing them in parallel with a simple function that loads the data, does some filtering stuff, and write it to disk.

In essence, my function is

processing_func <- function(file){
  mydata <- readr::read_csv(file)
  mydata <- mydata %>% dplyr::filter(var == 1)
  data.table::fwrite(mydata, 'myfolder/processed.csv.gz')
  rm()
  gc()
}

and so I am simply running

listfiles %>% furrr::future_map(., processing_func(.x))

This works, but despite my gc() and rm() calls, the RAM keeps filling up until the session crashes.

What is the conceptual issue here? Why would some residual objects remain somehow in memory when I explicitly discard them?

Thanks!

Adelleadelpho answered 30/7, 2019 at 21:34 Comment(7)

It's hard to know without a replicable example, but rm() is not doing anything for you. You need to tell rm() what to remove. For example rm(mydata). – Premonition 30/7, 2019 at 22:12

damn!!!! is it that simple??? – Andrea 30/7, 2019 at 22:17

Did that work for you? – Premonition 30/7, 2019 at 23:12

I am trying right now :) – Andrea 30/7, 2019 at 23:13

If running n instances of R causes you to run out of memory, try n-1 or n-2 instances of R. Doing things in parallel can decrease run-time, but always increases CPU and memory usage. (Or is there something else I'm missing in your workflow?) – Subaxillary 31/7, 2019 at 1:13

In a use case presented above I would go with grep, piping, awk, etc. rather than R. Unless filter is more complex. – Chian 31/7, 2019 at 10:54

interesting. can we use awk from R? – Andrea 31/7, 2019 at 11:0

You can try using a callr future plan, it may be less memory hungry. As quoted from the future.callr vignette

When using callr futures, each future is resolved in a fresh background R session which ends as soon as the value of the future has been collected. In contrast, multisession futures are resolved in background R worker sessions that serve multiple futures over their life spans. The advantage with using a new R process for each future is that it is that the R environment is guaranteed not to be contaminated by previous futures, e.g. memory allocations, finalizers, modified options, and loaded and attached packages. The disadvantage, is an added overhead of launching a new R process

library("future.callr")
plan(callr)

Philbrook answered 16/5, 2021 at 7:31 Comment(0)

Assuming you're using 64-bit R on Windows, R is only bound to RAM by default. You can use memory.limit() to increase the amount of memory your r session can use. The line "memory.limit(50*1024)" would allow your R session to use 50GB of memory. Also, R automatically calls gc() whenever it's running low on space, so that line isn't helping you.

Encincture answered 31/7, 2019 at 1:28 Comment(1)

OK, I'm not sure if memory.limit() will help you then, let me know!? – Encincture 31/7, 2019 at 2:4

With future multisession:

future::plan(multisession)
processing_func <- function(file){
  readr::read_csv(file) |> 
    dplyr::filter(var == 1) |> 
    data.table::fwrite('...csv.gz')
  gc()
}
listfiles |> purrr::walk(processing_func)

Note that I am

Not creating any variables in processing_func so there is nothing to rm
Using purrr::walk not map, as we don't need resolved value.
Using gc() inside the future.

Passing files to functions in futures is a nice way to parrallelize things. I also like to use multicore instead of multisession to share some objects from the parent environment.

It seems like these sessions run out of memory if you aren't careful. A gc call in the future function seems to help pretty often.

Noranorah answered 15/7, 2022 at 1:58 Comment(0)

Recommended topics

Hot tags