Why does gc() not free memory?
Asked Answered
L

3

70

I run simulations on a Windows 64bit-computer with 64 GB RAM. Memory use reaches 55% and after a finished simulation run I remove all objects in the working space by rm(list=ls()), followed by a gc().

I supposed that this would free enough memory for the next simulation run, but actually memory usage drops by just 1%. Consulting a lot of different fora I could not find a satisfactory explanation, only vague comments such as:

"Depending on your operating system, the freed up memory might not be returned to the operating system, but kept in the process space."

I'd like to find information on:

  1. which OS and under which conditions freed memory is not returned to the OS, and
  2. if there is any other remedy than closing R and start it again for the next simulation run?
Laurettalaurette answered 29/1, 2013 at 10:3 Comment(4)
Will the next run run out of memory if you do not close R?Pending
Do you actually run out of memory later on?Hent
I could not check that yet. I'm quite in a hurry with the current project and I did not want to face the risk of getting stuck with the simulations (they need between six hours and two days).Laurettalaurette
That's why you test the behavior of gc or rm... on a nice small dataset before executing the entire simulation.Outlier
T
27

How do you check memory usage? Normally virtual machine allocates some chunk of memory that it uses to store its data. Some of the allocated may be unused and marked as free. What GC does is discovering data that is not referenced from anywhere else and marking corresponding chunks of memory as unused, this does not mean that this memory is released to the OS. Still from the VM perspective there's now more free memory that can be used for further computation.

As others asked did you experience out of memory errors? If not then there's nothing to worry about.

EDIT: This and this should be enough to understand how memory allocation and garbage collection works in R.

From the first document:

Occasionally an attempt is made to release unused pages back to the operating system. When pages are released, a number of free nodes equal to R_MaxKeepFrac times the number of allocated nodes for each class is retained. Pages not needed to meet this requirement are released. An attempt to release pages is made every R_PageReleaseFreq level 1 or level 2 collections.

EDIT2:

To see used memory try running gc() with verbose set to TRUE:

gc(verbose=T)

Here's a result with an array of 10'000'000 integers in memory:

Garbage collection 9 = 1+0+8 (level 2) ... 
10.7 Mbytes of cons cells used (49%)
40.6 Mbytes of vectors used (72%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  198838 10.7     407500 21.8   350000 18.7
Vcells 5311050 40.6    7421749 56.7  5311504 40.6

And here's after discarding reference to it:

Garbage collection 10 = 1+0+9 (level 2) ... 
10.7 Mbytes of cons cells used (49%)
2.4 Mbytes of vectors used (5%)
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 198821 10.7     407500 21.8   350000 18.7
Vcells 310987  2.4    5937399 45.3  5311504 40.6

As you can see memory used by Vcells fell from 40.6Mb to 2.4Mb.

The answered 29/1, 2013 at 10:20 Comment(4)
I check memory usage with the Windows Task Manager.Laurettalaurette
@Laurettalaurette Memory shown in Task Manager as used by the R process could be marked as free on the VM level, meaning all of it will be available for future computation. GC when performing level 1 or 2 garbage collection may decide to free some of it to the system to let other processes use it.The
After finishing my analysis I checked whether I would actually run out of memory - - I did not (although the Windows Task Manager showed that the largest part of memory was still occupied). So I better trust my gc()-output...Laurettalaurette
In my computer (Windows 10 12GB RAM) the garbage collector works very bad too. If I'm working with large datasets for a long time all Windows memory gets filled no matter if I'm using gc() or not, and the computer gets really slow and unusable.Sylvan
V
46

The R garbage collector is imperfect in the following (not so) subtle way: it does not move objects (i.e., it does not compact memory) because of the way it interacts with C libraries. (Some other languages/implementations suffer from this too, but others, despite also having to interact with C, manage to have a compacting generational GC which does not suffer from this problem).

This means that if you take turns allocating small chunks of memory which are then discarded and larger chunks for more permanent objects (this is a common situation when doing string/regexp processing), then your memory becomes fragmented and the garbage collector can do nothing about it: the memory is released, but cannot be re-used because the free chunks are too short.

The only way to fix the problem is to save the objects you want, restart R, and reload the objects.

Since you are doing rm(list=ls()), i.e., you do not need any objects, you do not need to save and reload anything, so, in your case, the solution is precisely what you want to avoid - restarting R.

PS1. Garbage collection is a highly non-trivial topic. E.g., Ruby used 5 (!) different GC algorithms over 20 years. Java GC does not suck because Sun/Oracle and IBM spent many programmer-years on their respective implementations of the GC. On the other hand, R and Python have lousy GC - because no one bothered to invest the necessary man-years - and they are quite popular. That's worse-is-better for you.

PS2. Related: R: running out of memory using `strsplit`

Vouchsafe answered 21/4, 2013 at 15:3 Comment(9)
You seem to be contradicting yourself. No need to save and reload means no need to restart R, right?Groh
@AlexanderHanysz: not at all. Alas, the only way to reliably clean up the memory is to restart R. The objects which intersperse the released memory might be parts of the working environment which are not removed by rm(list=ls()).Vouchsafe
Thanks for the response. This is very unintuitive! Can you give examples of objects that aren't removed by rm(list=ls())?Groh
@AlexanderHanysz: if it were that easy, it would have been fixed. :-) I am not such an expert in R internals, sorry.Vouchsafe
"The only way to fix the problem is to save the objects you want, restart R, and reload the objects." I think that doesn't speak well about R.Sylvan
Do you think it will change? Or should we move on to other platform such as Julia or Python or would we need something more complex?Sylvan
@skan: R is pretty good at what it does, and it is extremely unlikely that its GC will be replaced. Python's GC sucks too (it's refcounting!) I know nothing about Julia. Generally speaking, GC has secondary importance for a research platform like R. It's more important for Python which is often used in production, but there I think there might be changes in the cards.Vouchsafe
Great answer. It’s even more extreme than you imply: “many programmer-years” is a phenomenal understatement (the actual number is in the centuries by now), and their GC still sucks in some scenarios.Fbi
Do you know if Julia has the same problem?Sylvan
T
27

How do you check memory usage? Normally virtual machine allocates some chunk of memory that it uses to store its data. Some of the allocated may be unused and marked as free. What GC does is discovering data that is not referenced from anywhere else and marking corresponding chunks of memory as unused, this does not mean that this memory is released to the OS. Still from the VM perspective there's now more free memory that can be used for further computation.

As others asked did you experience out of memory errors? If not then there's nothing to worry about.

EDIT: This and this should be enough to understand how memory allocation and garbage collection works in R.

From the first document:

Occasionally an attempt is made to release unused pages back to the operating system. When pages are released, a number of free nodes equal to R_MaxKeepFrac times the number of allocated nodes for each class is retained. Pages not needed to meet this requirement are released. An attempt to release pages is made every R_PageReleaseFreq level 1 or level 2 collections.

EDIT2:

To see used memory try running gc() with verbose set to TRUE:

gc(verbose=T)

Here's a result with an array of 10'000'000 integers in memory:

Garbage collection 9 = 1+0+8 (level 2) ... 
10.7 Mbytes of cons cells used (49%)
40.6 Mbytes of vectors used (72%)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  198838 10.7     407500 21.8   350000 18.7
Vcells 5311050 40.6    7421749 56.7  5311504 40.6

And here's after discarding reference to it:

Garbage collection 10 = 1+0+9 (level 2) ... 
10.7 Mbytes of cons cells used (49%)
2.4 Mbytes of vectors used (5%)
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 198821 10.7     407500 21.8   350000 18.7
Vcells 310987  2.4    5937399 45.3  5311504 40.6

As you can see memory used by Vcells fell from 40.6Mb to 2.4Mb.

The answered 29/1, 2013 at 10:20 Comment(4)
I check memory usage with the Windows Task Manager.Laurettalaurette
@Laurettalaurette Memory shown in Task Manager as used by the R process could be marked as free on the VM level, meaning all of it will be available for future computation. GC when performing level 1 or 2 garbage collection may decide to free some of it to the system to let other processes use it.The
After finishing my analysis I checked whether I would actually run out of memory - - I did not (although the Windows Task Manager showed that the largest part of memory was still occupied). So I better trust my gc()-output...Laurettalaurette
In my computer (Windows 10 12GB RAM) the garbage collector works very bad too. If I'm working with large datasets for a long time all Windows memory gets filled no matter if I'm using gc() or not, and the computer gets really slow and unusable.Sylvan
F
2

If possible, you can do memory-leaking process in the background using callr package. I used this function in a for loop instead of collecting garbage or restarting R.

There is an example below where I call a function in the background. You can load packages inside the background function.

p = callr::r_bg(fun = function(n){
  library(dplyr)
  library(data.table)
  df = data.frame(matrix(data = rep(1, n), ncol = 10))
  write.csv(x = df, file = 'xxxxx.csv')
  write.csv(x = data.frame(matrix(data = rep(1, n), ncol = 10)), file = 'xxxxx2.csv')
  df2 = df %>% summarise(x = sum(X1))
  df3 = fread('xxxxx.csv')
  fwrite(df3, file = 'x.csv')
}, args = list(n = 10000000))

p$wait()


print('finished')
Freund answered 14/10, 2022 at 5:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.