Saving a single object within a function in R: RData file size is very large

Asked 7/12, 2015 at 11:12 Answered 25/6, 2024 at 22:57

I am trying to save trimmed-down GLM objects in R (i.e. with all the "non-essential" characteristics set to NULL e.g. residuals, prior.weights, qr$qr).

As an example, looking at the smallest object that I need to do this with:

print(object.size(glmObject))
168992 bytes
save(glmObject, "FileName.RData")

Assigning this object in the global environment and saving leads to an RData file of about 6KB.

However, I effectively need to create and save the glm object within a function, which is in itself within a function. So the code looks something like:

subFn <- function(DT, otherArg, ...){
                 glmObject <- glm(...)
                 save(glmObject,"FileName.RData")
}

mainFn <- function(DT, ...){ 
             subFn(DT, otherArg, ...)
}

mainFn(DT, ...)

Which leads to much, much larger RData files of about 20 MB, despite the object itself being the same size.

So I understand this to be an environment issue, but I'm struggling to pinpoint exactly how and why it's happening. The resulting file size seems to vary quite a lot. I have tried using saveRDS, and equally I have tried assigning the glmObject via <<- to make it global, but nothing seems to help.

My understanding of environments in R clearly isn't very good, and would really appreciate if anyone could suggest a way around this. Thanks.

Shadwell answered 7/12, 2015 at 11:12 Comment(3)

Can you flesh out your example, including fake data so that your issue is reproducible? – Lindie 7/12, 2015 at 11:32

Yes, and also could you give the output of object.size from within the function? – Impersonalize 1/4, 2018 at 15:32

See the solution proposed here (with code correction in the comments) for plots. It should also work for this case. #32192798 – Amboina 1/8, 2019 at 20:31

Formulas have an environment attached. If that's the global environment or a package environment, it's not saved, but if it's not one that can be reconstructed, it will be saved.

glm results typically contain formulas, so they can contain the environment attached to that formula.

You don't need glm to demonstrate this. Just try this:

formula1 <- y ~ x
save(formula1, file = "formula1.Rdata")

f <- function() {
   z <- rnorm(1000000)
   formula2 <- y ~ x
   save(formula2, file = "formula2.Rdata")
}
f()

When I run the code above, formula1.Rdata ends up at 114 bytes, while formula2.Rdata ends up at 7.7 MB. This is because the latter captures the environment it was created in, and that contains the big vector z.

To avoid this, clean up the environment where you created a formula before saving the formula. Don't delete things that the formula refers to (because glm may need those), but do delete irrelevant things (like z in my example). See:

g <- function() {
   z <- rnorm(1000000)
   formula3 <- y ~ x
   rm(z)
   save(formula3, file = "formula3.Rdata")
}
g()

This gives formula3.Rdata of 144 bytes.

Demetricedemetris answered 2/4, 2018 at 0:38 Comment(0)

Do you find that you have the same problem when you name the arguments in your call to save?

I used:

subFn <- function(y, x){
             glmObject <- glm(y ~ x, family = "binomial")
             save(list = "glmObject", file = "FileName.RData")
}

mainFn <- function(y, x){ 
         subFn(y, x)
}

mainFn(y = rbinom(n = 10, size = 1, prob = 1 / 2), x = 1:10)

I saw that the file "FileName.RData" was created in my working directory. It is 6.6 kb in size.

I then use:

load("FileName.RData")

to load the contents, glmObject, to my global environment.

Links answered 29/3, 2018 at 21:11 Comment(0)

Another approach would be setting the environment of the already-saved function to emptyenv() before re-saving.

Here is a short simulation to see what kind of ballast gets attached to functions and formulæ:

set.seed(1)
ballast1 <- runif(1000000)
sigma2 <- function(x, ...) x^2
runSimulation <- function(seed, ...) {
  set.seed(seed)
  ballast2 <- -runif(100000)
  design <- list(n = 500, sigma2 = function(x) sigma2(x, ...))
  return(list(result = mean(sigma2(ballast2)), design = design))
}
seeds <- 1:10
results <- lapply(seeds, function(s) runSimulation(seed = s, r = -1/3))
results$baseFun <- sigma2
save(results, file = "test.RData")
file.size("test.RData")  # 5333203 -- HUGE!

The output size is 5.1 MB (!). Let us start with a clean slate and see what was captured:

results[[1]]$design
# $n
# 500
# 
# $sigma2
# function(x) sigma2(x, ...)
# <environment: 0x5ff9dbb370f8>

environment(results[[1]]$design$sigma2)
# <environment: 0x5ff9dbb370f8>

ls(envir = environment(results[[1]]$design$sigma2))
# "ballast2" "design"   "seed"    

results$baseFun
# function(x, ...) x^2
# <bytecode: 0x5ff9e0efc820>

# ls(envir = environment(results$baseFun))
# "results"

We see that the second ballast from the environment of runSimulation() was attached, but the first ballast from the global environment was not. In that sense, the removal recommended by @user2554330 can be substituted with a later emptying in case one needs to manipulate the ballast.

The shortest solution taking into account the known output structure would look like this:

results.compact <- lapply(results, function(x) {
  if (is.list(x)) environment(x$design$sigma2) <- emptyenv()
  return(x)})
save(results.compact, file = "test2.RData")
file.size("test2.RData")
# 756

This method allows the user to clean their existing results in case they had been encumbered by the heavy environments at the moment of saving and need to be re-saved compactly.

Azotemia answered 25/6, 2024 at 22:57 Comment(0)

Recommended topics

Hot tags