How to minimize size of object of class "lm" without compromising it being passed to predict()

Asked 20/2, 2014 at 1:30 Answered 11/1 at 9:26

I want to run lm() on a large dataset with 50M+ observations with 2 predictors. The analysis is run on a remote server with only 10GB for storing the data. I have tested ´lm()´ on 10K observations sampled from the data and the resulting object had size 2GB+.

I need the object of class "lm" returned from lm() ONLY to produce the summary statistics of the model (summary(lm_object)) and to make predictions (predict(lm_object)).

I have done some experiment with the options model, x, y, qr of lm. If I set them all to FALSE I reduce the size by 38%

library(MASS)
fit1=lm(medv~lstat,data=Boston)
size1 <- object.size(fit1)
print(size1, units = "Kb")
# 127.4 Kb bytes
fit2=lm(medv~lstat,data=Boston,model=F,x=F,y=F,qr=F)
size2 <- object.size(fit2)
print(size2, units = "Kb")
# 78.5 Kb Kb bytes
- ((as.integer(size1) - as.integer(size2)) / as.integer(size1)) * 100
# -38.37994

but

summary(fit2)
# Error in qr.lm(object) : lm object does not have a proper 'qr' component.
#  Rank zero or should not have used lm(.., qr=FALSE).
predict(fit2,data=Boston)
# Error in qr.lm(object) : lm object does not have a proper 'qr' component.
#  Rank zero or should not have used lm(.., qr=FALSE).

Apparently I need to keep qr=TRUE which reduce the object size by only 9% if compared with the default object

fit3=lm(medv~lstat,data=Boston,model=F,x=F,y=F,qr=T)
size3 <- object.size(fit3)
print(size3, units = "Kb")
# 115.8 Kb
- ((as.integer(size1) - as.integer(size3)) / as.integer(size1)) * 100
# -9.142752

How do I bring the size of the "lm" object to a minimum without dumping a lot of unneeded information in memory and storage?

Thain answered 20/2, 2014 at 1:30 Comment(7)

+1 Interesting question. You haven't tried toggling each of the options yourself yet? By the way, it's safer to write out TRUE and FALSE, as you may forget and make variables with those names later. – Dillie 20/2, 2014 at 1:34

I'm sure you find your answer #15260929 or in one of the questions linked there – Robyn 20/2, 2014 at 1:35

I don't see how lm using only 10000 observations can result in a 2GB object. How many columns are there in your dataset? – Insistence 20/2, 2014 at 2:1

@HongOoi I use two predictors in the model. I think the dataset including variables I don't model has 5 columns – Thain 20/2, 2014 at 2:38

There is no way a 10000x5 dataset can result in a 2GB object. I'd check to make sure you're not including big environments by accident. Are you calling lm from inside another function, which manipulates your big dataset? – Insistence 20/2, 2014 at 4:8

No, this is the function: lm(response~predictor1+predictor2,data=predictors) – Thain 20/2, 2014 at 6:48

someone might want to write an answer referring to cran.r-project.org/web/packages/butcher/index.html – Jape 8/1, 2020 at 1:39

The link here provides a relevant answer (for glm object, which is very similar to lm output object).

http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r/

Basically, predict only use the coefficient part which is very small portion of the glm output. the function below (copied from the link) trim information that will not be used by predict.

It does have a caveat though. After trimming, it can't be used by summary(fit) or other summary functions since those functions need more that what predict requires.

cleanModel1 = function(cm) {
  # just in case we forgot to set
  # y=FALSE and model=FALSE
  cm$y = c()
  cm$model = c()

  cm$residuals = c()
  cm$fitted.values = c()
  cm$effects = c()
  cm$qr$qr = c()
  cm$linear.predictors = c()
  cm$weights = c()
  cm$prior.weights = c()
  cm$data = c()
  cm
}

Ethos answered 3/5, 2016 at 20:52 Comment(2)

That article claims 99.7% reduction (for small models) to 99.985% reduction (large). Also, do summary(fit2) and save to text file before trimming down the model. – Zoraidazorana 13/5, 2017 at 5:50

I recently tested and found that the elements or subelements in the output object can be further downsized, just need to try empty each single one of them and use the result against predict() if it still works. One the other hand, the only useful part of the object is fit$coefficients. In a bootstrap practice I re-fitted the model 1000 times, and only save the coefficient for forecasting, which saves a lot more memory than saving 1000 glm result object. – Ethos 14/5, 2017 at 20:34

The answer of xappp is nice but not the whole story. There is also a huge environment variable you can do something about (see: https://blogs.oracle.com/R/entry/is_the_size_of_your)

Either add this to xappp's function

     e <- attr(cm$terms, ".Environment")
     parent.env(e) <- emptyenv()
     rm(list=ls(envir=e), envir=e)

Or use this version which reduces less data but allows you to still use summary()

      cleanModel1 = function(cm) {
      # just in case we forgot to set
      # y=FALSE and model=FALSE
      cm$y = c()
      cm$model = c()

      e <- attr(cm$terms, ".Environment")
      parent.env(e) <- emptyenv()
      rm(list=ls(envir=e), envir=e)
      cm
      }

Equitable answered 2/11, 2016 at 13:54 Comment(1)

You should only use this if you fitted a model using some other function. With normal lm use in the global environment this will start deleting all kinds of objects on your search path. – Suzan 7/1, 2020 at 23:37

I'm trying to deal with same issue as well. What I use is not perfect for other things but works for predict, you can basically take out the qr slot of the qr slot in lm :

lmFull <- lm(Volume~Girth+Height,data=trees)
lmSlim <- lmFull
lmSlim$fitted.values <- lmSlim$qr$qr <- lmSlim$residuals <- lmSlim$model <- lmSlim$effects <- NULL
pred1 <- predict(lmFull,newdata=data.frame(Girth=c(1,2,3),Height=c(2,3,4)))
pred2 <- predict(lmSlim,newdata=data.frame(Girth=c(1,2,3),Height=c(2,3,4)))
identical(pred1,pred2)
[1] TRUE

as.numeric((object.size(lmFull) - object.size(lmSlim)) / object.size(lmFull))
[1] 0.6550523

Starobin answered 25/2, 2014 at 8:26 Comment(0)

If you are using caret to train the model as 'lm', 'glm'

You can use the method of the model it self to trim the model

the way is

model = caret::train(X,y,method = 'glm')
model$finalModel = model$modelInfo$trim(model$finalModel)

This one can shrink my model from 120 Mb to 4 Mb without hurting the prediction function.

Goods answered 11/1 at 9:26 Comment(0)

Recommended topics

Hot tags