I want to run lm()
on a large dataset with 50M+ observations with 2 predictors. The analysis is run on a remote server with only 10GB for storing the data. I have tested ´lm()´ on 10K observations sampled from the data and the resulting object had size 2GB+.
I need the object of class "lm" returned from lm()
ONLY to produce the summary statistics of the model (summary(lm_object)
) and to make predictions (predict(lm_object)
).
I have done some experiment with the options model, x, y, qr
of lm
. If I set them all to FALSE
I reduce the size by 38%
library(MASS)
fit1=lm(medv~lstat,data=Boston)
size1 <- object.size(fit1)
print(size1, units = "Kb")
# 127.4 Kb bytes
fit2=lm(medv~lstat,data=Boston,model=F,x=F,y=F,qr=F)
size2 <- object.size(fit2)
print(size2, units = "Kb")
# 78.5 Kb Kb bytes
- ((as.integer(size1) - as.integer(size2)) / as.integer(size1)) * 100
# -38.37994
but
summary(fit2)
# Error in qr.lm(object) : lm object does not have a proper 'qr' component.
# Rank zero or should not have used lm(.., qr=FALSE).
predict(fit2,data=Boston)
# Error in qr.lm(object) : lm object does not have a proper 'qr' component.
# Rank zero or should not have used lm(.., qr=FALSE).
Apparently I need to keep qr=TRUE
which reduce the object size by only 9% if compared with the default object
fit3=lm(medv~lstat,data=Boston,model=F,x=F,y=F,qr=T)
size3 <- object.size(fit3)
print(size3, units = "Kb")
# 115.8 Kb
- ((as.integer(size1) - as.integer(size3)) / as.integer(size1)) * 100
# -9.142752
How do I bring the size of the "lm" object to a minimum without dumping a lot of unneeded information in memory and storage?
lm
using only 10000 observations can result in a 2GB object. How many columns are there in your dataset? – Insistencelm
from inside another function, which manipulates your big dataset? – Insistencelm(response~predictor1+predictor2,data=predictors)
– Thain