How to make group_by and lm fast?

Asked 20/8, 2018 at 2:53 Answered 21/8, 2018 at 1:36

Solved r performance regression linear-regression lm

This is a sample.

df <- tibble(
      subject = rep(letters[1:7], c(5, 6, 7, 5, 2, 5, 2)),
      day = c(3:7, 2:7, 1:7, 3:7, 6:7, 3:7, 6:7),
      x1 = runif(32), x2 = rpois(32, 3), x3 = rnorm(32), x4 = rnorm(32, 1, 5))

df %>%
  group_by(subject) %>%
  summarise(
    coef_x1 = lm(x1 ~ day)$coefficients[2],
    coef_x2 = lm(x2 ~ day)$coefficients[2],
    coef_x3 = lm(x3 ~ day)$coefficients[2],
    coef_x4 = lm(x4 ~ day)$coefficients[2])

This data is small, so performance is not problem.

But my data is so large, approximately 1,000,000 rows and 200,000 subjects, and this code is very slow.

I think the cause is not lm's speed but a lot of subject and subsetting.

Bunco answered 20/8, 2018 at 2:53 Comment(1)

The fastest way would be to not using lm at all, and to implement it in Rcpp with a list of indices. – Phrasing 20/8, 2018 at 8:6

In theory

Firstly, you can fit a linear model with multiple LHS.

Secondly, explicit data splitting is not the only way (or a recommended way) for group-by regression. See R regression analysis: analyzing data for a certain ethnicity and R: build separate models for each category. So build your model as cbind(x1, x2, x3, x4) ~ day * subject where subject is a factor variable.

Finally, since you have many factor levels and work with a large dataset, lm is infeasible. Consider using speedglm::speedlm with sparse = TRUE, or MatrixModels::glm4 with sparse = TRUE.

In reality

Neither speedlm nor glm4 is under active development. Their functionality is (in my view) primitive.

Neither speedlm nor glm4 supports multiple LHS as lm. So you need do 4 separate models x1 ~ day * subject to x4 ~ day * subject instead.

These two packages have different logic behind sparse = TRUE.

speedlm first constructs a dense design matrix using the standard model.matrix.default, then uses is.sparse to survey whether it is sparse or not. If TRUE, subsequent computations can use sparse methods.
glm4 uses model.Matrix to construct a design matrix, and a sparse one can be built directly.

So, it is not surprising that speedlm is as bad as lm in this sparsity issue, and glm4 is the one we really want to use.

glm4 does not have a full, useful set of generic functions for analyzing fitted models. You can extract coefficients, fitted values and residuals by coef, fitted and residuals, but have to compute all statistics (standard error, t-statistic, F-statistic, etc) all by yourself. This is not a big deal for people who know regression theory well, but it is still rather inconvenient.

glm4 still expects you to use the best model formula so that the sparsest matrix can be constructed. The conventional ~ day * subject is really not a good one. I should probably set up a Q & A on this issue later. Basically, if your formula has intercept and factors are contrasted, you loose sparsity. This is the one we should use: ~ 0 + subject + day:subject.

A test with `glm4`

## use chinsoon12's data in his answer
set.seed(0L)
nSubj <- 200e3
nr <- 1e6
DF <- data.frame(subject = gl(nSubj, 5),
                 day = 3:7,
                 y1 = runif(nr), 
                 y2 = rpois(nr, 3), 
                 y3 = rnorm(nr), 
                 y4 = rnorm(nr, 1, 5))

library(MatrixModels)
fit <- glm4(y1 ~ 0 + subject + day:subject, data = DF, sparse = TRUE)

It takes about 6 ~ 7 seconds on my 1.1GHz Sandy Bridge laptop. Let's extract its coefficients:

b <- coef(fit)

head(b)
#  subject1   subject2   subject3   subject4   subject5   subject6 
# 0.4378952  0.3582956 -0.2597528  0.8141229  1.3337102 -0.2168463 

tail(b)
#subject199995:day subject199996:day subject199997:day subject199998:day 
#      -0.09916175       -0.15653402       -0.05435883       -0.02553316 
#subject199999:day subject200000:day 
#       0.02322640       -0.09451542

You can do B <- matrix(b, ncol = 2), so that the 1st column is intercept and the 2nd is slope.

My thoughts: We probably need better packages for large regression

Using glm4 here does not give attractive advantage over chinsoon12's data.table solution since it also basically just tells you the regression coefficient. It is also a bit slower than data.table method, because it computes fitted values and residuals.

Analysis of simple regression does not require a proper model fitting routine. I have a few answers on how to do fancy stuff on this kind of regression, like Fast pairwise simple linear regression between all variables in a data frame where details on how to compute all statistics are also given. But as I wrote this answer, I was thinking about something in general regarding large regression problem. We might need better packages otherwise there is no end doing case-to-case coding.

In reply to OP

speedglm::speedlm(x1 ~ 0 + subject + day:subject, data = df, sparse = TRUE) gives Error: cannot allocate vector of size 74.5 Gb

yeah, because it has a bad sparse logic.

MatrixModels::glm4(x1 ~ day * subject, data = df, sparse = TRUE) gives Error in Cholesky(crossprod(from), LDL = FALSE) : internal_chm_factor: Cholesky factorization failed

This is because you have only one datum for some subject. You need at least two data to fit a line. Here is an example (in dense settings):

dat <- data.frame(t = c(1:5, 1:9, 1),
                  f = rep(gl(3,1,labels = letters[1:3]), c(5, 9, 1)),
                  y = rnorm(15))

Level "c" of f only has one datum / row.

X <- model.matrix(~ 0 + f + t:f, dat)
XtX <- crossprod(X)
chol(XtX)
#Error in chol.default(XtX) : 
#  the leading minor of order 6 is not positive definite

Cholesky factorization is unable to resolve rank-deficient model. If we use lm's QR factorization, we will see an NA coefficient.

lm(y ~ 0 + f + t:f, dat)
#Coefficients:
#      fa        fb        fc      fa:t      fb:t      fc:t  
# 0.49893   0.52066  -1.90779  -0.09415  -0.03512        NA

We can only estimate an intercept for level "c", not a slope.

Note that if you use the data.table solution, you end up with 0 / 0 when computing slope of this level, and the final result is NaN.

Update: fast solution now available

Check out Fast group-by simple linear regression and general paired simple linear regression.

Nazarite answered 20/8, 2018 at 3:23 Comment(2)

speedglm::speedlm(x1 ~ day * subjectm, data = df, sparse = TRUE) gives Error: cannot allocate vector of size 74.5 GbMatrixModels::glm4(x1 ~ day * subjectm, data = df, sparse = TRUE) gives Error in Cholesky(crossprod(from), LDL = FALSE) : internal_chm_factor: Cholesky factorization failed – Bunco 21/8, 2018 at 14:36

It is very helpful answer!! My R skill can't handle Rcpp yet, but I could see new field which I have to study. I will study from now on. – Bunco 28/8, 2018 at 12:0

Since OP appears to be looking only for beta, here is an approach using data.table package to calculate only beta. See reference for the formula.

dt[, sumx := sum(day), by=.(subject)][,
    denom := sum(day^2) - sumx^2 / .N, by=.(subject)]

dt[, lapply(.SD, function(y) (sum(day*y) - (sumx[1L] * sum(y))/.N) / denom[1L]), 
    by=.(subject),
    .SDcols = paste0("y", 1:4)]

data:

library(data.table)

set.seed(0L)
nSubj <- 200e3
nr <- 1e6
dt <- data.table(
    subject = rep(1:nSubj, each=5),
    day = 3:7,
    y1 = runif(nr), 
    y2 = rpois(nr, 3), 
    y3 = rnorm(nr), 
    y4 = rnorm(nr, 1, 5))
dt2 <- copy(dt)

reference: Inference in Linear Regression

In terms of intellectual understanding of linear regression, Lee Zheyuan's solution is the best (+1)

add some timings:

 system.time({
      dt[, lapply(.SD, function(y) cov(x,y) / var(x) ), 
          by=.(subject), 
          .SDcols=paste0("y", 1:4)]
  })
   user  system elapsed 
  73.96    0.00   74.15 


  system.time({
      dt2[, sumx := sum(day), by=.(subject)][,
          denom := sum(day^2) - sumx^2 / .N, by=.(subject)]

      dt2[, lapply(.SD, function(y) (sum(day*y) - (sumx[1L] * sum(y))/.N) / denom[1L]), 
          by=.(subject),
          .SDcols = paste0("y", 1:4)]
  })
   user  system elapsed 
   2.14    0.00    2.14

Weariless answered 21/8, 2018 at 1:36 Comment(4)

@李哲源, I added some timings. It could technically even faster but I am not quite sure how to do it though (i.e. turning on GForce) – Weariless 21/8, 2018 at 2:19

@李哲源, yeah I just saw it. its quite over my head. will need to learn everything abt linear regression properly. have both #51954209 and #51996521 opened but have not gone through them yet – Weariless 24/8, 2018 at 0:51

@李哲源, can I contact you via email re something about regression? – Weariless 25/10, 2018 at 6:33

@李哲源, when will you be graduating? – Weariless 11/6, 2019 at 0:22

In theory

In reality

A test with `glm4`

My thoughts: We probably need better packages for large regression

In reply to OP

Update: fast solution now available

Recommended topics

Hot tags

In theory

In reality

A test with glm4

My thoughts: We probably need better packages for large regression

In reply to OP

Update: fast solution now available

Recommended topics

Hot tags

A test with `glm4`