Is there a fast estimation of simple regression (a regression line with only intercept and slope)?
Asked Answered
M

1

2

This question relates to a machine learning feature selection procedure.

I have a large matrix of features - columns are the features of the subjects (rows):

set.seed(1)
features.mat <- matrix(rnorm(10*100),ncol=100)
colnames(features.mat) <- paste("F",1:100,sep="")
rownames(features.mat) <- paste("S",1:10,sep="")

The response was measured for each subject (S) under different conditions (C) and therefore looks like this:

response.df <-
data.frame(S = c(sapply(1:10, function(x) rep(paste("S", x, sep = ""),100))),
           C = rep(paste("C", 1:100, sep = ""), 10),
           response = rnorm(1000), stringsAsFactors = F)

So I match the subjects in response.df:

match.idx <- match(response.df$S, rownames(features.mat))

I'm looking for a fast way to compute the univariate regression of each feature and the response.

Anything faster than this?:

fun <- function(f){
  fit <- lm(response.df$response ~ features.mat[match.idx,f])
  beta <- coef(summary(fit))
  data.frame(feature = colnames(features.mat)[f], effect = beta[2,1],
             p.val = beta[2,4], stringsAsFactors = F))
  }

res <- do.call(rbind, lapply(1:ncol(features.mat), fun))

I am interested in marginal boost, i.e., methods other than using parallel computing via mclapply or mclapply2.

Mosesmosey answered 19/10, 2016 at 21:22 Comment(7)
try lm.fit (or even .lm.fit)Ioneionesco
That's right only Estimate and Pr(>|t|)Mosesmosey
Zheyuan Li it seems that lm.chol will be fitting an lm to all features in one model whereas I'm looking for a fast way to fit an lm to each feature,target pair separately. Am I wrong?Mosesmosey
Yep, my actual data is millions of featuresMosesmosey
speedlm.fit from the speedglm package may be useful if you have a lot of observations. It's pretty fast when you have millions of observations, and it returns a pretty rich object.Euchromosome
@ZheyuanLi On my quick benchmarks with your toy data, changing from 50 observations to 1e6 then 5e6 and then 1e7 observations, speedlm.fit was a bit (~25%) faster whether I used eigen, Cholesky, or qr as the method. Anyway, I meant it to be a helpful comment for OP and/or future readers (whose goals may not be exactly the same).Euchromosome
@ZheyuanLi speedlm.fit barely (~5%) faster in the case of 1e6, ~8% for 5e6, and ~25% faster for 1e7 (based on median time). I'm not sure what BLAS I'm using on this particular machine, but I have to head out for the evening.Euchromosome
G
9

I would provide a light-weighed toy routine for estimation of a simple regression model: y ~ x, i.e., a regression line with only an intercept and slope. As will be seen, this is 36 times faster than lm + summary.lm.

## toy data
set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)

## fast estimation of simple linear regression: y ~ x 
simplelm <- function (x, y) {
  ## number of data
  n <- length(x)
  ## centring
  y0 <- sum(y) / length(y); yc <- y - y0
  x0 <- sum(x) / length(x); xc <- x - x0
  ## fitting an intercept-free model: yc ~ xc + 0
  xty <- c(crossprod(xc, yc))
  xtx <- c(crossprod(xc))
  slope <- xty / xtx
  rc <- yc - xc * slope
  ## Pearson estimate of residual standard error
  sigma2 <- c(crossprod(rc)) / (n - 2)
  ## standard error for slope
  slope_se <- sqrt(sigma2 / xtx)
  ## t-score and p-value for slope
  tscore <- slope / slope_se
  pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
  ## return estimation summary for slope
  c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
  }

Let's have a test:

simplelm(x, y)

#    Estimate   Std. Error      t value     Pr(>|t|) 
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15

On the other hand, lm + summary.lm gives:

coef(summary(lm(y ~ x)))

#             Estimate Std. Error   t value     Pr(>|t|)
#(Intercept) 0.1154549 0.01373051  8.408633 5.350248e-11
#x           0.2656737 0.02279663 11.654079 1.337380e-15

So the result matches. If you require R-squared and adjusted R-squared, it can be easily computed, too.


Let's have a benchmark:

set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)

library(microbenchmark)

microbenchmark(coef(summary(lm(y ~ x))), simplelm(x, y))

#Unit: microseconds
#                     expr      min       lq       mean   median       uq
# coef(summary(lm(y ~ x))) 14158.28 14305.28 17545.1544 14444.34 17089.00
#           simplelm(x, y)   235.08   265.72   485.4076   288.20   319.46
#      max neval cld
# 114662.2   100   b
#   3409.6   100  a 

Holy!!! We have 36 times boost!


Remark-1 (solving normal equation)

The simplelm is based on solving normal equation via Cholesky factorization. But since it is simple, no actual matrix computation is involved. If we need regression with multiple covariates, we can use the lm.chol defined in my this answer.

Normal equation can also be solved by using LU factorization. I will not touch on this, but if you feel interested, here is it: Solving normal equation gives different coefficients from using lm?.

Remark-2 (alternative via cor.test)

The simplelm is an extension to the fastsim in my answer Monte Carlo simulation of correlation between two Brownian motion (continuous random walk). An alternative way is based on cor.test. It is also much faster than lm + summary.lm, but as shown in that answer, it is yet slower than my proposal above.

Remark-3 (alternative via QR method)

QR based method is also possible, in which case we want to use .lm.fit, a light-weighed wrapper for qr.default, qr.coef, qr.fitted and qr.resid at C-level. Here is how we can add this option to our simplelm:

## fast estimation of simple linear regression: y ~ x 
simplelm <- function (x, y, QR = FALSE) {
  ## number of data
  n <- length(x)
  ## centring
  y0 <- sum(y) / length(y); yc <- y - y0
  x0 <- sum(x) / length(x); xc <- x - x0
  ## fitting intercept free model: yc ~ xc + 0
  if (QR) {
    fit <- .lm.fit(matrix(xc), yc)
    slope <- fit$coefficients
    rc <- fit$residuals
    } else {
    xty <- c(crossprod(xc, yc))
    xtx <- c(crossprod(xc))
    slope <- xty / xtx
    rc <- yc - xc * slope
    }
  ## Pearson estimate of residual standard error
  sigma2 <- c(crossprod(rc)) / (n - 2)
  ## standard error for slope
  if (QR) {
    slope_se <- sqrt(sigma2) / abs(fit$qr[1])
    } else {
    slope_se <- sqrt(sigma2 / xtx)
    }
  ## t-score and p-value for slope
  tscore <- slope / slope_se
  pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
  ## return estimation summary for slope
  c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
  }

For our toy data, both QR method and Cholesky method give the same result:

set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)

simplelm(x, y, TRUE)

#    Estimate   Std. Error      t value     Pr(>|t|) 
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15 

simplelm(x, y, FALSE)

#    Estimate   Std. Error      t value     Pr(>|t|) 
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15

QR methods is known to be 2 ~ 3 times slower than Cholesky method (Read my answer Why the built-in lm function is so slow in R? for detailed explanation). Here is a quick check:

set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)

library(microbenchmark)

microbenchmark(simplelm(x, y, TRUE), simplelm(x, y))

#Unit: microseconds
#                 expr    min     lq      mean median     uq     max neval cld
# simplelm(x, y, TRUE) 776.88 873.26 1073.1944 908.72 933.82 3420.92   100   b
#       simplelm(x, y) 238.32 292.02  441.9292 310.44 319.32 3515.08   100  a 

So indeed, 908 / 310 = 2.93.

Remark-4 (simple regression for GLM)

If we move on to GLM, there is also a fast, light-weighed version based on glm.fit. You can read my answer R loop help: leave out one observation and run glm one variable at a time and use function f defined there. At the moment f is customized to logistic regression, but we can generalize it to other response easily.

Garica answered 19/10, 2016 at 22:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.