I would provide a light-weighed toy routine for estimation of a simple regression model: y ~ x
, i.e., a regression line with only an intercept and slope. As will be seen, this is 36 times faster than lm
+ summary.lm
.
## toy data
set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)
## fast estimation of simple linear regression: y ~ x
simplelm <- function (x, y) {
## number of data
n <- length(x)
## centring
y0 <- sum(y) / length(y); yc <- y - y0
x0 <- sum(x) / length(x); xc <- x - x0
## fitting an intercept-free model: yc ~ xc + 0
xty <- c(crossprod(xc, yc))
xtx <- c(crossprod(xc))
slope <- xty / xtx
rc <- yc - xc * slope
## Pearson estimate of residual standard error
sigma2 <- c(crossprod(rc)) / (n - 2)
## standard error for slope
slope_se <- sqrt(sigma2 / xtx)
## t-score and p-value for slope
tscore <- slope / slope_se
pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
## return estimation summary for slope
c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
}
Let's have a test:
simplelm(x, y)
# Estimate Std. Error t value Pr(>|t|)
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15
On the other hand, lm
+ summary.lm
gives:
coef(summary(lm(y ~ x)))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1154549 0.01373051 8.408633 5.350248e-11
#x 0.2656737 0.02279663 11.654079 1.337380e-15
So the result matches. If you require R-squared and adjusted R-squared, it can be easily computed, too.
Let's have a benchmark:
set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)
library(microbenchmark)
microbenchmark(coef(summary(lm(y ~ x))), simplelm(x, y))
#Unit: microseconds
# expr min lq mean median uq
# coef(summary(lm(y ~ x))) 14158.28 14305.28 17545.1544 14444.34 17089.00
# simplelm(x, y) 235.08 265.72 485.4076 288.20 319.46
# max neval cld
# 114662.2 100 b
# 3409.6 100 a
Holy!!! We have 36 times boost!
Remark-1 (solving normal equation)
The simplelm
is based on solving normal equation via Cholesky factorization. But since it is simple, no actual matrix computation is involved. If we need regression with multiple covariates, we can use the lm.chol
defined in my this answer.
Normal equation can also be solved by using LU factorization. I will not touch on this, but if you feel interested, here is it: Solving normal equation gives different coefficients from using lm
?.
Remark-2 (alternative via cor.test
)
The simplelm
is an extension to the fastsim
in my answer Monte Carlo simulation of correlation between two Brownian motion (continuous random walk). An alternative way is based on cor.test
. It is also much faster than lm
+ summary.lm
, but as shown in that answer, it is yet slower than my proposal above.
Remark-3 (alternative via QR method)
QR based method is also possible, in which case we want to use .lm.fit
, a light-weighed wrapper for qr.default
, qr.coef
, qr.fitted
and qr.resid
at C-level. Here is how we can add this option to our simplelm
:
## fast estimation of simple linear regression: y ~ x
simplelm <- function (x, y, QR = FALSE) {
## number of data
n <- length(x)
## centring
y0 <- sum(y) / length(y); yc <- y - y0
x0 <- sum(x) / length(x); xc <- x - x0
## fitting intercept free model: yc ~ xc + 0
if (QR) {
fit <- .lm.fit(matrix(xc), yc)
slope <- fit$coefficients
rc <- fit$residuals
} else {
xty <- c(crossprod(xc, yc))
xtx <- c(crossprod(xc))
slope <- xty / xtx
rc <- yc - xc * slope
}
## Pearson estimate of residual standard error
sigma2 <- c(crossprod(rc)) / (n - 2)
## standard error for slope
if (QR) {
slope_se <- sqrt(sigma2) / abs(fit$qr[1])
} else {
slope_se <- sqrt(sigma2 / xtx)
}
## t-score and p-value for slope
tscore <- slope / slope_se
pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
## return estimation summary for slope
c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
}
For our toy data, both QR method and Cholesky method give the same result:
set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)
simplelm(x, y, TRUE)
# Estimate Std. Error t value Pr(>|t|)
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15
simplelm(x, y, FALSE)
# Estimate Std. Error t value Pr(>|t|)
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15
QR methods is known to be 2 ~ 3 times slower than Cholesky method (Read my answer Why the built-in lm function is so slow in R? for detailed explanation). Here is a quick check:
set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)
library(microbenchmark)
microbenchmark(simplelm(x, y, TRUE), simplelm(x, y))
#Unit: microseconds
# expr min lq mean median uq max neval cld
# simplelm(x, y, TRUE) 776.88 873.26 1073.1944 908.72 933.82 3420.92 100 b
# simplelm(x, y) 238.32 292.02 441.9292 310.44 319.32 3515.08 100 a
So indeed, 908 / 310 = 2.93
.
Remark-4 (simple regression for GLM)
If we move on to GLM, there is also a fast, light-weighed version based on glm.fit
. You can read my answer R loop help: leave out one observation and run glm one variable at a time and use function f
defined there. At the moment f
is customized to logistic regression, but we can generalize it to other response easily.
lm.fit
(or even.lm.fit
) – Ioneionescospeedlm.fit
from thespeedglm
package may be useful if you have a lot of observations. It's pretty fast when you have millions of observations, and it returns a pretty rich object. – Euchromosomespeedlm.fit
was a bit (~25%) faster whether I used eigen, Cholesky, or qr as the method. Anyway, I meant it to be a helpful comment for OP and/or future readers (whose goals may not be exactly the same). – Euchromosomespeedlm.fit
barely (~5%) faster in the case of 1e6, ~8% for 5e6, and ~25% faster for 1e7 (based on median time). I'm not sure what BLAS I'm using on this particular machine, but I have to head out for the evening. – Euchromosome