Run lm with multiple responses and weights

Now I would also like to add weights to the model that correspond to the accuracy of responses. Therefore, for each response vector I would need also different weight vector. However, the lm allows to enter the weights only as a vector not matrix. So I could not enter the weights in batch and would have to run lm for every response separately. This way the calculations would become much slower.

As explained in Fitting a linear model with multiple LHS, efficiency of "mlm" demands a shared model matrix for all LHS responses. However, weighted regression gives no reuse of model matrix, as for a different set of weights, both response y and model matrix X need be rescaled. Have a read on R: lm() result differs when using weights argument and when using manually reweighted data to see how weighted regression works.

Is there any way run these type of models in batch mode, without calling lm repeatedly?

It depends on what you want. If you need a full lmObject, then you have to call lm every time. If you only want coefficients, you may use .lm.fit. The 2nd link above has demonstrated the use of lm.fit, while the use of .lm.fit is almost identical. A simple template may be the following:

## weighted mlm, by specifying matrix directly
## `xmat`: non-weighted model matrix, manually created from `model.matrix`
## `ymat`: non-weighted response matrix
## `wmat`: matrix of weights

## all matrices must have the same number of rows (not checked)
## `ymat` and `wmat` must have the same number of columns (not checked)
## no `NA` values in any where is allowed (not checked)
## all elements of `wmat` must be strictly positive (not checked)

wmlm <- function (xmat, ymat, wmat) {
  N <- ncol(ymat)
  wmlmList <- vector("list", length = N)
  for (j in 1:N) {
    rw <- sqrt(wmat[, j])
    wmlmList[[j]] <- .lm.fit(rw * xmat, rw * ymat[, j])
    }
  return(wmlmList)
  }

Consider a simple example of its use:

## a toy dataset of 200 data with 3 numerical variables and 1 factor variable
dat <- data.frame(x1 = rnorm(200), x2 = rnorm(200), x3 = rnorm(200), f = gl(5, 40, labels = letters[1:5]))

## consider a model `~ x1 + poly(x3, 3) + x2 * f`
## we construct the non-weighted model matrix
xmat <- model.matrix (~ x1 + poly(x3, 3) + x2 * f, dat)

## now let's assume we have 100 model responses as well as 100 sets of weights
ymat <- matrix(rnorm(200 * 100), 200)
wmat <- matrix(runif(200 * 100), 200)

## Let's call `wmlm`:
fit <- wmlm (xmat, ymat, wmat)

.lm.fit returns critical information for further model inference, and a complete lmObject would inherit most of these entries.

## take the first fitted model as an example
str(fit[[1]])
 #$ qr          : num [1:200, 1:14] -10.4116 0.061 0.0828 0.0757 0.0698 ...
 # ..- attr(*, "assign")= int [1:14] 0 1 2 2 2 3 4 4 4 4 ...
 # ..- attr(*, "contrasts")=List of 1
 # .. ..$ f: chr "contr.treatment"
 # ..- attr(*, "dimnames")=List of 2
 # .. ..$ : chr [1:200] "1" "2" "3" "4" ...
 # .. ..$ : chr [1:14] "(Intercept)" "x1" "poly(x3, 3)1" "poly(x3, 3)2" ...
 #$ coefficients: num [1:14] 0.1184 -0.0506 0.3032 0.1643 0.4269 ...
 #$ residuals   : num [1:200] -0.7311 -0.0795 -0.2495 0.4097 0.0495 ...
 #$ effects     : num [1:200] -0.351 -0.36 0.145 0.182 0.291 ...
 #$ rank        : int 14
 #$ pivot       : int [1:14] 1 2 3 4 5 6 7 8 9 10 ...
 #$ qraux       : num [1:14] 1.06 1.13 1.07 1.05 1.01 ...
 #$ tol         : num 1e-07
 #$ pivoted     : logi FALSE

Result of .lm.fit has no supported generic functions like summary, anova, predict, plot, etc. But inference of a linear model is easy, so it is straightforward to compute yourself (provided you know the theory behind):

t-statistic table for regression coefficients (from $qr);
F-statistic and ANOVA table (from $effects);
residual standard error, R-squared and adjusted R-squared (from $residulas and $rank).

Finally, I offer a benchmark:

library(microbenchmark)
microbenchmark(wmlm = {wmlm (xmat, ymat, wmat);},
               lm = {for (i in 1:ncol(ymat))
                       lm(ymat[, i] ~ x1 + poly(x3, 3) + x2 * f, dat, weights = wmat[, i]);} )

#Unit: milliseconds
# expr       min        lq      mean    median       uq      max neval cld
# wmlm  20.84512  23.02756  27.29539  24.49314  25.9027  79.8894   100  a 
#   lm 400.53000 405.10622 430.09787 414.42152 442.2640 535.9144   100   b

So 17.25 times boost is seen (based on median time).

Recommended topics

Hot tags