Multiple regression analysis in R using QR decomposition
Asked Answered
B

1

8

I am trying to write a function for solving multiple regression using QR decomposition. Input: y vector and X matrix; output: b, e, R^2. So far I`ve got this and am terribly stuck; I think I have made everything way too complicated:

QR.regression <- function(y, X) {
X <- as.matrix(X)
y <- as.vector(y)
p <- as.integer(ncol(X))
if (is.na(p)) stop("ncol(X) is invalid")
n <- as.integer(nrow(X))
if (is.na(n)) stop("nrow(X) is invalid")
nr <- length(y)
nc <- NCOL(X)

# Householder 
for (j in seq_len(nc)) {
id <- seq.int(j, nr)
sigma <- sum(X[id, j]^2)
s <- sqrt(sigma)
diag_ej <- X[j, j]
gamma <- 1.0 / (sigma + abs(s * diag_ej))
kappa <- if (diag_ej < 0) s else -s
X[j,j] <- X[j, j] - kappa
if (j < nc)
for (k in seq.int(j+1, nc)) {
yPrime <- sum(X[id,j] * X[id,k]) * gamma
X[id,k] <- X[id,k] - X[id,j] * yPrime
}
yPrime <- sum(X[id,j] * y[id]) * gamma
y[id] <- y[id] - X[id,j] * yPrime
X[j,j] <- kappa
} # end of Householder transformation

rss <- sum(y[seq.int(nc+1, nr)]^2)  # residuals sum of squares
e <- rss/nr
e <- mean(residuals(QR.regression)^2)
beta <- solve(t(X) %*% X, t(X) %*% y)
for (i in seq_len(ncol(X))) # set zeros in the lower triangular side of X
X[seq.int(i+1, nr),i] <- 0
Rsq <- (X[1:nc,1:nc])^2
return(list(Rsq=Rsq, y = y, beta = beta, e = e))
}


UPDATE:
my.QR <- function(y, X) {
X <- as.matrix(X)
y <- as.vector(y)
p <- as.integer(ncol(X))
if (is.na(p)) stop("ncol(X) is invalid")
n <- as.integer(nrow(X))
if (is.na(n)) stop("nrow(X) is invalid")
qr.X <- qr(X)
b <- solve(t(X) %*% X, t(X) %*% y)
e <- as.vector(y - X %*% beta) #e
R2 <- (X[1:p, 1:p])^2
return(list(b = b, e= e, R2 = R2 ))
}

X <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3)
y <- c(1,2,3,4)
my.QR(X, y)
Bittern answered 25/10, 2016 at 22:20 Comment(3)
By "think I made this way too complicated" perhaps you mean: why didn't I just use lm(y~X)?Grasp
I am allowed to use lm(y~X) only to prove that my function provided correct answers. Untill then I need to make mine work.Bittern
@ZheyuanLi thank you, I was already reading that post. The task is rather vague, but I think I should better use some of the built-in functions. Also: my Householder is off since I keep on getting this error message: "Error in X[id, j] : subscript out of bounds". Any help would be very much appreciated, seriously.Bittern
C
12

It all depends on how much of R's built-in facility you are allowed to use to solve this problem. I already know that lm is not allowed, so here is the rest of the story.


If you are allowed to use any other routines than lm

Then you can simply use lm.fit, .lm.fit or lsfit for QR-based ordinary least squares solving.

lm.fit(X, y)
.lm.fit(X, y)
lsfit(X, y, intercept = FALSE)

Among those, .lm.fit is the most light-weighed, while lm.fit and lsfit are pretty similar. The following is what we can do via .lm.fit:

f1 <- function (X, y) {
  z <- .lm.fit(X, y)
  RSS <- crossprod(z$residuals)[1]
  TSS <- crossprod(y - mean(y))[1]
  R2 <- 1 - RSS / TSS
  list(coefficients = z$coefficients, residuals = z$residuals, R2 = R2)
  }

In the question by your fellow classmate: Toy R function for solving ordinary least squares by singular value decomposition, I have already used this to check the correctness of SVD approach.


If you are not allowed to use R's built-in QR factorization routine qr.default

If .lm.fit is not allowed, but qr.default is, then it is also not that complicated.

f2 <- function (X, y) {
  ## QR factorization `X = QR`
  QR <- qr.default(X)
  ## After rotation of `X` and `y`, solve upper triangular system `Rb = Q'y` 
  b <- backsolve(QR$qr, qr.qty(QR, y))
  ## residuals
  e <- as.numeric(y - X %*% b)
  ## R-squared
  RSS <- crossprod(e)[1]
  TSS <- crossprod(y - mean(y))[1]
  R2 <- 1 - RSS / TSS
  ## multiple return
  list(coefficients = b, residuals = e, R2 = R2)
  }

If you further want variance-covariance of estimated coefficients, follow How to calculate variance of least squares estimator using QR decomposition in R?.


If you are not even allowed to use qr.default

Then we have to write QR decomposition ourselves. Writing a Householder QR factorization function in R code is giving this.

Using the function myqr there, we can write

f3 <- function (X, y) {
  ## our own QR factorization
  ## complete Q factor is not required
  QR <- myqr(X, complete = FALSE)
  Q <- QR$Q
  R <- QR$R
  ## rotation of `y`
  Qty <- as.numeric(crossprod(Q, y))
  ## solving upper triangular system
  b <- backsolve(R, Qty)
  ## residuals
  e <- as.numeric(y - X %*% b)
  ## R-squared
  RSS <- crossprod(e)[1]
  TSS <- crossprod(y - mean(y))[1]
  R2 <- 1 - RSS / TSS
  ## multiple return
  list(coefficients = b, residuals = e, R2 = R2)
  }

f3 is not extremely efficient, as we have formed Q explicitly, even though it is the thin-Q factor. In principle, we should rotate y along with the QR factorization of X, thus Q needs not be formed.


If you want to fix your existing code

This requires some debugging effort so would take some time. I will make another answer regarding this later.

Cinquain answered 25/10, 2016 at 23:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.