Ridge regression with `glmnet` gives different coefficients than what I compute by "textbook definition"?

Asked 4/10, 2016 at 23:42 Answered 5/10, 2016 at 6:11

Solved r machine-learning regression linear-regression glmnet

I am running Ridge regression with the use of glmnet R package. I noticed that the coefficients I obtain from glmnet::glmnet function are different from those I get by computing coefficients by definition (with the use of the same lambda value). Could somebody explain me why?

Data (both: response Y and design matrix X) are scaled.

library(MASS)
library(glmnet)

# Data dimensions
p.tmp <- 100
n.tmp <- 100

# Data objects
set.seed(1)
X <- scale(mvrnorm(n.tmp, mu = rep(0, p.tmp), Sigma = diag(p.tmp)))
beta <- rep(0, p.tmp)
beta[sample(1:p.tmp, 10, replace = FALSE)] <- 10
Y.true <- X %*% beta
Y <- scale(Y.true + matrix(rnorm(n.tmp))) # Y.true + Gaussian noise

# Run glmnet 
ridge.fit.cv <- cv.glmnet(X, Y, alpha = 0)
ridge.fit.lambda <- ridge.fit.cv$lambda.1se

# Extract coefficient values for lambda.1se (without intercept)
ridge.coef <- (coef(ridge.fit.cv, s = ridge.fit.lambda))[2:(p.tmp+1)]

# Get coefficients "by definition"
ridge.coef.DEF <- solve(t(X) %*% X + ridge.fit.lambda * diag(p.tmp)) %*% t(X) %*% Y

# Plot estimates
plot(ridge.coef, type = "l", ylim = range(c(ridge.coef, ridge.coef.DEF)),
     main = "black: Ridge `glmnet`\nred: Ridge by definition")
lines(ridge.coef.DEF, col = "red")

Phidias answered 4/10, 2016 at 23:42 Comment(3)

This is really a request for statistical tutoring and as such is more appropriate for CrossValidated.com. (I think the answer is that ridge regression is a penalized method, but you would probably get a more authoritative answer from the CV crowd.) – Toniatonic 5/10, 2016 at 0:51

@42- It seems like this is actually a coding question. If I understand correctly, the OP is asking why the coefficients returned by glmnet for a given value of lambda (the penalty term) are not the same as the coefficients he gets by solving for the regression coefficients directly using the same value of lambda as glmnet. – Mebane 5/10, 2016 at 1:14

Interestingly, the OP's "by hand" calculation with 100*ridge.fit.lambda results in (almost) exactly the same coefficients from solve(t(X) %*% X + 100*ridge.fit.lambda * diag(p.tmp)) %*% t(X) %*% Y as you get from glmnet with ridge.fit.lambda. – Mebane 5/10, 2016 at 1:37

If you read ?glmnet, you will see that the penalized objective function of Gaussian response is:

1/2 * RSS / nobs + lambda * penalty

In case the ridge penalty 1/2 * ||beta_j||_2^2 is used, we have

1/2 * RSS / nobs + 1/2 * lambda * ||beta_j||_2^2

which is proportional to

RSS + lambda * nobs * ||beta_j||_2^2

This is different to what we usually see in textbook regarding ridge regression:

RSS + lambda * ||beta_j||_2^2

The formula you write:

##solve(t(X) %*% X + ridge.fit.lambda * diag(p.tmp)) %*% t(X) %*% Y
drop(solve(crossprod(X) + diag(ridge.fit.lambda, p.tmp), crossprod(X, Y)))

is for the textbook result; for glmnet we should expect:

##solve(t(X) %*% X + n.tmp * ridge.fit.lambda * diag(p.tmp)) %*% t(X) %*% Y
drop(solve(crossprod(X) + diag(n.tmp * ridge.fit.lambda, p.tmp), crossprod(X, Y)))

So, the textbook uses penalized least squares, but glmnet uses penalized mean squared error.

Note I did not use your original code with t(), "%*%" and solve(A) %*% b; using crossprod and solve(A, b) is more efficient! See Follow-up section in the end.

Now let's make a new comparison:

library(MASS)
library(glmnet)

# Data dimensions
p.tmp <- 100
n.tmp <- 100

# Data objects
set.seed(1)
X <- scale(mvrnorm(n.tmp, mu = rep(0, p.tmp), Sigma = diag(p.tmp)))
beta <- rep(0, p.tmp)
beta[sample(1:p.tmp, 10, replace = FALSE)] <- 10
Y.true <- X %*% beta
Y <- scale(Y.true + matrix(rnorm(n.tmp)))

# Run glmnet 
ridge.fit.cv <- cv.glmnet(X, Y, alpha = 0, intercept = FALSE)
ridge.fit.lambda <- ridge.fit.cv$lambda.1se

# Extract coefficient values for lambda.1se (without intercept)
ridge.coef <- (coef(ridge.fit.cv, s = ridge.fit.lambda))[-1]

# Get coefficients "by definition"
ridge.coef.DEF <- drop(solve(crossprod(X) + diag(n.tmp * ridge.fit.lambda, p.tmp), crossprod(X, Y)))

# Plot estimates
plot(ridge.coef, type = "l", ylim = range(c(ridge.coef, ridge.coef.DEF)),
     main = "black: Ridge `glmnet`\nred: Ridge by definition")
lines(ridge.coef.DEF, col = "red")

Note that I have set intercept = FALSE when I call cv.glmnet (or glmnet). This has more conceptual meaning than what it will affect in practice. Conceptually, our textbook computation has no intercept, so we want to drop intercept when using glmnet. But practically, since your X and Y are standardized, the theoretical estimate of intercept is 0. Even with intercepte = TRUE (glment default), you can check that the estimate of intercept is ~e-17 (numerically 0), hence estimate of other coefficients is not notably affected. The other answer is just showing this.

Follow-up

As for the using crossprod and solve(A, b) - interesting! Do you by chance have any reference to simulation comparison for that?

t(X) %*% Y will first take transpose X1 <- t(X), then do X1 %*% Y, while crossprod(X, Y) will not do the transpose. "%*%" is a wrapper for DGEMM for case op(A) = A, op(B) = B, while crossprod is a wrapper for op(A) = A', op(B) = B. Similarly tcrossprod for op(A) = A, op(B) = B'.

A major use of crossprod(X) is for t(X) %*% X; similarly the tcrossprod(X) for X %*% t(X), in which case DSYRK instead of DGEMM is called. You can read the first section of Why the built-in lm function is so slow in R? for reason and a benchmark.

Be aware that if X is not a square matrix, crossprod(X) and tcrossprod(X) are not equally fast as they involve different amount of floating point operations, for which you may read the side notice of Any faster R function than “tcrossprod” for symmetric dense matrix multiplication?

Regarding solvel(A, b) and solve(A) %*% b, please read the first section of How to compute diag(X %% solve(A) %% t(X)) efficiently without taking matrix inverse?

Morette answered 5/10, 2016 at 3:18 Comment(2)

Zheyuan Li, I appreciate your answer greatly! As for the using crossprod and solve(A, b) - interesting! Do you by chance have any reference to simulation comparison for that? It is efficient in a time-computing sense only or yields some better underneath computational solution that result in more precision while computing some "troublesome" matrices (very large or with high amplitude of values)? One drawback I can see is that it makes the code less lucid than chain %*% operation. – Phidias 5/10, 2016 at 6:40

Zheyuan Li, thanks for following-up! I will definitely check these links over the weekend. Bests! – Phidias 5/10, 2016 at 19:31

Adding on top of Zheyuan's interesting post, did some more experiments to see that we can get the same results with intercept as well, as follows:

# ridge with intercept glmnet
ridge.fit.cv.int <- cv.glmnet(X, Y, alpha = 0, intercept = TRUE, family="gaussian")
ridge.fit.lambda.int <- ridge.fit.cv.int$lambda.1se
ridge.coef.with.int <- as.vector(as.matrix(coef(ridge.fit.cv.int, s = ridge.fit.lambda.int)))

# ridge with intercept definition, use the same lambda obtained with cv from glmnet
X.with.int <- cbind(1, X)
ridge.coef.DEF.with.int <- drop(solve(crossprod(X.with.int) + ridge.fit.lambda.int * diag(n.tmp, p.tmp+1), crossprod(X.with.int, Y)))

ggplot() + geom_point(aes(ridge.coef.with.int, ridge.coef.DEF.with.int))

# comupte residuals
RSS.fit.cv.int <- sum((Y.true - predict(ridge.fit.cv.int, newx=X))^2) # predict adds inter
RSS.DEF.int <- sum((Y.true - X.with.int %*% ridge.coef.DEF.with.int)^2)

RSS.fit.cv.int
[1] 110059.9
RSS.DEF.int
[1] 110063.4

Trinia answered 5/10, 2016 at 6:11 Comment(2)

Hi sandipan, thank you for this post! I am even surprised that the residuals difference is noticeable (not: super-marginal difference of 1e-10 or smaller order). In the case of this post all I actually cared was simplicity in adjusting the two vectors on the same plot (I needed to make them consistent in length). – Phidias 5/10, 2016 at 6:44

@MartaKaras but the difference is quite small if you consider the magnitude of RSS, it's (110063.4-110059.9)/110063.4 = 3.179985e-05. – Trinia 5/10, 2016 at 6:48

Recommended topics

Hot tags