So what follows is a slight adaptation to the example that @NPR linked to from statsmethods. Essentially I adapted the example to make it a function.
library(bootstrap)
k_fold_rsq <- function(lmfit, ngroup=10) {
# assumes library(bootstrap)
# adapted from http://www.statmethods.net/stats/regression.html
mydata <- lmfit$model
outcome <- names(lmfit$model)[1]
predictors <- names(lmfit$model)[-1]
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}
X <- as.matrix(mydata[predictors])
y <- as.matrix(mydata[outcome])
results <- crossval(X,y,theta.fit,theta.predict,ngroup=ngroup)
raw_rsq <- cor(y, lmfit$fitted.values)**2 # raw R2
cv_rsq <- cor(y,results$cv.fit)**2 # cross-validated R2
c(raw_rsq=raw_rsq, cv_rsq=cv_rsq)
}
So using the data from before
# sample data
set.seed(1234)
x <- rnorm(100)
z <- rnorm(100)
y <- rnorm(100, x+z)
mydata <- data.frame(x,y,z)
We can fit a linear model and call the cross validation function:
# fit and call function
lmfit <- lm(y ~ x + z, mydata)
k_fold_rsq(lmfit, ngroup=30)
And get the resulting raw and cross-validated r-square:
raw_rsq cv_rsq
0.7237907 0.7050297
Caveat: While raw_rsq
is clearly correct and cv_rsq
is in the ball park that I expect, note that I haven't yet examined exactly what the crosval
function does. So use at your own risk and if anyone has any feedback, it would be most welcome. It's also only designed for linear models with an intercept and standard main effects notation.
DAAG
on this site, you'll find some relevant questions and answers. – Kaliningrad