How to obtain cross-validated r-square from linear model in R?

Asked 16/4, 2013 at 5:35 Answered 9/11, 2017 at 21:10

I have a linear model in R.

set.seed(1234)
x <- rnorm(100)
z <- rnorm(100)
y <- rnorm(100, x+z)
mydata <- data.frame(x,y,z)

fit <- lm(y ~ x + z, mydata)

I would like to obtain an estimate of the out of sample r-square. I was thinking of using some form k-fold cross validation.

What code in R takes a linear model fit and returns a cross-validated r-square?
Or is there some other approach to obtaining cross-validated r-square using R?

Gypsie answered 16/4, 2013 at 5:35 Comment(7)

May be off-topic.. and good cross-validated. – Brassiere 16/4, 2013 at 5:38

Why? It is about how to implement a statistical technique in the language r which has close to 30,000 questions. If you'd prefer, I could remove the statistical elements of the question and just focus on R implementation? – Gypsie 16/4, 2013 at 5:40

Take a look at statmethods.net/stats/regression.html – Kaliningrad 16/4, 2013 at 5:41

@Kaliningrad Many thanks. That looks like it will do the trick. Once I've applied it, I'll post an example of how it applies to the example above. – Gypsie 16/4, 2013 at 5:48

@JeromyAnglim: If you search for DAAG on this site, you'll find some relevant questions and answers. – Kaliningrad 16/4, 2013 at 5:51

@geektrader I thought about cross-validated, but they generally refer questions about R implementation to Stack Overflow. I agree this question has a mix of both statistics and implementation, but I felt that the emphasis was on R implementation and therefore was more suited to Stack Overflow. – Gypsie 16/4, 2013 at 6:18

@geektrader. Okay. I'ved edited the question to be purely about R implementation. – Gypsie 16/4, 2013 at 6:26

So what follows is a slight adaptation to the example that @NPR linked to from statsmethods. Essentially I adapted the example to make it a function.

library(bootstrap)

k_fold_rsq <- function(lmfit, ngroup=10) {
    # assumes library(bootstrap)
    # adapted from http://www.statmethods.net/stats/regression.html
    mydata <- lmfit$model
    outcome <- names(lmfit$model)[1]
    predictors <- names(lmfit$model)[-1]

    theta.fit <- function(x,y){lsfit(x,y)}
    theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef} 
    X <- as.matrix(mydata[predictors])
    y <- as.matrix(mydata[outcome]) 

    results <- crossval(X,y,theta.fit,theta.predict,ngroup=ngroup)
    raw_rsq <- cor(y, lmfit$fitted.values)**2 # raw R2 
    cv_rsq <- cor(y,results$cv.fit)**2 # cross-validated R2

    c(raw_rsq=raw_rsq, cv_rsq=cv_rsq)
}

So using the data from before

# sample data
set.seed(1234)
x <- rnorm(100)
z <- rnorm(100)
y <- rnorm(100, x+z)
mydata <- data.frame(x,y,z)

We can fit a linear model and call the cross validation function:

# fit and call function
lmfit <- lm(y ~ x + z, mydata)
k_fold_rsq(lmfit, ngroup=30)

And get the resulting raw and cross-validated r-square:

  raw_rsq    cv_rsq 
0.7237907 0.7050297

Caveat: While raw_rsq is clearly correct and cv_rsq is in the ball park that I expect, note that I haven't yet examined exactly what the crosval function does. So use at your own risk and if anyone has any feedback, it would be most welcome. It's also only designed for linear models with an intercept and standard main effects notation.

Gypsie answered 16/4, 2013 at 6:16 Comment(2)

This function breaks for models with factor predictors. Example: fit = lm("Sepal.Length ~ Species", data = iris); k_fold_rsq(fit) Error in lsfit(x, y) : NA/NaN/Inf in 'x' In addition: Warning message: In lsfit(x, y) : NAs introduced by coercion – Azal 12/4, 2016 at 20:30

I wasn't sure how to implement this with interactions – Holladay 9/11, 2017 at 21:1

I wrote a function for doing this. It also works for nominal predictors. It only works for lm objects (I think), but could easily be expanded to glm etc.

# from
# https://mcmap.net/q/1534914/-how-to-obtain-cross-validated-r-square-from-linear-model-in-r
# via http://www.statmethods.net/stats/regression.html

#' Calculate k fold cross validated r2
#'
#' Using k fold cross-validation, estimate the true r2 in a new sample. This is better than using adjusted r2 values.
#' @param lmfit (an lm fit) An lm fit object.
#' @param folds (whole number scalar) The number of folds to use (default 10).
#' @export
#' @examples
#' fit = lm("Petal.Length ~ Sepal.Length", data = iris)
#' MOD_k_fold_r2(fit)
MOD_k_fold_r2 = function(lmfit, folds = 10, runs = 100, seed = 1) {
  library(magrittr)

  #get data
  data = lmfit$model

  #seed
  if (!is.na(seed)) set.seed(seed)

  v_runs = sapply(1:runs, FUN = function(run) {
    #Randomly shuffle the data
    data2 = data[sample(nrow(data)), ]

    #Create n equally size folds
    folds_idx <- cut(seq(1, nrow(data2)), breaks = folds, labels = FALSE)

    #Perform n fold cross validation
    sapply(1:folds, function(i) {
      #Segement your data by fold using the which() function

      test_idx = which(folds_idx==i, arr.ind=TRUE)
      test_data = data2[test_idx, ]
      train_data = data2[-test_idx, ]

      #weights
      if ("(weights)" %in% data) {
        wtds = train_data[["(weights)"]]
      } else {
        train_data$.weights = rep(1, nrow(train_data))
      }

      #fit
      fit = lm(formula = lmfit$call$formula, data = train_data, weights = .weights)

      #predict
      preds = predict(fit, newdata = test_data)

      #correlate to get r2
      cor(preds, test_data[[1]], use = "p")^2
    }) %>%
      mean()
  })

  #return
  c("raw_r2" = summary(lmfit)$r.squared, "cv_r2" = mean(v_runs))
}

Testing it:

fit = lm("Petal.Length ~ Species", data = iris)
MOD_k_fold_r2(fit)
#>    raw_r2     cv_r2 
#> 0.9413717 0.9398156

And on the OP sample:

> MOD_k_fold_r2(lmfit)
#raw_r2  cv_r2 
# 0.724  0.718

Azal answered 13/4, 2016 at 8:0 Comment(0)

Discussion over on stats.stackexchange (e.g., link 1, and link 2) argues that mean-squared error (MSE) should be used rather than R^2.

Leave-one-out cross-validation (special case of k-folds cv where k=N) has a property allowing the quick computation of CV MSE for linear models using a simple formula. See section 5.1.2 of "Introduction to Statistical Learning in R". The following code should compute RMSE value for lm models (using Equation 5.2 from same section):

sqrt(sum((residuals(fit)/(1-hatvalues(fit)))^2)/length(fit$residuals))

Which you could compare to the "regular" RMSE:

summary(fit)$sigma

or RMSE obtained from 5- or 10-fold cross-validation, I suppose.

Holladay answered 9/11, 2017 at 21:10 Comment(0)

Recommended topics

Hot tags