R-squared on test data
Asked Answered
S

5

15

I fit a linear regression model on 75% of my data set that includes ~11000 observations and 143 variables:

gl.fit <- lm(y[1:ceiling(length(y)*(3/4))] ~ ., data= x[1:ceiling(length(y)*(3/4)),]) #3/4 for training

and I got an R^2 of 0.43. I then tried predicting on my test data using the rest of the data:

ytest=y[(ceiling(length(y)*(3/4))+1):length(y)]
x.test <- cbind(1,x[(ceiling(length(y)*(3/4))+1):length(y),]) #The rest for test
yhat <- as.matrix(x.test)%*%gl.fit$coefficients  #Calculate the predicted values

I now would like to calculate the R^2 value on my test data. Is there any easy way to calculate that?

Thank you

Sherlock answered 5/9, 2014 at 17:33 Comment(4)
See this similar question on CrossValidated.Elodea
@Elodea Thanks; I used the formula in the mentioned question and got a negative number (-0.59) as my R^2 value. I have a doubt about my lm model, should I add an intercept (I assume R does it automatically)? Then why am I getting negative R^2?Sherlock
Did you use the formula in the question or the formula in the comment below the question? Because the formula in the question is incorrect - see @Panos's comment on that question.Elodea
@Elodea I used the one in comment:errors <- (ytest - yhat) 1 - sum(errors^2)/sum((ytest-mean(ytest))^2) and I get -0.59!Sherlock
B
26

There are a couple of problems here. First, this is not a good way to use lm(...). lm(...) is meant to be used with a data frame, with the formula expressions referencing columns in the df. So, assuming your data is in two vectors x and y,

set.seed(1)    # for reproducible example
x <- 1:11000
y <- 3+0.1*x + rnorm(11000,sd=1000)

df <- data.frame(x,y)
# training set
train <- sample(1:nrow(df),0.75*nrow(df))   # random sample of 75% of data

fit <- lm(y~x,data=df[train,])

Now fit has the model based on the training set. Using lm(...) this way allows you, for example to generate predictions without all the matrix multiplication.

The second problem is the definition of R-squared. The conventional definition is:

1 - SS.residuals/SS.total

For the training set, and the training set ONLY,

SS.total = SS.regression + SS.residual

so

SS.regression = SS.total - SS.residual,

and therefore

R.sq = SS.regression/SS.total

so R.sq is the fraction of variability in the dataset that is explained by the model, and will always be between 0 and 1.

You can see this below.

SS.total      <- with(df[train,],sum((y-mean(y))^2))
SS.residual   <- sum(residuals(fit)^2)
SS.regression <- sum((fitted(fit)-mean(df[train,]$y))^2)
SS.total - (SS.regression+SS.residual)
# [1] 1.907349e-06
SS.regression/SS.total     # fraction of variation explained by the model
# [1] 0.08965502
1-SS.residual/SS.total     # same thing, for model frame ONLY!!! 
# [1] 0.08965502          
summary(fit)$r.squared     # both are = R.squared
# [1] 0.08965502

But this does not work with the test set (e.g., when you make predictions from a model).

test <- -train
test.pred <- predict(fit,newdata=df[test,])
test.y    <- df[test,]$y

SS.total      <- sum((test.y - mean(test.y))^2)
SS.residual   <- sum((test.y - test.pred)^2)
SS.regression <- sum((test.pred - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
# [1] 8958890

# NOT the fraction of variability explained by the model
test.rsq <- 1 - SS.residual/SS.total  
test.rsq
# [1] 0.0924713

# fraction of variability explained by the model
SS.regression/SS.total 
# [1] 0.08956405

In this contrived example there is not much difference, but it is very possible to have an R-sq. value less than 0 (when defined this way).

If, for example, the model is a very poor predictor with the test set, then the residuals can actually be larger than the total variation in test set. This is equivalent to saying that the test set is modeled better using it's mean, than using the model derived from the training set.

I noticed that you use the first three quarters of your data as the training set, rather than taking a random sample (as in this example). If the dependance of y on x is non-linear, and the x's are in order, then you could get a negative R-sq with the test set.

Regarding OP's comment below, one way to assess the model with a test set is by comparing in-model to out-of-model mean squared error (MSE).

mse.train <- summary(fit)$sigma^2
mse.test  <- sum((test.pred - test.y)^2)/(nrow(df)-length(train)-2)

If we assume that the training and test set are both normally distributed with the same variance and having means which follow the same model formula, then the ratio should have an F-distribution with (n.train-2) and (n.test-2) degrees of freedom. If the MSE's are significantly different based on an F-test, then the model does not fit the test data well.

Have you plotted your test.y and pred.y vs x?? This alone will tell you a lot.

Bomke answered 5/9, 2014 at 20:22 Comment(6)
Thank you so much for this elaborate example. In this case, what is the best way for me to assess my model on the test data set?Sherlock
I just edited the response to bring it into alignment with the more conventional definition of R-sq, but the main conclusions are unchanged. Regarding your question, see my comments at the end.Bomke
Excellent answer, as usual. I changed my train/test set as you suggested to collect the points randomly. I am no longer getting negative R-squared for my test (assuming it has a meaning). I also calculated the training and test MSEs: 0.00056 for training, 0.00036 for test, ratio ~0.65. comparing with this: qf(0.95,length(train)-2,length(test)-2) = 1.036603, the model is doing something. Please correct me if I am making a mistake.Sherlock
Well, since the out-of-model MSE is less than the MSE for the fit, it's hard to argue that the model doesn't work with the test data. Have you looked at the diagnostic plots par(mfrow=c(2,2)); plot(fit)?Bomke
I did, but I am not 100% sure on how to interpret the graphs. Here is the link: dropbox.com/s/tdufnw8jqr7gfzs/diag.pdf?dl=0. Can you please comment on that? ThanksSherlock
You either have a very large number of outliers, or your model is not adequate. The Q-Q plot should be a straight line. Your residuals have a much higher proportion of very large -ve and +ve values than would be expected from a normal distribution. Also, generally most points should have leverage < 0.1. The fact that your residuals have leverage as large as 1.0 is strongly suggestive of outliers. Read this, especially Chapt. 7.Bomke
W
7

Calculating R-squared on the testing data is a little tricky, as you have to remember what your baseline is. Your baseline projection is a mean of your training data.

Therefore, extending the example provided by @jlhoward above:

SS.test.total      <- sum((test.y - mean(df[train,]$y))^2)
SS.test.residual   <- sum((test.y - test.pred)^2)
SS.test.regression <- sum((test.pred - mean(df[train,]$y))^2)
SS.test.total - (SS.test.regression+SS.test.residual)
# [1] 11617720 not 8958890

test.rsq <- 1 - SS.test.residual/SS.test.total  
test.rsq
# [1] 0.09284556 not 0.0924713

# fraction of variability explained by the model
SS.test.regression/SS.test.total 
# [1] 0.08907705 not 0.08956405

Update: miscTools::rSquared() function makes an assumption that R-squared is calculated on the same dataset, on which the model is trained, as it calculates

yy <- y - mean(y)

behind the scenes in line 184 here: https://github.com/cran/miscTools/blob/master/R/utils.R

Walrus answered 19/4, 2016 at 19:39 Comment(1)
do you have any reference in the literature for using the mean of the training set as reference point? intuitively it seems convincing but not many people use it...Redwing
N
3

If you want a function, the miscTools package has an rSquared function.

require(miscTools)
r2 <- rSquared(ytest, resid = ytest-yhat)
Nay answered 5/9, 2014 at 17:47 Comment(3)
I couldn't find this package: Installing package into ‘C:/Users/Haidar/Documents/R/win-library/3.1’ (as ‘lib’ is unspecified) Warning in install.packages : package ‘micsTools’ is not available (for R version 3.1.1)Sherlock
@H_A, typo on my part, sorry. It is miscTools.Nay
Thanks, it worked, I am still getting negative value for my R^2, I suspect that there is something wrong with my regression/prediction procedure.Sherlock
T
2

When you use an R2 measure on an (out-of-) sample, you loose certain aspects of the interpretation of the R2:

  • the equivalence SSR total = SSR explained + SSR error
  • The fact that R2 is equal to the squared of the correlation between y and predicted y
  • The fact that R2 is in [0,1]

If you want to use R, I would recommend the function modelr::rsquare. Note this uses the SSR total from the test sample, not the training sample (as some people seem to advocate).

Here I take an example where our train data has only 3 points, there is hence a high risk that we are having a bad model, and hence a poor out-of-sample performance, Indeed, you can see that the R2 is negative!

library(modelr)

train <- mtcars[c(1,3,4),]
test  <- mtcars[-c(1,3,4),]

mod <- lm(carb ~ drat, data = train)

Compute on train data:

## train
y_train <- train$carb
SSR_y_train <- sum((y_train-mean(y_train))^2)

cor(fitted(mod), y_train)^2
#> [1] 0.2985092
rsquare(mod, train)
#> [1] 0.2985092
1-sum(residuals(mod)^2)/SSR_y_train
#> [1] 0.2985092

Compute on test data:

## test
pred_test <- predict(mod, newdata = test)
y_test <- test$carb
SSR_y_test <- sum((y_test-mean(y_test))^2)

cor(pred_test, y_test)^2
#> [1] 0.01737236
rsquare(mod, test)
#> [1] -0.6769549

1- 28* var(pred_test-y_test)/SSR_y_train
#> [1] -19.31621
1- 28* var(pred_test-y_test)/SSR_y_test
#> [1] -0.6769549
Titer answered 1/5, 2019 at 2:1 Comment(1)
why do you use 28 and not 29?Redwing
C
0
model_1 = sm.OLS(df['ExpirationMonth'], 
                 sm.add_constant(df[['Salerank', 
                                     'X2013USSales', 
                                     'X2013WorldSales', 
                                     'ProfitMargin', 
                                     'NumStores', 
                                     'RewardSize']])).fit()

r_squared_1 = model_1.rsquared
print(f"Model 1 R-squared: {r_squared_1}")
Callous answered 26/12, 2023 at 7:38 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Neomaneomah

© 2022 - 2024 — McMap. All rights reserved.