scikit-learn & statsmodels - which R-squared is correct?

Asked 10/2, 2019 at 7:4 Answered 10/2, 2019 at 17:13

Solved python machine-learning scikit-learn linear-regression statsmodels

I'd like to choose the best algorithm for future. I found some solutions, but I didn't understand which R-Squared value is correct.

For this, I divided my data into two as test and training, and I printed two different R squared values below.

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

lineer = LinearRegression()
lineer.fit(x_train,y_train)
lineerPredict = lineer.predict(x_test)

scoreLineer = r2_score(y_test, lineerPredict)  # First R-Squared

model = sm.OLS(lineerPredict, y_test)
print(model.fit().summary()) # Second R-Squared

First R-Squared result is -4.28.
Second R-Squared result is 0.84

But I didn't understand which value is correct.

Greater answered 10/2, 2019 at 7:4 Comment(0)

Arguably, the real challenge in such cases is to be sure that you compare apples to apples. And in your case, it seems that you don't. Our best friend is always the relevant documentation, combined with simple experiments. So...

Although scikit-learn's LinearRegression() (i.e. your 1st R-squared) is fitted by default with fit_intercept=True (docs), this is not the case with statsmodels' OLS (your 2nd R-squared); quoting from the docs:

An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.

Keeping this important detail in mind, let's run some simple experiments with dummy data:

import numpy as np
import statsmodels.api as sm
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

# dummy data:
y = np.array([1,3,4,5,2,3,4])
X = np.array(range(1,8)).reshape(-1,1) # reshape to column

# scikit-learn:
lr = LinearRegression()
lr.fit(X,y)
# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
#     normalize=False)

lr.score(X,y)
# 0.16118421052631582

y_pred=lr.predict(X)
r2_score(y, y_pred)
# 0.16118421052631582


# statsmodels
# first artificially add intercept to X, as advised in the docs:
X_ = sm.add_constant(X)

model = sm.OLS(y,X_) # X_ here
results = model.fit()
results.rsquared
# 0.16118421052631593

For all practical purposes, these two values of R-squared produced by scikit-learn and statsmodels are identical.

Let's go a step further, and try a scikit-learn model without intercept, but where we use the artificially "intercepted" data X_ we have already built for use with statsmodels:

lr2 = LinearRegression(fit_intercept=False)
lr2.fit(X_,y) # X_ here
# LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None,
#         normalize=False)

lr2.score(X_, y)
# 0.16118421052631593

y_pred2 = lr2.predict(X_)
r2_score(y, y_pred2)
# 0.16118421052631593

Again, the R-squared is identical with the previous values.

So, what happens when we "accidentally" forget to account for the fact that statsmodels OLS is fitted without an intercept? Let's see:

model3 = sm.OLS(y,X) # X here, i.e. no intercept
results3 = model2.fit()
results3.rsquared
# 0.8058035714285714

Well, an R-squared of 0.80 is indeed very far from the one of 0.16 returned by a model with an intercept, and arguably this is exactly what has happened in your case.

So far so good, and I could easily finish the answer here; but there is indeed a point where this harmonious world breaks down: let's see what happens when we fit both models without intercept and with the initial data X where we have not artificially added any interception. We have already fitted the OLS model above, and got an R-squared of 0.80; what about a similar model from scikit-learn?

# scikit-learn
lr3 = LinearRegression(fit_intercept=False)
lr3.fit(X,y) # X here
lr3.score(X,y)
# -0.4309210526315792

y_pred3 = lr3.predict(X)
r2_score(y, y_pred3)
# -0.4309210526315792

Ooops...! What the heck??

It seems that scikit-earn, when computes the r2_score, always assumes an intercept, either explicitly in the model (fit_intercept=True) or implicitly in the data (the way we have produced X_ from X above, using statsmodels' add_constant); digging a little online reveals a Github thread (closed without a remedy) where it is confirmed that the situation is indeed like that.

[UPDATE Dec 2021: for a more detailed & in-depth investigation and explanation of why the two scores are different in this particular case (i.e. both models fitted without an intercept), see this great answer by Flavia]

Let me clarify that the discrepancy I have described above has nothing to do with your issue: in your case, the real issue is that you are actually comparing apples (a model with intercept) with oranges (a model without intercept).

So, why scikit-learn not only fails in such an (admittedly edge) case, but even when the fact emerges in a Github issue it is actually treated with indifference? (Notice also that the scikit-learn core developer who replies in the above thread casually admits that "I'm not super familiar with stats"...).

The answer goes a little beyond coding issues, such as the ones SO is mainly about, but it may be worth elaborating a little here.

Arguably, the reason is that the whole R-squared concept comes in fact directly from the world of statistics, where the emphasis is on interpretative models, and it has little use in machine learning contexts, where the emphasis is clearly on predictive models; at least AFAIK, and beyond some very introductory courses, I have never (I mean never...) seen a predictive modeling problem where the R-squared is used for any kind of performance assessment; neither it's an accident that popular machine learning introductions, such as Andrew Ng's Machine Learning at Coursera, do not even bother to mention it. And, as noted in the Github thread above (emphasis added):

In particular when using a test set, it's a bit unclear to me what the R^2 means.

with which I certainly concur.

As for the edge case discussed above (to include or not an intercept term?), I suspect it would sound really irrelevant to modern deep learning practitioners, where the equivalent of an intercept (bias parameters) is always included by default in neural network models...

See the accepted (and highly upvoted) answer in the Cross Validated question Difference between statsmodel OLS and scikit linear regression for a more detailed discussion along these last lines. The discussion (and links) in Is R-squared Useless?, triggered by some relevant (negative) remarks by the great statistician Cosma Shalizi, is also enlightening and highly recommended.

Gasometry answered 10/2, 2019 at 17:13 Comment(1)

It's not that unclear: the R2 is zero if you predict the mean of the test set (or close to zero if you use the mean estimated on the training set), it's below zero if do you worse than predicting the mean, it's one if you make a perfect prediction. So it's somewhat interpretable. It's also scale-independent so it can be aggregated across data sets. But I agree, I've never seen it being used in practice. – Oconnor 2/6, 2020 at 14:19

You seem to be using sklearn.metrics_r2_score. The documentation states that

Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse)

The Wikipedia article which the documentation leads to points out that

values of R2 outside the range 0 to 1 can occur when the model fits the data worse than a horizontal hyperplane. This would occur when the wrong model was chosen, or nonsensical constraints were applied by mistake.

For this reason, the fact that you had such a negative r2_score is probably far more significant than that you had a relatively good (but not great) R^2 statistic computed in the other way. If the first score indicates that your model choice is poor then the second statistic is likely to be just an artifact of overfitting.

Gorden answered 10/2, 2019 at 12:1 Comment(0)

As you note, and as the Wikipedia article notes, there are multiple definitions of "r squared" or "R squared." However, the common ones all have the property that they range from 0 to 1. They are usually positive, as is clear from the "squared" part of the name. (For exceptions to this general rule, see the Wikipedia article.)

Your "First R-Squared result" is -4.28, which is not between 0 and 1 and is not even positive. Thus it is not really an "R squared" at all. So use the "Second R-Squared result" which is in the correct range.

Jarl answered 10/2, 2019 at 10:58 Comment(3)

The Wikipedia article states that there are multiple definitions, some of which take on negative values. To say that the first R-squared is not really an R-squared at all is to take sides in a way that the Wikipedia article doesn't (although I would tend to agree with you that anything called R-squared which isn't positive is misnamed, but such is the terminology in this area). But you are correct that only the second one is really standard, so +1 – Gorden 10/2, 2019 at 11:50

@JohnColeman: I tried to briefly cover that some R-squared definitions result in negative values by my comments and my link. I do consider those definitions to be non-standard, as you state. Your answer covers those other definitions well and gives needed and helpful context, so +1 for you. – Jarl 10/2, 2019 at 12:28

Arguably, use the second one because it looks better is a naive and poor advice; and even the Wikipedia article explicitly mentions that R-squared can be negative. The real issue here seems to be that OP tries to compare apples with oranges (i.e. models with and without an intercept); (-1) from me, willing of course to rectify it in case the answer gets edited... – Gasometry 10/2, 2019 at 17:29

Recommended topics

Hot tags