Accidentally I have noticed, that OLS models implemented by sklearn
and statsmodels
yield different values of R^2 when not fitting intercept. Otherwise they seems to work fine. The following code yields:
import numpy as np
import sklearn
import statsmodels
import sklearn.linear_model as sl
import statsmodels.api as sm
np.random.seed(42)
N=1000
X = np.random.normal(loc=1, size=(N, 1))
Y = 2 * X.flatten() + 4 + np.random.normal(size=N)
sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y)
sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y)
statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X))
statsmodelsNoIntercept = sm.OLS(Y, X)
print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared)
print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared)
print(sklearn.__version__, statsmodels.__version__)
prints:
0.78741906105 0.78741906105
-0.950825182861 0.783154483028
0.19.1 0.8.0
Where the difference comes from?
The question differs from Different Linear Regression Coefficients with statsmodels and sklearn as there sklearn.linear_model.LinearModel
(with intercept) was fit for X prepared as for statsmodels.api.OLS
.
The question differs from
Statsmodels: Calculate fitted values and R squared
as it addresses difference between two Python packages (statsmodels
and scikit-learn
) while linked question is about statsmodels
and common R^2 definition. They are both answered by the same answer, however that issue has been arleady discussed here: Does the same answer imply that the questions should be closed as duplicate?
np.random.seed(###)
. – Loupgarou