Different coefficients: scikit-learn vs statsmodels (logistic regression)
Asked Answered
S

3

8

When running a logistic regression, the coefficients I get using statsmodels are correct (verified them with some course material). However, I am unable to get the same coefficients with sklearn. I've tried preprocessing the data to no avail. This is my code:

Statsmodels:

import statsmodels.api as sm

X_const = sm.add_constant(X)
model = sm.Logit(y, X_const)
results = model.fit()
print(results.summary())

The relevant output is:

                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const      -0.2382      3.983     -0.060      0.952      -8.045       7.569
a           2.0349      0.837      2.430      0.015       0.393       3.676
b           0.8077      0.823      0.981      0.327      -0.806       2.421
c           1.4572      0.768      1.897      0.058      -0.049       2.963
d          -0.0522      0.063     -0.828      0.407      -0.176       0.071
e_2         0.9157      1.082      0.846      0.397      -1.205       3.037
e_3         2.0080      1.052      1.909      0.056      -0.054       4.070

Scikit-learn (no preprocessing)

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
results = model.fit(X, y)
print(results.coef_)
print(results.intercept_)

The coefficients given are:

array([[ 1.29779008,  0.56524976,  0.97268593, -0.03762884,  0.33646097,
     0.98020901]])

And the intercept/constant given is:

array([ 0.0949539])

As you can see, regardless of which coefficient corresponds to which variable, the numbers given by sklearn don't match the correct ones from statsmodels. What am I missing? Thanks in advance!

Starling answered 19/5, 2018 at 19:37 Comment(0)
S
8

Thanks to a kind soul on reddit, this was solved. To get the same coefficients, one has to negate the regularisation that sklearn applies to logistic regression by default:

model = LogisticRegression(C=1e8)

Where C according to the documentation is:

C : float, default: 1.0

Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

Starling answered 19/5, 2018 at 20:1 Comment(1)
You can also turn regularization OFF as follows: LogisticRegression(penalty = "none")Hypertensive
T
0

I'm not familiar with statsmodel, but could it be that the .fit() method of this library uses different default arguments compared to sklearn? To verify this, you could try to explicitly set the same corresponding arguments for each .fit() call, and see if you still get different results.

Tomahawk answered 19/5, 2018 at 19:49 Comment(0)
A
0

you can find excellent article on this quetion in https://blog.finxter.com/logistic-regression-scikit-learn-vs-statsmodels/

Addend answered 7/7, 2022 at 12:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.