OLS Regression: Scikit vs. Statsmodels? [closed]
Asked Answered
P

3

31

Short version: I was using the scikit LinearRegression on some data, but I'm used to p-values so put the data into the statsmodels OLS, and although the R^2 is about the same the variable coefficients are all different by large amounts. This concerns me since the most likely problem is that I've made an error somewhere and now I don't feel confident in either output (since likely I have made one model incorrectly but don't know which one).

Longer version: Because I don't know where the issue is, I don't know exactly which details to include, and including everything is probably too much. I am also not sure about including code or data.

I am under the impression that scikit's LR and statsmodels OLS should both be doing OLS, and as far as I know OLS is OLS so the results should be the same.

For scikit's LR, the results are (statistically) the same whether or not I set normalize=True or =False, which I find somewhat strange.

For statsmodels OLS, I normalize the data using StandardScaler from sklearn. I add a column of ones so it includes an intercept (since scikit's output includes an intercept). More on that here: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html (Adding this column did not change the variable coefficients to any notable degree and the intercept was very close to zero.) StandardScaler didn't like that my ints weren't floats, so I tried this: https://github.com/scikit-learn/scikit-learn/issues/1709 That makes the warning go away but the results are exactly the same.

Granted I'm using 5-folds cv for the sklearn approach (R^2 are consistent for both test and training data each time), and for statsmodels I just throw it all the data.

R^2 is about 0.41 for both sklearn and statsmodels (this is good for social science). This could be a good sign or just a coincidence.

The data is observations of avatars in WoW (from http://mmnet.iis.sinica.edu.tw/dl/wowah/) which I munged about to make it weekly with some different features. Originally this was a class project for a data science class.

Independent variables include number of observations in a week (int), character level (int), if in a guild (Boolean), when seen (Booleans on weekday day, weekday eve, weekday late, and the same three for weekend), a dummy for character class (at the time for the data collection, there were only 8 classes in WoW, so there are 7 dummy vars and the original string categorical variable is dropped), and others.

The dependent variable is how many levels each character gained during that week (int).

Interestingly, some of the relative order within like variables is maintained across statsmodels and sklearn. So, rank order of "when seen" is the same although the loadings are very different, and rank order for the character class dummies is the same although again the loadings are very different.

I think this question is similar to this one: Difference in Python statsmodels OLS and R's lm

I am good enough at Python and stats to make a go of it, but then not good enough to figure something like this out. I tried reading the sklearn docs and the statsmodels docs, but if the answer was there staring me in the face I did not understand it.

I would love to know:

  1. Which output might be accurate? (Granted they might both be if I missed a kwarg.)
  2. If I made a mistake, what is it and how to fix it?
  3. Could I have figured this out without asking here, and if so how?

I know this question has some rather vague bits (no code, no data, no output), but I am thinking it is more about the general processes of the two packages. Sure, one seems to be more stats and one seems to be more machine learning, but they're both OLS so I don't understand why the outputs aren't the same.

(I even tried some other OLS calls to triangulate, one gave a much lower R^2, one looped for five minutes and I killed it, and one crashed.)

Thanks!

Persephone answered 26/2, 2014 at 22:34 Comment(6)
Can you replicate your problem on a small input? If so can you post the input and your code here?Tuft
just one possibility: Did you check the rank of your matrix of explanatory variables? Could it be singular? But, it's difficult to tell what might cause differences without a more explicit example.Indeterminacy
Ah ok -- I will see if I can improve the q with some of those things tomorrow (US Eastern time). I was worried I would not be able to ask a question with the right specifics for this case.Persephone
One possibility is for you to generate some random data and run your procedure with it, and see whether you get the same difference. This way you could see whether it's a problem in the data or in the usage of statsmodels versus scikit-learn.Indeterminacy
Oh that's a good idea too! I am not sure what "the rank of your matrix of explanatory variables" means, btw. My stats are all old and rusty, and the machine learning side of things seems to use different names for things, and the approaches are a bit different, so I'm struggling at times with the nomenclature.Persephone
@NatPoor: if the matrix is singular, that indicates that some of your independent variables are correlated, which might explain why you're getting two sets of different coefficients.Chemotherapy
L
42

It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here's an example to show you which options you need to use for sklearn and statsmodels to produce identical results.

import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

# Generate artificial data (2 regressors + constant)
nobs = 100 
X = np.random.random((nobs, 2)) 
X = sm.add_constant(X)
beta = [1, .1, .5] 
e = np.random.random(nobs)
y = np.dot(X, beta) + e 

# Fit regression model
sm.OLS(y, X).fit().params
>> array([ 1.4507724 ,  0.08612654,  0.60129898])

LinearRegression(fit_intercept=False).fit(X, y).coef_
>> array([ 1.4507724 ,  0.08612654,  0.60129898])

As a commenter suggested, even if you are giving both programs the same X, X may not have full column rank, and they sm/sk could be taking (different) actions under-the-hood to make the OLS computation go through (i.e. dropping different columns).

I recommend you use pandas and patsy to take care of this:

import pandas as pd
from patsy import dmatrices

dat = pd.read_csv('wow.csv')
y, X = dmatrices('levels ~ week + character + guild', data=dat)

Or, alternatively, the statsmodels formula interface:

import statsmodels.formula.api as smf
dat = pd.read_csv('wow.csv')
mod = smf.ols('levels ~ week + character + guild', data=dat).fit()

Edit: This example might be useful: http://statsmodels.sourceforge.net/devel/example_formulas.html

Lacker answered 27/2, 2014 at 13:42 Comment(5)
Awesome thanks. Let me... well I'll post the functions I built, and then come back and try to apply these ideas. I do understand what "not feeding the same matrix" means, #win... I would hope I didn't mess up at that level, but of course it is possible.Persephone
Actually I will try the code sample here before pasting in 75 lines of code (my two different function calls). I don't want to waste people's time by having them read over code if the answer is here already. (Granted, I might end up posting it if this code works and then I can't quite figure out where I went wrong, but one step at a time.) Should be able to get to it sometime today (maybe later). Thanks everyone!Persephone
Ok! That code did indeed get me the same results across the two libraries for the same data! Nice! However, the numbers are totally different from the previous two I have -- good thing I asked here! I will work on figuring that out, now that I have a good starting point and some numbers that I think I can trust. (I'm a little disappointed that I managed to make two regressions and yet they went totally awry... Maybe I should stick to SPSS and R.... no way!)Persephone
Summary: Ok I got SM with normalizing (StandardScaler) and also SK with CV (and with SS) to work with roughly the same results. The problem seems to be that I had to convert the integers to numpy floats (at this point I cannot recall why), and that worked for both the SM and SK (no CV) versions (worked meaning, they gave the same results and I am confident those results are accurate). When I added CV to the working SK function (with numpy floats), the R^2 went to like -5000. So, something(? perhaps obvious?) isn't working between the CV and the np floats. I take np floats out and it is ok!Persephone
Hi, i just wanted to add here, that in terms of sklearn, it does not use OLS method for linear regression under the hood. Since sklearn comes from the data-mining/machine-learning realm, they like to use Steepest Descent Gradient algorithm. This is a numerical method that is sensitive to initial conditions etc, while the OLS is an analytical closed form approach, so one should expect differences. So statsmodels comes from classical statistics field hence they would use OLS technique. So there are differences between the two linear regressions from the 2 different libraries.Dorso
V
1

If you use statsmodels, I would highly recommend using the statsmodels formula interface instead. You will get the same old result from OLS using the statsmodels formula interface as you would from sklearn.linear_model.LinearRegression, or R, or SAS, or Excel.

smod = smf.ols(formula ='y~ x', data=df)
result = smod.fit()
print(result.summary())

When in doubt, please

  1. try reading the source code
  2. try a different language for benchmark, or
  3. try OLS from scratch, which is basic linear algebra.
Varia answered 23/8, 2019 at 23:30 Comment(1)
statsmodels is way more friendly than scikit-learn. i'm about done with (mostly failing) to decipher the incomprehensible input and output array/matrix formats required to the latterOrangeman
D
-1

i just wanted to add here, that in terms of sklearn, it does not use OLS method for linear regression under the hood. Since sklearn comes from the data-mining/machine-learning realm, they like to use Steepest Descent Gradient algorithm. This is a numerical method that is sensitive to initial conditions etc, while the OLS is an analytical closed form approach, so one should expect differences. So statsmodels comes from classical statistics field hence they would use OLS technique. So there are differences between the two linear regressions from the 2 different libraries

Dorso answered 18/6, 2018 at 22:35 Comment(3)
This answer is wrong. LinearRegression from sklearn uses OLS. Just look at the souce code: github.com/scikit-learn/scikit-learn/blob/1495f6924/sklearn/…Varia
Hi, back where i answered this, i contacted the guys at sklearn, and they informed me that they did not have OLS implementation only SDG algorithm. But i did not try to look into the git code base. So thanks for finding this out Sarah. SO either the person that replied back to me was not aware, or they more recently implemented OLS. Either way, thanks for pointing this out Sarah, really appreciate it.Dorso
Thank you Palu for responding with your nice comments :)Varia

© 2022 - 2024 — McMap. All rights reserved.