OLS using statsmodel.formula.api versus statsmodel.api
Asked Answered
T

3

14

Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api?

Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's LinearRegression.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

df = pd.read_csv("C:\...\Advertising.csv")

x1 = df.loc[:,['TV']]
y1 = df.loc[:,['Sales']]

print "Statsmodel.Formula.Api Method"
model1 = smf.ols(formula='Sales ~ TV', data=df).fit()
print model1.params

print "\nStatsmodel.Api Method"
model2 = sm.OLS(y1, x1)
results = model2.fit()
print results.params

print "\nSci-Kit Learn Method"
model3 = LinearRegression()
model3.fit(x1, y1)
print model3.coef_
print model3.intercept_

The output is as follows:

Statsmodel.Formula.Api Method
Intercept    7.032594
TV           0.047537
dtype: float64

Statsmodel.Api Method
TV    0.08325
dtype: float64

Sci-Kit Learn Method
[[ 0.04753664]]
[ 7.03259355]

The statsmodel.api method returns a different parameter for TV from the statsmodel.formula.api and the scikit-learn methods.

What kind of ols algorithm is statsmodel.api running that would produce a different result? Does anyone have a link to documentation that could help answer this question?

Trev answered 4/6, 2015 at 17:20 Comment(1)
Its worth noting that creating interactions and non-linear terms using the formula can be done directly by typing it in rather than generating new columns in your dataset. This repo provides a useful guide to formulas statsmodels.org/dev/examples/notebooks/generated/formulas.htmlBonine
N
19

The difference is due to the presence of intercept or not:

  • in statsmodels.formula.api, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted
  • in statsmodels.api, you have to add a constant yourself (see the documentation here). Try using add_constant from statsmodels.api

    x1 = sm.add_constant(x1)
    
Nato answered 4/6, 2015 at 18:6 Comment(2)
I will add that statsmodels.formula.api is easier if you want to run a fixed-effect regressionBursar
I have the impression that this answer is now obsolete. Please have a look here for more details: #51127428. Apologies in advance if I am wrong in what I am saying.Bernstein
R
21

Came across this issue today and wanted to elaborate on @stellasia's answer because the statsmodels documentation is perhaps a bit ambiguous.

Unless you are using actual R-style string-formulas when instantiating OLS, you need to add a constant (literally a column of 1s) under both statsmodels.formulas.api and plain statsmodels.api. @Chetan is using R-style formatting here (formula='Sales ~ TV'), so he will not run into this subtlety, but for people with some Python knowledge but no R background this could be very confusing.

Furthermore it doesn't matter whether you specify the hasconst parameter when building the model. (Which is kind of silly.) In other words, unless you are using R-style string formulas, hasconst is ignored even though it is supposed to

[Indicate] whether the RHS includes a user-supplied constant

because, in the footnotes

No constant is added by the model unless you are using formulas.

The example below shows that both .formulas.api and .api will require a user-added column vector of 1s if not using R-style string formulas.

# Generate some relational data
np.random.seed(123)
nobs = 25 
x = np.random.random((nobs, 2)) 
x_with_ones = sm.add_constant(x, prepend=False)
beta = [.1, .5, 1] 
e = np.random.random(nobs)
y = np.dot(x_with_ones, beta) + e

Now throw x and y into Excel and run Data>Data Analysis>Regression, making sure "Constant is zero" is unchecked. You'll get the following coefficients:

Intercept       1.497761024
X Variable 1    0.012073045
X Variable 2    0.623936056

Now, try running this regression on x, not x_with_ones, in either statsmodels.formula.api or statsmodels.api with hasconst set to None, True, or False. You'll see that in each of those 6 scenarios, there is no intercept returned. (There are only 2 parameters.)

import statsmodels.formula.api as smf
import statsmodels.api as sm

print('smf models')
print('-' * 10)
for hc in [None, True, False]:
    model = smf.OLS(endog=y, exog=x, hasconst=hc).fit()
    print(model.params)

# smf models
# ----------
# [ 1.46852293  1.8558273 ]
# [ 1.46852293  1.8558273 ]
# [ 1.46852293  1.8558273 ]

Now running things correctly with a column vector of 1.0s added to x. You can use smf here but it's really not necessary if you're not using formulas.

print('sm models')
print('-' * 10)
for hc in [None, True, False]:
    model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit()
    print(model.params)

# sm models
# ----------
# [ 0.01207304  0.62393606  1.49776102]
# [ 0.01207304  0.62393606  1.49776102]
# [ 0.01207304  0.62393606  1.49776102]
Rhebarhee answered 13/7, 2017 at 18:30 Comment(2)
Very good anwer which clarified many things from stellasia's answer. In my opinion, you answer should be ticked as correct. (Because of stellasia's answer I spent some hours to understand that even statsmodels.formula.api does not add an intercept if you do not use use formulas)Bernstein
In the 0.14.0 version (statsmodels.__version__), the capitalized OLS function is removed from the formulas api, so the above code doesn't work. I believe this change was made several versions ago. Thus the recent versions are more straightforward. statsmodels.formula.api only accepts formulas and automatically adds the intercept unless you eliminate it via your formula. statsmodels.api only accepts endog / exog inputs and does not add the intercept unless you explicitly add it manually or via the add_constant( ) function.Uncanny
N
19

The difference is due to the presence of intercept or not:

  • in statsmodels.formula.api, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted
  • in statsmodels.api, you have to add a constant yourself (see the documentation here). Try using add_constant from statsmodels.api

    x1 = sm.add_constant(x1)
    
Nato answered 4/6, 2015 at 18:6 Comment(2)
I will add that statsmodels.formula.api is easier if you want to run a fixed-effect regressionBursar
I have the impression that this answer is now obsolete. Please have a look here for more details: #51127428. Apologies in advance if I am wrong in what I am saying.Bernstein
A
0

I had a similar issue with the Logit function. (I used patsy to create my matrices, so the intercept was there.) My sm.logit was not converging. My sm.formula.logit was converging however.

Data going in was exactly the same. I changed the solver method to 'newton' and the sm.logit converged also. Is it possible the two versions have different default solver methods.

Agave answered 18/6, 2020 at 16:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.