Different results from lm in R vs. statsmodel OLS in Python
Asked Answered
A

1

6

I'm new to Python and have been an R User. I am getting VERY different results from a simple regression model when I build it in R vs. when I execute the same thing in iPython.

The R-Squared, The P Value , The significance of the co-efficients - nothing matches . Am I reading the output wrong or making some other fundamental error?

Below are my codes for both and results:

R Code

str(df_nv)
Classes 'tbl_df', 'tbl' and 'data.frame':   81 obs. of  2 variables:
 $ Dependent Variabls       : num  733 627 405 353 434 556 381 558 612 901 ...
 $ Independent Variable: num  0.193 0.167 0.169 0.14 0.145 ...


summary(lm(`Dependent Variable` ~ `Independent Variable`, data = df_nv))

Call:
    lm(formula = `Dependent Variable` ~ `Independent Variable`, data = df_nv)


Residuals:
    Min      1Q  Median      3Q     Max 
-501.18 -139.20  -82.61  -15.82 2136.74 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)   
(Intercept)               478.2      148.2   3.226  0.00183 **
`Independent Variable`   -196.1     1076.9  -0.182  0.85601   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 381.5 on 79 degrees of freedom
Multiple R-squared:  0.0004194, Adjusted R-squared:  -0.01223 
F-statistic: 0.03314 on 1 and 79 DF,  p-value: 0.856

iPython Notebook Code

df_nv.dtypes

Dependent Variable           float64
Independent Variable         float64
dtype: object

model = sm.OLS(df_nv['Dependent Variable'], df_nv['Independent Variable'])

results = model.fit()
results.summary()

OLS Regression Results
Dep. Variable:  Dependent Variable  R-squared:  0.537
Model:  OLS Adj. R-squared: 0.531
Method: Least Squares   F-statistic:    92.63
Date:   Fri, 20 Jan 2017    Prob (F-statistic): 5.23e-15
Time:   09:08:54    Log-Likelihood: -600.40
No. Observations:   81  AIC:    1203.
Df Residuals:   80  BIC:    1205.
Df Model:   1       
Covariance Type:    nonrobust       
coef    std err t   P>|t|   [95.0% Conf. Int.]
Independent Variable    3133.1825   325.537 9.625   0.000   2485.342 3781.023
Omnibus:    89.595  Durbin-Watson:  1.940
Prob(Omnibus):  0.000   Jarque-Bera (JB):   980.289
Skew:   3.489   Prob(JB):   1.36e-213
Kurtosis:   18.549  Cond. No.   1.00

For reference, head of dataframe in both R and Python :

R:

head(df_nv)
  Dependent Variable Independent Variable
          <dbl>                <dbl>
1           733            0.1932367
2           627            0.1666667
3           405            0.1686183
4           353            0.1398601
5           434            0.1449275
6           556            0.1475410

Python:

df_nv.head()

    Dependent Variable  Independent Variable
5292    733.0   0.193237
5320    627.0   0.166667
5348    405.0   0.168618
5404    353.0   0.139860
5460    434.0   0.144928
Acicular answered 20/1, 2017 at 14:20 Comment(8)
Were do you add the intercept in the python code?Flotation
you have to add the intercept explicitly ? I was referring this page on statsmodels documentation : statsmodels.sourceforge.net/devel/…Acicular
How to add the intercept?Acicular
See there.Flotation
Note that I don't speak python. But I get suspicious if regression output doesn't show a coeffcient estimate for the intercept.Flotation
Yeah, I was a bit hasty there .Acicular
sm.add_constant(df_nv['Independent Variable'])Primateship
or use the formula interface that imitates R and other packages by adding a constant by default.Halfcock
P
9

The following is the result of running linear regression on the gapminder dataset using python pandas (use statsmodels.formula.api) and R, they are exactly same:

R code

df <- read.csv('gapminder.csv')
df <- df[c('internetuserate', 'urbanrate')]
df <- df[complete.cases(df),]
dim(df)
# [1] 190   2
m <- lm(internetuserate~urbanrate, df)
summary(m)
#Call:
#lm(formula = internetuserate ~ urbanrate, data = df)

#Residuals:
#    Min      1Q  Median      3Q     Max 
#-51.474 -15.857  -3.954  14.305  74.590 

#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept) -4.90375    4.11485  -1.192    0.235    
#urbanrate    0.72022    0.06753  10.665   <2e-16 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
#Residual standard error: 22.03 on 188 degrees of freedom
#Multiple R-squared:  0.3769,   Adjusted R-squared:  0.3736 
#F-statistic: 113.7 on 1 and 188 DF,  p-value: < 2.2e-16

python code

import pandas
import statsmodels.formula.api as smf 
data = pandas.read_csv('gapminder.csv')
data = data[['internetuserate', 'urbanrate']]
data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data = data.dropna(axis=0, how='any')
print data.shape
# (190, 2)
reg1 = smf.ols('internetuserate ~  urbanrate', data=data).fit()
print (reg1.summary())
#                           OLS Regression Results
#==============================================================================
#Dep. Variable:        internetuserate   R-squared:                       0.377
#Model:                            OLS   Adj. R-squared:                  0.374
#Method:                 Least Squares   F-statistic:                     113.7
#Date:                Fri, 20 Jan 2017   Prob (F-statistic):           4.56e-21
#Time:                        23:27:50   Log-Likelihood:                -856.14
#No. Observations:                 190   AIC:                             1716.
#Df Residuals:                     188   BIC:                             1723.
#Df Model:                           1
#Covariance Type:            nonrobust
#================================================================================
#                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
#    ------------------------------------------------------------------------------
#    Intercept     -4.9037      4.115     -1.192      0.235       -13.021     3.213
#    urbanrate      0.7202      0.068     10.665      0.000         0.587     0.853
#================================================================================
#    Omnibus:                       10.750   Durbin-Watson:                   2.097
#    Prob(Omnibus):                  0.005   Jarque-Bera (JB):               10.990
#    Skew:                           0.574   Prob(JB):                      0.00411
#    Kurtosis:                       3.262   Cond. No.                         157.
#==============================================================================
Parrish answered 20/1, 2017 at 18:10 Comment(3)
But when we use statsmodels.api we don't get the same result. Do you know why?Teetotaler
import statsmodels.api as sm; model = sm.OLS(data['internetuserate'], data['urbanrate']); results = model.fit(); results.summary()Teetotaler
OLS Regression Results Dep. Variable: internetuserate R-squared: 0.762 Model: OLS Adj. R-squared: 0.761 Method: Least Squares F-statistic: 605.3 Date: Jeu, 18 oct 2018 Prob (F-statistic): 7.89e-61 Time: 16:50:18 Log-Likelihood: -856.86 No. Observations: 190 AIC: 1716. Df Residuals: 189 BIC: 1719. Df Model: 1 Covariance Type: nonrobustTeetotaler

© 2022 - 2024 — McMap. All rights reserved.