Different results from lm in R vs. statsmodel OLS in Python

I'm new to Python and have been an R User. I am getting VERY different results from a simple regression model when I build it in R vs. when I execute the same thing in iPython.

The R-Squared, The P Value , The significance of the co-efficients - nothing matches . Am I reading the output wrong or making some other fundamental error?

Below are my codes for both and results:

R Code

str(df_nv)
Classes 'tbl_df', 'tbl' and 'data.frame':   81 obs. of  2 variables:
 $ Dependent Variabls       : num  733 627 405 353 434 556 381 558 612 901 ...
 $ Independent Variable: num  0.193 0.167 0.169 0.14 0.145 ...


summary(lm(`Dependent Variable` ~ `Independent Variable`, data = df_nv))

Call:
    lm(formula = `Dependent Variable` ~ `Independent Variable`, data = df_nv)


Residuals:
    Min      1Q  Median      3Q     Max 
-501.18 -139.20  -82.61  -15.82 2136.74 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)   
(Intercept)               478.2      148.2   3.226  0.00183 **
`Independent Variable`   -196.1     1076.9  -0.182  0.85601   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 381.5 on 79 degrees of freedom
Multiple R-squared:  0.0004194, Adjusted R-squared:  -0.01223 
F-statistic: 0.03314 on 1 and 79 DF,  p-value: 0.856

iPython Notebook Code

df_nv.dtypes

Dependent Variable           float64
Independent Variable         float64
dtype: object

model = sm.OLS(df_nv['Dependent Variable'], df_nv['Independent Variable'])

results = model.fit()
results.summary()

OLS Regression Results
Dep. Variable:  Dependent Variable  R-squared:  0.537
Model:  OLS Adj. R-squared: 0.531
Method: Least Squares   F-statistic:    92.63
Date:   Fri, 20 Jan 2017    Prob (F-statistic): 5.23e-15
Time:   09:08:54    Log-Likelihood: -600.40
No. Observations:   81  AIC:    1203.
Df Residuals:   80  BIC:    1205.
Df Model:   1       
Covariance Type:    nonrobust       
coef    std err t   P>|t|   [95.0% Conf. Int.]
Independent Variable    3133.1825   325.537 9.625   0.000   2485.342 3781.023
Omnibus:    89.595  Durbin-Watson:  1.940
Prob(Omnibus):  0.000   Jarque-Bera (JB):   980.289
Skew:   3.489   Prob(JB):   1.36e-213
Kurtosis:   18.549  Cond. No.   1.00

For reference, head of dataframe in both R and Python :

head(df_nv)
  Dependent Variable Independent Variable
          <dbl>                <dbl>
1           733            0.1932367
2           627            0.1666667
3           405            0.1686183
4           353            0.1398601
5           434            0.1449275
6           556            0.1475410

Python:

df_nv.head()

    Dependent Variable  Independent Variable
5292    733.0   0.193237
5320    627.0   0.166667
5348    405.0   0.168618
5404    353.0   0.139860
5460    434.0   0.144928

R code

df <- read.csv('gapminder.csv') df <- df[c('internetuserate', 'urbanrate')] df <- df[complete.cases(df),] dim(df) # [1] 190 2 m <- lm(internetuserate~urbanrate, df) summary(m) #Call: #lm(formula = internetuserate ~ urbanrate, data = df) #Residuals: # Min 1Q Median 3Q Max #-51.474 -15.857 -3.954 14.305 74.590 #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) -4.90375 4.11485 -1.192 0.235 #urbanrate 0.72022 0.06753 10.665 <2e-16 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 22.03 on 188 degrees of freedom #Multiple R-squared: 0.3769, Adjusted R-squared: 0.3736 #F-statistic: 113.7 on 1 and 188 DF, p-value: < 2.2e-16

python code

import pandas import statsmodels.formula.api as smf data = pandas.read_csv('gapminder.csv') data = data[['internetuserate', 'urbanrate']] data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce') data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce') data = data.dropna(axis=0, how='any') print data.shape # (190, 2) reg1 = smf.ols('internetuserate ~ urbanrate', data=data).fit() print (reg1.summary()) # OLS Regression Results #============================================================================== #Dep. Variable: internetuserate R-squared: 0.377 #Model: OLS Adj. R-squared: 0.374 #Method: Least Squares F-statistic: 113.7 #Date: Fri, 20 Jan 2017 Prob (F-statistic): 4.56e-21 #Time: 23:27:50 Log-Likelihood: -856.14 #No. Observations: 190 AIC: 1716. #Df Residuals: 188 BIC: 1723. #Df Model: 1 #Covariance Type: nonrobust #================================================================================ # coef std err t P>|t| [95.0% Conf. Int.] # ------------------------------------------------------------------------------ # Intercept -4.9037 4.115 -1.192 0.235 -13.021 3.213 # urbanrate 0.7202 0.068 10.665 0.000 0.587 0.853 #================================================================================ # Omnibus: 10.750 Durbin-Watson: 2.097 # Prob(Omnibus): 0.005 Jarque-Bera (JB): 10.990 # Skew: 0.574 Prob(JB): 0.00411 # Kurtosis: 3.262 Cond. No. 157. #==============================================================================

R code

python code

Recommended topics

Hot tags