Why doesn't statsmodels GLM have R^2 in results?
Asked Answered
H

2

11

I did a simple experiment of a GLM in statsmodels, and was perplexed to find why GLM Results did not contain any R^2 attributes?

I feel like there is something very simple here about why GLM does not have an R^2 calculation, and ways that I can calculate it myself.

Thanks!

In [1]: import pandas as np

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: import statsmodels.api as sm

In [5]: data = pd.DataFrame({'col1':np.arange(10),'col2':np.arange(
KeyboardInterrupt

In [5]: x  = np.arange(0,10,0.5)

In [6]: 

In [6]: y = np.zeros(len(x))

In [7]: y[0] = 0

In [8]: for i in range(1,len(x)):
   ...:         y[i] = 0.5*x[i] + 2.5*y[i-1] + 10*np.random.rand()
   ...:     

In [9]: print y
[  0.00000000e+00   9.35177024e-01   8.18487881e+00   2.95126464e+01
   8.08584645e+01   2.11423251e+02   5.38685230e+02   1.35653420e+03
   3.39564225e+03   8.49234338e+03   2.12377817e+04   5.31015961e+04
   1.32764789e+05   3.31924691e+05   8.29818265e+05   2.07455796e+06
   5.18640343e+06   1.29660216e+07   3.24150658e+07   8.10376747e+07]

In [10]: X = pd.DataFrame({'x1':x[1:],'y-Lag1':y[:-1]})


In [11]: m1 = sm.GLM(y[1:],X).fit()

In [12]: m1.summary()
Out[12]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                   19
Model:                            GLM   Df Residuals:                       17
Model Family:                Gaussian   Df Model:                            1
Link Function:               identity   Scale:                   12.9022715725
Method:                          IRLS   Log-Likelihood:                -50.199
Date:                Thu, 23 Oct 2014   Deviance:                       219.34
Time:                        13:44:22   Pearson chi2:                     219.
No. Iterations:                     3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             1.5746      0.175      8.999      0.000         1.232     1.918
y-Lag1         2.5000   1.23e-07   2.03e+07      0.000         2.500     2.500
==============================================================================
"""

In [13]: m1.
m1.aic                    m1.llf                    m1.remove_data
m1.bic                    m1.load                   m1.resid_anscombe
m1.bse                    m1.model                  m1.resid_deviance
m1.conf_int               m1.mu                     m1.resid_pearson
m1.cov_params             m1.nobs                   m1.resid_response
m1.deviance               m1.norm_resid             m1.resid_working
m1.df_model               m1.normalized_cov_params  m1.save
m1.df_resid               m1.null                   m1.scale
m1.f_test                 m1.null_deviance          m1.summary
m1.family                 m1.params                 m1.summary2
m1.fit_history            m1.pearson_chi2           m1.t_test
m1.fittedvalues           m1.pinv_wexog             m1.tvalues
m1.initialize             m1.predict                
m1.k_constant             m1.pvalues         
Hurd answered 24/10, 2014 at 5:16 Comment(4)
Curious! try to print the result of m1.rsquared() ?Bucci
Check this thread as well: #21785549Bucci
Not sure about why its implemented this way, but the SO answer above and some wikipedia-ing allowed me to calculate the R^2 by hand: sst_val = sum(map(lambda x: np.power(x,2),y-np.mean(y))) sse_val = sum(map(lambda x: np.power(x,2),m1.resid_response)) r2 = 1.0 - sse_val/sst_valHurd
Great! report a bug in their github repo github.com/statsmodels. It might be helpful for the other usersBucci
S
5

For GLM with Gaussian errors and the identity link, R^2 makes sense (if the model has a constant), but it doesn't make sense as a general goodness of fit measure for GLM. You can file an enhancement request (or better yet a pull request) for including some better, more general goodness of fit statistics in the GLM results.

You can read some more about this in the context of R here.

Scrappy answered 24/10, 2014 at 15:51 Comment(0)
H
3

Not sure about why its implemented this way, but the SO answer above and some wikipedia-ing allowed me to calculate the R^2 by hand easily:

sst_val = sum(map(lambda x: np.power(x,2),y-np.mean(y))) 
sse_val = sum(map(lambda x: np.power(x,2),m1.resid_response)) 
r2 = 1.0 - sse_val/sst_val
Hurd answered 24/10, 2014 at 6:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.