Summary not working for OLS estimation
Asked Answered
M

2

8

I am having an issue with my statsmodels OLS estimation. The model runs without any issues, but when I try to call for a summary so that I can see the actual results I get the TypeError of the axis needing to be specified when shapes of a and weights differ.

My code looks like this:

from __future__ import print_function, division 
import xlrd as xl
import numpy as np
import scipy as sp
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm

file_loc = "/Users/NiklasLindeke/Python/dataset_3.xlsx"
workbook = xl.open_workbook(file_loc)
sheet = workbook.sheet_by_index(0)
tot = sheet.nrows

data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]

rv1 = []
rv5 = []
rv22 = []
rv1fcast = []
T = []
price = []
time = []
retnor = []

model = []

for i in range(1, tot):        
    t = data[i][0]
    ret = data[i][1]
    ret5 = data[i][2]
    ret22 = data[i][3]
    ret1_1 = data[i][4]
    retn = data[i][5]
    t = xl.xldate_as_tuple(t, 0)
    rv1.append(ret)
    rv5.append(ret5)
    rv22.append(ret22)
    rv1fcast.append(ret1_1)
    retnor.append(retn)
    T.append(t)


df = pd.DataFrame({'RVFCAST':rv1fcast, 'RV1':rv1, 'RV5':rv5, 'RV22':rv22,})
df = df[df.RVFCAST != ""]

Model = smf.ols(formula='RVFCAST ~ RV1 + RV5 + RV22', data = df).fit()
print Model.summary()

In other words, this doesnt work.

The callback is the following:

print Model.summary()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-394-ea8ea5139fd4> in <module>()
----> 1 print Model.summary()

/Users/NiklasLindeke/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.6-x86_64.egg/statsmodels/regression/linear_model.pyc in summary(self, yname, xname, title, alpha)
   1948             top_left.append(('Covariance Type:', [self.cov_type]))
   1949 
-> 1950         top_right = [('R-squared:', ["%#8.3f" % self.rsquared]),
   1951                      ('Adj. R-squared:', ["%#8.3f" % self.rsquared_adj]),
   1952                      ('F-statistic:', ["%#8.4g" % self.fvalue] ),

/Users/NiklasLindeke/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.6-x86_64.egg/statsmodels/tools/decorators.pyc in __get__(self, obj, type)
     92         if _cachedval is None:
     93             # Call the "fget" function
---> 94             _cachedval = self.fget(obj)
     95             # Set the attribute in obj
     96 #            print("Setting %s in cache to %s" % (name, _cachedval))

/Users/NiklasLindeke/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.6-x86_64.egg/statsmodels/regression/linear_model.pyc in rsquared(self)
   1179     def rsquared(self):
   1180         if self.k_constant:
-> 1181             return 1 - self.ssr/self.centered_tss
   1182         else:
   1183             return 1 - self.ssr/self.uncentered_tss

/Users/NiklasLindeke/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.6-x86_64.egg/statsmodels/tools/decorators.pyc in __get__(self, obj, type)
     92         if _cachedval is None:
     93             # Call the "fget" function
---> 94             _cachedval = self.fget(obj)
     95             # Set the attribute in obj
     96 #            print("Setting %s in cache to %s" % (name, _cachedval))

/Users/NiklasLindeke/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.6-x86_64.egg/statsmodels/regression/linear_model.pyc in centered_tss(self)
   1159         if weights is not None:
   1160             return np.sum(weights*(model.endog - np.average(model.endog,
-> 1161                                                         weights=weights))**2)
   1162         else:  # this is probably broken for GLS
   1163             centered_endog = model.wendog - model.wendog.mean()

/Users/NiklasLindeke/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numpy/lib/function_base.pyc in average(a, axis, weights, returned)
    522             if axis is None:
    523                 raise TypeError(
--> 524                     "Axis must be specified when shapes of a and weights "
    525                     "differ.")
    526             if wgt.ndim != 1:

TypeError: Axis must be specified when shapes of a and weights differ.

Where I am sorry, but I have no idea what to do from there. And I wish to also after this, perform a correction for auto-correlation with some Newey-West method, which I saw you could do with the following line:

mdl = Model.get_robustcov_results(cov_type='HAC',maxlags=1)

But when I try to run that with my model it returns the error:

ValueError: operands could not be broadcast together with shapes (256,766) (256,1,256) 

But I realize that the statsmodels.formula isn't compatible with the get_robustcov function, but if so, how could I test for the auto-correlation then?

But my most pressing issue is the fact that I cannot produce a summary for my OLS.

As requested, here is the first thirty rows of my dataset in df.

print df
             RV1          RV22           RV5      RVFCAST
0     0.01553801    0.01309511    0.01081393  0.008421236
1    0.008881671    0.01301336    0.01134905   0.01553801
2     0.01042178    0.01326669    0.01189979  0.008881671
3    0.009809431    0.01334593    0.01170942   0.01042178
4    0.009418737    0.01358808    0.01152253  0.009809431
5     0.01821364    0.01362502    0.01269661  0.009418737
6     0.01163536    0.01331585    0.01147541   0.01821364
7    0.009469907    0.01329509    0.01172988   0.01163536
8    0.008875018    0.01361841    0.01202432  0.009469907
9     0.01528914    0.01430873    0.01233219  0.008875018
10    0.01210761    0.01412724    0.01238776   0.01528914
11    0.01290773     0.0144439    0.01432174   0.01210761
12    0.01094212    0.01425895    0.01493865   0.01290773
13    0.01041433    0.01430177     0.0156763   0.01094212
14    0.01556703     0.0142857    0.01986616   0.01041433
15     0.0217775    0.01430253    0.01864532   0.01556703
16    0.01599228    0.01390088    0.01579069    0.0217775
17    0.01463037    0.01384096    0.01416622   0.01599228
18    0.03136361    0.01395866    0.01398807   0.01463037
19   0.009462822    0.01295695     0.0106063   0.03136361
20   0.007504367    0.01295204    0.01114677  0.009462822
21   0.007869922    0.01300863    0.01267322  0.007504367
22    0.01373964     0.0129547    0.01314553  0.007869922
23    0.01445476    0.01271198       0.01268   0.01373964
24    0.01216517    0.01249902    0.01202476   0.01445476
25     0.0151366    0.01266783     0.0129083   0.01216517
26    0.01023149    0.01258627     0.0146934    0.0151366
27    0.01141199    0.01284094    0.01490637   0.01023149
28    0.01117856    0.01321258    0.01643881   0.01141199
29    0.01658287    0.01340074    0.01597086   0.01117856
Marauding answered 22/4, 2015 at 13:31 Comment(5)
I guess there is something strange with your dataframe df that doesn't get properly converted to numpy arrays. There is also an extra dimension in the value error. Can you show a few lines of df or even better, enough so that we can run the example? If you don't have categorical variables (strings) in your df, then you could try df.astype(float) and check if the numbers make sense.Richmound
BTW: Since statsmodels 0.6, you can specify the cov_type directly as argument to model.fit.Richmound
Specifically based on the traceback, Model.model.endog doesn't seem to be a one dimensional numpy array as it is supposed to be.Richmound
Additionally, what is df.dtypes? The print statement doesn't show it.Richmound
If I copy your data in a dataframe, which has dtypes float64, then I don't have any problem with the summary().Richmound
M
7

I would like to thank user333800 for all the help!

For future reference if anyone comes across the same issue.

The following code:

df = pd.DataFrame({'RVFCAST':rv1fcast, 'RV1':rv1, 'RV5':rv5, 'RV22':rv22,})
df = df[df.RVFCAST != ""]
df = df.astype(float)

Model = smf.ols(formula='RVFCAST ~ RV1 + RV5 + RV22', data = df).fit()
mdl = Model.get_robustcov_results(cov_type='HAC',maxlags=1)

gave me:

print mdl.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                RVFCAST   R-squared:                       0.681
Model:                            OLS   Adj. R-squared:                  0.677
Method:                 Least Squares   F-statistic:                     120.9
Date:                Wed, 22 Apr 2015   Prob (F-statistic):           1.60e-48
Time:                        17:19:19   Log-Likelihood:                 1159.8
No. Observations:                 256   AIC:                            -2312.
Df Residuals:                     252   BIC:                            -2297.
Df Model:                           3                                         
Covariance Type:                  HAC                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      0.0005      0.000      2.285      0.023      7.24e-05     0.001
RV1            0.2823      0.104      2.710      0.007         0.077     0.487
RV5           -0.0486      0.193     -0.252      0.802        -0.429     0.332
RV22           0.7450      0.232      3.212      0.001         0.288     1.202
==============================================================================
Omnibus:                      174.186   Durbin-Watson:                   2.045
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2152.634
Skew:                           2.546   Prob(JB):                         0.00
Kurtosis:                      16.262   Cond. No.                     1.19e+03
==============================================================================

And I can now continue on my paper :)

Marauding answered 22/4, 2015 at 15:21 Comment(1)
As explanation: string variables get interpreted as factors or categorical variables by the formula handling (patsy). It will be converted to a dummy representation by default, and then OLS breaks at some point because it cannot handle multivariate dependent variables in most parts.Richmound
S
0

I also had the same problem and found the reason is the input data. I solved the problem by changing the decimal point ',' to '.'

Shillelagh answered 16/1, 2019 at 16:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.