Statsmodels - OLS Clustered Standard Errors (not accepting Series from DF?)

Asked 14/11, 2015 at 19:26 Answered 26/6, 2021 at 13:45

I am running an analysis that could benefit from clustering by BEA regions. I have not used the clustered standard error option in Statsmodels before, so I am unclear of whether or not I am messing up the syntax, or the option is broken. Any help would be greatly appreciated.

Here is the relevant section of code (note that the topline_specs dict returns Patsy-style formulas):

#Capture topline specs
topline_specs={'GO':spec_dict['PC_GO']['Total']['TYPE']['BOTH'],
               'RV':spec_dict['PC_RV']['Total']['TYPE']['BOTH'],
               'ISSUER':spec_dict['PROP']['ISSUER']['TYPE']['BOTH'],
               'PURPOSE':spec_dict['PROP']['PURPOSE']['TYPE']['BOTH']}

#Estimate each model
topline_mods={'GO':smf.ols(formula=topline_specs['GO'],data=data_d).fit(cov_type='cluster',
                                                                       cov_kwds={'groups':data_d['BEA_INT']})}

topline_mods['GO']

The traceback stems from a numpy call. It returns the following:

ValueError: The weights and list don't have the same length.

Everything I could find on the use of clustered standard errors looked like the cov_kwds argument can take a Series from the DataFrame housing the model data. What am I missing?

Weathertight answered 14/11, 2015 at 19:26 Comment(3)

The usage of cov_type and cov_kwds looks good to me. But the formula needs to be a string, e.g. formula="GO ~ RV + ISSUER + PURPOSE". Otherwise you can use the data directly OLS(data_d['GO'], sm.add_constant(data_d[['RV', ....]])).fit(...) – Fugitive 14/11, 2015 at 20:58

Also, if you have missing values in data_d, then you need to remove them before calling OLS. AFAIR, there is no check that 'groups' have matching entries if missing values where removed from the data by the fomula/data handling in ols. – Fugitive 14/11, 2015 at 21:1

The dictionary returns a string. However, your second comment was right on point. I had inadvertently created missing values. Thanks for the suggestion. If you want to put it in an answer, I will check it. – Weathertight 14/11, 2015 at 22:54

When a model is created with formulas, then the missing value handling defaults to 'drop', and rows with missing observations are dropped from all data arrays given to the model (__init__). In the non-formula interface the default is currently to ignore missing values.

However, there is currently no check and automatic dropping of missing values in the arrays that are given at a later point, in this case data that is required in cov_kwds. If this has the original set of observations, but some have been dropped in the dependent and explanatory variables, then there will be a length mismatch, and it will raise the reported exception.

I reopened https://github.com/statsmodels/statsmodels/issues/1220 because it is possible to handle missing values in the special cases where we have enough information through the pandas indices.

Fugitive answered 15/11, 2015 at 14:6 Comment(0)

Here is a workaround, waiting for the bug mentioned by Josef being solved:

def cluster_fit(formula, data, group_var):
    fit = OLS.from_formula(formula, data=data).fit()
    to_keep = pd.RangeIndex(len(data)).difference(pd.Index(fit.model.data.missing_row_idx))
    robust = fit.get_robustcov_results(cov_type='cluster',
                                       groups=data.iloc[to_keep][group_var])
    return robust

To be used as res = cluster_fit('y ~ x + z', data=mydata, group_var='uid').

Notice that for some reason the result will be a RegressionResults rather than a RegressionResultsWrapper (not sure if this makes any difference).

Matins answered 26/6, 2021 at 13:45 Comment(1)

Thanks a lot for posting a workaround! Another approach is to run the regression on subsample = data[[<columns used in formula and robustness>]].dropna(). Less clean and with computational overhead, though. – Greco 7/9, 2023 at 13:50

Recommended topics

Hot tags