Pandas Dataframe AttributeError: 'DataFrame' object has no attribute 'design_info'
Asked Answered
O

1

11

I am trying to use the predict() function of the statsmodels.formula.api OLS implementation. When I pass a new data frame to the function to get predicted values for an out-of-sample dataset result.predict(newdf) returns the following error: 'DataFrame' object has no attribute 'design_info'. What does this mean and how do I fix it? The full traceback is:

    p = result.predict(newdf)
  File "C:\Python27\lib\site-packages\statsmodels\base\model.py", line 878, in predict
    exog = dmatrix(self.model.data.orig_exog.design_info.builder,
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2088, in __getattr__
    (type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'design_info'

EDIT: Here is a reproducible example. The error appears to occur when I pickle and then unpickle the result object (which I need to do in my actual project):

import cPickle
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm

df = pd.DataFrame({"A": [10,20,30,324,2353], "B": [20, 30, 10, 1, 2332], "C": [0, -30, 120, 11, 2]})

result = sm.ols(formula="A ~ B + C", data=df).fit()
print result.summary()

test1 = result.predict(df) #works

f_myfile = open('resultobject', "wb")
cPickle.dump(result, f_myfile, 2)
f_myfile.close()
print("Result Object Saved")


f_myfile = open('resultobject', "rb")
model = cPickle.load(f_myfile)

test2 = model.predict(df) #produces error
Otalgia answered 22/12, 2013 at 0:18 Comment(9)
Please edit your question and include some sample code as well as the complete stack trace.Radiography
I've added the full traceback. I can try and add a reproducible example if no one knows why this error generally occurs.Otalgia
I think we need a reproducible example. I don't see a reason why the formula information design_info is not there, but I don't fully understand the code path for this with the interaction with patsy. You could also open an issue with statsmodels on github. It might not be very robust to keep the formula information attached to the original dataframe.Koal
Added a reproducable example, seems to have something to do with pickling and depickling the object.Otalgia
Yes, I was thinking about that as a possible candidate. It's also possible to remove the data before pickling if we only want to predict after unpickling which will also cause the same problem. My guess is that statsmodels doesn't have any unit tests for pickling when formulas have been used.Koal
Is there an alternative to using a formula or is there a way to feed in the formula once its been unpickled?Otalgia
"It's also possible to remove the data before pickling" Does the result object store the training dataset? How do you remove it?Otalgia
statsmodels.sourceforge.net/devel/generated/… It is supposed to remove all full length arrays including the training data. The only part that is kept are the parameters and other small attributes. This was specifically added to allow prediction after pickling a results instance without the full dataset. However, this was added and tested only for datasets that are numpy arrays, and as this question indicates, will not work correctly with formulas.Koal
I am facing this problem too. Has there been a solution to this?Belomancy
K
14

Pickling and unpickling of a pandas DataFrame doesn't save and restore attributes that have been attached by a user, as far as I know.

Since the formula information is currently stored together with the DataFrame of the original design matrix, this information is lost after unpickling a Results and Model instance.

If you don't use categorical variables and transformations, then the correct designmatrix can be built with patsy.dmatrix. I think the following should work

x = patsy.dmatrix("B + C", data=df)  # df is data for prediction
test2 = model.predict(x, transform=False)

or constructing the design matrix for the prediction directly should also work Note we need to explicitly add a constant that the formula adds by default.

from statsmodels.api import add_constant
test2 = model.predict(add_constant(df[["B", "C"]]), transform=False)

If the formula and design matrix contain (stateful) transformation and categorical variables, then it's not possible to conveniently construct the design matrix without the original formula information. Constructing it by hand and doing all the calculations explicitly is difficult in this case, and looses all the advantages of using formulas.

The only real solution is to pickle the formula information design_info independently of the dataframe orig_exog.

Koal answered 22/12, 2013 at 4:44 Comment(3)
I opened an issue with statsmodels github.com/statsmodels/statsmodels/issues/1263Koal
Solution 1 produces the same error in the sample code. Solution 2 gives ValueError: matrices are not aligned again with the sample code.Otalgia
I fixed both examples, in the first I forgot to add transform=False to avoid calling patsy, in the second example I just forgot to add the constant that patsy adds automatically.Koal

© 2022 - 2024 — McMap. All rights reserved.