Using predict() on statsmodels.formula data with different column names using Python and Pandas
Asked Answered
S

2

6

I've got some regressions results from running statsmodels.formula.api.ols. Here's a toy example:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

example_df = pd.DataFrame(np.random.randn(10, 3))
example_df.columns = ["a", "b", "c"]
fit = smf.ols('a ~ b', example_df).fit()

I'd like to apply the model to column c, but a naive attempt to do so doesn't work:

fit.predict(example_df["c"])

Here's the exception I get:

PatsyError: Error evaluating factor: NameError: name 'b' is not defined
    a ~ b
        ^

I can do something gross and create a new, temporary DataFrame in which I rename the column of interest:

example_df2 = pd.DataFrame(example_df["c"])
example_df2.columns = ["b"]
fit.predict(example_df2)

Is there a cleaner way to do this? (short of switching to statsmodels.api instead of statsmodels.formula.api)

Saito answered 12/3, 2015 at 20:58 Comment(0)
T
4

You can use a dictionary:

>>> fit.predict({"b": example_df["c"]})
array([ 0.84770672, -0.35968269,  1.19592387, -0.77487812, -0.98805215,
        0.90584753, -0.15258093,  1.53721494, -0.26973941,  1.23996892])

or create a numpy array for the prediction, although that is much more complicated if there are categorical explanatory variables:

>>> fit.predict(sm.add_constant(example_df["c"].values), transform=False)
array([ 0.84770672, -0.35968269,  1.19592387, -0.77487812, -0.98805215,
        0.90584753, -0.15258093,  1.53721494, -0.26973941,  1.23996892])
Tisman answered 12/3, 2015 at 23:2 Comment(2)
Did you find this in the documentation somewhere? My attempt didn't turn anything up.Saito
I might have seen it in one of the notebook examples. However, patsy is handling the formula information and construction of the design matrix, and in most or in all cases the data can be specified as pandas dataframe or any other dictionary like data structure. This will not be explicitly documented for each method or model.Tisman
C
1

If you replace your fit definition with this line:

fit = smf.ols('example_df.a ~ example_df.b', example_df).fit()

It should work.

fit.predict(example_df["c"])

array([-0.52664491, -0.53174346, -0.52172484, -0.52819856, -0.5253607 ,
       -0.52391618, -0.52800043, -0.53350634, -0.52362988, -0.52520823])
Childbearing answered 12/3, 2015 at 21:53 Comment(3)
Or fit = smf.ols("example_df['a'] ~ example_df['b']", example_df).fit() if you prefer the other style of column reference.Shipyard
I'm not getting the correct results with this, with the version of patsy and pandas that I have installed. Check with example_df["c"] * fit.params[1] + fit.params[0].Tisman
This doesn't seem to be working for me. fit.predict seems to be ignoring the argument. I get the same output if I use fit.predict(None).Saito

© 2022 - 2024 — McMap. All rights reserved.