Numbers as variable names not recognized by statsmodels.formula.api
Asked Answered
I

2

5

Consider the following example:

import pandas as pd
from pandas import DataFrame
import statsmodels.formula.api as smf
df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})
df2 = DataFrame({'177sdays': [1,2,3], 'b': [2,3,4]})

Then smf.ols('a ~ b', df) smf.ols('177sdays ~ b', df2)

And the first work and the second does not. The only difference seems to be the presence of numerical characters in the variable name. Why is this?

Impeachable answered 23/11, 2016 at 1:25 Comment(4)
In particular it generates error invalid syntax!Impeachable
... valid python names cannot begin with numbers. Perhaps under the hood there is an eval in statsmodels. Try prefixing with underscore.Candleberry
Q can "quote" arbitrary variable names patsy.readthedocs.io/en/latest/…Housebreak
@Housebreak what if there is a variable named Q which conflicts with the Q function?Heterotaxis
C
7

Apparently, statsmodels uses a library called patsy to interpret the formulas passed to ols. From the docs, an expression of the form:

y ~ a + a:b + np.log(x)

will construct a patsy object of the form:

ModelDesc([Term([EvalFactor("y")])],
      [Term([]),
       Term([EvalFactor("a")]),
       Term([EvalFactor("a"), EvalFactor("b")]),
       Term([EvalFactor("np.log(x)")])])

EvalFactor then "executes arbitrary Python code." Thus your variable names must be valid Python identifiers. I.e. the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.

Candleberry answered 23/11, 2016 at 1:49 Comment(1)
This was super helpful. Otherwise it's a "gotcha" with an utterly vague error message. Thanks!Gressorial
G
3

As @Josef stated one can use patsy Q to quote the variable:

smf.ols('Q("177sdays") ~ b', df2).fit()
Girandole answered 2/9, 2019 at 15:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.