Why doesn't Statsmodels OLS support reading in columns with multiple words?
Asked Answered
A

1

7

I've been experimenting with Seaborn's lmplot() and Statsmodels .ols() functions for simple linear regression plots and their associated p-values, r-squared, etc.

I've noticed that when I specify which columns I want to use for lmplot, I can specify a column even if it has multiple words for it:

import seaborn as sns
import pandas as pd
input_csv = pd.read_csv('./test.csv',index_col = 0,header = 0)
input_csv

CSV Plot

sns.lmplot(x='Age',y='Count of Specific Strands',data = input_csv)
<seaborn.axisgrid.FacetGrid at 0x2800985b710>

enter image description here

However, if I try to use ols, I'm getting an error for inputting in "Count of Specific Strands" as my dependent variable (I've only listed out the last couple of lines in the error):

import statsmodels.formula.api as smf
test_results = smf.ols('Count of Specific Strands ~ Age',data = input_csv).fit()

File "<unknown>", line 1
    Count of Specific Strands
           ^
SyntaxError: invalid syntax

Conversely, if I specify the "Counts of Specific Strand" as shown below, the regression works:

test_results = smf.ols('input_csv.iloc[:,1] ~ Age',data = input_csv).fit()
test_results.summary()

Regression Results

Does anyone know why this is? Is it just because of how Statsmodels was written? Is there an alternative to specify the dependent variable for regression analysis that doesn't involve iloc or loc?

Aniseikonia answered 17/10, 2018 at 18:30 Comment(0)
C
15

This is due to the way the formula parser patsy is written: see this link for more information

The authors of patsy have, however, thought of this problem: (quoted from here)

This flexibility does create problems in one case, though – because we interpret whatever you write in-between the + signs as Python code, you do in fact have to write valid Python code. And this can be tricky if your variable names have funny characters in them, like whitespace or punctuation. Fortunately, patsy has a builtin “transformation” called Q() that lets you “quote” such variables

Therefore, in your case, you should be able to write:

smf.ols('Q("Count of Specific Strands") ~ Age',data = input_csv).fit()
Closemouthed answered 17/10, 2018 at 20:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.