I have a df that is 700 rows x 2 columns
, below a code to reproduce a smaller version of it (with 7 rows).
df = pd.DataFrame(columns=['Private','Elite'])
df[''] = ['Abilene Christian University', 'Center for Creative Studies', 'Florida Institute of Technology',
'LaGrange College', 'Muhlenberg College', 'Saint Mary-of-the-Woods College', 'Union College KY']
df = df.set_index('')
df['Private'] = 'Yes'
df['Elite'] = 'No'
for x in df.columns:
df[x] = df[x].astype('category')
df_train = df.copy(deep=True)
Both columns have categorical values dtype = 'category'
(Either Yes or No).
According to this post: Linear regression with dummy/categorical variables the below code should work, as I am specifying C('Elite') and C('Private')
as categorical vars...
from statsmodels.formula.api import ols
fit = ols('C(Private) ~ C(Elite)', data=df_train).fit()
fit.summary()
The full error is:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [213], in <cell line: 3>()
1 from statsmodels.formula.api import ols
----> 3 fit = ols('C(Private) ~ C(Elite)', data=df_train).fit()
5 fit.summary()
File ~/.local/lib/python3.10/site-packages/statsmodels/base/model.py:206, in Model.from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
203 max_endog = cls._formula_max_endog
204 if (max_endog is not None and
205 endog.ndim > 1 and endog.shape[1] > max_endog):
--> 206 raise ValueError('endog has evaluated to an array with multiple '
207 'columns that has shape {0}. This occurs when '
208 'the variable converted to endog is non-numeric'
209 ' (e.g., bool or str).'.format(endog.shape))
210 if drop_cols is not None and len(drop_cols) > 0:
211 cols = [x for x in exog.columns if x not in drop_cols]
ValueError: endog has evaluated to an array with multiple columns that has shape (700, 2). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).
In other posts, I saw some solutions with pd.to_numeric()
, I have tried the below, where No = 0
and Yes = 1
, using pd.to_numeric()
, but I still get the same error.
from statsmodels.formula.api import ols
df_train = df_train.replace('Yes', int(1))
df_train = df_train.replace('No', int(0))
for x in df_train.columns:
df_train[x] = pd.to_numeric(df_train[x])
fit = ols('C(Private) ~ C(Elite)', data=df_train).fit()
fit.summary()
df
I have. – Anzio