Trying to get AIC (or BIC) values of categorical data - ValueError: endog has evaluated to an array with multiple columns that has shape (700, 2)

I have a df that is 700 rows x 2 columns, below a code to reproduce a smaller version of it (with 7 rows).

df = pd.DataFrame(columns=['Private','Elite'])
df[''] = ['Abilene Christian University', 'Center for Creative Studies', 'Florida Institute of Technology', 
          'LaGrange College', 'Muhlenberg College', 'Saint Mary-of-the-Woods College', 'Union College KY']

df = df.set_index('')
df['Private'] = 'Yes'
df['Elite'] = 'No'

for x in df.columns:
    df[x] = df[x].astype('category')

df_train = df.copy(deep=True)

Both columns have categorical values dtype = 'category' (Either Yes or No).

According to this post: Linear regression with dummy/categorical variables the below code should work, as I am specifying C('Elite') and C('Private') as categorical vars...

from statsmodels.formula.api import ols

fit = ols('C(Private) ~ C(Elite)', data=df_train).fit()

fit.summary()

The full error is:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [213], in <cell line: 3>()
      1 from statsmodels.formula.api import ols
----> 3 fit = ols('C(Private) ~ C(Elite)', data=df_train).fit()
      5 fit.summary()

File ~/.local/lib/python3.10/site-packages/statsmodels/base/model.py:206, in Model.from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
    203 max_endog = cls._formula_max_endog
    204 if (max_endog is not None and
    205         endog.ndim > 1 and endog.shape[1] > max_endog):
--> 206     raise ValueError('endog has evaluated to an array with multiple '
    207                      'columns that has shape {0}. This occurs when '
    208                      'the variable converted to endog is non-numeric'
    209                      ' (e.g., bool or str).'.format(endog.shape))
    210 if drop_cols is not None and len(drop_cols) > 0:
    211     cols = [x for x in exog.columns if x not in drop_cols]

ValueError: endog has evaluated to an array with multiple columns that has shape (700, 2). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).

In other posts, I saw some solutions with pd.to_numeric(), I have tried the below, where No = 0 and Yes = 1, using pd.to_numeric(), but I still get the same error.

from statsmodels.formula.api import ols

df_train = df_train.replace('Yes', int(1))
df_train = df_train.replace('No', int(0))

for x in df_train.columns:
    df_train[x] = pd.to_numeric(df_train[x])

fit = ols('C(Private) ~ C(Elite)', data=df_train).fit()

fit.summary()

For the endogenous (dependent) variable (Private), you just need to manually define it as numeric, without using the C() function. That function creates a 2-column design matrix that includes the intercept term. You can see what it looks like by passing the right-hand side of the formula to patsy's dmatrix function:

from patsy import dmatrix

dmatrix('C(Elite)', df_train)

# DesignMatrix with shape (7, 2)
#   Intercept  C(Elite)[T.Yes]
#           1                0
#           1                0
#           1                1
#           1                1
#           1                1
#           1                0
#           1                0
#   Terms:
#     'Intercept' (column 0)
#     'C(Elite)' (column 1)

As the error says, by using C('Private'), you are also creating a 2-column array, resulting in the ValueError.

The independent variable Elite does not need to be converted to a categorical data type, as statsmodels will automatically treat it as a categorical variable because it contains strings. It will encode it using the Treatment coding scheme, which is just dummy coding. One of the categories will be omitted, if the intercept term is used.

So, you can write the full code as follows (I modified the code to create initial column values using np.random.choice(), rather than just using same values for all rows):

import numpy as np
import pandas as pd

df = pd.DataFrame(columns=['Private','Elite'])
df[''] = ['Abilene Christian University', 'Center for Creative Studies', 'Florida Institute of Technology', 
          'LaGrange College', 'Muhlenberg College', 'Saint Mary-of-the-Woods College', 'Union College KY']

df = df.set_index('')
df['Private'] = np.random.choice(['Yes','No'], 7)
df['Elite'] = np.random.choice(['Yes','No'], 7)

# map string values to numeric values
df['Private'] = df['Private'].map({'Yes':1, 'No':0})

df_train = df.copy(deep=True)
print(df_train)

#                                  Private Elite
                                              
# Abilene Christian University           0   Yes
# Center for Creative Studies            0    No
# Florida Institute of Technology        0   Yes
# LaGrange College                       0    No
# Muhlenberg College                     1   Yes
# Saint Mary-of-the-Woods College        0   Yes
# Union College KY                       1   Yes

The model code:

fit = ols('Private ~ C(Elite)', data=df_train).fit()

print(fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Private   R-squared:                       0.160
Model:                            OLS   Adj. R-squared:                 -0.008
Method:                 Least Squares   F-statistic:                    0.9524
Date:                Tue, 16 Aug 2022   Prob (F-statistic):              0.374
Time:                        21:53:46   Log-Likelihood:                -3.7600
No. Observations:                   7   AIC:                             11.52
Df Residuals:                       5   BIC:                             11.41
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept       -6.895e-17      0.346  -1.99e-16      1.000      -0.890       0.890
C(Elite)[T.Yes]     0.4000      0.410      0.976      0.374      -0.654       1.454
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   2.367
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.817
Skew:                           0.483   Prob(JB):                        0.665
Kurtosis:                       1.633   Cond. No.                         3.51
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/usr/local/lib/python3.7/dist-packages/statsmodels/stats/stattools.py:75: ValueWarning: omni_normtest is not valid with less than 8 observations; 7 samples were given.
  "samples were given." % int(n), ValueWarning)

The C(Elite)[T.Yes] notation indicates that the No level of the Elite column was used as the reference category.

Recommended topics

Hot tags