How does statsmodels encode endog variables entered as strings?

I'm new to using statsmodels to do statistical analyses. I'm getting expected answers most of the time but there are some things I don't quite understand about the way that statsmodels defines endog (dependant) variables for logistic regression when entered as strings.

An example Pandas dataframe to illustrate the issue can be defined as shown below. The yN, yA and yA2 columns represent different ways to define an endog variable: yN is a binary variable coded 0, 1; yA is a binary variable coded 'y', 'n'; and yA2 is a variable coded 'x', 'y' and 'w':

import pandas as pd

df = pd.DataFrame({'yN':[0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1],
                   'yA':['y','y','y','y','y','y','y','n','n','n','n','n','n','n','n','n','n','n','n','n',],
                   'yA2':['y','y','y','w','y','w','y','n','n','n','n','n','n','n','n','n','n','n','n','n',],
                   'xA':['a','a','b','b','b','c','c','c','c','c','a','a','a','a','b','b','b','b','c','c']})

The dataframe looks like:

   xA yA yA2  yN
0   a  y   y   0
1   a  y   y   0
2   b  y   y   0
3   b  y   w   0
4   b  y   y   0
5   c  y   w   0
6   c  y   y   0
7   c  n   n   1
8   c  n   n   1
9   c  n   n   1
10  a  n   n   1
11  a  n   n   1
12  a  n   n   1
13  a  n   n   1
14  b  n   n   1
15  b  n   n   1
16  b  n   n   1
17  b  n   n   1
18  c  n   n   1
19  c  n   n   1

I can run a 'standard' logistic regression using a 0/1 encoded endog variable and a categorical exog variable (xA) as follows:

import statsmodels.formula.api as smf
import statsmodels.api as sm

phjLogisticRegressionResults = smf.glm(formula='yN ~ C(xA)',
                                       data=df,
                                       family = sm.families.Binomial(link = sm.genmod.families.links.logit)).fit()

print('\nResults of logistic regression model')
print(phjLogisticRegressionResults.summary())

This produces the following results, which are exactly as I'd expect:

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                     yN   No. Observations:                   20
Model:                            GLM   Df Residuals:                       17
Model Family:                Binomial   Df Model:                            2
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -12.787
Date:                Thu, 18 Jan 2018   Deviance:                       25.575
Time:                        02:19:45   Pearson chi2:                     20.0
No. Iterations:                     4                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6931      0.866      0.800      0.423      -1.004       2.391
C(xA)[T.b]    -0.4055      1.155     -0.351      0.725      -2.669       1.858
C(xA)[T.c]     0.2231      1.204      0.185      0.853      -2.137       2.583
==============================================================================

However, if I run the same model using a binary endog variable encode 'y' and 'n' (but exactly opposite to the intuitive 0/1 coding in previous example) or if I include a variable where some of the 'y' codes have been replaced by 'w' (i.e. there are now 3 outcomes), it still produces the same results as follows:

phjLogisticRegressionResults = smf.glm(formula='yA ~ C(xA)',
                                       data=df,
                                       family = sm.families.Binomial(link = sm.genmod.families.links.logit)).fit()


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:     ['yA[n]', 'yA[y]']   No. Observations:                   20
Model:                            GLM   Df Residuals:                       17
Model Family:                Binomial   Df Model:                            2
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -12.787
Date:                Thu, 18 Jan 2018   Deviance:                       25.575
Time:                        02:29:06   Pearson chi2:                     20.0
No. Iterations:                     4                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6931      0.866      0.800      0.423      -1.004       2.391
C(xA)[T.b]    -0.4055      1.155     -0.351      0.725      -2.669       1.858
C(xA)[T.c]     0.2231      1.204      0.185      0.853      -2.137       2.583
==============================================================================

...and...

phjLogisticRegressionResults = smf.glm(formula='yA2 ~ C(xA)',
                                       data=df,
                                       family = sm.families.Binomial(link = sm.genmod.families.links.logit)).fit()


                       Generalized Linear Model Regression Results                        
==========================================================================================
Dep. Variable:     ['yA2[n]', 'yA2[w]', 'yA2[y]']   No. Observations:                   20
Model:                                        GLM   Df Residuals:                       17
Model Family:                            Binomial   Df Model:                            2
Link Function:                              logit   Scale:                             1.0
Method:                                      IRLS   Log-Likelihood:                -12.787
Date:                            Thu, 18 Jan 2018   Deviance:                       25.575
Time:                                    02:29:06   Pearson chi2:                     20.0
No. Iterations:                                 4                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6931      0.866      0.800      0.423      -1.004       2.391
C(xA)[T.b]    -0.4055      1.155     -0.351      0.725      -2.669       1.858
C(xA)[T.c]     0.2231      1.204      0.185      0.853      -2.137       2.583
==============================================================================

The Dep. Variable cell in the output table recognises but that there are differences but the results are the same. What rule is statsmodels using to code the endog variable when it is entered as a string variable?

Warning: this behavior is not by design and came about through the interaction of patsy and statsmodels.

First, patsy does all the conversion of the string formula and data to create the corresponding design matrix, and possibly the conversion for the response variable. In the case when the response variable, endog or y, is a string, then patsy treats it as categorical and uses the default encoding for categorical variables and build the corresponding array of dummy variables. Also, AFAIK patsy sorts the levels alphanumerically which determines the order of the columns.

The main part of the model, either GLM or Logit/Probit, just takes the array provided by patsy and interprets it in the model appropriate way if that's possible and does so without much specific input checking.

In the example patsy creates the dummy variable array with either two or three columns. statsmodels interprets it as "success", "failure" counts. So the lowest category in alphanumerical order defines "success". The row sum corresponds to the number of trials in the observation, which is 1 in this case. If or that it works for three columns must be a lack of input checking which implies that it is a first-against-the-rest binary response. (Which, I guess, is a consequence of the implementation details, and is not by design.)

A related problem is in discrete model Logit. https://github.com/statsmodels/statsmodels/issues/2733 There is no clear solution for now that would not require a lot of second guessing the user's intention.

So for now it's better to use numerical values for binary models, especially also because what defines "success" and the signs of the parameter depends on the alphanumerical ordering of the categorical level names. For example, try after renaming your "success" level to "z" instead of "n".

Recommended topics

Hot tags