I'm new to using statsmodels to do statistical analyses. I'm getting expected answers most of the time but there are some things I don't quite understand about the way that statsmodels defines endog (dependant) variables for logistic regression when entered as strings.
An example Pandas dataframe to illustrate the issue can be defined as shown below. The yN, yA and yA2 columns represent different ways to define an endog variable: yN is a binary variable coded 0, 1; yA is a binary variable coded 'y', 'n'; and yA2 is a variable coded 'x', 'y' and 'w':
import pandas as pd
df = pd.DataFrame({'yN':[0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1],
'yA':['y','y','y','y','y','y','y','n','n','n','n','n','n','n','n','n','n','n','n','n',],
'yA2':['y','y','y','w','y','w','y','n','n','n','n','n','n','n','n','n','n','n','n','n',],
'xA':['a','a','b','b','b','c','c','c','c','c','a','a','a','a','b','b','b','b','c','c']})
The dataframe looks like:
xA yA yA2 yN
0 a y y 0
1 a y y 0
2 b y y 0
3 b y w 0
4 b y y 0
5 c y w 0
6 c y y 0
7 c n n 1
8 c n n 1
9 c n n 1
10 a n n 1
11 a n n 1
12 a n n 1
13 a n n 1
14 b n n 1
15 b n n 1
16 b n n 1
17 b n n 1
18 c n n 1
19 c n n 1
I can run a 'standard' logistic regression using a 0/1 encoded endog variable and a categorical exog variable (xA) as follows:
import statsmodels.formula.api as smf
import statsmodels.api as sm
phjLogisticRegressionResults = smf.glm(formula='yN ~ C(xA)',
data=df,
family = sm.families.Binomial(link = sm.genmod.families.links.logit)).fit()
print('\nResults of logistic regression model')
print(phjLogisticRegressionResults.summary())
This produces the following results, which are exactly as I'd expect:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: yN No. Observations: 20
Model: GLM Df Residuals: 17
Model Family: Binomial Df Model: 2
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -12.787
Date: Thu, 18 Jan 2018 Deviance: 25.575
Time: 02:19:45 Pearson chi2: 20.0
No. Iterations: 4
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.6931 0.866 0.800 0.423 -1.004 2.391
C(xA)[T.b] -0.4055 1.155 -0.351 0.725 -2.669 1.858
C(xA)[T.c] 0.2231 1.204 0.185 0.853 -2.137 2.583
==============================================================================
However, if I run the same model using a binary endog variable encode 'y' and 'n' (but exactly opposite to the intuitive 0/1 coding in previous example) or if I include a variable where some of the 'y' codes have been replaced by 'w' (i.e. there are now 3 outcomes), it still produces the same results as follows:
phjLogisticRegressionResults = smf.glm(formula='yA ~ C(xA)',
data=df,
family = sm.families.Binomial(link = sm.genmod.families.links.logit)).fit()
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: ['yA[n]', 'yA[y]'] No. Observations: 20
Model: GLM Df Residuals: 17
Model Family: Binomial Df Model: 2
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -12.787
Date: Thu, 18 Jan 2018 Deviance: 25.575
Time: 02:29:06 Pearson chi2: 20.0
No. Iterations: 4
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.6931 0.866 0.800 0.423 -1.004 2.391
C(xA)[T.b] -0.4055 1.155 -0.351 0.725 -2.669 1.858
C(xA)[T.c] 0.2231 1.204 0.185 0.853 -2.137 2.583
==============================================================================
...and...
phjLogisticRegressionResults = smf.glm(formula='yA2 ~ C(xA)',
data=df,
family = sm.families.Binomial(link = sm.genmod.families.links.logit)).fit()
Generalized Linear Model Regression Results
==========================================================================================
Dep. Variable: ['yA2[n]', 'yA2[w]', 'yA2[y]'] No. Observations: 20
Model: GLM Df Residuals: 17
Model Family: Binomial Df Model: 2
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -12.787
Date: Thu, 18 Jan 2018 Deviance: 25.575
Time: 02:29:06 Pearson chi2: 20.0
No. Iterations: 4
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.6931 0.866 0.800 0.423 -1.004 2.391
C(xA)[T.b] -0.4055 1.155 -0.351 0.725 -2.669 1.858
C(xA)[T.c] 0.2231 1.204 0.185 0.853 -2.137 2.583
==============================================================================
The Dep. Variable cell in the output table recognises but that there are differences but the results are the same. What rule is statsmodels using to code the endog variable when it is entered as a string variable?