You just need to set your reference group to either male or female (depending on what you're interested in):
With a small test dataset in R, the code and model summary looks like this:
df <- data.frame(c(0,0,1,1,0), c("Male", "Female", "Female", "Male", "Male"))
colnames(df) <- c("Survived", "Sex")
model <- glm(Survived ~ Sex, data=df, family="binomial")
summary(model)
Output:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.084e-16 1.414e+00 0.000 1.000
SexMale -6.931e-01 1.871e+00 -0.371 0.711
To get something similar in Python/statsmodels:
import pandas as pd
import statsmodels.api as sm
df = pd.DataFrame({"Survived": [0,0,1,1,0],
"Sex": ["Male", "Female", "Female", "Male", "Male"]})
model = sm.formula.glm("Survived ~ C(Sex, Treatment(reference='Female'))",
family=sm.families.Binomial(), data=df).fit()
print(model.summary())
Which will give:
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------------
Intercept 5.551e-16 1.414 3.93e-16 1.000 -2.772 2.772
C(Sex, Treatment(reference='Female'))[T.Male] -0.6931 1.871 -0.371 0.711 -4.360 2.974
Notice the use of Treatment()
to set the reference group. I've set it to Female
in this case to match the R output, but with your dataset it might make more sense to use Male
. Either way, its just an issue of being explicit about which group is used as reference.
sm.formula.glm
is available in base python. Please list any modules / packages that you are using in the body of your question or add the appropriate tag. – Horthyimport numpy as np
import pandas as pd
import statsmodels.api as sm
– Sparid