glm in python vs R
Asked Answered
S

1

6

I have some dataset: titanic

Doing this in R

glm(Survived ~ Sex, titanic, family = "binomial")

I get

(Intercept)     SexMale 
1.124321   -2.477825

R takes survived as positive outcome.

But when I'm doing the same in Python

sm.formula.glm("Survived ~ Sex", family=sm.families.Binomial(), data=titanic).fit()

I get negative results: i.e. Python takes not survived as positive outcome.

How can I adjust Python's glm function behavior so it will return the same result as R does?

Sparid answered 29/11, 2017 at 11:18 Comment(2)
I doubt sm.formula.glm is available in base python. Please list any modules / packages that you are using in the body of your question or add the appropriate tag.Horthy
import numpy as np import pandas as pd import statsmodels.api as smSparid
S
12

You just need to set your reference group to either male or female (depending on what you're interested in):

With a small test dataset in R, the code and model summary looks like this:

df <- data.frame(c(0,0,1,1,0), c("Male", "Female", "Female", "Male", "Male"))
colnames(df) <- c("Survived", "Sex")

model <- glm(Survived ~ Sex, data=df, family="binomial")
summary(model)

Output:

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.084e-16  1.414e+00   0.000    1.000
SexMale     -6.931e-01  1.871e+00  -0.371    0.711

To get something similar in Python/statsmodels:

import pandas as pd
import statsmodels.api as sm

df = pd.DataFrame({"Survived": [0,0,1,1,0],
                   "Sex": ["Male", "Female", "Female", "Male", "Male"]})

model = sm.formula.glm("Survived ~ C(Sex, Treatment(reference='Female'))",
                       family=sm.families.Binomial(), data=df).fit()
print(model.summary())

Which will give:

                                                    coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------
Intercept                                      5.551e-16      1.414   3.93e-16      1.000      -2.772       2.772
C(Sex, Treatment(reference='Female'))[T.Male]    -0.6931      1.871     -0.371      0.711      -4.360       2.974

Notice the use of Treatment() to set the reference group. I've set it to Female in this case to match the R output, but with your dataset it might make more sense to use Male. Either way, its just an issue of being explicit about which group is used as reference.

Sensibility answered 5/12, 2017 at 23:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.