Python Negative Binomial Regression - Results Don't Match those from R
Asked Answered
C

1

5

I'm experimenting with negative binomial regression using Python. I found this example using R, along with a data set:

http://www.karlin.mff.cuni.cz/~pesta/NMFM404/NB.html

I tried to replicate the results on the web page using this code:

import pandas as pd
import statsmodels.formula.api as smf 
import statsmodels.api as sm

df = pd.read_stata("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/nb_data.dta")

model = smf.glm(formula = "daysabs ~ math + prog", data=df, family=sm.families.NegativeBinomial()).fit()

model.summary()

Unfortunately this didn't give the same coefficients. It gave the following:

coef        std err     z       P>|z|   [95.0% Conf. Int.]
Intercept    3.4875     0.236   14.808  0.000    3.026  3.949
math        -0.0067     0.003   -2.600  0.009   -0.012 -0.002
prog        -0.6781     0.101   -6.683  0.000   -0.877 -0.479

These aren't even close to those on the website. Assuming the R code is correct, what am I doing wrong?

Community answered 16/2, 2017 at 15:0 Comment(0)
P
10

The reason for the discrepancy is that when you are reading in the dataset with Pandas, the prog variable is treated as type float by default:

df.prog.head()

0    2.0
1    2.0
2    2.0
3    2.0
4    2.0
Name: prog, dtype: float32

In the R example, on the other hand, the prog variable was explicitly cast to a factor (categorical) variable:

dat <- within(dat, {
    prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
    id <- factor(id)
})

As a result, when you look at the fit summary in R, you can see that the prog variable has been split into n-1 binary-encoded terms:

> summary(m1 <- glm.nb(daysabs ~ math + prog, data = dat))

Call:
glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156, 
    link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1547  -1.0192  -0.3694   0.2285   2.5273  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     2.615265   0.197460  13.245  < 2e-16 ***
math           -0.005993   0.002505  -2.392   0.0167 *  
progAcademic   -0.440760   0.182610  -2.414   0.0158 *  
progVocational -1.278651   0.200720  -6.370 1.89e-10 ***

Compare this to how the prog variable appears in the Python fit summary you posted.

To fix the issue, you can use the C() function to cast the variable to categorical in statsmodels. This way you will arrive at the same result:

model = smf.glm(formula = "daysabs ~ math + C(prog)", data=df, family=sm.families.NegativeBinomial()).fit()
model.summary()

<class 'statsmodels.iolib.summary.Summary'>
"""
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                daysabs   No. Observations:                  314
Model:                            GLM   Df Residuals:                      310
Model Family:        NegativeBinomial   Df Model:                            3
Link Function:                    log   Scale:                   1.06830885199
Method:                          IRLS   Log-Likelihood:                -865.68
Date:                Thu, 16 Feb 2017   Deviance:                       350.98
Time:                        10:34:16   Pearson chi2:                     331.
No. Iterations:                     6                                         
==================================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          2.6150      0.207     12.630      0.000       2.209       3.021
C(prog)[T.2.0]    -0.4408      0.192     -2.302      0.021      -0.816      -0.065
C(prog)[T.3.0]    -1.2786      0.210     -6.079      0.000      -1.691      -0.866
math              -0.0060      0.003     -2.281      0.023      -0.011      -0.001
==================================================================================
"""
Pilsen answered 16/2, 2017 at 15:39 Comment(4)
Aha!! V Cool - Thanks!Community
The results are now close, but still not exactly the same e.g. the intercept coefficient is 2.6150 vs. in R it is 2.61526. I assume this is nothing to be worried about but what would be the reason for the slight difference?Community
Note, that statsmodels GLM does not fix the scale equal to 1 for negative binomial. So, standard errors might differ to R because statsmodels defaults to Quasi-NegativeBinomial in GLM. I don't know if this is a bug or a feature of statsmodels github.com/statsmodels/statsmodels/issues/2888 .Raft
If you run a negative binomial regression in R with a fixed theta (i.e. alpha in statsmodel) parameter, you get the same standard errors: round(summary(m1 <- glm(daysabs ~ math + as.factor(prog), family=negative.binomial(theta=1.033), data = dat))$coefficients[,2],3) = 0.207 0.003 0.191 0.210 Gallager

© 2022 - 2024 — McMap. All rights reserved.