Statsmodels Poisson glm different than R
Asked Answered
M

1

3

I am trying to fit some models (Spatial interaction models) according to some code which is provided in R. I have been able to get some of the code to work using statsmodels in a python framework but some of them do not match at all. I believe that the code I have for R and Python should give identical results. Does anyone see any differences? Or is there some fundamental differences which might be throwing things off? The R code is the original code which matches the numbers given in a tutorial (Found here: http://www.bartlett.ucl.ac.uk/casa/pdf/paper181).

R sample Code:

library(mosaic)
Data = fetchData('http://dl.dropbox.com/u/8649795/AT_Austria.csv')
Model = glm(Data~Origin+Destination+Dij+offset(log(Offset)), family=poisson(link="log"), data = Data)
cor = cor(Data$Data, Model$fitted, method = "pearson", use = "complete")
rsquared = cor * cor
rsquared

R output:

> Model = glm(Data~Origin+Destination+Dij+offset(log(Offset)), family=poisson(link="log"), data = Data)
Warning messages:
1: glm.fit: fitted rates numerically 0 occurred 
2: glm.fit: fitted rates numerically 0 occurred 
> cor = cor(Data$Data, Model$fitted, method = "pearson", use = "complete")
> rsquared = cor * cor
> rsquared
[1] 0.9753279

Python Code:

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy.stats.stats import pearsonr

Data= pd.DataFrame(pd.read_csv('http://dl.dropbox.com/u/8649795/AT_Austria.csv'))
Model = smf.glm('Data~Origin+Destination+Dij', data=Data, offset=np.log(Data['Offset']), family=sm.families.Poisson(link=sm.families.links.log)).fit()

cor = pearsonr(doubleConstrained.fittedvalues, Data["Data"])[0]

print "R-squared for doubly-constrained model is: " + str(cor*cor)

Python Output:

R-squared for doubly-constrained model is: 0.104758481123
Massive answered 14/2, 2014 at 16:52 Comment(9)
I'm not sure that the R^2 makes sense for a non-linear model. You could compute something similar using the ratio of deviance explain to null deviance I suppose. Or there are alternatives to R^2 for GLMs you could investigated.Giesecke
Thanks for the response. From what I have read, typically these models use several different measures to check their fit, and do not solely rely on R^2. That said, there must be a reason these two are so radically off. The "Dij" variable can sometimes be interpretted as log("Dij") and in that case I have been able to fit all models of interest using the same code in both R and python+statsmodelsMassive
Well, have you checked that the fits are the same? Compare the fitted values in both software. Do some model checking. With difficult data, sometimes the algorithms might not work well and may need some coaxing etc. You seem to have gone from fitting a model to checking some dubious summary stat without checking if the actual fits are the same. For one, just look at the coefficients (and make sure you know what scales there are on), and are the fitted values in Python on the scale of the response or the link function? R's are on the scale of response.Giesecke
Indeed, I agree, it would not be helpful to jump from fitting a model one single dubious summary stat. I know that the neither the fitted values or the model coefficients are matching up between the R code and Python code. When I use the exact same code, but change the model formula to "Data~Origin+Destination+log(Dij)" I get a matching R-squared values, matching fitted values, and matching coefficients between both frameworks. When I use"Dij" as a variable, not taking the log, nothing matches up, though I have been taking the R code as correct since it was from an example in a working paperMassive
I am therefore thinking there must be some difference mechanics behind the glm functions between the two different frameworks, which I have searched for and not found.Massive
That suggests a problem with the fitting. Check convergence criteria in the R model, plot the fitted model to look at the fit. The output you show for R suggests some issues - it is warning that you are fitting a model where \hat{E(mu)} is predicted to be 0, i.e. the expectation of the mean of the Poisson distribution is being fitted as 0. Which suggests a possible issue with the fit because of some odd values in Dij which is being reduced if you transform to the log scale.Giesecke
you have several 1.000000e-300 in your Dij. Are they supposed to be missing values or really almost zero?Coff
Also, every tenth Data is zero. So as Gavin pointed out, there is something strange with the data. I don't know what R is doing, but the results in statsmodels look to be heavily influenced by these outliers. If I drop those observations, then I get with 'Dij' as regressor an rsquared of 0.97486 with statsmodels (using 72 observations).Coff
They are supposed to represent zero values and therefore are meant to be really almost zero. This fixed the issue...it appears that python handles these values differently than R and therefore substituting in numbers slightly larger 1.0^-25 say will give the model results expected.Massive
R
3

It looks like GLM has convergence problems here in statsmodels. Maybe in R too, but R only gives these warnings.

Warning messages:
1: glm.fit: fitted rates numerically 0 occurred 
2: glm.fit: fitted rates numerically 0 occurred 

That could mean something like perfect separation in Logit/Probit context. I'd have to think about it for a Poisson model.

R is doing a better, if subtle, job of telling you that something may be wrong in your fitting. If you look at the fitted likelihood in statsmodels for instance, it's -1.12e27. That should be a clue right there that something is off.

Using Poisson model directly (I always prefer maximum likelihood to GLM when possible), I can replicate the R results (but I get a convergence warning). Tellingly, again, the default newton-raphson solver fails, so I use bfgs.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy.stats.stats import pearsonr

data= pd.DataFrame(pd.read_csv('http://dl.dropbox.com/u/8649795/AT_Austria.csv'))

mod = smf.poisson('Data~Origin+Destination+Dij', data=data, offset=np.log(data['Offset'])).fit(method='bfgs')

print mod.mle_retvals['converged']
Retinol answered 14/2, 2014 at 20:25 Comment(1)
I filed a bug report to look into exactly what's going on here in statsmodels GLM. github.com/statsmodels/statsmodels/issues/1391Retinol

© 2022 - 2024 — McMap. All rights reserved.