Fitting a Poisson distribution to data in statsmodels
Asked Answered
A

1

13

I am trying to fit a Poisson distribution to my data using statsmodels but I am confused by the results that I am getting and how to use the library.

My real data will be a series of numbers that I think that I should be able to describe as having a poisson distribution plus some outliers so eventually I would like to do a robust fit to the data.

However for testing purposes, I just create a dataset using scipy.stats.poisson

samp = scipy.stats.poisson.rvs(4,size=200)

So to fit this using statsmodels I think that I just need to have a constant 'endog'

res = sm.Poisson(samp,np.ones_like(samp)).fit()

print res.summary()

                          Poisson Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  200
Model:                        Poisson   Df Residuals:                      199
Method:                           MLE   Df Model:                            0
Date:                Fri, 27 Jun 2014   Pseudo R-squ.:                   0.000
Time:                        14:28:29   Log-Likelihood:                -404.37
converged:                       True   LL-Null:                       -404.37
                                        LLR p-value:                       nan
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          1.3938      0.035     39.569      0.000         1.325     1.463
==============================================================================

Ok, that doesn't look right, But if I do

res.predict()

I get an array of 4.03 (which was the mean for this test sample). So basically, firstly I very confused how to interpret this result from statsmodel and secondly I should probably being doing something completely different if I'm interested in robust parameter estimation of a distribution rather than fitting trends but how should I go about doing that?

Edit I should really have given more detail in order to answer the second part of my question.

I have an event that occurs a random time after a starting time. When I plot a histogram of the delay times for many events, I see that the distribution looks like a scaled Poisson distribution plus several outlier points which are normally caused by issues in my underlying system. So I simply wanted to find the expected time delay for the dataset, excluding the outliers. If not for the outliers, I could simply find the mean time. I suppose that I could exclude them manually but I thought that I could find something more exacting.

Edit On further reflection, I will be considering other distributions instead of sticking with a Poissonion and the details of my issue are probably a distraction from the original question but I've left them here anyway.

Ant answered 27/6, 2014 at 13:6 Comment(3)
What do you mean by "robust"? robust to outliers, robust to misspecification, robust to numerical problems, ...?Dannadannel
I meant robust to outliersAnt
I added some comments about outlier robust estimation to my reply. I started to look into it a while ago, but there is still a long way to go until we have it available in statsmodels.Dannadannel
D
9

The Poisson model, as most other models in generalized linear model families or for other discrete data, assumes that we have a transformation that bounds the prediction in the appropriate range.

Poisson works for nonnegative numbers and the transformation is exp, so the model that is estimated assumes that the expected value of an observation, conditional on the explanatory variables is

 E(y | x) = exp(X dot params)

To get the lambda parameter of the poisson distribution, we need to use exp, i.e.

>>> np.exp(1.3938)
4.0301355071650118

predict does this by default, but you can request just the linear part (X dot params) with a keyword argument.

BTW: statsmodels' controversial terminology endog is y exog is x (has x in it) (http://statsmodels.sourceforge.net/devel/endog_exog.html )

Outlier Robust Estimation

The answer to the last part of the question is that there is currently no outlier robust estimation in Python for Poisson or other count models, as far as I know.

For overdispersed data, where the variance is larger than the mean, we can use NegativeBinomial Regression. For outliers in Poisson we would have to use R/Rpy or do manual trimming of outliers. Outlier identification could be based on one of the standardized residuals.

It will not be available in statsmodels for some time, unless someone is contributing this.

Dannadannel answered 27/6, 2014 at 13:32 Comment(3)
Thank you, so this answers my first question. Clearly I would need to do a lot more statistics before I become comfortable with statsmodels nomenclature.Ant
endog/exog is just something to memorize (with the help that exog is ex). I refused so far to introduce one letter names into the statsmodels code.Dannadannel
As a new user, the meaning of endog/exog didn't confuse me so much as how to specify exog and I was confused between performing a regression vs parameter fitting a distribution. Many thanks for your answers and indeed for statsmodels.Ant

© 2022 - 2024 — McMap. All rights reserved.