Fitting For Discrete Data: Negative Binomial, Poisson, Geometric Distribution
Asked Answered
S

2

5

In scipy there is no support for fitting discrete distributions using data. I know there are a lot of subject about this.

For example if i have an array like below:

x = [2,3,4,5,6,7,0,1,1,0,1,8,10,9,1,1,1,0,0]

I couldn't apply for this array:

from scipy.stats import nbinom
param = nbinom.fit(x)

But i would like to ask you up to date, is there any way to fit for these three discrete distributions and then choose the best fit for the discrete dataset?

Suitcase answered 12/12, 2019 at 15:57 Comment(9)
What do you mean, there is no support? What about docs.scipy.org/doc/scipy/reference/generated/…?Plight
i edited my question @PlightSuitcase
What is x supposed to mean? What did you expect nbinom.fit(x) to do? scipy.stats.nbinom has no fit method.Plight
i know that "no fit method". i want to learn is there any way to fit these discrete distributions and getting its parameters or not... @PlightSuitcase
FYI: For Poisson, see #37500906Blanchblancha
thank you, burlt i am trying to find fitting for all discrete distributions and piking the best fitting @Weckesser.Suitcase
There is no generic method to fit arbitrary discrete distribution, as there is an infinite number of them, with potentially unlimited parameters. There are methods to fit a particular distribution, though, e.g. Method of Moments. If you only need these three I can show how to use itCushiony
@Marat, i d like that. it can be super-helpful for me, thank you.Suitcase
@Plight I know it's not your responsibility or anything, but the decision to omit 'fit' methods from all discrete distributions seems like a pretty weak design choice. Surely it's reasonable to provide methods for the commonly-used discrete distributions. Oh well, I won't worry about it.Horotelic
C
9

You can use Method of Moments to fit any particular distribution.

Basic idea: get empirical first, second, etc. moments, then derive distribution parameters from these moments.

So, in all these cases we only need two moments. Let's get them:

import pandas as pd
# for other distributions, you'll need to implement PMF
from scipy.stats import nbinom, poisson, geom

x = pd.Series(x)
mean = x.mean()
var = x.var()
likelihoods = {}  # we'll use it later

Note: I used pandas instead of numpy. That is because numpy's var() and std() don't apply Bessel's correction, while pandas' do. If you have 100+ samples, there shouldn't be much difference, but on smaller samples it could be important.

Now, let's get parameters for these distributions. Negative binomial has two parameters: p, r. Let's estimate them and calculate likelihood of the dataset:

# From the wikipedia page, we have:
# mean = pr / (1-p)
# var = pr / (1-p)**2
# without wiki, you could use MGF to get moments; too long to explain here
# Solving for p and r, we get:

p = 1 - mean / var  # TODO: check for zero variance and limit p by [0, 1]
r = (1-p) * mean / p

UPD: Wikipedia and scipy are using different definitions of p, one treating it as probability of success and another as probability of failure. So, to be consistent with scipy notion, use:

p = mean / var
r = p * mean / (1-p)

END OF UPD

UPD2:

I'd suggest using @thilak's code log likelihood instead. It allows to avoid loss of precision, which is especially important on large samples.

END OF UPD2

Calculate likelihood:

likelihoods['nbinom'] = x.map(lambda val: nbinom.pmf(val, r, p)).prod()

Same for Poisson, there is only one parameter:

# from Wikipedia,
# mean = variance = lambda. Nothing to solve here
lambda_ = mean
likelihoods['poisson'] = x.map(lambda val: poisson.pmf(val, lambda_)).prod()

Same for Geometric distribution:

# mean = 1 / p  # this form fits the scipy definition
p = 1 / mean

likelihoods['geometric'] = x.map(lambda val: geom.pmf(val, p)).prod()

Finally, let's get the best fit:

best_fit = max(likelihoods, key=lambda x: likelihoods[x])
print("Best fit:", best_fit)
print("Likelihood:", likelihoods[best_fit])

Let me know if you have any questions

Cushiony answered 12/12, 2019 at 20:45 Comment(7)
thank you so much. I have one more question, I'd appreciate it if you could answer it. I know that my dataset is discrete but let's say I wanted to see if it fits the normal distribution or not. Is it possible? Is there any way to do this like Method of Moments?Suitcase
and also, can we apply this method to mixture models? like binomial mixture?Suitcase
@Suitcase yes, it works for continuous distributions like Gaussian. In some cases, however, it is hard or even impossible to estimate all parameters. For example, for a mixture of two binomials you'll need three parameters and thus three moment; it is already unpleasant to solve. It gets even worse once you add more components into the mixture.Cushiony
@Maral, how do I do it if I want to do it for binomial mixture for two? Could you please show me the way?Suitcase
@Suitcase Sorry, but I will not. It already deviates from the scope of the original question, so maybe it's better to post a separate question for that. Also, it is actually five parameters, not three, and as I mentioned it gets really unpleasant to solve it in a closed form. In practice, mixture models are usually estimated using EM algorithm and gaussians.Cushiony
@Maral, are you sure for negative binomial above code? When I generate random data, it does not give the correct result for that distribution.Suitcase
@Salih, the posted code actually doesn't work for the questions data, as p>1, right? What is the previous step to take before implementing your algorithm?Bushtit
U
1

In addition to Marat's great answer, I would most certainly recommend taking log of the probability mass function. Some information on why log likelihood is preferred over likelihood- https://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution

I would rewrite the code for Negative Binomial to-

log_likelihoods = {}
log_likelihoods['nbinom'] = x.map(lambda val: nbinom.logpmf(val, r, p)).sum()

Note that I have used-

  • logpmf instead of pmf
  • sum instead of product

And to find out the best distribution-

best_fit = max(log_likelihoods, key=lambda x: log_likelihoods[x])
print("Best fit:", best_fit)
print("log_Likelihood:", log_likelihoods[best_fit])
Urbanity answered 12/4, 2022 at 19:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.