Fitting For Discrete Data: Negative Binomial, Poisson, Geometric Distribution

Asked 12/12, 2019 at 15:57 Answered 12/4, 2022 at 19:22

In scipy there is no support for fitting discrete distributions using data. I know there are a lot of subject about this.

For example if i have an array like below:

x = [2,3,4,5,6,7,0,1,1,0,1,8,10,9,1,1,1,0,0]

I couldn't apply for this array:

from scipy.stats import nbinom
param = nbinom.fit(x)

But i would like to ask you up to date, is there any way to fit for these three discrete distributions and then choose the best fit for the discrete dataset?

Suitcase answered 12/12, 2019 at 15:57 Comment(9)

What do you mean, there is no support? What about docs.scipy.org/doc/scipy/reference/generated/…? – Plight 12/12, 2019 at 16:0

i edited my question @Plight – Suitcase 12/12, 2019 at 16:4

What is x supposed to mean? What did you expect nbinom.fit(x) to do? scipy.stats.nbinom has no fit method. – Plight 12/12, 2019 at 16:28

i know that "no fit method". i want to learn is there any way to fit these discrete distributions and getting its parameters or not... @Plight – Suitcase 12/12, 2019 at 16:30

FYI: For Poisson, see #37500906 – Blanchblancha 12/12, 2019 at 16:49

thank you, burlt i am trying to find fitting for all discrete distributions and piking the best fitting @Weckesser. – Suitcase 12/12, 2019 at 17:8

There is no generic method to fit arbitrary discrete distribution, as there is an infinite number of them, with potentially unlimited parameters. There are methods to fit a particular distribution, though, e.g. Method of Moments. If you only need these three I can show how to use it – Cushiony 12/12, 2019 at 17:27

@Marat, i d like that. it can be super-helpful for me, thank you. – Suitcase 12/12, 2019 at 17:56

@Plight I know it's not your responsibility or anything, but the decision to omit 'fit' methods from all discrete distributions seems like a pretty weak design choice. Surely it's reasonable to provide methods for the commonly-used discrete distributions. Oh well, I won't worry about it. – Horotelic 12/12, 2019 at 20:26

You can use Method of Moments to fit any particular distribution.

Basic idea: get empirical first, second, etc. moments, then derive distribution parameters from these moments.

So, in all these cases we only need two moments. Let's get them:

import pandas as pd
# for other distributions, you'll need to implement PMF
from scipy.stats import nbinom, poisson, geom

x = pd.Series(x)
mean = x.mean()
var = x.var()
likelihoods = {}  # we'll use it later

Note: I used pandas instead of numpy. That is because numpy's var() and std() don't apply Bessel's correction, while pandas' do. If you have 100+ samples, there shouldn't be much difference, but on smaller samples it could be important.

Now, let's get parameters for these distributions. Negative binomial has two parameters: p, r. Let's estimate them and calculate likelihood of the dataset:

# From the wikipedia page, we have:
# mean = pr / (1-p)
# var = pr / (1-p)**2
# without wiki, you could use MGF to get moments; too long to explain here
# Solving for p and r, we get:

p = 1 - mean / var  # TODO: check for zero variance and limit p by [0, 1]
r = (1-p) * mean / p

UPD: Wikipedia and scipy are using different definitions of p, one treating it as probability of success and another as probability of failure. So, to be consistent with scipy notion, use:

p = mean / var
r = p * mean / (1-p)

END OF UPD

UPD2:

I'd suggest using @thilak's code log likelihood instead. It allows to avoid loss of precision, which is especially important on large samples.

END OF UPD2

Calculate likelihood:

likelihoods['nbinom'] = x.map(lambda val: nbinom.pmf(val, r, p)).prod()

Same for Poisson, there is only one parameter:

# from Wikipedia,
# mean = variance = lambda. Nothing to solve here
lambda_ = mean
likelihoods['poisson'] = x.map(lambda val: poisson.pmf(val, lambda_)).prod()

Same for Geometric distribution:

# mean = 1 / p  # this form fits the scipy definition
p = 1 / mean

likelihoods['geometric'] = x.map(lambda val: geom.pmf(val, p)).prod()

Finally, let's get the best fit:

best_fit = max(likelihoods, key=lambda x: likelihoods[x])
print("Best fit:", best_fit)
print("Likelihood:", likelihoods[best_fit])

Let me know if you have any questions

Cushiony answered 12/12, 2019 at 20:45 Comment(7)

thank you so much. I have one more question, I'd appreciate it if you could answer it. I know that my dataset is discrete but let's say I wanted to see if it fits the normal distribution or not. Is it possible? Is there any way to do this like Method of Moments? – Suitcase 12/12, 2019 at 20:59

and also, can we apply this method to mixture models? like binomial mixture? – Suitcase 12/12, 2019 at 21:14

@Suitcase yes, it works for continuous distributions like Gaussian. In some cases, however, it is hard or even impossible to estimate all parameters. For example, for a mixture of two binomials you'll need three parameters and thus three moment; it is already unpleasant to solve. It gets even worse once you add more components into the mixture. – Cushiony 12/12, 2019 at 22:3

@Maral, how do I do it if I want to do it for binomial mixture for two? Could you please show me the way? – Suitcase 13/12, 2019 at 7:16

@Suitcase Sorry, but I will not. It already deviates from the scope of the original question, so maybe it's better to post a separate question for that. Also, it is actually five parameters, not three, and as I mentioned it gets really unpleasant to solve it in a closed form. In practice, mixture models are usually estimated using EM algorithm and gaussians. – Cushiony 13/12, 2019 at 15:3

@Maral, are you sure for negative binomial above code? When I generate random data, it does not give the correct result for that distribution. – Suitcase 13/12, 2019 at 16:0

@Salih, the posted code actually doesn't work for the questions data, as p>1, right? What is the previous step to take before implementing your algorithm? – Bushtit 18/1, 2023 at 15:17

In addition to Marat's great answer, I would most certainly recommend taking log of the probability mass function. Some information on why log likelihood is preferred over likelihood- https://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution

I would rewrite the code for Negative Binomial to-

log_likelihoods = {}
log_likelihoods['nbinom'] = x.map(lambda val: nbinom.logpmf(val, r, p)).sum()

Note that I have used-

logpmf instead of pmf
sum instead of product

And to find out the best distribution-

best_fit = max(log_likelihoods, key=lambda x: log_likelihoods[x])
print("Best fit:", best_fit)
print("log_Likelihood:", log_likelihoods[best_fit])

Urbanity answered 12/4, 2022 at 19:22 Comment(0)

Recommended topics

Hot tags