Fit a distribution to a histogram
Asked Answered
V

1

8

I want to know the distribution of my data points, so first I plotted the histogram of my data. My histogram looks like the following: my histogram

Second, in order to fit them to a distribution, here's the code I wrote:

size = 20000
x = scipy.arange(size)
# fit
param = scipy.stats.gamma.fit(y)
pdf_fitted = scipy.stats.gamma.pdf(x, *param[:-2], loc = param[-2], scale = param[-1]) * size
plt.plot(pdf_fitted, color = 'r')

# plot the histogram
plt.hist(y)

plt.xlim(0, 0.3)
plt.show()

The result is:

enter image description here

What am I doing wrong?

Vadavaden answered 23/3, 2015 at 10:53 Comment(0)
Z
13

Your data does not appear to be gamma-distributed, but assuming it is, you could fit it like this:

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

gamma = stats.gamma
a, loc, scale = 3, 0, 2
size = 20000
y = gamma.rvs(a, loc, scale, size=size)

x = np.linspace(0, y.max(), 100)
# fit
param = gamma.fit(y, floc=0)
pdf_fitted = gamma.pdf(x, *param)
plt.plot(x, pdf_fitted, color='r')

# plot the histogram
plt.hist(y, normed=True, bins=30)

plt.show()

enter image description here

  • The area under the pdf (over the entire domain) equals 1. The area under the histogram equals 1 if you use normed=True.

  • x has length size (i.e. 20000), and pdf_fitted has the same shape as x. If we call plot and specify only the y-values, e.g. plt.plot(pdf_fitted), then values are plotted over the x-range [0, size]. That is much too large an x-range. Since the histogram is going to use an x-range of [min(y), max(y)], we much choose x to span a similar range: x = np.linspace(0, y.max()), and call plot with both the x- and y-values specified, e.g. plt.plot(x, pdf_fitted).

  • As Warren Weckesser points out in the comments, for most applications you know the gamma distribution's domain begins at 0. If that is the case, use floc=0 to hold the loc parameter to 0. Without floc=0, gamma.fit will try to find the best-fit value for the loc parameter too, which given the vagaries of data will generally not be exactly zero.

Zirconium answered 23/3, 2015 at 11:15 Comment(8)
Note that typically, the loc parameter of the gamma distribution is not used (i.e. the PDF should not be shifted), and the value is fixed at 0. By default, the fit method treats loc as fitting parameter, so you might get a small nonzero shift--check the parameters returned by fit. You can tell fit to not include loc as a fitting parameter by using the argument floc=0.Coed
@WarrenWeckesser: Thanks very much; I didn't know about floc.Zirconium
@unutbu, thanks a lot for the detailed answer! Now I understand what I was doing wrong. Indeed, my data is not gamma-distributed. Any hints what might my distribution be?Vadavaden
@po6: Sorry, I don't recognize what distribution your data might be coming from. If there is a theoretical model for the system you are measuring, then that model would provide a guess for the distribution. Or, if there is no such model, then perhaps you need more samples to "flesh out" that thin tail. If there is no model and you can't obtain more samples, then perhaps you would have to use the data itself as the definition of a discrete probability mass function.Zirconium
@Zirconium One question about this fitting using the gamma distribution: for the histogram, since you have used normed=True the area of all bars adds up to one, but what if we instead re-normalize such that the y-value of each bar is already the right probability, since all bins are same size, using weights = np.ones_like(y)/float(len(y)), that's fine so far, but how do I make my gamma fit match this new normalization?Kosciusko
I ask because, given the parameter results of the fitting, with a=param[0] the shape parameter, and theta=param[2] the scale, how can we renormalize them so that the obtained pdf is normed? (are there parameters we could include in gamma.fit() so that the resulting pdf is normed?) (to test that it is not normed currently, sum(pdf_fitted) doesn't give one) I fixed it by multiplying it by np.diff(bins)[0].Kosciusko
hello!! do you have any idea how to do the same fit, but instead of a gamma distribution, to use a custom distribution like f(x)=C*exp(a*x+b*x^2+c*x^4+d*x^8)?Intervention
Based on matplotlib version 3.5.0, we should be using plt.hist(y, density=True, bins=30) instead of plt.hist(y, normed=True, bins=30).Hydroquinone

© 2022 - 2024 — McMap. All rights reserved.