Making Probability Distribution Functions (PDFs) from histograms
Asked Answered
M

3

9

Say I have several histograms, each with counts at different bin locations (on a real axis). e.g.

def generate_random_histogram():

    # Random bin locations between 0 and 100
    bin_locations = np.random.rand(10,) * 100
    bin_locations.sort()

    # Random counts between 0 and 50 on those locations 
    bin_counts = np.random.randint(50, size=len(bin_locations))
    return {'loc': bin_locations, 'count':bin_counts}

# We can assume that the bin size is either pre-defined or that 
# the bin edges are on the middle-point between consecutive counts.
hists = [generate_random_histogram() for x in xrange(3)]

How can I normalize these histograms so that I get PDFs where the integral of each PDF adds up to one within a given range (e.g. 0 and 100)?

We can assume that the histogram counts events on pre-defined bin size (e.g. 10)

Most implementations I have seen are based, for example, on Gaussian Kernels (see scipy and scikit-learn) that start from the data. In my case, I need to do this from the histograms, since I don't have access to the original data.

Update:

Note that all current answers assume that we are looking at a random variable that lives in (-Inf, +Inf). That's fine as a rough approximation, but this may not be the case depending on the application, where the variable may be defined within some other range [a,b] (e.g. 0 and 100 in the above case)

Mullane answered 18/3, 2014 at 15:3 Comment(19)
I think this is a very good question, I'd vote, but I've run out of votes for the day. I agree, I don't think you should start out with the assumption of normality.Montfort
A KDE (gaussian or otherwise) works inherently on the data, so it won't help you in your case.Synchroscope
What will the pdf be used for? Are there any constraints on it (such as known lower or upper bounds on the data)? Do you know anything else about the distribution these data should have? In any case, I think you should think along the lines of mixture distributions. Which one you use specifically would depend on the answers to these questions.Refine
Thanks @Refine I probably have a mixture of gaussians, but I don't know the number of mixtures. Does that help?Mullane
Do you expect that bin_locations describes the centers of bins? Otherwise, (if they are bin edges) you should have len(bin_counts) = len(bin_locations) - 1Synchroscope
@Synchroscope Yes, bin_locations are the center locationsMullane
There was a thread about this subject on the scikit learn mailing list a while ago. Here is a link: sourceforge.net/p/scikit-learn/mailman/message/31984328Refine
When you say "have same range of X" do you mean all the bin_locations should be the same, or do you just mean range as in the range keyword arg to np.histogram which indicates just the lower and upper bounds of the domain?Synchroscope
@askewchan: What I mean is that we can assume that all the probability is contained within a specified range (0 to 100 in the OP) for every histogram.Mullane
You've edited it slightly, but to me "PDF" is still not well defined in your question. Do you want a true function that's defined on the entire domain (like what gaussian_kde or an interpolation would return)? Or is it ok to have an array-like PDF with values at bin_locations, as long as it is properly normalized on the domain? Or maybe you want something in betweeen, say a standardized evenly spaced bins that the values of all histograms are given at (like np.linspace(0,100))?Synchroscope
@Synchroscope I want a true PDF defined on a pre-specified range, in this case [0,100]. For it to be a PDF, the integral of the PDF within the range should be 1. Since we are working on the real domain, I should be able to discretize it at, say, regular intervals (e.g. using np.linspace).Mullane
If the PDF integrates to 1 on [0, 100], then it is not a mixture of Gaussian's. Do you mean to imply that a support of [0, 100] is a hard constraint, or are you only trying to illustrate that it must be a true pdf for a continuous distribution with support that includes [0,100]?Refine
@Refine - You are correct, they wouldn't be a mixture of Gaussians, but if having the constraint that the density is constrained within [a,b] is too difficult, we can relax the constraint to (-Inf,Inf)Mullane
Then you must interpolate. There are many options, and without some more information, there's no way to maximize entropy. The kde I posted earlier integrates to 1 on (-inf, inf), and the normalized array integrates to 1 on [0,100], but I assumed you found those unsatisfactory.Synchroscope
"We can assume that the histogram counts events on pre-defined bin size (e.g. 10)" We cannot if you give random bin centers. How could these bin_locations have uniform binsize: [.1, .2, 5, 50, 90]?Synchroscope
I just posted a GMM possibility. It is kind of weak because it involves resampling from the histogram and then fitting a GMM to the resampled data. Have you considered using pymc? With pymc it might be possible to do something a bit more efficient (statistically speaking). However, Bayesian inference can be a rabbit hole.Refine
@Synchroscope We can make the bin_size much smaller if that's causing a problem. Technically speaking, to make the problem well-defined, the histogram counts must correspond to intervals (otherwise it doesn't make sense to talk about PDFs). This data is fake after all, but each count must correspond to a well-defined interval.Mullane
Thanks @Refine - I am looking at it now. Yes - t tagged the question with pymc, since I thought pymc could help with it.Mullane
Oh, I hadn't noticed the tag. I'll see if I can whip up a quick pymc example.Refine
S
6

The main point of delicacy is defining bin_edges since technically they could be anywhere. I chose the midpoint between each pair of bin centers. Probably there are other ways to do this, but here is one:

hists = [generate_random_histogram() for x in xrange(3)]
for h in hists:
    bin_locations = h['loc']
    bin_counts = h['count']
    bin_edges = np.concatenate([[0], (bin_locations[1:] + bin_locations[:-1])/2, [100]])
    bin_widths = np.diff(bin_edges)
    bin_density = bin_counts.astype(float) / np.dot(bin_widths, bin_counts)
    h['density'] = bin_density

    data = np.repeat(bin_locations, bin_counts)
    h['kde'] = gaussian_kde(data)

    plt.step(bin_locations, bin_density, where='mid', label='normalized')
    plt.plot(np.linspace(0,100), h['kde'](np.linspace(0,100)), label='kde')

Which would result in plots like the following (one for each histogram): hists

Synchroscope answered 18/3, 2014 at 15:45 Comment(11)
I am confused. What is this supposed to compute? When I run this, I get an array full of 0s for bin_counts.Mullane
Sorry, make your bin_counts have dtype of float first. I'll edit.Synchroscope
Thanks - What code exactly from your answer you used to generate the bottom plot? (which looks roughly right). Was it the one using "fake" data?Mullane
All of it, run together. Be sure to run the 'kde' part before normalizing the bin_counts array (since I normalized in place) and the kde weights using np.repeat and needs the counts to be integers. Note that the kde has finite probability outside the domain.Synchroscope
Thanks - Although it is still not clear to me. Would you mind, perhaps, re-organizing the answer to show what code snippet you run from end to end to get the plot?Mullane
Sorry. Edited --- now run all the formatted code, in order, and you should get the plot. (after you've created the hists list of random histograms, of course). This just plots the last entry in hists. Put the plt functions inside the loop if you want to plot each and every one.Synchroscope
Hmm I don't get the same result. I think you normalized the bin_counts before plt.step so that the range is roughly equivalent. How did you normalize the counts?Mullane
OH sorry, changed avariable name. plt.step with bin_density nowSynchroscope
I updated your answer to reflect that change. I hope you don't mindMullane
Thanks for the accept. I think to have the full constraint at on [a,b], you'll have to do a least-squares fit or something with the constraint that the integral equals one. The question then is, what functional for would you fit (here, it's a stepwise constant or a sum of gaussians).Synchroscope
Yes - I also just run into seaborn. I have used for example their violinplots, which seem to estimate a PDF from histogram data. Not sure what's under the hood.Mullane
R
3

Here is one possibility. I'm not that crazy about it, but it kind of works. Note that the histograms are kind of weird, as the bin width is quite variable.

import numpy as np
from matplotlib import pyplot
from sklearn.mixture.gmm import GMM
from sklearn.grid_search import GridSearchCV

def generate_random_histogram():
    # Random bin locations between 0 and 100
    bin_locations = np.random.rand(10,) * 100
    bin_locations.sort()

    # Random counts on those locations
    bin_counts = np.random.randint(50, size=len(bin_locations))
    return {'loc': bin_locations, 'count':bin_counts}

def bin_widths(loc):
    widths = []
    for i in range(len(loc)-1):
        widths.append(loc[i+1] - loc[i])
    widths.append(widths[-1])
    widths = np.array(widths)
    return widths

def sample_from_hist(loc, count, size):
    n = len(loc)
    tot = np.sum(count)
    widths = bin_widths(loc)
    lowers = loc - widths
    uppers = loc + widths
    probs = count / float(tot)
    bins = np.argmax(np.random.multinomial(n=1, pvals=probs, size=(size,)),1)
    return np.random.uniform(lowers[bins], uppers[bins])

# Generate the histogram
hist = generate_random_histogram()

# Sample from the histogram
sample = sample_from_hist(hist['loc'],hist['count'],np.sum(hist['count']))

# Fit a GMM
param_grid = [{'n_components':[1,2,3,4,5]}]
def scorer(est, X, y=None):
    return np.sum(est.score(X))
grid_search = GridSearchCV(GMM(), param_grid, scoring=scorer).fit(np.reshape(sample,(len(sample),1)))
gmm = grid_search.best_estimator_

# Sample from the GMM
gmm_sample = gmm.sample(np.sum(hist['count']))

# Plot the resulting histograms and mixture distribution
fig = pyplot.figure()
ax1 = fig.add_subplot(311)
widths = bin_widths(hist['loc'])
ax1.bar(hist['loc'], hist['count'], width=widths)
ax2 = fig.add_subplot(312, sharex=ax1)
ax2.hist(gmm_sample, bins=hist['loc'])
x = np.arange(min(sample), max(sample), .1)
y = np.exp(gmm.score(x))
ax3 = fig.add_subplot(313, sharex=ax1)
ax3.plot(x, y)
pyplot.show()

resulting plot

Refine answered 18/3, 2014 at 17:25 Comment(1)
This is a great answer. Thanks @jcrudy. I certainly would be interested in seeing a solution with PyMC.Mullane
R
2

Here's a way to do it with pymc. This method uses a fixed number of components (n_components) in the mixture model. You could try attaching a prior to n_components and sampling over that prior. Alternatively, you could just pick something reasonable or use the grid search technique from my other answer to estimate the number of components. In the below code I used 10000 iterations with a burn in of 9000, but that might not be sufficient to get good results. I would suggest using a much larger burn in. I also chose the priors somewhat arbitrarily, so those might be something to look at. You'll have to experiment with it. Best of luck to you. Here is the code.

import numpy as np
import pymc as mc
import scipy.stats as stats
from matplotlib import pyplot

def generate_random_histogram():
    # Random bin locations between 0 and 100
    bin_locations = np.random.rand(10,) * 100
    bin_locations.sort()

    # Random counts on those locations
    bin_counts = np.random.randint(50, size=len(bin_locations))
    return {'loc': bin_locations, 'count':bin_counts}


def bin_widths(loc):
    widths = []
    for i in range(len(loc)-1):
        widths.append(loc[i+1] - loc[i])
    widths.append(widths[-1])
    widths = np.array(widths)
    return widths


def mixer(name, weights, value=None):
    n = value.shape[0]
    def logp(value, weights):
        vals = np.zeros(shape=(n, weights.shape[1]), dtype=int)
        vals[:, value.astype(int)] = 1
        return mc.multinomial_like(x = vals, n=1, p=weights)

    def random(weights):
        return np.argmax(np.random.multinomial(n=1, pvals=weights[0,:], size=n), 1)

    result = mc.Stochastic(logp = logp,
                             doc = 'A kernel smoothing density node.',
                             name = name,
                             parents = {'weights': weights},
                             random = random,
                             trace = True,
                             value = value,
                             dtype = int,
                             observed = False,
                             cache_depth = 2,
                             plot = False,
                             verbose = 0)
    return result

def create_model(lowers, uppers, count, n_components):
    n = np.sum(count)
    lower = min(lowers)
    upper = min(uppers)
    locations = mc.Uniform(name='locations', lower=lower, upper=upper, value=np.random.uniform(lower, upper, size=n_components), observed=False)
    precisions = mc.Gamma(name='precisions', alpha=1, beta=1, value=.001*np.ones(n_components), observed=False)
    weights = mc.CompletedDirichlet('weights', mc.Dirichlet(name='weights_ind', theta=np.ones(n_components)))

    choice = mixer('choice', weights, value=np.ones(n).astype(int))

    @mc.observed
    def histogram_data(value=count, locations=locations, precisions=precisions, weights=weights):
        if hasattr(weights, 'value'):
            weights = weights.value

        lower_cdfs = sum([weights[0,i]*stats.norm.cdf(lowers, loc=locations[i], scale=np.sqrt(1.0/precisions[i])) for i in range(len(weights))])
        upper_cdfs = sum([weights[0,i]*stats.norm.cdf(uppers, loc=locations[i], scale=np.sqrt(1.0/precisions[i])) for i in range(len(weights))])

        bin_probs = upper_cdfs - lower_cdfs
        bin_probs = np.array(list(upper_cdfs - lower_cdfs) + [1.0 - np.sum(bin_probs)])
        n = np.sum(count)
        return mc.multinomial_like(x=np.array(list(count) + [0]), n=n, p=bin_probs)

    @mc.deterministic
    def location(locations=locations, choice=choice):
        return locations[choice.astype(int)]

    @mc.deterministic
    def dispersion(precisions=precisions, choice=choice):
        return precisions[choice.astype(int)]

    data_generator = mc.Normal('data', mu=location, tau=dispersion)

    return locals()

# Generate the histogram
hist = generate_random_histogram()
loc = hist['loc']
count = hist['count']
widths = bin_widths(hist['loc'])
lowers = loc - widths
uppers = loc + widths

# Create the model
model = create_model(lowers, uppers, count, 5)

# Initialize to the MAP estimate
model = mc.MAP(model)
model.fit(method ='fmin')

# Now sample with MCMC
model = mc.MCMC(model)
model.sample(iter=10000, burn=9000, thin=300)

# Plot the mu and tau traces
mc.Matplot.plot(model.trace('locations'))
pyplot.show()

# Get the samples from the fitted pdf
sample = np.ravel(model.trace('data')[:])

# Plot the original histogram, sampled histogram, and pdf
lower = min(lowers)
upper = min(uppers)
kde = stats.gaussian_kde(sample)
x = np.arange(0,100,.1)
y = kde(x)
fig = pyplot.figure()
ax1 = fig.add_subplot(311)
pyplot.xlim(lower,upper)
ax1.bar(loc, count, width=widths)
ax2 = fig.add_subplot(312, sharex=ax1)
ax2.hist(sample, bins=loc)
ax3 = fig.add_subplot(313, sharex=ax1)
ax3.plot(x, y)
pyplot.show()

And as you can see, the two distributions don't look terribly alike. However, a histogram is not much to go off of. I would play with different numbers of components and more iterations / burn in, but it's a project. Depending on your priorities, I suspect either @askewchan's or my other answer might serve you better.

resulting plot

Refine answered 19/3, 2014 at 1:23 Comment(4)
This is a very interesting answer. Something I don't understand is, from a statistical standpoint, why @askewchan's answer can get away without having to estimate the number of mixtures. Is he using a non-parametric Bayesian method? How does your answer relate to his?Scherman
He certainly seems to be using a non-parametric method. This made me wonder, does PyMC include any built-ins to model non-parametric distributions?Scherman
I haven't looked at @askewchan's answer in detail, but if I understand correctly he is implicitly defining a GMM with one component for each bar in the histogram, where the weight for that component is the number of data points contributing to that bar and the variance is determined by the bandwidth of the KDE. In this answer, I simply choose the number of components in the mixture arbitrarily. In my other answer, I use a grid search to find the number of components. I think pymc now has a kernel density based stochastic node available. See https://mcmap.net/q/1315832/-solving-inverse-problems-with-pymc.Refine
From a statistical standpoint, I'm not sure how much any of these three methods can be justified. The question to ask is: how similar can we say the estimated pdf is to the pdf of the data that generated the original histogram? With these histograms, I think there is not much hope of a reliable reconstruction, simply because the bins are wide and there aren't that many data points. However, they are just examples. A histogram with narrower bins and more data would do better. The way to find out would be to do simulations with known pdfs.Refine

© 2022 - 2024 — McMap. All rights reserved.