Generating random numbers from arbitrary probability density function

Asked 15/1, 2017 at 4:3 Answered 14/3 at 19:46

I would like to be able to generate random numbers with a probability density function that comes from a drawn curve. These two below have the same area under the curve but should produce lists of random numbers with different characteristics.

My intuition is that one way would be to do it is to sample the curve, and then use the areas of those rectangles to feed an np.random.choice to pick a range to do an ordinary random in the range of that rectangle's range.

This doesn't feel like a very efficient way to do it. Is there a more 'correct' way to do it?

I had a crack at actually doing it:

import matplotlib.pyplot as plt
import numpy as np

areas = [4.397498, 4.417111, 4.538467, 4.735034, 4.990129, 5.292455, 5.633938,
         6.008574, 6.41175, 5.888393, 2.861898, 2.347887, 2.459234, 2.494357,
         2.502986, 2.511614, 2.520243, 2.528872, 2.537501, 2.546129, 7.223747,
         7.223747, 2.448148, 1.978746, 1.750221, 1.659351, 1.669999]
divisons = [0.0, 0.037037, 0.074074, 0.111111, 0.148148, 0.185185, 0.222222,
            0.259259, 0.296296, 0.333333, 0.37037, 0.407407, 0.444444, 0.481481,
            0.518519, 0.555556, 0.592593, 0.62963, 0.666667, 0.703704, 0.740741,
            0.777778, 0.814815, 0.851852, 0.888889, 0.925926, 0.962963, 1.0]
weights = [a/sum(areas) for a in areas]
indexes = np.random.choice(range(len(areas)), 50000, p=weights)
samples = []
for i in indexes:
    samples.append(np.random.uniform(divisons[i], divisons[i+1]))

binwidth = 0.02
binSize = np.arange(min(samples), max(samples) + binwidth, binwidth)
plt.hist(samples, bins=binSize)
plt.xlim(xmax=1)
plt.show()

The method seems to work, but is a bit heavy!

Oleneolenka answered 15/1, 2017 at 4:3 Comment(5)

Are you saying you just have an image file with that curve? Or do you actually have numbers representing the coordinates of points on the curve? – Quinnquinol 15/1, 2017 at 4:50

It could be either. It could be an image file, but more likely a drawn curve. Either svg or some kind of inking thing on a touch screen. – Oleneolenka 15/1, 2017 at 5:1

An SVG is an image file. If it's drawn on a screen, then how is your program accessing it? I'm asking what the data format is that your program will be using, not how the thing is created.' – Quinnquinol 15/1, 2017 at 5:3

It's quite hypothetical at the moment. I'm prototyping it in a CAD program, but it could end up anywhere. I assumed you meant a bitmap, one can access the coords in an SVG curve. (Eventually!) – Oleneolenka 15/1, 2017 at 5:10

Mathematically speaking, what I would do is integrate the PDF to get the cumulative distribution function. If you then invert that, you get a function into which you can plug a random number in [0, 1] and effectively get a value from the original distribution. How you actually do that depends on the format of your data. – Quinnquinol 15/1, 2017 at 5:40

One way to do it is to use rv_continuous from scipy.stats. The straightforward way to begin would be to approximate one of those pdf's with a a collection of splines with rv_continuous. In fact, you can generate pseudorandom deviates by defining either a pdf or a cdf with this thing.

Mohammedmohammedan answered 15/1, 2017 at 4:45 Comment(0)

For your case, it seems like histogram-based approach would definitely be easiest since you have a line that the user has drawn.

But since you're just trying to generate random numbers from that distribution, you can use the normalized y-values (sum the y-position of all pixels and divide by the total) as the probability_distribution directly in the function below and just take arrays the size of the number of pixels the user has drawn.

from numpy.random import choice
pde = choice(list_of_candidates, number_of_items_to_pick, p=probability_distribution)

probability_distribution (the normalized pixel y-values) is a sequence in the same order of list_of_candidates (the associated x-values). You can also use the keyword replace=False to change the behavior so that drawn items are not replaced.

see numpy docs here

This should be much faster since you're not actually generating an entire pde, just drawing random numbers that match the pde.

EDIT: your update looks like a solid approach. If you do want to generate the pde, you might consider investigating numba (http://numba.pydata.org) to vectorize your for loop.

Intimate answered 15/1, 2017 at 5:5 Comment(0)

Another method is to sample the inverse of the CDF. Then you use a uniform random generator to generate p values on the x-axis of the inverse CDF to generate random draws of your PDF. See this article: http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution

Ethology answered 23/11, 2017 at 12:33 Comment(1)

Your link is now broken. I added an explicit self-contained code in my answer to make your proposition more concrete – Guncotton 2/10, 2020 at 12:24

I was having trouble with rv_continuous, so I made my own little routine to sample from any continuous distribution with compact support, e.g. from a sum of two exponentials, or from any known discrete pdf (as asked in the question). This is essentially @Jan 's solution (a pretty classic solution).

My code is fully self-contained. To adapt it to any other distribution, you only need to change the formula in unnormalized_pdf, and make sure the boundaries of your support are correctly set (in my case from 0 to 10/lambda_max is enough.

import numpy as np
import matplotlib.pyplot as plt

plt.ion()

## The function may be any function, so long as it is with FINITE Support
def unnormalized_pdf(T, lambda1, intercept1, lambda2, intercept2):
    return np.exp(-lambda1 * T - intercept1) + np.exp(-lambda2 * T - intercept2)


lambda1, intercept1, lambda2, intercept2 = (
    0.0012941708402716523,
    8.435217547457713,
    0.0063804460354380385,
    6.712937938322769,
)

## defining the support of the pdf by hand
x0 = 0
xmax = max(1 / lambda1, 1 / lambda2) * 10

## the more bins, the higher the precision
Nbins = 1000000
xs = np.linspace(x0, xmax, Nbins)
dx = xs[1] - xs[0]
## other way to specify it:
# dx = min(1/lambda1, 1/lambda2)/100
# xs = np.arange(x0, xmax, dx)

## compute the (approximate) pdf and cdf of the thing to sample:
pdf = unnormalized_pdf(xs, lambda1, intercept1, lambda2, intercept2)
normalized_pdf = pdf / pdf.sum()
cdf = np.cumsum(normalized_pdf)

## sampling from the distro
Nsamples = 100000
r = np.random.random(Nsamples)
indices_in_cdf = np.searchsorted(cdf, r)
values_drawn = xs[indices_in_cdf]
histo, bins = np.histogram(values_drawn, 1000, density=True)
plt.semilogy(bins[:-1], histo, label="drawn from distro", color="blue")
plt.semilogy(xs, normalized_pdf / dx, label="exact pdf from which we sample", color="k", lw=3)
plt.legend()
plt.show()

Burtis answered 2/10, 2020 at 11:15 Comment(0)

Your intuition to sample with weights proportional to the density is fine as an approximation. I recommend to use tools to invert the distribution function after integrating the density. Here is an example:

import numpy as np
from scipy.stats.sampling import NumericalInversePolynomial
from matplotlib import pyplot as plt
from scipy.integrate import quad


class MyDist:
    def __init__(self, a):
        self.a = a

    def support(self):
        # distribution restricted to 0, 5, can be changed as needed
        return (0, 5)

    def pdf(self, x):
        # this is not a proper pdf, the normalizing
        # constant is missing (does not integrate to one)
        # this is ok for the method presented here
        return x * (x + np.sin(5*x) + 2) * np.exp(-x**self.a)


dist = MyDist(0.5)
gen = NumericalInversePolynomial(dist)

# compute the missing normalizing constant to plot the pdf
const_pdf = quad(dist.pdf, *dist.support())[0]

r = gen.rvs(size=50000)
x = np.linspace(r.min(), r.max(), 500)

# show histogram together with the pdf
plt.plot(x, dist.pdf(x) / const_pdf)
plt.hist(r, density=True, bins=100)
plt.show()

There are more tools to sample from custom continuous or discrete univariate distributions by just providing some information about the distribution, such as the density / pdf. Overview of the different methods: https://docs.scipy.org/doc/scipy/reference/stats.sampling.html

Wooster answered 14/3 at 19:46 Comment(0)

Recommended topics

Hot tags