Sample a truncated integer power law in Python?

Asked 4/7, 2014 at 18:21 Answered 17/4, 2020 at 11:28

What function can I use in Python if I want to sample a truncated integer power law?

That is, given two parameters a and m, generate a random integer x in the range [1,m) that follows a distribution proportional to 1/x^a.

I've been searching around numpy.random, but I haven't found this distribution.

Zing answered 4/7, 2014 at 18:21 Comment(1)

Why not just do rejection sampling with the built in power law distributions? – Unbolted 26/9, 2016 at 23:42

AFAIK, neither NumPy nor Scipy defines this distribution for you. However, using SciPy it is easy to define your own discrete distribution function using scipy.rv_discrete:

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

def truncated_power_law(a, m):
    x = np.arange(1, m+1, dtype='float')
    pmf = 1/x**a
    pmf /= pmf.sum()
    return stats.rv_discrete(values=(range(1, m+1), pmf))

a, m = 2, 10
d = truncated_power_law(a=a, m=m)

N = 10**4
sample = d.rvs(size=N)

plt.hist(sample, bins=np.arange(m)+0.5)
plt.show()

enter image description here

Brice answered 4/7, 2014 at 19:25 Comment(8)

Looks like you're integrating the pmf as if it were continuous, and taking the area between 1 and 2 to come up with p(1), between 2 and 3 for p(2), etc., is that right? If so, for your example I think you need to emulate Spinal Tap and go to 11 to get p(10). Your const would be adjusted by having (m+1)**k in the denominator. Or am I misunderstanding? – Disgust 4/7, 2014 at 19:46

@pjs: I'm taking the pdf to be the continous function 1/x**a. So there is no integration over intervals [1,2], [2,3], etc. However, I did integrate (by hand) to find the formulas for const and _ppf, the inverse of the cdf. I think I got it right, but I could be wrong. (I did try your suggestion, but it shifts the domain to [1, 11], so if I'm understanding you correctly, that does not pass a basic sanity check.) By the way, what is Spinal Tap referring to here? – Brice 4/7, 2014 at 19:59

Spinal Tap was a mockumentary film about a heavy metal band. They distinguished themselves from other bands by having their amplifiers go to 11. – Disgust 4/7, 2014 at 20:2

I'm not a pythonista so I can't directly check your results, but I've done a direct calculation for a,m = 2,10, and p(1) should be 0.6452579827864142. Is that what you get? – Disgust 4/7, 2014 at 20:4

Oh dear, I'm getting d.cdf(2) = p(1) = 0.555.... – Brice 4/7, 2014 at 20:12

@pjs: I get the same result (of 0.555...) by hand, integrating 1/x**2 from x=1 to x=2, and normalizing so the integral is 1 when integrated from x=1 to x=10. Is that what you are doing? – Brice 4/7, 2014 at 20:18

No, I'm actually calculating it directly as a truncated discrete distribution. See the answer I've posted for details. – Disgust 4/7, 2014 at 20:27

I've revised my answer to compute a discrete pmf. Now pmf[0] = p(1) = 0.64525798. – Brice 4/7, 2014 at 21:15

I don't use Python, so rather than risk syntax errors I'll try to describe the solution algorithmically. This is a brute-force discrete inversion. It should translate quite easily into Python. I'm assuming 0-based indexing for the array.

Setup:

Generate an array cdf of size m with cdf[0] = 1 as the first entry, cdf[i] = cdf[i-1] + 1/(i+1)**a for the remaining entries.
Scale all entries by dividing cdf[m-1] into each -- now they actually are CDF values.

Usage:

Generate your random values by generating a Uniform(0,1) and searching through cdf[] until you find an entry greater than your uniform. Return the index + 1 as your x-value.

Repeat for as many x-values as you want.

For instance, with a,m = 2,10, I calculate the probabilities directly as:

[0.6452579827864142, 0.16131449569660355, 0.07169533142071269, 0.04032862392415089, 0.02581031931145657, 0.017923832855178172, 0.013168530260947229, 0.010082155981037722, 0.007966147935634743, 0.006452579827864143]

and the CDF is:

[0.6452579827864142, 0.8065724784830177, 0.8782678099037304, 0.9185964338278814, 0.944406753139338, 0.9623305859945162, 0.9754991162554634, 0.985581272236501, 0.9935474201721358, 1.0]

When generating, if I got a Uniform outcome of 0.90 I would return x=4 because 0.918... is the first CDF entry larger than my uniform.

If you're worried about speed you could build an alias table, but with a geometric decay the probability of early termination of a linear search through the array is quite high. With the given example, for instance, you'll terminate on the first peek almost 2/3 of the time.

Disgust answered 4/7, 2014 at 20:25 Comment(2)

Doh, it only took me two hours (and reading your answer) to realize the OP is asking for a discrete probability distribution... – Brice 4/7, 2014 at 20:32

That's why I was asking about taking range areas to yield the discrete values. – Disgust 4/7, 2014 at 20:36

Use numpy.random.zipf and just reject any samples greater than or equal to m

Fang answered 17/4, 2020 at 11:28 Comment(0)

Recommended topics

Hot tags