How to draw a sample from a categorical distribution

Asked 12/7, 2020 at 13:48 Answered 24/7, 2023 at 9:21

I have a 3D numpy array with the probabilities of each category in the last dimension. Something like:

import numpy as np
from scipy.special import softmax

array = np.random.normal(size=(10, 100, 5))
probabilities = softmax(array, axis=2)

How can I sample from a categorical distribution with those probabilities?

EDIT: Right now I'm doing it like this:

def categorical(x):
    return np.random.multinomial(1, pvals=x)

samples = np.apply_along_axis(categorical, axis=2, arr=probabilities)

But it's very slow so I want to know if there's a way to vectorize this operation.

Magaretmagas answered 12/7, 2020 at 13:48 Comment(2)

Could you clarify "probabilities of each category in the last dimension."? Are there 5 categories and each of the 1000 points have a different probability distirbution? – Unfix 12/7, 2020 at 20:47

@Han-KwangNienhuys It's for a recommender system probabilities[i, j, k] is the probability that user i rated item j with the rating k – Nonviolence 13/7, 2020 at 10:50

Drawing samples from a given probability distribution is done by building the evaluating the inverse cumulative distribution for a random number in the range 0 to 1. For a small number of discrete categories - like in the question - you can find the inverse using a linear search:

## Alternative test dataset
probabilities[:, :, :] = np.array([0.1, 0.5, 0.15, 0.15, 0.1])

n1, n2, m = probabilities.shape

cum_prob = np.cumsum(probabilities, axis=-1) # shape (n1, n2, m)
r = np.random.uniform(size=(n1, n2, 1))

# argmax finds the index of the first True value in the last axis.
samples = np.argmax(cum_prob > r, axis=-1)

print('Statistics:')
print(np.histogram(samples, bins=np.arange(m+1)-0.5)[0]/(n1*n2))

For the test dataset, a typical test output was:

Statistics:
[0.0998 0.4967 0.1513 0.1498 0.1024]

which looks OK.

If you have many, many categories (thousands), it's probably better to do a bisection search using a numba compiled function.

Unfix answered 13/7, 2020 at 12:15 Comment(2)

Can't you just use np.random-choice and use the p argument? Seems both easier to understand and faster. – Succuss 30/10, 2022 at 16:51

@Succuss if you want to convert your comment into an answer, I am happy to upvote. Of course, the manual cdf way works too but a non-liner is always appreciated :smile – Isatin 23/7, 2023 at 22:48

You can simply use the np.random.choice function. It has a convenient p argument allowing you to specify the probability of each category. See working example below

import numpy as np
categories = ['apple', 'banana', 'kiwi']
probabilities = [0.2, 0.2, 0.6]

# draw 1000 samples
n = 1000
draw = np.random.choice(categories, n, p=probabilities)

# print counts to verify
from collections import Counter
print(Counter(draw))

Succuss answered 24/7, 2023 at 9:21 Comment(0)

Recommended topics

Hot tags