How do I generate data with specified mean, variance, skewness, kurtosis in Python?

Asked 13/4, 2019 at 16:42 Answered 16/5 at 19:53

I want to generate data in Python that behaves like real stock market data, which means I need to be able to specify and play around with all of the first four moments. Only being able to control skewness or only kurtosis is unfortunately not enough.

I found some answers here: How to generate a distribution with a given mean, variance, skew and kurtosis in Python?, however I seem unable to gain control of the properties with the gengamma distribution.

I know there are tons of distributions here: https://docs.scipy.org/doc/scipy/reference/stats.html#continuous-distributions, maybe I can use one of them in some clever way? Or is there another way?

Lacey answered 13/4, 2019 at 16:42 Comment(0)

There are a number of potential choices of distribution family.

The classic example would be the Pearson family of distributions.

https://en.wikipedia.org/wiki/Pearson_distribution

These encompass scaled (including multiplication by negative values to get left-skewed distributions) and shifted versions of the beta, gamma, inverse gamma, t and F distributions, among others.

Laidlaw answered 24/12, 2023 at 22:38 Comment(0)

I think you are better using the gengamma function in scipy since you have all the parameters to control the shape of the distribution.

from scipy.stats import gengamma

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gengamma.html

Hopes this helps.

Hexapla answered 13/2, 2020 at 6:20 Comment(0)

One way of generating such data is by repeatedly sampling random numbers within specific minimum and maximum until the desired statistic of the data is within a given tolerance. See the following Python code for example:

from scipy.stats import skew, kurtosis 
import numpy as np

def Generator(lower, upper, m, v, kur, sk, n, tol=0.01):
    """
    This function generates a list of random numbers within a given range that meet specific statistical criteria.

    Parameters:
        lower (int): The lower limit of the range from which to generate random numbers.
        upper (int): The upper limit of the range from which to generate random numbers.
        m (float): The desired mean value for the generated data.
        v (float): The desired variance for the generated data.
        kur (float): The desired kurtosis for the generated data.
        sk (float): The desired skewness for the generated data.
        n (int): The number of random numbers to generate.
        tol (float, optional): The tolerance for the mean, variance, kurtosis, and skewness. Defaults to 0.01.

    Returns:
        list: A list of n random numbers that meet the specified statistical criteria.

    """
    while True:
        data=list(np.random.choice(np.arange(lower, upper), n, replace=True))
        if (abs(np.mean(data)-m)< tol and abs(np.var(data)-v)< tol 
           and abs(kurtosis(data)-kur)< tol and abs(skew(data)-sk)< tol):
            return data

If you set the tolerance minimal, it will take time to generate data. Now to generate 100 data points with a minimum of 0 and a maximum of 10, with mean=5, variance=9.5, etc... you have:

g=Generator(lower=0, upper=10, m=5, v=9.5, kur=-1.5, sk=-0.3, n=100, tol=0.1)

and

np.mean(g), np.var(g), kurtosis(g), skew(g), len(g), min(g), max(g)
(4.98, 9.5996, -1.428594751943574, -0.25566031308484666, 100, 0, 9)

Distribution of Generated Data (g)

Medius answered 16/5 at 19:53 Comment(0)

Recommended topics

Hot tags