Generate sample data with an exact Mean and Standard Deviation
Asked Answered
I

3

12

I wanted to create a data set with a specific Mean and Std deviation.

Using np.random.normal() gives me an approximate. However for what I want to test I need an exact Mean and Std deviation.

I have tried using a combination of norm.pdf and np.linspace however the data set generated doesn't match up either (It could just be me misusing it though).

It really doesn't matter whether the data set is random or not as long as I can set a specific Sample size, mean and Std deviation.

Help would be much appreciated

Insula answered 25/7, 2018 at 9:26 Comment(0)
U
14

The easiest would be to generate some zero-mean samples, with the desired standard deviation. Then subtract the sample mean from the samples so it is truly zero mean. Then scale the samples so that the standard deviation is spot on, and then add the desired mean.

Here is some example code:

import numpy as np

num_samples = 1000
desired_mean = 50.0
desired_std_dev = 10.0

samples = np.random.normal(loc=0.0, scale=desired_std_dev, size=num_samples)

actual_mean = np.mean(samples)
actual_std = np.std(samples)
print("Initial samples stats   : mean = {:.4f} stdv = {:.4f}".format(actual_mean, actual_std))

zero_mean_samples = samples - (actual_mean)

zero_mean_mean = np.mean(zero_mean_samples)
zero_mean_std = np.std(zero_mean_samples)
print("True zero samples stats : mean = {:.4f} stdv = {:.4f}".format(zero_mean_mean, zero_mean_std))

scaled_samples = zero_mean_samples * (desired_std_dev/zero_mean_std)
scaled_mean = np.mean(scaled_samples)
scaled_std = np.std(scaled_samples)
print("Scaled samples stats    : mean = {:.4f} stdv = {:.4f}".format(scaled_mean, scaled_std))

final_samples = scaled_samples + desired_mean
final_mean = np.mean(final_samples)
final_std = np.std(final_samples)
print("Final samples stats     : mean = {:.4f} stdv = {:.4f}".format(final_mean, final_std))

Which produces output similar to this:

Initial samples stats   : mean = 0.2946 stdv = 10.1609
True zero samples stats : mean = 0.0000 stdv = 10.1609
Scaled samples stats    : mean = 0.0000 stdv = 10.0000
Final samples stats     : mean = 50.0000 stdv = 10.0000
Unbraid answered 25/7, 2018 at 9:44 Comment(3)
The original sample data has the mean(-0.005542) and std (0.06089), but the original data points were ranging between (-0.1208 to 0.14069). But after generating the new data with the desired mean (-0.005542) and std (0.06089), newly generated data points were ranges between (-26.847, 27.9262). is there any way to restrict the new data points to deviate a maximum of 1 std from the original data range?Hemophilia
The original sample data has the mean(-0.005542) and std (0.06089), but the original data points were ranging between (-0.1208 to 0.14069). But after generating the new data with the desired mean (-0.005542) and std (0.06089), newly generated data points were ranges between (-26.847, 27.9262). is there any way to restrict the new data points to deviate a maximum of 1 std from the original data range?Hemophilia
Is there a way to limit the max and min for the final_samples?Bail
E
6

For others seeing this later, Python 3.8+ has the statistics.NormalDist class for exactly this purpose:

import statistics as s
n = s.NormalDist(mu=10, sigma=2)
samples = n.samples(100_000, seed=42)  # remove seed if desired
print(s.mean(samples))  # 10.004521585462394
print(s.stdev(samples))  # 2.0052615406360457

Methods from @Spoonless's answer can be used to tweak the exact mean and stdev of the samples if needed, or one can just use a large enough number of samples to get exceedingly close -- this is statistics, after all.

Eats answered 10/12, 2021 at 13:39 Comment(0)
S
0

You can also do this with the random library.

import random as rand
mean = 20.9
stdd = 3
samples = 1000
data = [rand.normalvariate(mean, stdd) for i in range(samples)]

I also needed to generate data with residuals, so I simply added the product of a rand.randomrange(-1,1) with the residual.

data = [rand.normalvariate(mean, stdd)+(rand.randrange(-1,1)*residual) for i in range(samples)]

Note by adding residuals you will throw off the exact mean and stdd slightly.

Sable answered 14/4, 2022 at 6:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.