How to fit the best probability distribution model to my data in python?
Asked Answered
R

1

2

i have about 20,000 rows of data like this,,

Id | value
1    30
2    3
3    22
..
n    27

I did statistics to my data,, the average value 33.85, median 30.99, min 2.8, max 206, 95% confidence interval 0.21.. So most values around 33, and there are some outliers (a little).. So it seems like a distribution with long tail.

I am new to both distribution and python,, i tried class fitter https://pypi.org/project/fitter/ to try many distribution from Scipy package,, and loglaplace distribution showed the lowest error (although not quiet understand it).

I read almost all questions in this thread and i concluded two approaches (1) fitting a distribution model and then in my simulation i draw random values (2) compute the frequency of different groups of values,, but this solution will not have a value more than 206 for example.

Having my data which is values (number), what is the best approach to fit a distribution to my data in python as in my simulation i need to draw numbers. The random numbers must have same pattern as my data. Also i need to validate the model is well presenting my data by drawing my data and the model curve.

Rapid answered 16/6, 2019 at 8:39 Comment(0)
C
0

One way is to select the best model according to the Bayesian information criterion (called BIC). OpenTURNS implements an automatic method of selection (see doc here).

Suppose you have an array x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], here a quick example:

import openturns as ot
# Define x as a Sample object. It is a sample of size 11 and dimension 1
sample = ot.Sample([[xi] for xi in x])

# define distributions you want to test on the sample
tested_distributions = [ot.WeibullMaxFactory(), ot.NormalFactory(), ot.UniformFactory()]

# find the best distribution according to BIC and print its parameters
best_model, best_bic = ot.FittingTest.BestModelBIC(sample, tested_distributions)
print(best_model)
>>> Uniform(a = -0.769231, b = 10.7692)
Coadjutress answered 28/10, 2020 at 11:58 Comment(1)
You may use GetContinuousUniVariateFactories to create the list of all univariate factories, but this may return the Histogram distribution. This may be disappointing in some cases.Lasonde

© 2022 - 2024 — McMap. All rights reserved.