Fitting data to distributions?
Asked Answered
R

6

29

I am not a statistician (more of a researchy web developer) but I've been hearing a lot about scipy and R these days. So out of curiosity I wanted to ask this question (though it might sound silly to the experts around here) because I am not sure of the advances in this area and want to know how people without a sound statistics background approach these problems.

Given a set of real numbers observed from an experiment, let us say they belong to one of the many distributions out there (like Weibull, Erlang, Cauchy, Exponential etc.), are there any automated ways of finding the right distribution and the distribution parameters for the data? Are there any good tutorials that walk me through the process?

Real-world Scenario: For instance, let us say I initiated a small survey and recorded information about how many people a person talks to every day for say 300 people and I have the following information:

1 10
2 5
3 20
...
...

where X Y tells me that person X talked to Y people during the period of the survey. Now using the information from the 300 people, I want to fit this into a model. The question boils down to are there any automated ways of finding out the right distribution and distribution parameters for this data or if not, is there a good step-by-step procedure to achieve the same?

Rola answered 27/11, 2010 at 4:10 Comment(3)
You've failed to describe the most important part of your question - what do you want to do with the model?Maricruzmaridel
This question would be better suited at the stats.seFroe
Eh, I'm willing to allow that he need not address what he'd do with the parametric model. Even simply working with synthetic data generated from an adequate parametric model is sufficient for asking such a question. The bootstrap is wonderful, but you have to retain or ship the data.Damselfish
B
39

This is a complicated question, and there are no perfect answers. I'll try to give you an overview of the major concepts, and point you in the direction of some useful reading on the topic.

Assume that you a one dimensional set of data, and you have a finite set of probability distribution functions that you think the data may have been generated from. You can consider each distribution independently, and try to find parameters that are reasonable given your data. There are two methods for setting parameters for a probability distribution function given data:

  1. Least Squares
  2. Maximum Likelihood

In my experience, Maximum Likelihood has been preferred in recent years, although this may not be the case in every field.

Here's a concrete example of how to estimate parameters in R. Consider a set of random points generated from a Gaussian distribution with mean of 0 and standard deviation of 1:

x = rnorm( n = 100, mean = 0, sd = 1 )

Assume that you know the data were generated using a Gaussian process, but you've forgotten (or never knew!) the parameters for the Gaussian. You'd like to use the data to give you reasonable estimates of the mean and standard deviation. In R, there is a standard library that makes this very straightforward:

library(MASS)
params = fitdistr( x, "normal" )
print( params )

This gave me the following output:

      mean           sd     
  -0.17922360    1.01636446 
 ( 0.10163645) ( 0.07186782)

Those are fairly close to the right answer, and the numbers in parentheses are confidence intervals around the parameters. Remember that every time you generate a new set of points, you'll get a new answer for the estimates.

Mathematically, this is using maximum likelihood to estimate both the mean and standard deviation of the Gaussian. Likelihood means (in this case) "probability of data given values of the parameters." Maximum likelihood means "the values of the parameters that maximize the probability of generating my input data." Maximum likelihood estimation is the algorithm for finding the values of the parameters which maximize the probability of generating the input data, and for some distributions it can involve numerical optimization algorithms. In R, most of the work is done by fitdistr, which in certain cases will call optim.

You can extract the log-likelihood from your parameters like this:

print( params$loglik )
[1] -139.5772

It's more common to work with the log-likelihood rather than likelihood to avoid rounding errors. Estimating the joint probability of your data involves multiplying probabilities, which are all less than 1. Even for a small set of data, the joint probability approaches 0 very quickly, and adding the log-probabilities of your data is equivalent to multiplying the probabilities. The likelihood is maximized as the log-likelihood approaches 0, and thus more negative numbers are worse fits to your data.

With computational tools like this, it's easy to estimate parameters for any distribution. Consider this example:

x = x[ x >= 0 ]

distributions = c("normal","exponential")

for ( dist in distributions ) {
    print( paste( "fitting parameters for ", dist ) )
    params = fitdistr( x, dist )
    print( params )
    print( summary( params ) )
    print( params$loglik )
}

The exponential distribution doesn't generate negative numbers, so I removed them in the first line. The output (which is stochastic) looked like this:

[1] "fitting parameters for  normal"
      mean          sd    
  0.72021836   0.54079027 
 (0.07647929) (0.05407903)
         Length Class  Mode   
estimate 2      -none- numeric
sd       2      -none- numeric
n        1      -none- numeric
loglik   1      -none- numeric
[1] -40.21074
[1] "fitting parameters for  exponential"
     rate  
  1.388468 
 (0.196359)
         Length Class  Mode   
estimate 1      -none- numeric
sd       1      -none- numeric
n        1      -none- numeric
loglik   1      -none- numeric
[1] -33.58996

The exponential distribution is actually slightly more likely to have generated this data than the normal distribution, likely because the exponential distribution doesn't have to assign any probability density to negative numbers.

All of these estimation problems get worse when you try to fit your data to more distributions. Distributions with more parameters are more flexible, so they'll fit your data better than distributions with less parameters. Also, some distributions are special cases of other distributions (for example, the Exponential is a special case of the Gamma). Because of this, it's very common to use prior knowledge to constrain your choice models to a subset of all possible models.

One trick to get around some problems in parameter estimation is to generate a lot of data, and leave some of the data out for cross-validation. To cross-validate your fit of parameters to data, leave some of the data out of your estimation procedure, and then measure each model's likelihood on the left-out data.

Bellaude answered 27/11, 2010 at 5:5 Comment(5)
+1 @James, excellent. Would you have a link or two for 2d distributions (besides normal) ?Prayer
when you make the assumption that your probability distributions are Gaussian, the maximum likelihood goal function reduces to a (weighted) least squares function (i.e. least squares is a special case of maximum likelihood).Danelaw
In cross-validation, how do you compare different model's likelihood value on the "left-out" data is? I.e. is there a statistical test which takes likelihood values and states that one model is a better fit?Jaine
@Jaine - There are methods, none of them perfect. We quickly run into overfitting problems, where models with more parameters fit the data better, but don't generalize as well. Read about AIC and BIC, then post on the Cross Validated Stack Exchange if you have more questions.Bellaude
The numbers in parentheses are standard deviations of the estimates, not confidence intervals.Maskanonge
P
11

Take a look at fitdistrplus (http://cran.r-project.org/web/packages/fitdistrplus/index.html).

A couple of quick things to note:

  • Try the function descdist, which provides a plot of skew vs. kurtosis of the data and also shows some common distributions.
  • fitdist allows you to fit any distributions you can define in terms of density and cdf.
  • You can then use gofstat which computes the KS and AD stats which measure distance of the fit from the data.
Pullet answered 27/11, 2010 at 15:4 Comment(3)
It has been three years since I asked this question and I now realize that your four lines contain a lot of useful information. I took a closer look and just looked at descdist. Can you please elaborate how to use descdist along with fitdist and gofstat to formally present an analysis? I would be extremely thankful if you could at least point me to some existing tutorial on how to formally do this. Thank you for your time!Rola
@Ramnath, someone needs your help here.Fourcycle
I am looking into these functions aswell, have you found an existing tutorial yet?Satrap
M
6

This is probably a bit more general than you need, but might give you something to go on.

One way to estimate a probability density function from random data is to use an Edgeworth or Butterworth expansion. These approximations use density function properties known as cumulants (the unbiased estimators for which are the k-statistics) and express the density function as a perturbation from a Gaussian distribution.

These both have some rather dire weaknesses such as producing divergent density functions, or even density functions that are negative over some regions. However, some people find them useful for highly clustered data, or as starting points for further estimation, or for piecewise estimated density functions, or as part of a heuristic.

M. G. Kendall and A. Stuart, The advanced theory of statistics, vol. 1, Charles Griffin, 1963, was the most complete reference I found for this, with a whopping whole page dedicated to the topic; most other texts had a sentence on it at most or listed the expansion in terms of the moments instead of the cumulants which is a bit useless. Good luck finding a copy, though, I had to send my university librarian on a trip to the archives for it... but this was years ago, so maybe the internet will be more helpful today.

The most general form of your question is the topic of a field known as non-parametric density estimation, where given:

  • data from a random process with an unknown distribution, and
  • constraints on the underlying process

...you produce a density function that is the most likely to have produced the data. (More realistically, you create a method for computing an approximation to this function at any given point, which you can use for further work, eg. comparing the density functions from two sets of random data to see whether they could have come from the same process).

Personally, though, I have had little luck in using non-parametric density estimation for anything useful, but if you have a steady supply of sanity you should look into it.

Mullet answered 27/11, 2010 at 7:25 Comment(0)
B
3

I'm not a scientist, but if you were doing it with a pencil an paper, the obvious way would be to make a graph, then compare the graph to one of a known standard-distribution.

Going further with that thought, "comparing" is looking if the curves of a standard-distribution and yours are similar.

Trigonometry, tangents... would be my last thought.

I'm not an expert, just another humble Web Developer =)

Bespoke answered 27/11, 2010 at 4:29 Comment(1)
I am a scientist, and your idea of constructing a graph of your data and comparing it to well-known distributions is a really good one - its the basis of both maximum likelihood and least squares fitting. The difference between the two is how they evaluate goodness-of-fit between your data and the distributions, but both are based on your intuitively attractive idea. :)Bellaude
C
3

You are essentially wanting to compare your real world data to a set of theoretical distributions. There is the function qqnorm() in base R, which will do this for the normal distribution, but I prefer the probplot function in e1071 which allows you to test other distributions. Here is a code snippet that will plot your real data against each one of the theoretical distributions that we paste into the list. We use plyr to go through the list, but there are several other ways to go through the list as well.

library("plyr") 
library("e1071")

realData <- rnorm(1000) #Real data is normally distributed

distToTest <- list(qnorm = "qnorm", lognormal = "qlnorm", qexp =  "qexp")

#function to test real data against list of distributions above. Output is a jpeg for each distribution.
testDist <- function(x, data){
    jpeg(paste(x, ".jpeg", sep = ""))
    probplot(data, qdist = x)
    dev.off()
    }

l_ply(distToTest, function(x) testDist(x, realData))
Clementinaclementine answered 27/11, 2010 at 4:59 Comment(1)
Could you tell me if it is possible to add the "Negative Binomial" distribution into the test list as well? I've been trying but am not sure how to put something with a space i.e. the reference on R's website says that I need to put Negative Binomial but am not sure how to add this to the list.Rola
E
-4

For what it's worth, it seems like you might want to look at the Poisson distribution.

Economizer answered 27/11, 2010 at 4:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.