Fit distribution to given frequency values in R

Asked 14/3, 2015 at 21:12 Answered 6/5, 2015 at 7:43

Solved r distribution estimation probability-density weibull

I have frequency values changing with the time (x axis units), as presented on the picture below. After some normalization these values may be seen as data points of a density function for some distribution.

Q: Assuming that these frequency points are from Weibull distribution T, how can I fit best Weibull density function to the points so as to infer the distribution T parameters from it?

sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
            611,1037,727,489,432,371,1125,69,595,624)

plot(1:length(sample), sample, type = "l")
points(1:length(sample), sample)

enter image description here

Update. To prevent from being misunderstood, I would like to add little more explanation. By saying I have frequency values changing with the time (x axis units) I mean I have data which says that I have:

7787 realizations of value 1
3056 realizations of value 2
2359 realizations of value 3 ... etc.

Some way towards my goal (incorrect one, as I think) would be to create a set of these realizations:

# Loop to simulate values 
set.values <- c()
for(i in 1:length(sample)){
  set.values <<- c(set.values, rep(i, times = sample[i]))
}

hist(set.values)
lines(1:length(sample), sample)
points(1:length(sample), sample)

enter image description here

and use fitdistr on the set.values:

f2 <- fitdistr(set.values, 'weibull')
f2

Why I think it is incorrect way and why I am looking for a better solution in R?

in the distribution fitting approach presented above it is assumed that set.values is a complete set of my realisations from the distribution T
in my original question I know the points from the first part of the density curve - I do not know its tail and I want to estimate the tail (and the whole density function)

Boothe answered 14/3, 2015 at 21:12 Comment(7)

I have updated my answer with histograms. – Highwrought 3/5, 2015 at 13:42

Do you know the exact value where the first part of the density curve ends and the tail begins? Your sample ends at value 22: can I assume that the tail begins at 23? – Entoil 5/5, 2015 at 8:17

I am afraid I do not understand (I am not aware of a formal definition of "distribution tail" I could use here). My eventual goal is to compute expected value of the variable which is of distribution T. Maybe it is reasonalbe to assume that first part (part between 1. and 2. points in the histogram above) is linear and the latter part - Weibull (Weibull is an asumption I was given from someone who provided me with data. I wouldn't bet my life for this but I am inclined to assume the same.) – Boothe 5/5, 2015 at 9:48

You say: "in my original question I know the points from the first part of the density curve". What do you mean exactly by "first part"? At what value does the "first part" stop? You also say: "I do not know its tail and I want to estimate the tail (and the whole density function)". For that you need (a criterion) to select where the tail begins. – Entoil 5/5, 2015 at 21:15

I kind of think I have answered it. In what way is my solution not what you are looking for? – Chondroma 9/5, 2015 at 22:23

@MikeWise , I agree with your opinion. I am very glad you got involved in this discussion - you not only provided a "working" solution, but also contributed by sharing your thoughts multiple times. Thank you a lot! :) – Boothe 10/5, 2015 at 7:18

And thank you. Was actually the most fun question I have ever answered. I am also involved in a Predictive Maintenance project, if you want to share thoughts outside of SO, ping me. – Chondroma 10/5, 2015 at 7:57

First try with all points

Second try with first point dropped Here is a better attempt, like before it uses optim to find the best value constrained to a set of values in a box (defined by the lower and upper vectors in the optim call). Notice it scales x and y as part of the optimization in addition to the Weibull distribution shape parameter, so we have 3 parameters to optimize over.

Unfortunately when using all the points it pretty much always finds something on the edges of the constraining box which indicates to me that maybe Weibull is maybe not a good fit for all of the data. The problem is the two points - they ares just too large. You see the attempted fit to all data in the first plot.

If I drop those first two points and just fit the rest, we get a much better fit. You see this in the second plot. I think this is a good fit, it is in any case a local minimum in the interior of the constraining box.

library(optimx)
sample <- c(60953,7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
            611,1037,727,489,432,371,1125,69,595,624)
t.sample <- 0:22

s.fit <- sample[3:23]
t.fit <- t.sample[3:23]

wx <- function(param) { 
  res <- param[2]*dweibull(t.fit*param[3],shape=param[1])
  return(res)
} 
minwx <- function(param){
  v <- s.fit-wx(param)
  sqrt(sum(v*v))
}

p0 <- c(1,200,1/20)
paramopt <- optim(p0,minwx,gr=NULL,lower=c(0.1,100,0.01),upper=c(1.1,5000,1))

popt <- paramopt$par
popt
rms <- paramopt$value
tit <- sprintf("Weibull - Shape:%.3f xscale:%.1f  yscale:%.5f rms:%.1f",popt[1],popt[2],popt[3],rms)

plot(t.sample[2:23], sample[2:23], type = "p",col="darkred")
lines(t.fit, wx(popt),col="blue")
title(main=tit)

Chondroma answered 3/5, 2015 at 10:33 Comment(12)

Hi @Mike Wise , thank you for your interest and this full example! As you can see, this is hard to fit the curve by that way - in my opinion the curve fitted does not fit well as it is not "bend" enough. I believe it should be far more like the blue cirve from here, isn't it? – Boothe 3/5, 2015 at 10:41

plus1 for the full example with possible more general use purposes – Boothe 3/5, 2015 at 10:48

Yes I think you have to use external knowledge to constrain the fit. Like do you have prior knowledge of probabilities if the failure rate is constant, or increasing, or decreasing. That is venturing into the realm of Bayesian estimation I believe, and is beyond me at this point, but I am working my way there :) – Chondroma 3/5, 2015 at 10:51

There are some interesting discussion of fitting Weibull using other tools in this forum. But I will need time to digest them. – Chondroma 3/5, 2015 at 10:52

Thank you, @Mike Wise! I would be really grateful if you could mention here any interesting references if you find something. I really appreciate yor help. – Boothe 3/5, 2015 at 10:54

Took another shot at it. – Chondroma 3/5, 2015 at 12:37

Wow, I have just realize I think only the tail is Weibull may be a very good point! Thank you. I will investigate it and your solution further within a few days. – Boothe 3/5, 2015 at 17:52

@Mike Wise: any reason why you set t.sample <- 0:22? Using t.sample <- 1:23 and starting t.fit, s.fit from 1:23 I obtain a decent fit throughout, included the first point. – Entoil 5/5, 2015 at 21:55

Um, t.sample <- 0:22 was what the OP posted in the original problem, so I just used that and did not change it as I considered it as one of the givens. Now I see OP has changed the problem around and things are a bit different now, the t axis scale can vary. I need to think about that a bit. – Chondroma 6/5, 2015 at 4:11

I have some more ideas, may get around to trying them tomorrow or this evening. – Chondroma 7/5, 2015 at 7:14

Tried to fit two Weibull's at once to handle those first two points, but could not get convergence. – Chondroma 9/5, 2015 at 11:19

You can get other good fits by changing the x and y scales around a bit. Would be helpful to know more about the time scale (what was the origin etc). If this was my project I would probably do bootstrapping on these fits to get a parameter bounds and distributions. – Chondroma 9/5, 2015 at 11:25

You can directly calculate the maximum likelihood parameters, as described here.

# Defining the error of the implicit function
k.diff <- function(k, vec){
  x2 <- seq(length(vec))
  abs(k^-1+weighted.mean(log(x2), w = sample)-weighted.mean(log(x2), 
                                                            w = x2^k*sample))
}

# Setting the error to "quite zero", fulfilling the equation
k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min

# Calculate lambda, given k
l <- weighted.mean(seq(length(sample))^k, w = sample)

# Plot
plot(density(rep(seq(length(sample)),sample)))
x <- 1:25
lines(x, dweibull(x, shape=k, scale= l))

Liftoff answered 6/5, 2015 at 7:43 Comment(3)

It does not work until after I run my code. No idea why. The error message is: k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min Error in as.double(w) : cannot coerce type 'closure' to vector of type 'double' – Chondroma 9/5, 2015 at 11:28

I get the error message: Error in as.double(w) : cannot coerce type 'closure' to vector of type 'double' – Chondroma 9/5, 2015 at 11:29

Hi @Liftoff , thank you for your answer! I was able to reproduce your code. I also reproduced the code for the sample with the first element removed (as in the discussion there is an opinion that first poin does not "fit" the rest and I am inclined to this thoughts), see here. Then I compared the shapes of these dendisty plots and it seems Mike's solution gives more appropriate results in this case. Nevertheless, thank you a lot for sharing this approach! – Boothe 10/5, 2015 at 7:16

Assuming the data are from a Weibull distribution, you can get an estimate of the shape and scale parameter like this:

sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
        611,1037,727,489,432,371,1125,69,595,624)
 f<-fitdistr(sample, 'weibull')
 f

If you are not sure whether it is distributed Weibull, I would recommend using the ks.test. This tests whether your data is from a hypothesised distribution. Given your knowledge of the nature of the data, you could test for a few selected distributions and see which one works best.

For your example this would look like this:

 ks = ks.test(sample, "pweibull", shape=f$estimate[1], scale=f$estimate[2])
 ks

The p-value is insignificant, hence you do not reject the hypothesis that the data is from a Weibull distribution.

Update: The histograms of either the Weibull or exponential look like a good match to your data. I think the exponential distribution gives you a better fit. Pareto distribution is another option.

f<-fitdistr(sample, 'weibull')
z<-rweibull(10000, shape= f$estimate[1],scale= f$estimate[2])
hist(z)

f<-fitdistr(sample, 'exponential')
z = rexp(10000, f$estimate[1]) 
hist(z)

Highwrought answered 3/5, 2015 at 9:37 Comment(8)

Hmm I am afraid I made a mistake in acknowledging this answer as correct. The fitdistr function treats the values (here: values from sample vector) as realizations from distribution T (in other words: points drawn drom distribution T), not: data points of a density function curve for some distribution. See that when I use estimated shape and scale parameters to drawn points from estimated T and then plot density for that points (which is not the case of my question), I end up with density like this, where x-axis values ale not right. – Boothe 3/5, 2015 at 10:6

What do you mean by : "data points of a density function curve for some distribution"? In your question you said you think it is Weibull. The pdf is for a Weibull with the estimated parameters. If you want to compare it with your graph, you need to compare it with hist(sample). Your graph above does not look like a pdf. – Highwrought 3/5, 2015 at 10:15

Hi @Highwrought , I kindly refer you to the update I have just added to my question. – Boothe 3/5, 2015 at 10:34

What makes you think this is Weibull distributed? – Highwrought 3/5, 2015 at 11:59

I think only the tail is. – Chondroma 3/5, 2015 at 12:37

Can you post further details on: 1) the untransformed series (I understand that "sample" which you posted is the transformed data following an unknown distribution. Can you post the untransformed one, which does follow Weibull); 2) Details about the transformation you performed to get "sample", i.e. what means "some normalisation". – Highwrought 3/5, 2015 at 21:11

Hi @TinaW, this is the origin data I have obtained (data about number of users who continued to use a service for a particular number of time units (x-axis)). By "some normalization" I meant only the idea that after normalizing the values (number of users) we would get some weights which may somehow correspond to the density function values. It's like you plot a histogram with the use of hist function in R and you may choose between freq=TRUE or probability=TRUE parameters. Here, on the plots in my original question, data corresponds to the freq=TRUE type and ... – Boothe 4/5, 2015 at 7:9

... and "some normalizing" would make the values be of probability=TRUE nature. However, I do not think it is some kind of a crucial issue here :) – Boothe 4/5, 2015 at 7:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags