how to define your own distribution for fitdistr function in R with the help of lmomco function
Asked Answered
F

3

8

I would like to define my own distributions to use with the fitdistrplus function to fit my monthly precipitation data from now on refered as "month". I am using the “lmomco” function to help me define the distributions, but cannot manage to make it work. For example, I am defining the generalized extreme value (gev) distribution like the following:

dgev<-pdfgev   #functions which are included in lmomco
 pgev<-cdfgev
qgev<-quagev

Since "fitdistrplus" needs the argument "start", which consists of the initial parameter values for the desired distribution, I am estimating these initial values as the following:

lmom=lmoms(month,nmom=5)     #from lmomco package
para=pargev(lmom, checklmom=TRUE)

Now, I finally try using the “fitdist” function to fit “month” to the gev distribution as:

fitgev <- fitdist(month, "gev", start=para[2]) #fitdistrplus

I get an error like the one below. It does not matter which distribution I define with the help of “lmomco”, I get the same error. Could someone give me a hint on what am I doing wrong? Thanks!

fitgev <- fitdist(month, "gev", start=para[2])
[1] "Error in dgev(c(27.6, 97.9, 100.6, 107.3, 108.5, 109, 112.4, 120.9, 137.8,  : \n  unused arguments (para.xi = 196.19347977195, para.alpha = 91.9579520442104, para.kappa = -0.00762962879097294)\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in dgev(c(27.6, 97.9, 100.6, 107.3, 108.5, 109, 112.4, 120.9, 137.8, 138.4, 144.7, 156.8, 163.1, 168.9, 169.1, 171.4, 176.1, 177.1, 178.8, 178.9, 187.2, 190.2, 190.5, 190.8, 191.2, 193.1, 195.2, 198.5, 199.8, 201.7, 206.9, 213.4, 220.7, 240, 253.5, 254.5, 256.1, 256.4, 257.5, 258.3, 261.5, 263.7, 264.7, 279.1, 284.2, 313.1, 314.7, 319.4, 321.6, 328.9, 330.1, 332.2, 358.3, 366.8, 367.9, 403.5, 424.1, 425.9, 457.3, 459.7, 468, 497.1, 508.5, 547.1), para.xi = 196.19347977195, para.alpha = 91.9579520442104,     para.kappa = -0.00762962879097294): unused arguments (para.xi = 196.19347977195, para.alpha = 91.9579520442104, para.kappa = -0.00762962879097294)>
Error in fitdist(month, "gev", start = para[2]) : 
  the function mle failed to estimate the parameters, 
                with the error code 100
Fluctuate answered 27/4, 2015 at 13:53 Comment(0)
T
6

tl;dr this is fussy, and will probably always be fussy - fitting potentially unstable distributions to extremely small, noisy data sets, is just plain hard. I outline some strategies below that will get us an answer, but I don't really trust any of the answers I'm getting.

For the specific case here, @BelSmek's answer is best: evd::fgev(month) gives answers matching mle2/DEoptim below, and gives much more plausible standard error estimates. However, all of the machinations below may be useful stuff for people trying to fit parameters to distributions in general ...

fitdist is expecting a density/distribution function with named arguments, and a bunch more; we can make this work, although as I said I don't trust the answers.

library("lmomco")
library("fitdistrplus")
## reproducible:
month <- c(27.6, 97.9, 100.6, 107.3, 108.5,
              109, 112.4, 120.9, 137.8)

Setup:

lmom <- lmoms(month,nmom=5)     #from lmomco package
para <- pargev(lmom, checklmom=TRUE)

It turns out we need to redefine dgev with quite a few additional bits of plumbing to make everyone happy:

pgev <- function(q, xi, alpha, kappa) {
    if (length(q) == 0) return(numeric(0))
    r <- try(cdfgev(x = q, para = c(xi = xi, alpha = alpha, kappa = kappa)), 
           silent = TRUE)
    if (inherits(r, "try-error")) return(rep(NaN, length(q)))
    r
}
dgev <- function(x,xi,alpha,kappa, minval = 1e-8) {
    r <- pdfgev(x,list(type="gev",para=c(xi,alpha,kappa),source="pargev"))
    r[r==0] <- minval
    r
}

Possibly the most important thing here, besides changing the arguments from a vector to a list, is intercepting cases where the density function underflows to zero and replacing them by a small value. This is a hack that won't always work: the more principled approach is for the density function to compute the log-density directly (I'll try this below, although in this case it doesn't help that much).

fitgev <- fitdist(month, "gev", start=as.list(para[[2]]))

We get an answer ...

Parameters:
        estimate   Std. Error
xi    104.060486 0.0004131185
alpha  39.227041 0.0004150259
kappa   1.162644 0.0004105323

... but I don't trust this at all because the standard errors are unrealistically low (why would we think we could estimate the parameters this precisely when fitting a 3-parameter model to 9 data points ... ?)

An alternative approach uses bbmle::mle2 in conjunction with evd::dgev — the latter does have a log argument ...

## clean up
rm(dgev)
detach("package:lmomco")
## get new packages
library(evd)
library(bbmle) 

(in general it would be better to start a fresh R session here ...)

I again had to wrap the dgev function to substitute in something for impossible values (even though we're now working on the log scale, so things are somewhat more stable ...)

dgev <- function(..., log = FALSE, minval = 1e-8) {
    r <- evd::dgev(..., log = log)
    if (log) {
        r[r == -Inf] <- log(minval)
    }
    r
}
fit2 <- mle2(month ~ dgev(loc = xi, scale = alpha, shape = kappa), 
     data = data.frame(month),
     start = as.list(para[[2]]))
summary(fit2)

Note that the standard errors are now slightly more reasonable, but still surprisingly small, and that these answers are completely different from those we got from fitdistrplus.

Coefficients:
        Estimate Std. Error z value     Pr(z)    
xi    99.6720328  0.0765906 1301.36 < 2.2e-16 ***
alpha 30.7447099  0.3027090  101.57 < 2.2e-16 ***
kappa -0.7763013  0.0076273 -101.78 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

-2 log L: 82.063 

As a final brute-force approach, we'll try differential evolution

dgev_lik <- function(pars, minval = 1e-8) {
    r <- evd::dgev(month, pars[1], pars[2], pars[3], log = TRUE)
    r[r == -Inf] <- log(minval)
    -1*sum(r)
}

library(DEoptim)
set.seed(101)
d1 <- DEoptim(dgev_lik, lower = c(90, 10, -2),
        upper = c(130, 50, 2),
        control = DEoptim.control(NP = 1000, itermax = 1000))
d1$optim
$bestmem
      par1       par2       par3 
99.6299712 30.7704978 -0.7762563 

$bestval
[1] 41.03149

This is basically the same answer that mle2 got. Looking at the guts of fitgev, it claims to have a better log-likelihood than mle2 (logLik(fitgev) is -36.9, vs. -41 for mle2/DEoptim), but it appears to be computing a non-comparable value: plugging the fitgev parameters directly into our log-likelihood function gives a much worse answer (for negative loglikelihoods, higher values are worse ...)

dgev_lik(fitgev$estimate) ## 57.39
Thimblerig answered 27/4, 2015 at 14:19 Comment(5)
I just realized that when I try 'plot(fitgev)' or I try obtaining the goodness of fit information using 'gofstats(fitgev)' I get the error: Error in pgev(q = c(112.4, 168.9, 187.2, 198.5, 253.5, 263.7, 321.6, 403.5 : unused argument (q = c(112.4, 168.9, 187.2, 198.5, 253.5, 263.7, 321.6, 403.5)) I assume the definition of the 'qge' and the 'pgev' should also those additional bits, right? but anyway if I do this I keep obtaining this error. Do you know why this could be happening?Fluctuate
Hi @Fluctuate - I'm not sure what your plots are supposed to look like but I get a plot out of it if I use plot(fitgev[[1]])Genipap
Thanks for your comment @rg255. I am actually more interested on getting the goodness of fit statistics. When I run fitdist(month, "gev", start=para[[2]]) there seems to be no problem, but when I try to get the goodness of fit information --> gofstats(fitgev) I do get an error (see the error on my previous comment) . I get the same error if I try to plot, so it seems that something is not working in the fitting...Fluctuate
@Fluctuate looking at the str() and summary() of fitgev and an equivalent made using gamma distributions (fitgam <- fitdist(month, "gamma")) it appears the gev type is missing some information (standard errors etc.) - perhaps the problems lie within what the gev is not producing... but I'm out of my depth here, I'd only be able to find a solution by stumbling around in the dark :)Genipap
The Example provided above no longer works: Do you mind providing an example for gev fitting that works.Chaplain
K
1

Make sure the argument in you cumulative function has variable q: pgev(q, par1, par2) instead of pgev(x, par1, par2)

Because the error message essentially tells you that it couldn't find the variable q.

The keypoint is to use: x as pdf input ;q as cdf input ;p as inverse cdf input

For example, fitting a Gumble Distribution defined by yourself

# Data

x1 <- c(6.4,13.3,4.1,1.3,14.1,10.6,9.9,9.6,15.3,22.1,13.4,
13.2,8.4,6.3,8.9,5.2,10.9,14.4)

# Define pdf, cdf , inverse cdf

dgumbel <- function(x,a,b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b))
pgumbel <- function(q,a,b) exp(-exp((a-q)/b))
qgumbel <- function(p,a,b) a-b*log(-log(p))

# Fit with MLE
   f1c <- fitdist(x1,"gumbel",start=list(a=10,b=5))

# Goodness of Fit
   gofstat(f1c, fitnames = 'Gumbel MLE')

Reference: https://www.rdocumentation.org/packages/fitdistrplus/versions/0.2-1/topics/fitdist

Kourtneykovac answered 8/9, 2018 at 7:28 Comment(0)
C
1

here is another solution, if the example provided no longer works:

library(evd)
fitgev <- fgev(month) 

# e.g. extract log-likelihood
logLik(fitgev)[[1]]

Canvasback answered 21/9, 2022 at 11:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.