Is there a difference between gamma hurdle (two-part) models and zero-inflated gamma models?
Asked Answered
E

1

6

I have semicontinuous data (many exact zeros and continuous positive outcomes) that I am trying to model. I have largely learned about modeling data with substantial zero mass from Zuur and Ieno's Beginner's Guide to Zero-Inflated Models in R, which makes a distinction between zero-inflated gamma models and what they call "zero-altered" gamma models, which they describe as hurdle models that combine a binomial component for the zeros and a gamma component for the positive continuous outcome. I have been exploring the use of the ziGamma option in the glmmTMB package and comparing the resulting coefficients to a hurdle model that I built following the instructions in Zuur's book (pages 128-129), and they do not coincide. I'm having trouble understanding why not, as I know that the gamma distribution cannot take on the value of zero, so I suppose every zero-inflated gamma model is technically a hurdle model. Can anyone illuminate this for me? See more comments about the models below the code.

library(tidyverse)
library(boot)
library(glmmTMB)
library(parameters)

### DATA

id <- rep(1:75000)
age <- sample(18:88, 75000, replace = TRUE)
gender <- sample(0:1, 75000, replace = TRUE)
cost <- c(rep(0, 30000), rgamma(n = 37500, shape = 5000, rate = 1), 
          sample(1:1000000, 7500, replace = TRUE))
disease <- sample(0:1, 75000, replace = TRUE)
time <- sample(30:3287, 75000, replace = TRUE)

df <- data.frame(cbind(id, disease, age, gender, cost, time))

# create binary variable for non-zero costs

df <- df %>% mutate(cost_binary = ifelse(cost > 0, 1, 0))

### HURDLE MODEL (MY VERSION)

# gamma component

hurdle_gamma <- glm(cost ~ disease + gender + age + offset(log(time)), 
                    data = subset(df, cost > 0),
                    family = Gamma(link = "log"))

model_parameters(hurdle_gamma, exponentiate = T)

# binomial component

hurdle_binomial <-  glm(cost_binary ~ disease + gender + age + time, 
                        data = df, family = "binomial")

model_parameters(hurdle_binomial, exponentiate = T)

# predicted probability of use

df$prob_use <- predict(hurdle_binomial, type = "response")

# predicted mean cost for people with any cost

df_bin <- subset(df, cost_binary == 1)

df_bin$cost_gamma <- predict(hurdle_gamma, type = "response")

# combine data frames

df2 <- left_join(df, select(df_bin, c(id, cost_gamma)), by = "id")

# replace NA with 0

df2$cost_gamma <- ifelse(is.na(df2$cost_gamma), 0, df2$cost_gamma)

# calculate predicted cost for everyone

df2 <- df2 %>% mutate(cost_pred = prob_use * cost_gamma)

# mean predicted cost

mean(df2$cost_pred)

### glmmTMB with ziGamma

zigamma_model <- glmmTMB(cost ~ disease + gender + age + offset(log(time)),
                         family = ziGamma(link = "log"),
                         ziformula = ~ disease + gender + age + time,
                         data = df)

model_parameters(zigamma_model, exponentiate = T)

df <- df %>% predict(zigamma_model, new data = df, type = "response") # doesn't work
# "no applicable method for "predict" applied to an object of class "data.frame"

The coefficients from the gamma component of my hurdle model and the fixed effects components of the zigamma model are the same, but the SEs are different, which in my actual data has substantial implications for the significance of my predictor of interest. The coefficients on the zero-inflated model are different, and I also noticed that the z values in the binomial component are the negative inverse of those in my binomial model. I assume this has to do with my binomial model modeling the probability of presence (1 is a success) and glmmTMB presumably modeling the probability of absence (0 is a success)?

In sum, can anyone point out what I am doing wrong with the glmmTMB ziGamma model?

Extrados answered 16/1, 2021 at 0:8 Comment(8)
You could probably get this question reopened (if you want) by rephrasing as solving a problem rather than requesting a package. I think your question should be on-topic, but the consensus is that it isn't (see this meta question and answers)Asthmatic
I have edited it per your suggestion, which I hope will be sufficient for it to be re-opened.Extrados
Will work on this tomorrow. I think you're exactly right about the sign reversal (glmmTMB is predicting the probability of a zero rather than of a non-zero value). Can you say what kind of predictions you want to make (see glmmTMB::predict.glmmTMB) ?Asthmatic
I appreciated the discussion you linked to. I didn’t think my question was inappropriate, because I am not the most savvy navigator of the R universe and know there’s a lot out there that I don’t know about that others easily do - like your direction to the glmmTMB package. I had searched all over the internet for resources on gamma hurdle models and had come up with next to nothing, and posting here led me to your recommendation. I work hard to be thoughtful about my questions and feel like Stack Overflow often has an itchy trigger finger for abruptly closing questions without much feedback.Extrados
Sorry, your question wasn't inappropriate, it was just considered off-topic. As suggested in that discussion, I disagree with the consensus here, but feel bound to respect it. I think your question is probably re-openable now ...Asthmatic
Are your model and the version according to Zuur identical? If so, it might make this easier to read if you skip Zuur's ...Asthmatic
Yes, they are. Edited to remove Zuur's, which I had included just to bolster (my) confidence that I had done the hurdle model correctly, at least by their lights. Re: prediction, what I'm trying to do is to predict the cost for each person in my original data (pi * mu). I read the glmmTMB::predict overview but am not sure what I'm doing wrongExtrados
Let us continue this discussion in chat.Extrados
A
6

The glmmTMB package can do this:

glmmTMB(formula, family=ziGamma(link="log"), ziformula=~1, data= ...)

ought to do it. Maybe something in VGAM as well?


To answer the questions about coefficients and standard errors:

  • the change in sign of the binomial coefficients is exactly what you suspected (the difference between estimating the probability of 0 [glmmTMB] vs the probability of not-zero [your/Zuur's code])
  • The standard errors on the binomial part of the model are close but not identical: using broom.mixed::tidy,
round(1-abs(tidy(hurdle_g,component="zi")$statistic)/
      abs(tidy(hurdle_binomial)$statistic),3)
## [1] 0.057 0.001 0.000 0.000 0.295

6% for the intercept, up to 30% for the effect of age ...

  • the nearly twofold difference in the standard errors of the conditional (cost>0) component is definitely puzzling me; it holds up if we simply implement the Gamma/log-link in glmmTMB vs glm. It's hard to know how to check which is right/what the gold standard should be for this case. I might distrust Wald p-values in this case and try to get p-values with the likelihood ratio test instead (via drop1).

In this case the model is badly misspecified (i.e. the cost is uniformly distributed, nothing like Gamma); I wonder if that could be making things harder/worse?

More recent versions of glmmTMB can also use a Tweedie distribution for the conditional distribution of the response; this distribution allows for a combination of zeros and continuous positive values.

Asthmatic answered 16/1, 2021 at 0:18 Comment(6)
Thanks, @Ben Bolker! I have used VGAM for non-hurdle zero-truncated negative binomial models, but unless I've missed something more recent, it does not accommodate gamma hurdle models. Is there any more detailed guidance on using glmmTMB with zigamma beyond cran.r-project.org/web/packages/glmmTMB/vignettes/glmmTMB.pdf? I'm not 100% following the example of fitting the hurdle model as opposed to a zero-inflated model with a gamma distribution.Extrados
Maybe you can clarify: as far as I know the "zero-inflated model with gamma distribution" is the same as the hurdle model. (You could check the results of your hurdle code against glmmTMB results ...)Asthmatic
I have typically heard hurdle models described as "two-part" or "zero-altered." "Zero-inflated" to me means a model that does not consider the zeros and the continuous or count outcomes to come from two different processes.Extrados
When I compared my hurdle model to glmmTMB with ziGamma, the coefficient for my predictor of interest is the same in the fixed-effects model from glmmTMB and the gamma component of my hurdle model, but the standard error is twice as large in my hurdle model (enough to change the statistical significance of the coefficient). The coefficients from the glmmTMB zero-inflated model and the binomial part of my hurdle model are very different: one gives an exponentiated coefficient <1, and the other gives a coefficient >1. I don't understand how to parse this to figure out which model to use.Extrados
terminology problem, perhaps. Technically, since the Gamma distribution has no probability of producing a zero outcome (not only is Prob(0)=0, which it is for any continuous distribution, but the probability density is also zero if the shape parameter is > 1 (and infinite if shape < 1 ...). Can you edit your question to show the results of both approaches? (The zero "inflation" parameter in glmmTMB is fitted on the logit scale, in case that helps ...)Asthmatic
I updated the code to make the outcome a gamma-distributed variable -- thank you for pointing that out. It does not fix the differences between the two approaches, unfortunately.Extrados

© 2022 - 2024 — McMap. All rights reserved.