Regression for a Rate variable in R
Asked Answered
K

1

5

I was tasked with developing a regression model looking at student enrollment in different programs. This is a very nice, clean data set where the enrollment counts follow a Poisson distribution well. I fit a model in R (using both GLM and Zero Inflated Poisson.) The resulting residuals seemed reasonable.

However, I was then instructed to change the count of students to a "rate" which was calculated as students / school_population (Each school has its own population.)) This is now no longer a count variable, but a proportion between 0 and 1. This is considered the "proportion of enrollment" in a program.

This "rate" (students/population) is no longer Poisson, but is certainly not normal either. So, I'm a bit lost as to the appropriate distribution, and subsequent model to represent it.

A log normal distribution seems to fit this rate parameter well, however I have many 0 values, so it won't actually fit.

Any suggestions on the best form of distribution for this new parameter, and how to model it in R?

Thanks!

Kluge answered 16/4, 2013 at 20:43 Comment(2)
I think this is a case to use exposure/offset variable (en.wikipedia.org/wiki/…). And, maybe, a question to stats.stackexchange.comAnticipate
cross-posted to r-help: thread.gmane.org/gmane.comp.lang.r.general/291112Yahairayahata
Y
7

As suggested in the comments you could keep the Poisson model and do it with an offset:

glm(response~predictor1+predictor2+predictor3+ ... + offset(log(population),
     family=poisson,data=...)

Or you could use a binomial GLM, either

glm(cbind(response,pop_size-response) ~ predictor1 + ... , family=binomial,
        data=...)

or

glm(response/pop_size ~ predictor1 + ... , family=binomial,
        weights=pop_size,
        data=...)

The latter form is sometimes more convenient, although less widely used. Be aware that in general switching from Poisson to binomial will change the link function from log to logit, although you can use family=binomial(link="log")) if you prefer.

Zero-inflation might be easier to model with the Poisson + offset combination (I'm not sure if the pscl package, the most common approach to ZIP, handles offsets, but I think it does), which will be more commonly available than a zero-inflated binomial model.

I think glmmADMB will do a zero-inflated binomial model, but I haven't tested it.

Yahairayahata answered 16/4, 2013 at 22:44 Comment(1)
Ben - great answer. You are correct in that the pscl package will allow an offset with the ZIP model. However, when I try to fit that with an offset, it doesn't fit as well as a model without an offset. That seems weird. Also, I don't know how the predicted values are affected. If I use the zeroinfl() function in pscl, does having an offset in the formula change the interpretation of the predicted values?Kluge

© 2022 - 2024 — McMap. All rights reserved.