I have heard people talk about "modeling on the residuals" when they want to calculate some effect after an a-priori model has been made. For example, if they know that two variables, var_1
and var_2
are correlated, we first make a model with var_1
and then model the effect of var_2
afterwards. My problem is that I've never seen this done in practice.
I'm interested in the following:
- If I'm using a
glm
, how do I account for thelink function
used? - What distribution do I choose when running a second
glm
withvar_2
as explanatory variable? I assume this is related to 1. - Is this at all related to using the first models prediction as an offset in the second model?
My attempt:
dt <- data.table(mtcars) # I have a hypothesis that `mpg` is a function of both `cyl` and `wt`
dt[, cyl := as.factor(cyl)]
model <- stats::glm(mpg ~ cyl, family=Gamma(link="log"), data=dt) # I want to model `cyl` first
dt[, pred := stats::predict(model, type="response", newdata=dt)]
dt[, res := mpg - pred]
# will this approach work?
model2_1 <- stats::glm(mpg ~ wt + offset(pred), family=Gamma(link="log"), data=dt)
dt[, pred21 := stats::predict(model2_1, type="response", newdata=dt) ]
# or will this approach work?
model2_2 <- stats::glm(res ~ wt, family=gaussian(), data=dt)
dt[, pred22 := stats::predict(model2_2, type="response", newdata=dt) ]
My first suggested approach has convergence issues, but this is how my silly brain would approach this problem. Thanks for any help!