R: modeling on residuals
Asked Answered
S

1

7

I have heard people talk about "modeling on the residuals" when they want to calculate some effect after an a-priori model has been made. For example, if they know that two variables, var_1 and var_2 are correlated, we first make a model with var_1 and then model the effect of var_2 afterwards. My problem is that I've never seen this done in practice.

I'm interested in the following:

  1. If I'm using a glm, how do I account for the link function used?
  2. What distribution do I choose when running a second glm with var_2 as explanatory variable? I assume this is related to 1.
  3. Is this at all related to using the first models prediction as an offset in the second model?

My attempt:

dt <- data.table(mtcars) # I have a hypothesis that `mpg` is a function of both `cyl` and `wt`
dt[, cyl := as.factor(cyl)]
model <- stats::glm(mpg ~ cyl, family=Gamma(link="log"), data=dt) # I want to model `cyl` first
dt[, pred := stats::predict(model, type="response", newdata=dt)]
dt[, res := mpg - pred]

# will this approach work?
model2_1 <- stats::glm(mpg ~ wt + offset(pred), family=Gamma(link="log"), data=dt)
dt[, pred21 := stats::predict(model2_1, type="response", newdata=dt) ]

# or will this approach work?
model2_2 <- stats::glm(res ~ wt, family=gaussian(), data=dt)
dt[, pred22 := stats::predict(model2_2, type="response", newdata=dt) ]

My first suggested approach has convergence issues, but this is how my silly brain would approach this problem. Thanks for any help!

Serene answered 4/5, 2021 at 12:51 Comment(5)
I'm wondering whether this question is more likely to find an answer on Cross Validated, assuming that you write the post with less focus on the code and more on the validity of the approach.Unmusical
I don't have an answer, but a similar question was asked here. One commenter (on the accepted answer) adds a note on different types of residuals, which is further covered here. I found the answer by Maverick Meerkat particularly useful.Unmusical
@Unmusical yes, maybe you are right, that would be a good idea. I feel like I've invested so many points, though, I'm commited :DSerene
perhaps relevant stats.stackexchange.com/questions/368369/…, besjournals.onlinelibrary.wiley.com/doi/full/10.1046/…, stats.stackexchange.com/questions/244870/…Interlingua
@Interlingua thank you for the tips, much appreciated! I think I have to make a crossvalidated post :)Serene
I
0

In a sense, an ANCOVA is 'modeling on the residuals'. The model for ANCOVA is y_i = grand_mean + treatment_i + b * (covariate - covariate_mean_i) + error for each treatment i. The term (covariate - covariate_mean_i) can be seen as the residuals of a model with covariate as DV and treatment as IV.

The following regression is equivalent to this ANCOVA:

lm(y ~ treatment * scale(covariate, scale = FALSE))

Which applied to the data would look like this:

lm(mpg ~ factor(cyl) * scale(wt, scale = FALSE), data = mtcars)

And can be turned into a glm similar to the one you use in your example:

glm(mpg ~ factor(cyl) * scale(wt, scale = FALSE), 
    family=Gamma(link="log"), 
    data = mtcars)
Interrupted answered 3/6, 2021 at 13:19 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.