Dropping variable in lm formula still triggers contrast error
Asked Answered
C

1

9

I'm trying to run lm() on only a subset of my data, and running into an issue.

dt = data.table(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100), x3 = as.factor(c(rep('men',50), rep('women',50)))) # sample data

lm( y ~ ., dt) # Use all x: Works
lm( y ~ ., dt[x3 == 'men']) # Use all x, limit to men: doesn't work (as expected)

The above doesn't work because the dataset now has only men, and we therefore can't include x3, the gender variable, into the model. BUT...

lm( y ~ . -x3, dt[x3 == 'men']) # Exclude x3, limit to men: STILL doesn't work
lm( y ~ x1 + x2, dt[x3 == 'men']) # Exclude x3, with different notation: works great

This is an issue with the "minus sign" notation in the formula? Please advice. Note: Of course I can do it a different way; for example, I could exclude the variables prior to putting them into lm(). But I'm teaching a class on this stuff, and I don't want to confuse the students, having already told them they can exclude variable using a minus sign in the formula.

Charqui answered 12/2, 2020 at 23:0 Comment(5)
It's interesting that both model.matrix(y ~ . - x3, data = dt[x3 == "men"]) and model.matrix(y ~ x1 + x2, data = dt[x3 == "men"]) work (lm calls model.matrix internally). The only difference between both model matrices is a "contrasts" attribute (which still contains x3) and which gets picked up later on within the lm routine, likely causing the error you're seeing. So my feeling is that the issue has to do with how model.matrix creates and stores the design matrix when removing terms.Canonry
I was trying to "expand" the . to get a simplified formula with terms(y ~ . -x3, data=dt, simplify=TRUE) but oddly it still retains x3 in the variables attribute which trips up lmHazardous
@Hazardous - it looks like the unimplemented-in-R neg.out= option might be related. From the S help files for terms, where neg.out= is implemented: flag controlling the treatment of terms entering with "-" sign. If TRUE, terms will be checked for cancellation and otherwise ignored. If FALSE, negative terms will be retained (with negative order).Drab
@MauritsEvers: lm calls model.matrix on a modified version of the data. At the very beginning, lm composes and evaluates the following expression: mf <- stats::model.frame( y ~ . -x3, dt[x3=="men"], drop.unused.levels=TRUE ). This causes x3 to become a single-level factor. model.matrix() is then called on mf, not the original data, resulting in the error we're observing.Millenary
@ArtemSokolov but the -x3 in the formula should exclude x3 from the dataframe, so it doesn't matter whether it's single level or not. Why it doesn't exclude it?Holdall
B
2

The error you are getting is because x3 is in the model with only one value = "men" (see comment below from @Artem Sokolov)

One way to solve it is to subset ahead of time:

dt = data.table(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100), x3 = as.factor(c(rep('men',50), rep('women',50)))) # sample data

dmen<-dt[x3 == 'men'] # create a new subsetted dataset with just men

lm( y ~ ., dmen[,-"x3"]) # now drop the x3 column from the dataset (just for the model)

Or you can do both in the same step:

lm( y ~ ., dt[x3 == 'men',-"x3"])
Byway answered 12/2, 2020 at 23:33 Comment(2)
Overall, this is a nice solution. One thing to correct is that -x3 in a formula does not cause lm to think that you're trying to subtract the column. The "don't use x3 in the model" intent is communicated correctly, but the issue is that lm calls model.frame( ..., drop.unused.levels=TRUE ) causing x3 to become a single-level factor, leading to downstream problems in model.matrix().Millenary
Thanks for clarification Artem Sokolov, I have taken that incorrect explanation out of my answer.Byway

© 2022 - 2024 — McMap. All rights reserved.