I'm trying to run lm() on only a subset of my data, and running into an issue.
dt = data.table(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100), x3 = as.factor(c(rep('men',50), rep('women',50)))) # sample data
lm( y ~ ., dt) # Use all x: Works
lm( y ~ ., dt[x3 == 'men']) # Use all x, limit to men: doesn't work (as expected)
The above doesn't work because the dataset now has only men, and we therefore can't include x3, the gender variable, into the model. BUT...
lm( y ~ . -x3, dt[x3 == 'men']) # Exclude x3, limit to men: STILL doesn't work
lm( y ~ x1 + x2, dt[x3 == 'men']) # Exclude x3, with different notation: works great
This is an issue with the "minus sign" notation in the formula? Please advice. Note: Of course I can do it a different way; for example, I could exclude the variables prior to putting them into lm(). But I'm teaching a class on this stuff, and I don't want to confuse the students, having already told them they can exclude variable using a minus sign in the formula.
model.matrix(y ~ . - x3, data = dt[x3 == "men"])
andmodel.matrix(y ~ x1 + x2, data = dt[x3 == "men"])
work (lm
callsmodel.matrix
internally). The only difference between both model matrices is a"contrasts"
attribute (which still containsx3
) and which gets picked up later on within thelm
routine, likely causing the error you're seeing. So my feeling is that the issue has to do with howmodel.matrix
creates and stores the design matrix when removing terms. – Canonry.
to get a simplified formula withterms(y ~ . -x3, data=dt, simplify=TRUE)
but oddly it still retainsx3
in the variables attribute which trips uplm
– Hazardousneg.out=
option might be related. From the S help files forterms
, whereneg.out=
is implemented: flag controlling the treatment of terms entering with "-" sign. If TRUE, terms will be checked for cancellation and otherwise ignored. If FALSE, negative terms will be retained (with negative order). – Drablm
callsmodel.matrix
on a modified version of the data. At the very beginning,lm
composes and evaluates the following expression:mf <- stats::model.frame( y ~ . -x3, dt[x3=="men"], drop.unused.levels=TRUE )
. This causesx3
to become a single-level factor.model.matrix()
is then called onmf
, not the original data, resulting in the error we're observing. – Millenary-x3
in the formula should excludex3
from the dataframe, so it doesn't matter whether it's single level or not. Why it doesn't exclude it? – Holdall