model.matrix(): why do I lose control of contrast in this case
Asked Answered
G

1

2

Suppose we have a toy data frame:

x <- data.frame(x1 = gl(3, 2, labels = letters[1:3]),
                x2 = gl(3, 2, labels = LETTERS[1:3]))

I would like to construct a model matrix

#    x1b x1c x2B x2C
# 1    0   0   0   0
# 2    0   0   0   0
# 3    1   0   1   0
# 4    1   0   1   0
# 5    0   1   0   1
# 6    0   1   0   1

by:

model.matrix(~ x1 + x2 - 1, data = x,
             contrasts.arg = list(x1 = contr.treatment(letters[1:3]),
                                  x2 = contr.treatment(LETTERS[1:3])))

but actually I get:

#   x1a x1b x1c x2B x2C
# 1   1   0   0   0   0
# 2   1   0   0   0   0
# 3   0   1   0   1   0
# 4   0   1   0   1   0
# 5   0   0   1   0   1
# 6   0   0   1   0   1
# attr(,"assign")
# [1] 1 1 1 2 2
# attr(,"contrasts")
# attr(,"contrasts")$x1
#   b c
# a 0 0
# b 1 0
# c 0 1

# attr(,"contrasts")$x2
#   B C
# A 0 0
# B 1 0
# C 0 1

I am sort of confused here:

  • I have passed in explicit contrast matrix to drop first factor levels;
  • I have asked for dropping intercept.

Then why am I getting a model matrix with 5 columns? How can I get the model matrix I want?

Godard answered 1/7, 2016 at 17:13 Comment(3)
I think it's doing the right thing. You can't have rows with only zeros.Protuberant
As you say in your answer, it's always possible to create the dummy variables yourself. In this way, you can get the exact model matrix that you want. However, if you have rows that contain only zeros the matrix will be singular (in statistical parlance, you have "perfect colinearity") which means that the parameter estimates cannot be obtained. See: stats.stackexchange.com/questions/70899Protuberant
If you really want the broken model matrix, let it have an intercept and then just drop the column: model.matrix(~ x1 + x2, data = x)[, -1]. Though I can't really imagine what this would be useful for...Roseline
G
1

Whenever we lose control of something at R level, there must be some default, unchangable behaviour at C level. C code for model.matrix.default() can be found in R source package at:

R-<release_number>/src/library/stats/src/model.c

We can find the explanation here:

/* If there is no intercept we look through the factor pattern */
/* matrix and adjust the code for the first factor found so that */
/* it will be coded by dummy variables rather than contrasts. */

Let's make a small test on this, with a data frame

x <- data.frame(x1 = gl(2, 2, labels = letters[1:2]), x2 = sin(1:4))
  1. if we only have x2 on the RHS, we can drop intercept successfully:

    model.matrix(~ x2 - 1, data = x)
    #          x2
    #1  0.8414710
    #2  0.9092974
    #3  0.1411200
    #4 -0.7568025
    
  2. if we have only x1 on the RHS, contrast is not applied:

    model.matrix(~ x1 - 1, data = x)
    #  x1a x1b
    #1   1   0
    #2   1   0
    #3   0   1
    #4   0   1
    
  3. when we have both x1 and x2, contrast is not applied:

    model.matrix(~ x1 + x2 - 1, data = x)
    #  x1a x1b         x2
    #1   1   0  0.8414710
    #2   1   0  0.9092974
    #3   0   1  0.1411200
    #4   0   1 -0.7568025
    

This implies that while there is difference between:

lm(y ~ x2, data = x)
lm(y ~ x2 - 1, data = x)

there is no difference between

lm(y ~ x1, data = x)
lm(y ~ x1 - 1, data = x)

or

lm(y ~ x1 + x2, data = x)
lm(y ~ x1 + x2 - 1, data = x)

The reason for such behaviour is not to ensure numerical stability, but to ensure the sensibility of estimation / prediction. If we really drop the intercept while applying contrast to x1, we end up with a model matrix:

    #  x1b
    #1   0
    #2   0
    #3   1
    #4   1

The effect is that we constrain estimation for level a to 0.

In this post: How can I force dropping intercept or equivalent in this linear model?, we have a dataset:

#           Y    X1    X2
#1  1.8376852  TRUE  TRUE
#2 -2.1173739  TRUE FALSE
#3  1.3054450 FALSE  TRUE
#4 -0.3476706  TRUE FALSE
#5  1.3219099 FALSE  TRUE
#6  0.6781750 FALSE  TRUE

There isn't joint existence (X1 = FALSE, X2 = FALSE) in this dataset. But in broad sense, model.matrix() has to do something safe and sensible. It is biased to assume that no joint existence of two factor levels in the training dataset implies that they need not be predicted. If we really drop intercept while applying contrast, such joint existence is constrained at 0. However, the OP of that post deliberately wants such non-standard behaviour (for some reason), in which case, a possible workaround was given in my answer there.

Godard answered 9/7, 2016 at 3:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.