`lm` summary not display all factor levels
Asked Answered
F

1

13

I am running a linear regression on a number of attributes including two categorical attributes, B and F, and I don't get a coefficient value for every factor level I have.

B has 9 levels and F has 6 levels. When I initially ran the model (with intercepts), I got 8 coefficients for B and 5 for F which I understood as the first level of each being included in the intercept.

I want ranking the levels within B and F based on their coefficient so I added -1 after each factor to lock the intercept at 0 so that I could get coefficients for all levels.

Call:
lm(formula = dependent ~ a + B-1 + c + d + e + F-1 + g + h, data = input)

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
a     2.082e+03  1.026e+02  20.302  < 2e-16 ***
B1   -1.660e+04  9.747e+02 -17.027  < 2e-16 ***
B2   -1.681e+04  9.379e+02 -17.920  < 2e-16 ***
B3   -1.653e+04  9.254e+02 -17.858  < 2e-16 ***
B4   -1.765e+04  9.697e+02 -18.202  < 2e-16 ***
B5   -1.535e+04  1.388e+03 -11.059  < 2e-16 ***
B6   -1.677e+04  9.891e+02 -16.954  < 2e-16 ***
B7   -1.644e+04  9.694e+02 -16.961  < 2e-16 ***
B8   -1.931e+04  9.899e+02 -19.512  < 2e-16 ***
B9   -1.722e+04  9.071e+02 -18.980  < 2e-16 ***
c    -6.928e-01  6.977e-01  -0.993 0.321272    
d    -3.288e-01  2.613e+00  -0.126 0.899933    
e    -8.384e-01  1.171e+00  -0.716 0.474396    
F2    4.679e+02  2.176e+02   2.150 0.032146 *  
F3    7.753e+02  2.035e+02   3.810 0.000159 ***
F4    1.885e+02  1.689e+02   1.116 0.265046    
F5    5.194e+02  2.264e+02   2.295 0.022246 *  
F6    1.365e+03  2.334e+02   5.848 9.94e-09 ***
g     4.278e+00  7.350e+00   0.582 0.560847    
h     2.717e-02  5.100e-03   5.328 1.62e-07 ***

This worked in part, leading to the display of all levels of B, however F1 is still not displayed. As there is no longer an intercept I am confused why F1 is not in the linear model.

Switching the order of the call so that + F - 1 precedes + B - 1 results in coefficients of all levels of F being visible but not B1.

Does anybody know either how to display all levels of both B and F, or how to assess the relative weight of F1 compared to other levels of F from the outputs I have?

Fritzie answered 8/12, 2016 at 6:8 Comment(1)
Zheyuan Li produces a very response to simply say that linear regression is sort of an orthogonal projection of your original function onto a set of simpler functions, your variables. If two (or more) variables are the same (for instance constant functions) only one is kept. And it seems like R keeps only the first appearingBread
B
24

This issue is raised over and over again, but unfortunately no satisfying answer has been made which can be an appropriate duplicate target. Looks like I need to write one.


Most people know this is related to "contrasts", but not everyone knows why it is needed, and how to understand its result. We have to look at model matrix in order to fully digest this.

Suppose we are interested in a model with two factors: ~ f + g (numerical covariates do not matter so I include none of them; the response does not appear in model matrix, so drop it, too). Consider the following reproducible example:

set.seed(0)

f <- sample(gl(3, 4, labels = letters[1:3]))
# [1] c a a b b a c b c b a c
#Levels: a b c

g <- sample(gl(3, 4, labels = LETTERS[1:3]))
# [1] A B A B C B C A C C A B
#Levels: A B C

We start with a model matrix with no contrasts at all:

X0 <- model.matrix(~ f + g, contrasts.arg = list(
                   f = contr.treatment(n = 3, contrasts = FALSE),
                   g = contr.treatment(n = 3, contrasts = FALSE)))

#   (Intercept) f1 f2 f3 g1 g2 g3
#1            1  0  0  1  1  0  0
#2            1  1  0  0  0  1  0
#3            1  1  0  0  1  0  0
#4            1  0  1  0  0  1  0
#5            1  0  1  0  0  0  1
#6            1  1  0  0  0  1  0
#7            1  0  0  1  0  0  1
#8            1  0  1  0  1  0  0
#9            1  0  0  1  0  0  1
#10           1  0  1  0  0  0  1
#11           1  1  0  0  1  0  0
#12           1  0  0  1  0  1  0

Note, we have:

unname( rowSums(X0[, c("f1", "f2", "f3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1

unname( rowSums(X0[, c("g1", "g2", "g3")]) ) 
# [1] 1 1 1 1 1 1 1 1 1 1 1 1

So span{f1, f2, f3} = span{g1, g2, g3} = span{(Intercept)}. In this full specification, 2 columns are not identifiable. X0 will have column rank 1 + 3 + 3 - 2 = 5:

qr(X0)$rank
# [1] 5

So, if we fit a linear model with this X0, 2 coefficients out of 7 parameters will be NA:

y <- rnorm(12)  ## random `y` as a response
lm(y ~ X - 1)  ## drop intercept as `X` has intercept already

#X0(Intercept)           X0f1           X0f2           X0f3           X0g1  
#      0.32118        0.05039       -0.22184             NA       -0.92868  
#         X0g2           X0g3  
#     -0.48809             NA  

What this really implies, is that we have to add 2 linear constraints on 7 parameters, in order to get a full rank model. It does not really matter what these 2 constraints are, but there must be 2 linearly independent constrains. For example, we can do either of the following:

  • drop any 2 columns from X0;
  • add two sum-to-zero constrains on parameters, like we require coefficients for f1, f2 and f3 sum to 0, and the same for g1, g2 and g3.
  • use regularization, for example, adding ridge penalty to f and g.

Note, these three ways end up with three different solutions:

  • contrasts;
  • constrained least squares;
  • linear mixed models or penalized least squares.

The first two are still in the scope of fixed effect modelling. By "contrasts", we reduce the number of parameters until we get a full rank model matrix; while the other two does not reduce the number of parameters, but effectively reduces the effective degree of freedom.


Now, you are certainly after the "contrasts" way. So, remember, we have to drop 2 columns. They can be

  • one column from f and one column from g, giving to a model ~ f + g, with f and g contrasted;
  • intercept, and one column from either f or g, giving to a model ~ f + g - 1.

Now you should be clear, that within the framework of dropping columns, there is no way you can get what you want, because you are expecting to drop only 1 column. The resulting model matrix will still be rank-deficient.

If you really want to have all coefficients there, use constrained least squares, or penalized regression / linear mixed models.


Now, when we have interaction of factors, things are more complicated but the idea is still the same. But given that my answer is already long enough, I don't want to continue.

Backwash answered 8/12, 2016 at 7:10 Comment(3)
Thank you Zheyuan for such a detailed explanation! I now understand why I cannot get away with just dropping one column (the intercept) using this method. I will look into the options you suggested to find the one most appropriate to my data :)Fritzie
lm(y ~ X - 1) - supposed to be lm(y ~ X0 - 1)?Finbar
I definitely have to bookmark this answer -- I keep not finding it when I need it ...Harslet

© 2022 - 2024 — McMap. All rights reserved.