Comparing all factor levels to the grand mean: can I tweak contrasts in linear model fitting to show all levels?

Asked 30/6, 2022 at 18:6 Answered 7/1, 2023 at 19:10

Solved r regression linear-regression lm categorical-data

I am trying to tweak contrast coding on a linear model where I want to know if each level of a factor is significantly different from the grand mean.

Let’s say the factor has levels "A", "B" and "C". The most common control-treatment contrasts obviously set "A" as the reference level, and compare "B" and "C" to that. This is not what I want, because level "A" does not show up in model summary.

Deviation coding also doesn’t seem to give me what I want, since it sets the contrast matrix for level "C" to [-1,-1,-1], and now this level does not show up in model summary.

set.seed(1)
y <- rnorm(6, 0, 1)
x <- factor(rep(LETTERS[1:3], each = 2))
fit <- lm(y ~ x, contrasts = list(x = contr.sum))
summary(fit)

In addition, the reported level names have changed from "A", "B" to "1" and "2".

Call:
lm(formula = y ~ x, contrasts = list(x = contr.sum))

Residuals:
     1      2      3      4      5      6 
-0.405  0.405 -1.215  1.215  0.575 -0.575 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02902    0.46809  -0.062    0.954
x1          -0.19239    0.66198  -0.291    0.790
x2           0.40885    0.66198   0.618    0.581

Residual standard error: 1.147 on 3 degrees of freedom
Multiple R-squared:  0.1129,    Adjusted R-squared:  -0.4785 
F-statistic: 0.1909 on 2 and 3 DF,  p-value: 0.8355

Am I missing something? Should I add a dummy variable that is equal to the grand mean, so that I can use this as the reference level?

I saw a similar question (but maybe more demanding) asked last year, but without solution (yet): models with 'differences from mean' for all coefficients on categorical variables; get 'contrast coding' to do it?.

The accepted answer here works, but the author has not provided an explanation. I have asked about it on the stats SE: https://stats.stackexchange.com/questions/600798/understanding-the-process-of-tweaking-contrasts-in-linear-model-fitting-to-show

Epigraph answered 30/6, 2022 at 18:6 Comment(0)

This answer shows you how to obtain the following coefficient table:

#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982  0.4680937 -0.06199574 0.9544655
#A           -0.19238543  0.6619845 -0.29061922 0.7902750
#B            0.40884591  0.6619845  0.61760645 0.5805485
#C           -0.21646049  0.6619845 -0.32698723 0.7651640

Amazing, isn't it? It mimics what you see from summary(fit), i.e.,

#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982  0.4680937 -0.06199574 0.9544655
#x1          -0.19238543  0.6619845 -0.29061922 0.7902750
#x2           0.40884591  0.6619845  0.61760645 0.5805485

But now we have all factor levels displayed.

Why `lm` summary does not display all factor levels?

In 2016, I answered this Stack Overflow question: `lm` summary not display all factor levels and since then, it has become the target for marking duplicated questions on similar topics.

To recap, the basic idea is that in order to have a full-rank design matrix for least squares fitting, we must apply contrasts to a factor variable. Let's say that the factor has N levels, then no matter what type of contrasts we choose (see ?contrasts for a list), it reduces the raw N dummy variables to a new set of N - 1 variables. Therefore, only N - 1 coefficients are associated with an N-level factor.

However, we can transform the N - 1 coefficients back to the original N coefficients using the contrasts matrix. The transformation enables us to obtain a coefficient table for all factor levels. I will now demonstrate how to do this, based on OP's reproducible example:

set.seed(1)
y <- rnorm(6, 0, 1)
x <- factor(rep(LETTERS[1:3], each = 2))
fit <- lm(y ~ x, contrasts = list(x = contr.sum))

In this example, the sum-to-zero contrast is applied to factor x. To know more on how to control contrasts for model fitting, see my answer at How to set contrasts for my variable in regression analysis with R?.

R code walk-through

For a factor variable of N levels subject to sum-to-zero contrasts, we can use the following function to get the N x (N - 1) transformation matrix that maps the (N - 1) coefficients estimated by lm back to the N coefficients for all levels.

ContrSumMat <- function (fctr, sparse = FALSE) {
  if (!is.factor(fctr)) stop("'fctr' is not a factor variable!")
  N <- nlevels(fctr)
  Cmat <- contr.sum(N, sparse = sparse)
  dimnames(Cmat) <- list(levels(fctr), seq_len(N - 1))
  Cmat
}

For the example 3-level factor x, this matrix is:

Cmat <- ContrSumMat(x)
#   1  2
#A  1  0
#B  0  1
#C -1 -1

The fitted model fit reports 3 - 1 = 2 coefficients for this factor. We can extract them as:

## coefficients After Contrasts
coef_ac <- coef(fit)[2:3]
#        x1         x2 
#-0.1923854  0.4088459

Therefore, the level-specific coefficients are:

## coefficients Before Contrasts
coef_bc <- (Cmat %*% coef_ac)[, 1]
#         A          B          C 
#-0.1923854  0.4088459 -0.2164605 

## note that they sum to zero as expected
sum(coef_bc)
#[1] 0

Similarly, we can get the covariance matrix after contrasts:

var_ac <- vcov(fit)[2:3, 2:3]
#           x1         x2
#x1  0.4382235 -0.2191118
#x2 -0.2191118  0.4382235

and transform it to the one before contrasts:

var_bc <- Cmat %*% var_ac %*% t(Cmat)
#           A          B          C
#A  0.4382235 -0.2191118 -0.2191118
#B -0.2191118  0.4382235 -0.2191118
#C -0.2191118 -0.2191118  0.4382235

Interpretation:

The model intercept coef(fit)[1] is the grand mean.
The computed coef_bc is the deviation of each level's mean from the grand mean.
The diagonal entries of var_bc gives the estimated variance of these deviations.

We can then proceed to compute t-statistics and p-values for these coefficients, as follows.

## standard error of point estimate `coef_bc`
std.error_bc <- sqrt(diag(var_bc))
#        A         B         C 
#0.6619845 0.6619845 0.6619845 

## t-statistics (Null Hypothesis: coef_bc = 0)
t.stats_bc <- coef_bc / std.error_bc
#         A          B          C 
#-0.2906192  0.6176065 -0.3269872 

## p-values of the t-statistics
p.value_bc <- 2 * pt(abs(t.stats_bc), df = fit$df.residual, lower.tail = FALSE)
#        A         B         C 
#0.7902750 0.5805485 0.7651640 

## construct a coefficient table that mimics `coef(summary(fit))`
stats.tab_bc <- cbind("Estimate" = coef_bc,
                      "Std. Error" = std.error_bc,
                      "t value" = t.stats_bc,
                      "Pr(>|t|)" = p.value_bc)
#    Estimate Std. Error    t value  Pr(>|t|)
#A -0.1923854  0.6619845 -0.2906192 0.7902750
#B  0.4088459  0.6619845  0.6176065 0.5805485
#C -0.2164605  0.6619845 -0.3269872 0.7651640

We can also augment it by including the result for the grand mean (i.e., the model intercept).

## extract statistics of the intercept
intercept.stats <- coef(summary(fit))[1, , drop = FALSE]
#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982  0.4680937 -0.06199574 0.9544655

## augment the coefficient table
stats.tab <- rbind(intercept.stats, stats.tab_bc)
#               Estimate Std. Error     t value  Pr(>|t|)
#(Intercept) -0.02901982  0.4680937 -0.06199574 0.9544655
#A           -0.19238543  0.6619845 -0.29061922 0.7902750
#B            0.40884591  0.6619845  0.61760645 0.5805485
#C           -0.21646049  0.6619845 -0.32698723 0.7651640

We can also print this table with significance stars.

printCoefmat(stats.tab)
#            Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.02902    0.46809 -0.0620   0.9545
#A           -0.19239    0.66199 -0.2906   0.7903
#B            0.40885    0.66199  0.6176   0.5805
#C           -0.21646    0.66199 -0.3270   0.7652

Emm? Why are there no stars? Well, in this example all p-values are very large. The stars will show up if p-values are small. Here is a convincing demo:

fake.tab <- stats.tab
fake.tab[, 4] <- fake.tab[, 4] / 100
printCoefmat(fake.tab)
#            Estimate Std. Error t value Pr(>|t|)   
#(Intercept) -0.02902    0.46809 -0.0620 0.009545 **
#A           -0.19239    0.66199 -0.2906 0.007903 **
#B            0.40885    0.66199  0.6176 0.005805 **
#C           -0.21646    0.66199 -0.3270 0.007652 **
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Oh, this is so beautiful. For the meaning of these stars, see my answer at: Interpeting R significance codes for ANOVA table?

Closing Remarks

It should be possible to write a function (or even an R package) to perform such table transformation. However, it might take great effort to make such function flexible enough, to handle:

all type of contrasts (this is easy to do);
complicated model terms, like interaction between a factor and other numeric/factor variables (this seems really involving!!).

So, I will stop here for the moment.

Miscellaneous Replies

Are the model scores that I get from the lm's summary still accurate, even though it isn't displaying all levels of the factor?

Yes, they are. lm conducts accurate least squares fitting.

In addition, the transformation of coefficient table does not affect R-squares, degree of freedom, residuals, fitted values, F-statistics, ANOVA table, etc.

Fregger answered 30/6, 2022 at 18:51 Comment(2)

This is a great answer for those who want to understand the concept of contrasts. For those who want to compute contrasts as simply as possible, the best option might be to point them to the emmeans package. – Submersed 2/7, 2022 at 19:3

Do you have a citation for this methodology, or could you summarize it in a few sentences? I need to explain it in a research paper. Thanks! – Epigraph 3/1, 2023 at 4:11

This is merely a preliminary function-ized version @Zheyuan's amazing answer. As @Zheyuan also mentions, there are limitations to this answer, including complicated models with contrasts, etc. I don't believe the function works if the factor is ordered.

The teat data:

set.seed(1)
test.df <- data.frame(y = rnorm(10, 0, 1),
                     x = factor(rep(LETTERS[1:5], each = 2)))
test.fit <- lm(y ~ x, contrasts = list(x=contr.sum), data= test.df)

The functions to return the level stats:

# Output deviation coded factor matrix, used within `DevContrStats()`
ContrSumMat <- function (fctr, sparse = FALSE) {
  if (!is.factor(fctr)) stop("'fctr' is not a factor variable!")
  N <- nlevels(fctr)
  Cmat <- contr.sum(N, sparse = sparse)
  dimnames(Cmat) <- list(levels(fctr), seq_len(N - 1))
  Cmat
}

## `dat` is the original data
## `fit.model` is the linear model created from the data
## `fctr_col` is the colum you'd like to get all factor levels from
DevContrStats <- function(dat, fit.model, fctr_col) {
  
  fctr.mtx <- ContrSumMat(factor(dat[, fctr_col]))
  
  ## coefficients After Contrasts
  coef_a_contr <- coef(fit.model)[-1]

  ## coefficients Before Contrasts
  ## coef_bc tells how each factor level deviates from the grand mean.
  coef_b_contr <- (fctr.mtx %*% coef_a_contr)[, 1]

  ## Covariance matrix after contrasts:
  var_a_contr <- vcov(fit.model)[-1,-1]

  ## Transform covariance matrix (after contrasts) to the one before contrasts:
  ## The diagonal of var_bc gives the estimated variance of factor-level deviations.
  var_b_contr <- fctr.mtx %*% var_a_contr %*% t(fctr.mtx)

  ## standard error of point estimate `coef_bc`
  std.err_b_contr <- sqrt(diag(var_b_contr))

  ## t-statistics (Null Hypothesis: coef_bc = 0)
  t.stats_b_contr <- coef_b_contr / std.err_b_contr

  ## p-values of the t-statistics
  p.value_b_contr <- 2 * pt(abs(t.stats_b_contr),
                            df = fit.model$df.residual,
                            lower.tail = FALSE)

  ## construct a coefficient table that mimics `coef(summary(fit))`
  stats.tab_b_contr <- cbind("Estimate" = coef_b_contr,
                        "Std. Error" = std.err_b_contr,
                        "t value" = t.stats_b_contr,
                        "Pr(>|t|)" = p.value_b_contr)

  ## extract statistics of the intercept, which = grand mean
  intercept.stats <- coef(summary(fit.model))[1, , drop = FALSE]

  ## augment the coefficient table with the intercept stats
  stats.tab <- rbind(intercept.stats, stats.tab_b_contr)

  ## print stats table with significance stars
  printCoefmat(stats.tab)
}


17:20:20> DevContrStats(test.df, test.fit, 'x')
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.164183   0.245068  0.6699  0.53258  
A            0.448694   0.490137  0.9154  0.40195  
B           -0.028986   0.490137 -0.0591  0.95513  
C            0.786629   0.490137  1.6049  0.16942  
D           -1.582153   0.490137 -3.2280  0.02326 *
E            0.375816   0.490137  0.7668  0.47785  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Epigraph answered 30/6, 2022 at 22:32 Comment(2)

I just added an error I'm getting from my function. If you know anything about it, I'd appreciate it. If not, please take your richly-deserved vacation :-) – Epigraph 30/6, 2022 at 22:55

Ah, the issue was that in my data DA was an ordered factor, and that messed it up! – Epigraph 1/7, 2022 at 0:54

Contrasts are linear combinations of the model parameters. Calculating them requires a bit of linear algebra. You can do it by hand as demonstrated in the other answers. Or you can use one of several packages do the math for you.

I'll use the contrast package. Read its vignette here.

library("contrast")

set.seed(1)

xlevels <- LETTERS[1:3]
y <- rnorm(6, 0, 1)
x <- factor(rep(xlevels, each = 2))

NB. You don't need to change the parametrization of the model (by setting with contrasts argument in lm). You can compute any contrast of interest with any parametrization, so why not keep the default parametrization.

Fit a linear regression with the default parametrization.

fit <- lm(y ~ x)

The default parametrization has A as the reference level.

broom::tidy(fit)
#> # A tibble: 3 × 5
#>   term        estimate std.error statistic p.value
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)  -0.221      0.811   -0.273    0.803
#> 2 xB            0.601      1.15     0.524    0.636
#> 3 xC           -0.0241     1.15    -0.0210   0.985

Compare each level in turn to the average of the three levels.

for (xi in xlevels) {
  print(
    paste("Contrast", xi, "with the average")
  )
  print(
    contrast(
      fit,
      list(x = xi),
      list(x = xlevels),
      type = "average"
    )
  )
}
#> [1] "Contrast A with the average"
#> lm model parameter contrast
#> 
#>     Contrast      S.E.     Lower    Upper     t df Pr(>|t|)
#> 1 -0.1923854 0.6619845 -2.299116 1.914345 -0.29  3   0.7903
#> [1] "Contrast B with the average"
#> lm model parameter contrast
#> 
#>    Contrast      S.E.     Lower    Upper    t df Pr(>|t|)
#> 1 0.4088459 0.6619845 -1.697884 2.515576 0.62  3   0.5805
#> [1] "Contrast C with the average"
#> lm model parameter contrast
#> 
#>     Contrast      S.E.     Lower   Upper     t df Pr(>|t|)
#> 1 -0.2164605 0.6619845 -2.323191 1.89027 -0.33  3   0.7652

^{Created on 2023-01-07 with reprex v2.0.2}

Submersed answered 7/1, 2023 at 19:10 Comment(0)

What about just doing a series of t-tests, each one comparing the mean of the subset to the grand mean?

lapply(levels(x), \(i) t.test(y,y[x==i]))

Guild answered 30/6, 2022 at 18:15 Comment(1)

If I do that, I won’t be able to get overall model stats or do model competitions – Epigraph 30/6, 2022 at 18:17

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Why lm summary does not display all factor levels?

R code walk-through

Closing Remarks

Miscellaneous Replies

Recommended topics

Hot tags

Why `lm` summary does not display all factor levels?