lm: 1 coefficient not defined because of singularities while no major collinearity issues
Asked Answered
S

1

1

I have seen similar posts like this one which say that getting the error message saying: Coefficients: (1 not defined because of singularities) is because of nearly perfect correlation among predictors used in the lm() call.

But in my case, there is no nearly perfect correlation among predictors, but still one coefficient (X_wthn_outcome) returns NA in the output of lm().

I wonder what is wrong with the coefficient that returns NA?

For reproducibility purposes, the exact same data and code is provided below.

library(dplyr)

set.seed(132)
(data <- expand.grid(study = 1:1e3, outcome = rep(1:50,2)))
data$X <- rnorm(nrow(data))
e <- rnorm(nrow(data), 0, 2)
data$yi <- .8 +.6*data$X + e

dat <- data %>% 
  group_by(study) %>% 
  mutate(X_btw_study = mean(X), X_wthn_study = X-X_btw_study) %>%
  group_by(outcome, .add = TRUE) %>%
  mutate(X_btw_outcome = mean(X), X_wthn_outcome = X-X_btw_outcome) %>% ungroup()
  
round(cor(select(dat,-study,-outcome,-X,-yi)),3)

#               X_btw_study X_wthn_study X_btw_outcome X_wthn_outcome
#X_btw_study          1.000        0.000         0.141           0.00
#X_wthn_study         0.000        1.000         0.698           0.71
#X_btw_outcome        0.141        0.698         1.000           0.00
#X_wthn_outcome       0.000        0.710         0.000           1.00

summary(lm(yi ~ 0 + X_btw_study + X_btw_outcome + X_wthn_study
   + X_wthn_outcome, data = dat))

#Coefficients: (1 not defined because of singularities)
#               Estimate Std. Error t value Pr(>|t|)    
#X_btw_study    0.524093   0.069610   7.529 5.15e-14 ***
#X_btw_outcome  0.014557   0.013694   1.063    0.288    
#X_wthn_study   0.589517   0.009649  61.096  < 2e-16 ***
#X_wthn_outcome       NA         NA      NA       NA  ## What's wrong with this variable
Saprolite answered 29/8, 2021 at 18:36 Comment(2)
You say there was "no nearly perfect correlation among predictors", but two-way collinearity is not the same as multi-collinearity. If you have a lot of categorical predictors (or a few categorical with small sample size) it's not that hard to get "perfect multicollinearity" without any two way correlations that look worrisome.Illdefined
@IRTFM, I have no categorical predictors. But can you possibly show what you exactly mean by multi-collinearity in my data above?Saprolite
I
1

You have constructed a problem where the three-way combination of X_btw_study + X_btw_outcome + X_wthn_study perfectly predicts X_wthn_outcome:

lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat)
#------------------
Call:
lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study, 
    data = dat)

Coefficients:
  (Intercept)    X_btw_study  X_btw_outcome   X_wthn_study  
    1.165e-17      1.000e+00     -1.000e+00      1.000e+00  
#--------------
summary( lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat) )

#---------------
Call:
lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study, 
    data = dat)

Residuals:
       Min         1Q     Median         3Q        Max 
-3.901e-14 -6.000e-17  0.000e+00  5.000e-17  3.195e-13 

Coefficients:
                Estimate Std. Error    t value Pr(>|t|)    
(Intercept)    1.165e-17  3.242e-18  3.594e+00 0.000326 ***
X_btw_study    1.000e+00  3.312e-17  3.020e+16  < 2e-16 ***
X_btw_outcome -1.000e+00  6.515e-18 -1.535e+17  < 2e-16 ***
X_wthn_study   1.000e+00  4.590e-18  2.178e+17  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.025e-15 on 99996 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.582e+34 on 3 and 99996 DF,  p-value: < 2.2e-16

You have an adjusted R^2 of 1 with three predictors. So MULTI-collinearity but not two-way collinearity. (R has caught on to your tricks and won't let you get away with this dplyr game of "hide the dependencies".) I think the dependencies might have been more apparently if you had build the variables in a sequence of separate steps rather than in a piped chain.

Illdefined answered 29/8, 2021 at 19:0 Comment(11)
Aha! +1. But why so? Why X_btw_study + X_btw_outcome + X_wthn_study fully predict/overlap with X_wthn_outcome?Saprolite
Because of your sequence of steps in building those variables. You really only had three independent variables, Y, e and X, and everything else was built from them using means and differences.Illdefined
Sure, but I thought I broke X into four non-overlapping parts.Saprolite
So? That didn't remove the row-wise dependencies. Every single rows had that problem.Illdefined
row-wise dependencies?Saprolite
Every single row had X_btw_study + X_btw_outcome + X_wthn_study == X_wthn_outcome. I don't read dplyr chains very well so I couldn't see it in your code but I think that you tricked yourself by not making separate lines for your variable creation.Illdefined
But if I do: with(dat, X_btw_study + X_btw_outcome + X_wthn_study) I don't get X_wthn_outcome?Saprolite
OK I guess my assertion wan't correct, but the R^2 for those three variables X_btw_study + X_btw_outcome + X_wthn_study() predicting X_wthn_outcome os still 1.000 so there is a nearly perfect joint dependency. And as I said I have a lot of trouble reading dplyr code, so I can't tell exactly what you did. (And it appears you can't either.)Illdefined
Here is what I did: For rows within each study, I took mean of X. Subtracted that mean from X values within each study. For rows within each study, I took the mean of X values for rows with the same outcome value. Subtracted those means from their respective X value. If that is confusing let me know.Saprolite
You have ended up with two variable with a sum of zero and two with the same sum: with( dat, sum( X_wthn_outcome)) [1] 2.834885e-14 > with( dat, sum( X_wthn_study)) [1] -4.184847e-14 > with( dat, sum( X_btw_outcome)) [1] -12.19032 > with( dat, sum( X_btw_study)) [1] -12.19032Illdefined
Hummm! So essentially X_btw_outcome == X_btw_study, and on the other hand X_wthn_outcome == X_wthn_study. Which also mean to recreate the original X only following is enough within(dat, X_recreated <- X_wthn_outcome + X_btw_outcome) or within(dat, X_recreated <- X_wthn_study + X_btw_study) is enough.Saprolite

© 2022 - 2024 — McMap. All rights reserved.