Difference between categorical variables (factors) and dummy variables

Asked 1/2, 2019 at 0:9 Answered 25/12, 2019 at 19:53

Solved r statistics linear-regression lm

I was running a regression using categorical variables and came across this question. Here, the user wanted to add a column for each dummy. This left me quite confused because I though having long data with the column including all the dummies stored using as.factor() was equivalent to having dummy variables.

Could someone explain the difference between the following two linear regression models?

Linear Model 1, where Month is a factor:

dt_long
          Sales Period Month
   1: 0.4898943      1    M1
   2: 0.3097716      1    M1
   3: 1.0574771      1    M1
   4: 0.5121627      1    M1
   5: 0.6650744      1    M1
  ---                       
8108: 0.5175480     24   M12
8109: 1.2867316     24   M12
8110: 0.6283875     24   M12
8111: 0.6287151     24   M12
8112: 0.4347708     24   M12

M1 <- lm(data = dt_long,
         fomrula = Sales ~ Period + factor(Month)

Linear Model 2 where each month is an indicator variable:

    dt_wide
          Sales Period M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
   1: 0.4898943      1  1  0  0  0  0  0  0  0  0   0   0   0
   2: 0.3097716      1  1  0  0  0  0  0  0  0  0   0   0   0
   3: 1.0574771      1  1  0  0  0  0  0  0  0  0   0   0   0
   4: 0.5121627      1  1  0  0  0  0  0  0  0  0   0   0   0
   5: 0.6650744      1  1  0  0  0  0  0  0  0  0   0   0   0
  ---                                                        
8108: 0.5175480     24  0  0  0  0  0  0  0  0  0   0   0   1
8109: 1.2867316     24  0  0  0  0  0  0  0  0  0   0   0   1
8110: 0.6283875     24  0  0  0  0  0  0  0  0  0   0   0   1
8111: 0.6287151     24  0  0  0  0  0  0  0  0  0   0   0   1
8112: 0.4347708     24  0  0  0  0  0  0  0  0  0   0   0   1

M2 <- lm(data = data_wide,
         formula = Sales ~ Period + M1 + M2 + M3 + ... + M11 + M12

Judging by this previously asked question, both models seem exactly the same. However, after running both models, I noticed that M1 returns 11 dummy estimators (because M1 is used as the reference level), while M2 returns 12 dummies.

Is one model better than the other? Is M1 more efficien? Can I set the reference level in M1 to make both models exactly equivalent?

Trowbridge answered 1/2, 2019 at 0:9 Comment(0)

Defining a model as in M1 is just a shortcut of including dummy variables: if you wanted to compute those regression coefficients by hand, clearly they'd have to be numeric.

Now something that perhaps you didn't notice about M2 is that one of the dummies should have a NA coefficient. That is because you manually included all of them and left the intercept. In this way we have a perfect collinearity issue. By not including one of the dummies or adding -1 to eliminate the constant term everything would be fine.

Some examples. Let

y <- rnorm(100)
x0 <- rep(1:0, each = 50)
x1 <- rep(0:1, each = 50)
x <- factor(x1)

In this way x0 and x1 is a decomposition of x. Then

## Too much
lm(y ~ x0 + x1)

# Call:
# lm(formula = y ~ x0 + x1)

# Coefficients:
# (Intercept)           x0           x1  
#    -0.15044      0.07561           NA  

## One way to fix it
lm(y ~ x0 + x1 - 1)

# Call:
# lm(formula = y ~ x0 + x1 - 1)

# Coefficients:
#       x0        x1  
# -0.07483  -0.15044  

## Another one
lm(y ~ x1)

# Call:
# lm(formula = y ~ x1)

# Coefficients:
# (Intercept)           x1  
#    -0.07483     -0.07561  

## The same results
lm(y ~ x)

# Call:
# lm(formula = y ~ x)

# Coefficients:
# (Intercept)           x1  
#    -0.07483     -0.07561

Ultimately all the models contain the same amount of information, but in the case of multicollinearity we face the issue of identification.

Broil answered 1/2, 2019 at 0:16 Comment(3)

This is great. Thank you! Could you explain how you spotted the multicollinearity though? I understand that if I include one dummy for each month, then I basically have a column that's entirely filled with 1's, except it's spread (or decomposed, as you said) throughout 12 columns. I don't see why removing the intercept would fix this though. – Trowbridge 1/2, 2019 at 4:35

@ArturoSbr, multicollinearity appears when one cannot distinguish between some variables. In this case, one such variable is the constant term, while the other one is the sum of all the months which, as you said, always equals one. So lm doesn't know how to uniquely (there are infinitely many ways) distribute the contribution to the constant term and the sum of months (and, hence, each month separately). That's a classical case, just like with four seasons or seven days of the week. – Broil 1/2, 2019 at 10:17

@Jlius Vainora Thanks! So when writing this down on pen and paper, I would have a vector full of 1'sto be multiplied by my constant as well as a decomposed vector full of 1's to be multiplied by the dummy betas, which gives me perfect collinearity. Correct? – Trowbridge 1/2, 2019 at 16:0

Improper dummy coding.

When you change a categorical variable into dummy variables, you will have one fewer dummy variable than you had categories. That’s because the last category is already indicated by having a 0 on all other dummy variables. Including the last category just adds redundant information, resulting in multicollinearity. So always check your dummy coding if it seems you’ve got a multicollinearity problem.

Mirna answered 25/12, 2019 at 19:53 Comment(0)

Recommended topics

Hot tags