I was running a regression using categorical variables and came across this question. Here, the user wanted to add a column for each dummy. This left me quite confused because I though having long data with the column including all the dummies stored using as.factor()
was equivalent to having dummy variables.
Could someone explain the difference between the following two linear regression models?
Linear Model 1, where Month is a factor:
dt_long
Sales Period Month
1: 0.4898943 1 M1
2: 0.3097716 1 M1
3: 1.0574771 1 M1
4: 0.5121627 1 M1
5: 0.6650744 1 M1
---
8108: 0.5175480 24 M12
8109: 1.2867316 24 M12
8110: 0.6283875 24 M12
8111: 0.6287151 24 M12
8112: 0.4347708 24 M12
M1 <- lm(data = dt_long,
fomrula = Sales ~ Period + factor(Month)
Linear Model 2 where each month is an indicator variable:
dt_wide
Sales Period M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
1: 0.4898943 1 1 0 0 0 0 0 0 0 0 0 0 0
2: 0.3097716 1 1 0 0 0 0 0 0 0 0 0 0 0
3: 1.0574771 1 1 0 0 0 0 0 0 0 0 0 0 0
4: 0.5121627 1 1 0 0 0 0 0 0 0 0 0 0 0
5: 0.6650744 1 1 0 0 0 0 0 0 0 0 0 0 0
---
8108: 0.5175480 24 0 0 0 0 0 0 0 0 0 0 0 1
8109: 1.2867316 24 0 0 0 0 0 0 0 0 0 0 0 1
8110: 0.6283875 24 0 0 0 0 0 0 0 0 0 0 0 1
8111: 0.6287151 24 0 0 0 0 0 0 0 0 0 0 0 1
8112: 0.4347708 24 0 0 0 0 0 0 0 0 0 0 0 1
M2 <- lm(data = data_wide,
formula = Sales ~ Period + M1 + M2 + M3 + ... + M11 + M12
Judging by this previously asked question, both models seem exactly the same. However, after running both models, I noticed that M1
returns 11 dummy estimators (because M1 is used as the reference level), while M2 returns 12 dummies.
Is one model better than the other? Is M1 more efficien? Can I set the reference level in M1 to make both models exactly equivalent?