R: Error in contrasts when fitting linear models with `lm`
Asked Answered
P

1

2

I've found Error in contrasts when defining a linear model in R and have followed the suggestions there, but none of my factor variables take on only one value and I am still experiencing the same issue.

This is the dataset I'm using: https://www.dropbox.com/s/em7xphbeaxykgla/train.csv?dl=0.

This is the code I'm trying to run:

simplelm <- lm(log_SalePrice ~ ., data = train)

#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels

What is the issue?

Penuchle answered 19/5, 2018 at 0:20 Comment(3)
What makes you think none of your factors only have one value? I don't want to download, import, and inspect your data set, but could you post the output of sapply(train[!sapply(train, is.numeric)], function(x) length(unique(x)))?Soke
Glancing at your data, both the Utilities and the PoolQC columns look pretty 1-level (didn't scroll very much though...)Soke
I posted the verifiably correct answer 12 minutes after question was asked, but 4 years later some anonymous idiot decided to mysteriously downvote it, so I deleted the solution. Here it is: pastebin.com/8M05yt6VEster
L
0

Thanks for providing your dataset (I hope that link will forever be valid so that everyone can access). I read it into a data frame train.

Using the debug_contr_error, debug_contr_error2 and NA_preproc helper functions provided by How to debug "contrasts can be applied only to factors with 2 or more levels" error?, we can easily analyze the problem.

info <- debug_contr_error2(log_SalePrice ~ ., train)

## the data frame that is actually used by `lm`
dat <- info$mf

## number of cases in your dataset
nrow(train)
#[1] 1460

## number of complete cases used by `lm`
nrow(dat)
#[1] 1112

## number of levels for all factor variables in `dat`
info$nlevels
#     MSZoning        Street         Alley      LotShape   LandContour 
#            4             2             3             4             4 
#    Utilities     LotConfig     LandSlope  Neighborhood    Condition1 
#            1             5             3            25             9 
#   Condition2      BldgType    HouseStyle     RoofStyle      RoofMatl 
#            6             5             8             5             7 
#  Exterior1st   Exterior2nd    MasVnrType     ExterQual     ExterCond 
#           14            16             4             4             4 
#   Foundation      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1 
#            6             5             5             5             7 
# BsmtFinType2       Heating     HeatingQC    CentralAir    Electrical 
#            7             5             5             2             5 
#  KitchenQual    Functional   FireplaceQu    GarageType  GarageFinish 
#            4             6             6             6             3 
#   GarageQual    GarageCond    PavedDrive        PoolQC         Fence 
#            5             5             3             4             5 
#  MiscFeature      SaleType SaleCondition  MiscVal_bool      MoYrSold 
#            4             9             6             2            55 

As you can see, Utilities is the offending variable here as it has only 1 level.

Since you have many character / factor variables in train, I wonder whether you have NA for them. If we add NA as a valid level, we could possibly get more complete cases.

new_train <- NA_preproc(train)

new_info <- debug_contr_error2(log_SalePrice ~ ., new_train)

new_dat <- new_info$mf

nrow(new_dat)
#[1] 1121

new_info$nlevels
#     MSZoning        Street         Alley      LotShape   LandContour 
#            5             2             3             4             4 
#    Utilities     LotConfig     LandSlope  Neighborhood    Condition1 
#            1             5             3            25             9 
#   Condition2      BldgType    HouseStyle     RoofStyle      RoofMatl 
#            6             5             8             5             7 
#  Exterior1st   Exterior2nd    MasVnrType     ExterQual     ExterCond 
#           14            16             4             4             4 
#   Foundation      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1 
#            6             5             5             5             7 
# BsmtFinType2       Heating     HeatingQC    CentralAir    Electrical 
#            7             5             5             2             6 
#  KitchenQual    Functional   FireplaceQu    GarageType  GarageFinish 
#            4             6             6             6             3 
#   GarageQual    GarageCond    PavedDrive        PoolQC         Fence 
#            5             5             3             4             5 
#  MiscFeature      SaleType SaleCondition  MiscVal_bool      MoYrSold 
#            4             9             6             2            55

We do get more complete cases, but Utilities still has one level. This means that most incomplete cases are actually caused by NA in your numerical variables, which we can do nothing (unless you have a statistically valid way to impute those missing values).

As you only have one single-level factor variable, the same method as given in How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"? will work.

new_dat$Utilities <- 1

simplelm <- lm(log_SalePrice ~ 0 + ., data = new_dat)

The model now runs successfully. However, it is rank-deficient. You probably want to do something to address it, but leaving it as it is is fine.

b <- coef(simplelm)

length(b)
#[1] 301

sum(is.na(b))
#[1] 9

simplelm$rank
#[1] 292
Lillian answered 28/7, 2018 at 19:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.