Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor X has new levels
Asked Answered
B

2

8

I did a logistic regression:

 EW <- glm(everwrk~age_p + r_maritl, data = NH11, family = "binomial")

Moreover, I want to predict everwrk for each level of r_maritl.

r_maritl has the following levels:

levels(NH11$r_maritl)
 "0 Under 14 years" 
 "1 Married - spouse in household" 
 "2 Married - spouse not in household"
 "3 Married - spouse in household unknown" 
 "4 Widowed"                               
 "5 Divorced"                             
 "6 Separated"                             
 "7 Never married"                        
 "8 Living with partner"  
 "9 Unknown marital status"  

So I did:

predEW <- with(NH11,
expand.grid(r_maritl = c( "0 Under 14 years", "1 Married - 
spouse in household", "2 Married - spouse not in household", "3 Married - 
spouse in household unknown", "4 Widowed", "5 Divorced", "6 Separated", "7 
Never married", "8 Living with partner", "9 Unknown marital status"),
age_p = mean(age_p,na.rm = TRUE)))

cbind(predEW, predict(EW, type = "response",
                        se.fit = TRUE, interval = "confidence",
                        newdata = predEW))

The Problem is I get the following response:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor r_maritl has new levels 0 Under 14 years, Married - spouse in household unknown

Sample Data:

str(NH11$age_p)
num [1:33014] 47 18 79 51 43 41 21 20 33 56 ...

str(NH11$everwrk)
Factor w/ 2 levels "2 No","1 Yes": NA NA 2 NA NA NA NA NA 2 2 ...

str(NH11$r_maritl)
Factor w/ 10 levels "0 Under 14 years",..: 6 8 5 7 2 2 8 8 8 2 ...
Bascio answered 20/8, 2017 at 16:50 Comment(1)
Could you provide some example data? I cannot reproduce your problem with e.g. mtcars. Also, are there unused factors in your dataset NH11?Exchange
P
16

tl;dr it looks like you have some levels in your factor that are not represented in your data, that get dropped from the factors used in the model. In hindsight this isn't terribly surprising, since you won't be able to predict responses for these levels. That said, it's mildly surprising that R doesn't do something nice for you like generate NA values automatically. You can solve this problem by using levels(droplevels(NH11$r_maritl)) in constructing your prediction frame, or equivalently EW$xlevels$r_maritl.

A reproducible example:

maritl_levels <- c( "0 Under 14 years", "1 Married - spouse in household", 
  "2 Married - spouse not in household", "3 Married - spouse in household unknown", 
  "4 Widowed", "5 Divorced", "6 Separated", "7 Never married", "8 Living with partner", 
 "9 Unknown marital status")
set.seed(101)
NH11 <- data.frame(everwrk=rbinom(1000,size=1,prob=0.5),
                 age_p=runif(1000,20,50),
                 r_maritl = sample(maritl_levels,size=1000,replace=TRUE))

Let's make a missing level:

NH11 <- subset(NH11,as.numeric(NH11$r_maritl) != 3)

Fit the model:

EW <- glm(everwrk~r_maritl+age_p,data=NH11,family=binomial)
predEW <- with(NH11,
  expand.grid(r_maritl=levels(r_maritl),age_p=mean(age_p,na.rm=TRUE)))
predict(EW,newdata=predEW)

Success!

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor r_maritl has new levels 2 Married - spouse not in household

predEW <- with(NH11,
           expand.grid(r_maritl=EW$xlevels$r_maritl,age_p=mean(age_p,na.rm=TRUE)))
predict(EW,newdata=predEW)
Pinchcock answered 20/8, 2017 at 17:7 Comment(3)
I really, really doubt it [version differences]. This is all core R functionality that's been around for decades and exercised by thousands of users. My prior is very strongly on a subtle typo by the OP.Pinchcock
Cool, I learned something today. (I don't think I had encountered this particular problem before.)Pinchcock
that command doesn't do what you think it does. Use droplevels().Pinchcock
C
0

Thanks a lot for the answer, I was also facing the same problem of new levels. I made below changes in my code.

  1. I was using data.frame() and replace it by expand.grid() function
  2. Added na.rm=TRUE along with the variable in mean function
  3. Replaced factor(1:2) by glmoutput$xlevels$variablename

and the solution works!

Convalescence answered 8/9, 2021 at 7:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.