R error which says "Models were not all fitted to the same size of dataset"
Asked Answered
V

6

17

I have created two generalised linear models as follows:

glm1 <-glm(Y ~ X1 + X2 + X3, family=binomial(link=logit))

glm2 <-glm(Y ~ X1 + X2, family=binomial(link=logit))

I then use the anova function:

anova(glm2,glm1)

but get an error message:

"Error in anova.glmlist(c(list(object),dotargs), dispersion = dispersion, :
models were not all fitted to the same size of dataset"

What does this mean and how can I fix this? I have attached the dataset at the start of my code so both models are working off of the same dataset.

Viera answered 22/8, 2013 at 17:39 Comment(10)
On a side note, don't use attach().Township
Also, I'm assuming you used glm(Y~X1...) and not just (Y~X1...)? And why do you have commas separating the variables?Township
Yes i used that. Apologies that i did not post it on here correctly before. Any idea what might be wrong?Viera
Without seeing your data or code, no. Using attach could definitely cause that problem.Township
Instead of using attach, would i specify the glm as glm(Y ~ X1, X2, X3, family=binomial(link=logit), data.df) for each one?Viera
you need to use data=YourData in the glm, and you can't use commas to separate variables like that.Township
Yes, again that was a silly error on my part. Using data=YourData worked. Thanks so much :)Viera
let us continue this discussion in chatViera
Also, how do i get the p-value from the anova result? as i only get deviance in the output. thanks again! :)Viera
I think anova(glm1,glm2,test="Chisq") is what you wantJamboree
T
22

The main cause of that error is when there are missing values in one or more of the predictor variables. In recent versions of R the default action is to omit all rows that have any values missing (the previous default was to produce an error). So for example if the data frame has 100 rows and there is one missing value in X3 then your model glm1 will be fit to 99 rows of data (dropping the row where X3 is missing), but the glm2 object will be fit to the full 100 rows of data (since it does not use X3, no rows need to be deleted).

So then the anova function gives you an error because the 2 models were fit to different datasets (and how do you compute degrees of freedom, etc.).

One solution is to create a new data frame that has only the columns that will be used in at least one of your models and remove all the rows with any missing values (the na.omit or na.exclude function will make this easy), then fit both models to the same data frame that does not have any missing values.

Other options would be to look at tools for multiple imputation or other ways of dealing with missing data.

Teen answered 22/8, 2013 at 18:17 Comment(2)
Thanks for that, it was very explanatory. It seems to work when i stopped using attach and chose to specify the data in each glm. Is this only working by chance? Also, I am looking to get a p-value from anova between glm1 and various other glms. How do I do this?Viera
@Denis, With attach you may have had a variable of the same name in the global environment/workspace that was messing things up. One of the reasons not to use attach. To get a p-value from anova add test="Chisq", see ?anova.glm for details (and make sure that you are happy with the assumptions).Teen
P
10

To avoid the "models were not all fitted to the same size of dataset" error, you must fit both models on the exact same subset of data. There are two simple ways to do this:

  • either use data=glm1$model in the 2nd model fit
  • or retrieve the correctly subsetted dataset by using data=na.omit(orig.data[ , all.vars(formula(glm1))]) in the 2nd model fit

Here's a reproducible example using lm (for glm the same approach should work) and update:

# 1st approach
# define a convenience wrapper
update_nested <- function(object, formula., ..., evaluate = TRUE){
    update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}

# prepare data with NAs
data(mtcars)
for(i in 1:ncol(mtcars)) mtcars[i,i] <- NA

xa <- lm(mpg~cyl+disp, mtcars)
xb <- update_nested(xa, .~.-cyl)
anova(xa, xb)
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp
## Model 2: mpg ~ disp
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     26 256.91                              
## 2     27 301.32 -1   -44.411 4.4945 0.04371 *
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# 2nd approach
xc <- update(xa, .~.-cyl, data=na.omit(mtcars[ , all.vars(formula(xa))]))
anova(xa, xc)
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp
## Model 2: mpg ~ disp
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     26 256.91                              
## 2     27 301.32 -1   -44.411 4.4945 0.04371 *
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

See also:

Patricapatrice answered 24/5, 2016 at 14:20 Comment(0)
G
1

The solution is to use:

glm1 <-glm(Y ~ X1 + X2 + X3, family = binomial(link = logit), na.action = na.exclude)
glm2 <-glm(Y ~ X1 + X2, family = binomial(link = logit), na.action = na.exclude)

anova(glm2,glm1)

This will make R include the cases with missing data (NA) in the fitted model. This ensures that datasets are identical across different fit models no matter how missing data is distributed.

Gooey answered 19/7, 2015 at 1:7 Comment(3)
The na.exclude approach won't necessarily work. Try this: mtcars[1,2] <- NA; xa <- lm(mpg~cyl+disp, mtcars, na.action = na.exclude); xaa <- lm(mpg~disp, mtcars, na.action = na.exclude); anova(xa, xaa)Patricapatrice
You are right. It doesn't work in this case. It does however ensure that predict inserts NAs for the datapoints that have missing values in the data. Reading the documentation for na.exclude it notes that this only works for naresid and napredict which are connected to resid and predict. Apparently, anova uses something else. Apparently, one will have to subset to the complete cases first, e.g. using na.omit. At least, that's the advice given here.Gooey
There are at least two straightforward approaches to doing this which I discuss in this answer.Patricapatrice
R
0

I'm guessing that you meant to type:

glm1 <-glm(Y ~ X1+X2+X3, family=binomial(link=logit))

glm2 <-glm(Y ~ X1 + X2, family=binomial(link=logit))

The formula interface for R regression functions does not recognize commas as adding covariates to the RHS of the formula. And don't use attach(); use the data argument to regression functions.

Ricardoricca answered 22/8, 2013 at 19:1 Comment(4)
Yes, I have done all of this and the error still seems to come up. I have also inputted: na.omit=na.pass to see would the blank cells in my data by treated differently in R but to no avail. Any idea what I could be doing wrong?Viera
I would use a data argument that was the same for both glm calls: na.omit(YourData[ , c("Y" , "X1","X2","X3")])`. That way you remove the "extra cases" that are present for X1 and X2 that are not in X3.Ricardoricca
Do I put this glm1 <-glm(Y ~ X1+X2+X3, family=binomial(link=logit), na.omit(YourData[ , c("Y" , "X1","X2","X3")])) when defining both glms?Viera
The data argument should be the same and the formulas should be different.Ricardoricca
R
0

The cause is well described by Greg Snow. An alternative and very easy solution is to add a new variable, matching the problematic variable's NA's and otherwise with the value 1. Include it in both models and R will exclude the same rows in both models (--> datasets will match).

Rose answered 3/7, 2017 at 11:56 Comment(1)
None of these anwers seem to work in this case:fit4=lm(ctmax~ta+ratectmax,data=x,na.action = na.exclude)# rectilinear fit5=lme(ctmax~poly(ta,2)+ratectmax,random=~1|col,data=x,na.action = na.exclude) fit6=lme(ctmax~poly(ta,3)+ratectmax,random=~1|col,data=x,na.action = na.exclude) fit7=lme(ctmax~poly(ta,4)+ratectmax,random=~1|col,data=x,na.action = na.exclude) anova(fit4,fit5,fit6,fit7) - it gives the same error, with no missing data.Adept
I
0

I think that the easiest way to handle this situation without imputing the missing values is to create a new dataset using tidyr's drop_na() function.

For this function, put all the variables that you'll need in your final model inside the drop_na() portion, and it will remove any rows that have missing values in any relevant variable:

library(tidyr) #load in drop_na()

mtcars[1,2] <- NA #makes the first row of the cyl column become NA to illustrate

no_missing <- mtcars %>%
  drop_na(cyl)

glimpse(no_missing) #note, you only have 31 obs instead of 32 now

drop_na() also works across multiple columns:

library(tidyr)

mtcars[1,2] <- NA #makes the first row of the cyl column become NA to illustrate
mtcars[3,1] <- NA #makes the 3rd row of the mpg column become NA to illustrate

no_missing_2 <- mtcars %>%
  drop_na(mpg, cyl)

glimpse(no_missing_2) #now, you only have 30 obs

By running drop_na() with all the variables that you'll be using in your most complex model, you'll ensure that you're using the same dataset.

Import answered 14/1, 2020 at 15:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.