R - mice - machine learning: re-use imputation scheme from train to test set

Asked 3/11, 2015 at 13:12 Answered 17/12, 2020 at 13:39

I'm building a predictive model and am using the mice package for imputing NAs in my training set. Since I need to re-use the same imputation scheme for my test set, how can I re-apply it to my test data?

# generate example data
set.seed(333)
mydata <- data.frame(a = as.logical(rbinom(100, 1, 0.5)),
                     b = as.logical(rbinom(100, 1, 0.2)),
                     c = as.logical(rbinom(100, 1, 0.8)),
                     y = as.logical(rbinom(100, 1, 0.6)))

na_a <- as.logical(rbinom(100, 1, 0.3))
na_b <- as.logical(rbinom(100, 1, 0.3))
na_c <- as.logical(rbinom(100, 1, 0.3))
mydata$a[na_a] <- NA
mydata$b[na_b] <- NA
mydata$c[na_c] <- NA

# create train/test sets
library(caret)
inTrain <- createDataPartition(mydata$y, p = .8, list = FALSE)
train <- mydata[ inTrain, ] 
test <-  mydata[-inTrain, ]

# impute NAs in train set
library(mice)
imp <- mice(train, method = "logreg")
train_imp <- complete(imp)

# apply imputation scheme to test set
test_imp <- unknown_function(test, imp$unknown_data)

Lottielotto answered 3/11, 2015 at 13:12 Comment(2)

What are you trying to accomplish? When you say "re-use the same imputation scheme" it seems to imply you would simply use the same method for imputing missing data in your test set as you used in your training set. In this case you are doing multiple imputation using logistic regression as the underlying imputation method. – Vanna 21/12, 2016 at 0:0

I am actually trying to do the same. MICE trains a linear model (at least with method "logreg"). You can get the model by following the instruction in gerkovink.com/miceVignettes/Convergence_pooling/… at step 7. Edit: The author of the method and package writes his commentswith regards to this topic here: github.com/stefvanbuuren/mice/issues/32 – Extractive 7/11, 2018 at 1:53

As of mice::mice version 3.12.0 contains the ignore parameter which will cover most use cases.

Simply pass it a vector with TRUE for all rows that should be used during training and FALSE for all rows that should only be imputed (but not used during training).

imp.ignore <- mice(data, ignore = c(rep(FALSE, 99), TRUE), maxit = 5, m = 2, seed = 1)

Mooncalf answered 17/12, 2020 at 13:39 Comment(0)

prockenschaub has created a lovely function for that, called mice.reuse()

library(mice)
library(scorecard)

# function to impute new observations based on the previous imputation model
source("https://raw.githubusercontent.com/prockenschaub/Misc/master/R/mice.reuse/mice.reuse.R")

# split data into train and test
data_list <- split_df(airquality, y = NULL, ratio = 0.75, seed = 186)

imp <- mice(data = data_list$train, 
            seed = 500, 
            m = 5,
            method = "pmm",
            print = FALSE)


# impute test data based on train imputation model
test_imp <- mice.reuse(imp, data_list$test, maxit = 1)

Changteh answered 15/12, 2020 at 14:15 Comment(1)

quote from github.com/amices/mice/issues/32 "mice.reuse was my own hacked function and is not part of the mice package (you can still find it here but I wouldn't recommend using it anymore).mice::mice version 3.12.0 contains the ignore parameter that does the same thing in one go. Simply pass it a vector with FALSE for all rows that should be used during training and TRUE for all rows that should only be imputed (but not used during training). See the proposed solution in my example two comments ago for an idea on how to use it." – Mannerheim 11/2, 2022 at 14:19

When you are training a model you cannot use test data in any sense. Therefore you cannot impute with MICE the complete dataset before splitting. It is necessary to use only train data also for the imputation of the test data

Jac answered 8/2, 2018 at 15:56 Comment(0)

-5

Run mice imputation on the combined dataset and only then split it into train and test, fit the machine learning classifier on the train set and then on the test set.

Fundus answered 31/8, 2017 at 17:37 Comment(1)

Ill-advised because of data leakage and underestimating test error: machinelearningmastery.com/data-leakage-machine-learning/… – Priapism 23/8, 2018 at 20:3

Recommended topics

Hot tags