Using mlr3-pipelines to impute data and encode factor columns in GraphLearner?

imp_cat = po("imputenewlvl", param_vals =list(affect_columns = selector_name(my_cat_variables))) encode = po("encode", param_vals =list(affect_columns = selector_name(my_cat_variables))) cat = imp_cat %>>% encode

imp_mean = po("imputemean", param_vals = list(affect_columns =selector_name(my_first_set_of_numeric_variables))) scale = po("scale", param_vals = list(affect_columns = selector_name(my_first_set_of_numeric_variables))) num_mean = imp_mean %>>% scale

imp_median = po("imputemedian", param_vals = list(affect_columns =selector_name(my_second_set_of_numeric_variables))) min_max = po("scalerange", param_vals = list(affect_columns = selector_name(my_second_set_of_numeric_variables))) num_median = imp_median %>>% min_max

Answer to the first question

I will try to explain why there are always missing values in your workflow.

lets load a bunch of packages

library(mlr3) 
library(mlr3pipelines)
library(mlr3learners)
library(mlr3tuning)
library(paradox)

lets take the task pima which has missing values

task <- tsk("pima")
task$missings()
diabetes      age  glucose  insulin     mass pedigree pregnant pressure  triceps 
       0        0        5      374       11        0        0       35      227

since there are no categorical columns I will convert triceps to one:

hb <- po("histbin",
         param_vals =list(affect_columns = selector_name("triceps")))

now to impute new level and encode:

imp_cat <- po("imputenewlvl",
              param_vals =list(affect_columns = selector_name("triceps")))
encode <- po("encode",
             param_vals = list( affect_columns = selector_name("triceps")))

cat <- hb %>>% 
  imp_cat %>>%
  encode

When you use cat on the task:

cat$train(task)[[1]]$data()
#big output

not just the columns you selected to transform are returned but also all the others

This happens also for num_median and num_mean.

Lets create them

imp_mean <- po("imputemean", param_vals = list(affect_columns = selector_name(c("glucose", "mass"))))
scale <- po("scale", param_vals = list(affect_columns = selector_name(c("glucose", "mass"))))
num_mean <- imp_mean %>>% scale


imp_median <- po("imputemedian", param_vals = list(affect_columns = selector_name(c("insulin", "pressure"))))
min_max <- po("scalerange", param_vals = list(affect_columns = selector_name(c("insulin", "pressure"))))
num_median <- imp_median %>>% min_max

check what num_median does

num_median$train(task)[[1]]$data()
#output
     diabetes    insulin  pressure age glucose mass pedigree pregnant triceps
  1:      pos 0.13341346 0.4897959  50     148 33.6    0.627        6      35
  2:      neg 0.13341346 0.4285714  31      85 26.6    0.351        1      29
  3:      pos 0.13341346 0.4081633  32     183 23.3    0.672        8      NA
  4:      neg 0.09615385 0.4285714  21      89 28.1    0.167        1      23
  5:      pos 0.18509615 0.1632653  33     137 43.1    2.288        0      35
 ---                                                                         
764:      neg 0.19951923 0.5306122  63     101 32.9    0.171       10      48
765:      neg 0.13341346 0.4693878  27     122 36.8    0.340        2      27
766:      neg 0.11778846 0.4897959  30     121 26.2    0.245        5      23
767:      pos 0.13341346 0.3673469  47     126 30.1    0.349        1      NA
768:      neg 0.13341346 0.4693878  23      93 30.4    0.315        1      31

So it did what it was supposed to on "insulin" and "pressure" columns but also returned the rest unchanged.

By copying the data three times and applying these three pre processors in each step you return the transformed columns but also all the rest - three times.

What you should do is:

graph <- cat %>>%
  num_mean %>>%
  num_median

cat transforms selected columns and returns all, then num_mean transforms selected columns and returns all...

graph$train(task)[[1]]$data()

looks good to me

And more importantly

g1 <- GraphLearner$new(graph %>>% po(lrn("classif.ranger")))
g1$train(task)

works

2 - The answer to the 2nd question is to use selector functions, specifically in your case

selector_type():

selector_invert(selector_type("factor"))

should do the trick if called prior to piping into the learner.

Recommended topics

Hot tags