Using mlr3-pipelines to impute data and encode factor columns in GraphLearner?
Asked Answered
C

1

7

I have a few questions regarding the use of mlr3-pipelines. Indeed, my goal is to create a pipeline that combines three 3 graphs:

1 - A graph to process categorical variables: level imputation => standardization

imp_cat     = po("imputenewlvl", param_vals =list(affect_columns = selector_name(my_cat_variables)))
encode      = po("encode",     param_vals =list(affect_columns = selector_name(my_cat_variables)))
cat = imp_cat %>>% encode

2 - A graph to process a subset of numeric variables: mean imputation => standardization

imp_mean = po("imputemean", param_vals = list(affect_columns =selector_name(my_first_set_of_numeric_variables)))
scale = po("scale", param_vals = list(affect_columns = selector_name(my_first_set_of_numeric_variables)))
num_mean = imp_mean %>>% scale

A third graph to process another subset of numeric variables : median imputation => min max scaling

imp_median = po("imputemedian", param_vals = list(affect_columns =selector_name(my_second_set_of_numeric_variables)))
min_max = po("scalerange", param_vals = list(affect_columns = selector_name(my_second_set_of_numeric_variables)))
num_median = imp_median %>>% min_max

combine these graphs by featureUnion Ops :

graph = po("copy", 3) %>>%
   gunion(list(cat, num_mean, num_median )) %>>%
   po("featureunion")

and finally add learner in GraphLearner :

g1 = GraphLearner$new(graph %>>% po(lrn("classif.ranger")))

I have somme missing values in my data, hence the use of imputers in each graph and i have a binary classification task.

my_task = TaskClassif$new(id="classif", backend = data, target = "my_target")

Theoretically, I shouldn't have missing value errors when I start learning.

g1$train(my_task)

but I have several errors depending on the learner I choose. If I use for example ranger as learner: I have this error

Error: Missing data in columns: ....

if I use svm, glmnet or xgvoost: I have a problem due to the existence of categorical variables. Error : has the following unsupported feature types: factor...

With my pipeline, I shouldn't have a categorical variable and I shouldn't have missing values. so I do not see how to overcome this problem.

1 - I used an imputer in each graph, why do some algorithms tell me that there are always missing values?

2 - How do I remove the categorical variables, once encoded? some algorithms do not support this type of variable

Updated

I think that all the modifications made during the pipeline are not persisted. In other words, the algorithms (svm, ranger, ...), make the train on the original task, and not on the one updated by the pipeline

Cate answered 10/3, 2020 at 14:53 Comment(0)
H
3

Answer to the first question

I will try to explain why there are always missing values in your workflow.

lets load a bunch of packages

library(mlr3) 
library(mlr3pipelines)
library(mlr3learners)
library(mlr3tuning)
library(paradox)

lets take the task pima which has missing values

task <- tsk("pima")
task$missings()
diabetes      age  glucose  insulin     mass pedigree pregnant pressure  triceps 
       0        0        5      374       11        0        0       35      227 

since there are no categorical columns I will convert triceps to one:

hb <- po("histbin",
         param_vals =list(affect_columns = selector_name("triceps")))

now to impute new level and encode:

imp_cat <- po("imputenewlvl",
              param_vals =list(affect_columns = selector_name("triceps")))
encode <- po("encode",
             param_vals = list( affect_columns = selector_name("triceps")))

cat <- hb %>>% 
  imp_cat %>>%
  encode

When you use cat on the task:

cat$train(task)[[1]]$data()
#big output

not just the columns you selected to transform are returned but also all the others

This happens also for num_median and num_mean.

Lets create them

imp_mean <- po("imputemean", param_vals = list(affect_columns = selector_name(c("glucose", "mass"))))
scale <- po("scale", param_vals = list(affect_columns = selector_name(c("glucose", "mass"))))
num_mean <- imp_mean %>>% scale


imp_median <- po("imputemedian", param_vals = list(affect_columns = selector_name(c("insulin", "pressure"))))
min_max <- po("scalerange", param_vals = list(affect_columns = selector_name(c("insulin", "pressure"))))
num_median <- imp_median %>>% min_max

check what num_median does

num_median$train(task)[[1]]$data()
#output
     diabetes    insulin  pressure age glucose mass pedigree pregnant triceps
  1:      pos 0.13341346 0.4897959  50     148 33.6    0.627        6      35
  2:      neg 0.13341346 0.4285714  31      85 26.6    0.351        1      29
  3:      pos 0.13341346 0.4081633  32     183 23.3    0.672        8      NA
  4:      neg 0.09615385 0.4285714  21      89 28.1    0.167        1      23
  5:      pos 0.18509615 0.1632653  33     137 43.1    2.288        0      35
 ---                                                                         
764:      neg 0.19951923 0.5306122  63     101 32.9    0.171       10      48
765:      neg 0.13341346 0.4693878  27     122 36.8    0.340        2      27
766:      neg 0.11778846 0.4897959  30     121 26.2    0.245        5      23
767:      pos 0.13341346 0.3673469  47     126 30.1    0.349        1      NA
768:      neg 0.13341346 0.4693878  23      93 30.4    0.315        1      31

So it did what it was supposed to on "insulin" and "pressure" columns but also returned the rest unchanged.

By copying the data three times and applying these three pre processors in each step you return the transformed columns but also all the rest - three times.

What you should do is:

graph <- cat %>>%
  num_mean %>>%
  num_median

cat transforms selected columns and returns all, then num_mean transforms selected columns and returns all...

graph$train(task)[[1]]$data()

looks good to me

And more importantly

g1 <- GraphLearner$new(graph %>>% po(lrn("classif.ranger")))
g1$train(task)

works

2 - The answer to the 2nd question is to use selector functions, specifically in your case

selector_type():

selector_invert(selector_type("factor"))

should do the trick if called prior to piping into the learner.

Houselights answered 11/3, 2020 at 11:40 Comment(6)
Thank you for taking the time to answer me. Your solution works for me. However, by using cpoy or gunion, I wanted to parallelize the pre-process. What do you recommend for doing parallel preprocecessing as we can do with scikilearn pipeline? thanksCate
Glad if I can help. This is another topic. I doubt parallelization is handled this way in mlr3. And I doubt you would benefit much from it in this use case even if it was. Parallelization can be used when you are performing resampling, where each pipeline is performed in each resample, and each resample is run on a different thread. I don't think parallelization is implemented withing a pipeline unless the interfaced packages can use it internally.Houselights
And if I want to do baggin bagging by using gunion, learning each model, is it parallel?Cate
I would assume it is. The pipeline is applied on each subsample and each subsample can be performed in a separate thread. But for final verdict I advise waiting for one of mlr3 team members to provide input.Houselights
I added the answer to your 2nd question on how to remove categorical variables.Houselights
@FiesAtoS Did my answer solve the problem(s) you stated in the question?Houselights

© 2022 - 2024 — McMap. All rights reserved.