I have a few questions regarding the use of mlr3-pipelines. Indeed, my goal is to create a pipeline that combines three 3 graphs:
1 - A graph to process categorical variables: level imputation => standardization
imp_cat = po("imputenewlvl", param_vals =list(affect_columns = selector_name(my_cat_variables)))
encode = po("encode", param_vals =list(affect_columns = selector_name(my_cat_variables)))
cat = imp_cat %>>% encode
2 - A graph to process a subset of numeric variables: mean imputation => standardization
imp_mean = po("imputemean", param_vals = list(affect_columns =selector_name(my_first_set_of_numeric_variables)))
scale = po("scale", param_vals = list(affect_columns = selector_name(my_first_set_of_numeric_variables)))
num_mean = imp_mean %>>% scale
A third graph to process another subset of numeric variables : median imputation => min max scaling
imp_median = po("imputemedian", param_vals = list(affect_columns =selector_name(my_second_set_of_numeric_variables)))
min_max = po("scalerange", param_vals = list(affect_columns = selector_name(my_second_set_of_numeric_variables)))
num_median = imp_median %>>% min_max
combine these graphs by featureUnion Ops :
graph = po("copy", 3) %>>%
gunion(list(cat, num_mean, num_median )) %>>%
po("featureunion")
and finally add learner in GraphLearner :
g1 = GraphLearner$new(graph %>>% po(lrn("classif.ranger")))
I have somme missing values in my data, hence the use of imputers in each graph and i have a binary classification task.
my_task = TaskClassif$new(id="classif", backend = data, target = "my_target")
Theoretically, I shouldn't have missing value errors when I start learning.
g1$train(my_task)
but I have several errors depending on the learner I choose. If I use for example ranger as learner: I have this error
Error: Missing data in columns: ....
if I use svm, glmnet or xgvoost: I have a problem due to the existence of categorical variables.
Error : has the following unsupported feature types: factor...
With my pipeline, I shouldn't have a categorical variable and I shouldn't have missing values. so I do not see how to overcome this problem.
1 - I used an imputer in each graph, why do some algorithms tell me that there are always missing values?
2 - How do I remove the categorical variables, once encoded? some algorithms do not support this type of variable
Updated
I think that all the modifications made during the pipeline are not persisted. In other words, the algorithms (svm, ranger, ...), make the train on the original task, and not on the one updated by the pipeline
cpoy
orgunion
, I wanted to parallelize the pre-process. What do you recommend for doing parallel preprocecessing as we can do withscikilearn pipeline
? thanks – Cate