Is it possible to toggle a certain step in sklearn pipeline?
Asked Answered
P

2

25

I wonder if we can set up an "optional" step in sklearn.pipeline. For example, for a classification problem, I may want to try an ExtraTreesClassifier with AND without a PCA transformation ahead of it. In practice, it might be a pipeline with an extra parameter specifying the toggle of the PCA step, so that I can optimize on it via GridSearch and etc. I don't see such an implementation in sklearn source, but is there any work-around?

Furthermore, since the possible parameter values of a following step in pipeline might depend on the parameters in a previous step (e.g., valid values of ExtraTreesClassifier.max_features depend on PCA.n_components), is it possible to specify such a conditional dependency in sklearn.pipeline and sklearn.grid_search?

Thank you!

Provitamin answered 9/10, 2013 at 3:34 Comment(0)
U
18
  • Pipeline steps cannot currently be made optional in a grid search but you could wrap the PCA class into your own OptionalPCA component with a boolean parameter to turn off PCA when requested as a quick workaround. You might want to have a look at hyperopt to setup more complex search spaces. I think it has good sklearn integration to support this kind of patterns by default but I cannot find the doc anymore. Maybe have a look at this talk.

  • For the dependent parameters problem, GridSearchCV supports trees of parameters to handle this case as demonstrated in the documentation.

Uncouth answered 9/10, 2013 at 7:4 Comment(1)
As a side remark, note that ExtraTreesClassifier.max_features can be a float value between 0.0 and 1.0, instead of an integer value. This is useful when the actual number of features variable, as in your case.Sultry
E
20

From the docs:

Individual steps may also be replaced as parameters, and non-final steps may be ignored by setting them to None:

from sklearn.linear_model import LogisticRegression
params = dict(reduce_dim=[None, PCA(5), PCA(10)],
              clf=[SVC(), LogisticRegression()],
              clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=params)
Erlene answered 30/4, 2017 at 22:17 Comment(1)
Most recent documentation states that you should ignore non-final steps with the string 'passthrough' instead of None: scikit-learn.org/stable/modules/compose.htmlBarmaid
U
18
  • Pipeline steps cannot currently be made optional in a grid search but you could wrap the PCA class into your own OptionalPCA component with a boolean parameter to turn off PCA when requested as a quick workaround. You might want to have a look at hyperopt to setup more complex search spaces. I think it has good sklearn integration to support this kind of patterns by default but I cannot find the doc anymore. Maybe have a look at this talk.

  • For the dependent parameters problem, GridSearchCV supports trees of parameters to handle this case as demonstrated in the documentation.

Uncouth answered 9/10, 2013 at 7:4 Comment(1)
As a side remark, note that ExtraTreesClassifier.max_features can be a float value between 0.0 and 1.0, instead of an integer value. This is useful when the actual number of features variable, as in your case.Sultry

© 2022 - 2024 — McMap. All rights reserved.