Using scikit Pipeline for testing models but preprocessing data only once

Asked 20/11, 2017 at 5:52 Answered 31/5, 2022 at 19:10

Solved python machine-learning scikit-learn deep-learning data-science

Suppose I have a pipeline for my data which does preprocessing and has an estimator at the end. Now if I want to just change the estimator/model at the last step of the pipeline, how do I do it without preprocessing the same data all over again. Below is a code example

pipe = make_pipeline(
    ColumnSelector(columns),
    CategoricalEncoder(categories),
    FunctionTransformer(pd.get_dummies, validate=False),
    StandardScaler(scale),
    LogisticRegression(),
)

Now I want to change the model to use Ridge or some other model than the LogisticRegression. How do I do this without doing the preprocessing all over again?

EDIT: Can I get my transformed data from a pipeline of the following sort

pipe = make_pipeline(
        ColumnSelector(columns),
        CategoricalEncoder(categories),
        FunctionTransformer(pd.get_dummies, validate=False),
        StandardScaler(scale)
    )

Supplant answered 20/11, 2017 at 5:52 Comment(7)

Use a jupyter notebook. – Subbasement 20/11, 2017 at 5:53

@cᴏʟᴅsᴘᴇᴇᴅ Yeah running each step in a Jupyter notebook is what I have been doing, and then aggregating everything into a pipeline with the model. What I want to know is that can I extract transformed data from the pipeline before the logistic regression step? – Supplant 20/11, 2017 at 5:56

Maybe separate the last step from the pipeline? I'm not sure but I think it could be done. – Subbasement 20/11, 2017 at 5:57

As @cᴏʟᴅsᴘᴇᴇᴅ said. You can take out the last step from the pipeline. But if you want to send whole pipeline into some other scikit-learn functions like GridSearchCV or cross_val_score etc, then currently its not possible. Its in active development though. See specially this – Butterflies 20/11, 2017 at 6:22

@Vivek oh okay. Thanks. Thats great – Supplant 20/11, 2017 at 6:25

@gopi1410, does below answer with caching, or the linked answer with GridSearchCV as a pipeline step suit your problem? – Khan 21/11, 2017 at 7:36

@Supplant still interested, if any of the below solves your problem. Regarding your edit: I'm not completely sure what you mean by "Can I get my transformed data from a pipeline". What's wrong with: X_transformed = pipe.fit_transform(X), and then going with regular GridSearchCV from there, if you want to seperate it? – Khan 27/11, 2017 at 15:43

For the case that you have computationally expensive transformers, you can use caching. As you didn't provide your transformer, here an extension of the sklearn example from the link, where two models are grid searched with a cached pipeline:

from tempfile import mkdtemp
from shutil import rmtree
from sklearn.externals.joblib import Memory
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits

# Create a temporary folder to store the transformers of the pipeline
cachedir = mkdtemp()
memory = Memory(cachedir=cachedir, verbose=10)

# the pipeline
pipe = Pipeline([('reduce_dim', PCA()),
                ('classify', LinearSVC())],
                memory=memory)
# models to try
param_grid = {"classify" : [LinearSVC(), ElasticNet()]}

# do the gridsearch on the models
grid = GridSearchCV(pipe, param_grid=param_grid)
digits = load_digits()
grid.fit(digits.data, digits.target)

# delete the temporary cache before exiting
rmtree(cachedir)

Edit:

As you focus on models in your question, and this question focusses on parameters, I wouldn't consider it an exact duplicate. However, the proposed solution there in combination with the param_grid set up as here would also be a good, maybe even better solution, depending on your exact problem.

Khan answered 20/11, 2017 at 10:7 Comment(0)

My understanding is that only the fitted pipeline is saved to the cache, and not the data, so this solution doesn't achieve the goal of preprocessing the data only once.

I haven't been able to find any features of sklearn that facilitate data caching. A good implementation would be to separately cache the output of each call to fit(), transform(), and fit_transform(), so that the underlying data cache is read each time the corresponding output object is accessed.

This implementation would make sense only if the output object is an iterable, in which case every call to iter(cached_ouput) would open cached_output's underlying cache file(s) for reading.

I just found cachetools; it might work.

Earthshaker answered 31/5, 2022 at 19:10 Comment(0)

Edit:

Recommended topics

Hot tags