Using scikit Pipeline for testing models but preprocessing data only once
Asked Answered
S

2

5

Suppose I have a pipeline for my data which does preprocessing and has an estimator at the end. Now if I want to just change the estimator/model at the last step of the pipeline, how do I do it without preprocessing the same data all over again. Below is a code example

pipe = make_pipeline(
    ColumnSelector(columns),
    CategoricalEncoder(categories),
    FunctionTransformer(pd.get_dummies, validate=False),
    StandardScaler(scale),
    LogisticRegression(),
)

Now I want to change the model to use Ridge or some other model than the LogisticRegression. How do I do this without doing the preprocessing all over again?

EDIT: Can I get my transformed data from a pipeline of the following sort

pipe = make_pipeline(
        ColumnSelector(columns),
        CategoricalEncoder(categories),
        FunctionTransformer(pd.get_dummies, validate=False),
        StandardScaler(scale)
    )
Supplant answered 20/11, 2017 at 5:52 Comment(7)
Use a jupyter notebook.Subbasement
@cᴏʟᴅsᴘᴇᴇᴅ Yeah running each step in a Jupyter notebook is what I have been doing, and then aggregating everything into a pipeline with the model. What I want to know is that can I extract transformed data from the pipeline before the logistic regression step?Supplant
Maybe separate the last step from the pipeline? I'm not sure but I think it could be done.Subbasement
As @cᴏʟᴅsᴘᴇᴇᴅ said. You can take out the last step from the pipeline. But if you want to send whole pipeline into some other scikit-learn functions like GridSearchCV or cross_val_score etc, then currently its not possible. Its in active development though. See specially thisButterflies
@Vivek oh okay. Thanks. Thats greatSupplant
@gopi1410, does below answer with caching, or the linked answer with GridSearchCV as a pipeline step suit your problem?Khan
@Supplant still interested, if any of the below solves your problem. Regarding your edit: I'm not completely sure what you mean by "Can I get my transformed data from a pipeline". What's wrong with: X_transformed = pipe.fit_transform(X), and then going with regular GridSearchCV from there, if you want to seperate it?Khan
K
9

For the case that you have computationally expensive transformers, you can use caching. As you didn't provide your transformer, here an extension of the sklearn example from the link, where two models are grid searched with a cached pipeline:

from tempfile import mkdtemp
from shutil import rmtree
from sklearn.externals.joblib import Memory
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits

# Create a temporary folder to store the transformers of the pipeline
cachedir = mkdtemp()
memory = Memory(cachedir=cachedir, verbose=10)

# the pipeline
pipe = Pipeline([('reduce_dim', PCA()),
                ('classify', LinearSVC())],
                memory=memory)
# models to try
param_grid = {"classify" : [LinearSVC(), ElasticNet()]}

# do the gridsearch on the models
grid = GridSearchCV(pipe, param_grid=param_grid)
digits = load_digits()
grid.fit(digits.data, digits.target)

# delete the temporary cache before exiting
rmtree(cachedir)

Edit:

As you focus on models in your question, and this question focusses on parameters, I wouldn't consider it an exact duplicate. However, the proposed solution there in combination with the param_grid set up as here would also be a good, maybe even better solution, depending on your exact problem.

Khan answered 20/11, 2017 at 10:7 Comment(0)
E
0

My understanding is that only the fitted pipeline is saved to the cache, and not the data, so this solution doesn't achieve the goal of preprocessing the data only once.

I haven't been able to find any features of sklearn that facilitate data caching. A good implementation would be to separately cache the output of each call to fit(), transform(), and fit_transform(), so that the underlying data cache is read each time the corresponding output object is accessed.

This implementation would make sense only if the output object is an iterable, in which case every call to iter(cached_ouput) would open cached_output's underlying cache file(s) for reading.

I just found cachetools; it might work.

Earthshaker answered 31/5, 2022 at 19:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.