Use sklearn's GridSearchCV with a pipeline, preprocessing just once
Asked Answered
J

5

44

I'm using scickit-learn to tune a model hyper-parameters. I'm using a pipeline to have chain the preprocessing with the estimator. A simple version of my problem would look like this:

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

_ = grid.fit(X=np.random.rand(10, 3),
             y=np.random.randint(2, size=(10,)))

In my case the preprocessing (what would be StandardScale() in the toy example) is time consuming, and I'm not tuning any parameter of it.

So, when I execute the example, the StandardScaler is executed 12 times. 2 fit/predict * 2 cv * 3 parameters. But every time StandardScaler is executed for a different value of the parameter C, it returns the same output, so it'd be much more efficient, to compute it once, and then just run the estimator part of the pipeline.

I can manually split the pipeline between the preprocessing (no hyper parameters tuned) and the estimator. But to apply the preprocessing to the data, I should provide the training set only. So, I would have to implement the splits manually, and not use GridSearchCV at all.

Is there a simple/standard way to avoid repeating the preprocessing while using GridSearchCV?

Jauch answered 12/4, 2017 at 10:10 Comment(1)
scikit-learn.org/stable/modules/compose.htmlMalfeasance
E
58

Update: Ideally, the answer below should not be used as it leads to data leakage as discussed in comments. In this answer, GridSearchCV will tune the hyperparameters on the data already preprocessed by StandardScaler, which is not correct. In most conditions that should not matter much, but algorithms which are too sensitive to scaling will give wrong results.


Essentially, GridSearchCV is also an estimator, implementing fit() and predict() methods, used by the pipeline.

So instead of:

grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

Do this:

clf = make_pipeline(StandardScaler(), 
                    GridSearchCV(LogisticRegression(),
                                 param_grid={'logisticregression__C': [0.1, 10.]},
                                 cv=2,
                                 refit=True))

clf.fit()
clf.predict()

What it will do is, call the StandardScalar() only once, for one call to clf.fit() instead of multiple calls as you described.

Edit:

Changed refit to True, when GridSearchCV is used inside a pipeline. As mentioned in documentation:

refit : boolean, default=True Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this GridSearchCV instance after fitting.

If refit=False, clf.fit() will have no effect because the GridSearchCV object inside the pipeline will be reinitialized after fit(). When refit=True, the GridSearchCV will be refitted with the best scoring parameter combination on the whole data that is passed in fit().

So if you want to make the pipeline, just to see the scores of the grid search, only then the refit=False is appropriate. If you want to call the clf.predict() method, refit=True must be used, else Not Fitted error will be thrown.

Eisen answered 12/4, 2017 at 10:21 Comment(23)
I didn't think about using GridSearchCV in the pipe itself, sounds like a brilliant idea. Thanks a lot!Jauch
@MarcGarcia But do make sure to turn the refit=True, else it will throw an error, when calling clf.predict()Eisen
@MarcGarcia Edited the answer to reflect the sameEisen
Doesn't this technique use all the data in the StandardScalar() instead of just the training set ? I don't see how it allows to avoid doing the splits manually.Ensample
@imad3v No. It will only use to set the scales according to data given in fit(). And use those scales to scale the data given in predict(), not fit() on that data. Hope you get the point. Please ask if not.Eisen
@VivekKumar Ok I see that. But then during the fit(), GridSearchCV will tune the hyperparameter by a CV on the data preprocessed by StandardScaler(), so StandardScalar() will also be fitted on the validation set of GridSearchCV (not the test set passed to predict()), which isn't correct for me because the validation set shouldn't be preprocessed.Ensample
@VictorDeplasse Yes, I get your point. That is one caveat of using this approach. Thanks. I will update the answer for it.Eisen
@VivekKumar I tried the above solution for svc in the following manner: param_grid = {'SVC__C': [0.01, 0.1, 1],'SVC__gamma': [0.001, 0.01, 0.1, 1]} pipe = make_pipeline(Normalizer(), GridSearchCV(SVC(), param_grid = param_grid, cv=10, refit=True)) pipe.fit(X_train,y_train) gives the following error: ValueError: Invalid parameter SVC for estimator SVC Can you tell me how I can change the param_grid as I think that's where the problem is?Laccolith
@ShashwatSiddhant param_grid in your case goes inside the GridSearchCV. It has nothing to do with make_pipeline here. So in your case, param_grid should only contain 'C' and 'gamma'.Eisen
What would happen if we pass the Pipeline memory parameter instead?Freeloader
@Freeloader I'm sorry but I could not understand. Please describe in detail and if possible post a new question.Eisen
@VivekKumar sklearn.pipeline.Pipeline possesses a memory parameter that can be specify to cache the fitted transformers. I was wondering if that could be used to cache the fitted pipeline given each combination of hyper-parameters, instead of passing GridSearchCV inside the pipeline, to avoid running into the problem of validation folds still being fit on.Freeloader
@Freeloader Ah ok. I understand now. Yes that can be done. But at the time of writing this answer it was not in stable scikit build I think. And there were some issues in how the pipeline will optimize them. See the answer below.Eisen
Does this approach work for anyone? I am getting some unexpected results...Treasurehouse
@teter123f Which approach are you talking about? The one present in answer or the one discussed in comments?Eisen
@VivekKumar The one accepted as the answer. Although, i think it might be working. I just thought that it doesn't work because Victor Deplasse's answer below discusses github issues.Treasurehouse
@VivekKumar for instance, I ran a make_pipeline with a couple feature transformations and then i had a gridsearchCV for the last extratreesregressor estimator. Training took quite some time - as expected - but i get a prediction R-squared that is much lower than the R-squared i get using a model I manually built that has the same hyperparameters as one set inside the GridSearchCV. Additionally, my Pipeline object says the final estimator is GridSearchCV instead of ExtratreesRegressor.Treasurehouse
@VivekKumar If we don't want a data leak we should not use this approach of placing Gridsearch inside a pipelineDogcatcher
@Dogcatcher Yes, I agree. This I have already mentioned on top of the answer.Eisen
The alternate strategy would be perform hyperparameter tuning seperately using grid search and Cross validation. Get the best parameters 2. Create pipeline(pln) with scaler and classifier(mine logistic regression). 3. Pass best parameters to clasifier in pln. 4. Pln.fit(train,y) 5. pred=pln.predict(test) 6.proba=pln.prepdproba(tst) 7. rocauc = roc_auc_score(pred,proba) Hopefully rocauc is not 1 in which case it will not denote a data leak. I am passing pipeline to Gridsearch and getting rocaucscore of 1 which is what I am trying to solve now.Dogcatcher
This is not generally the proper way to do it. Instead, the pipeline needs to go into GridSearchCV. See this paper for an explanation why your approach can be problematic, e.g. in the case of resampling: researchgate.net/publication/…Tinnitus
what about this scikit-learn.org/stable/modules/compose.html?Malfeasance
@Malfeasance Are you talking about Caching in the linked page?Eisen
R
16

For those who stumbled upon a little bit different problem, that I had as well.

Suppose you have this pipeline:

classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=SEED, n_jobs=-1))])

Then, when specifying parameters you need to include this 'clf_' name that you used for your estimator. So the parameters grid is going to be:

params={'clf__max_features':[0.3, 0.5, 0.7],
        'clf__min_samples_leaf':[1, 2, 3],
        'clf__max_depth':[None]
        }
Radiolocation answered 28/3, 2019 at 15:31 Comment(0)
E
5

It is not possible to do this in the current version of scikit-learn (0.18.1). A fix has been proposed on the github project:

https://github.com/scikit-learn/scikit-learn/issues/8830

https://github.com/scikit-learn/scikit-learn/pull/8322

Ensample answered 13/5, 2017 at 11:14 Comment(0)
A
2

Use the memory argument to make_pipeline, e.g. together with a temporary directory:


cache_dir = tempfile.mkdtemp()
... make_pipeline(..., memory=cache_dir) ...

# after GridSearchCV
shutil.rmtree(cache_dir)
Ables answered 26/1 at 14:32 Comment(0)
T
1

I joined the party late, but I brought a new solution/insight using Pipeline():

  • sub-pipeline containing your model (regression/classifier) as a single component
  • main pipeline made of routine components:
    • pre-processing component e.g., scaler, dimension reduction, etc.
    • your refitted GridSearchCV(regressor, param) with desired/best params for your model (Note: don't forget to refit=True) based on @Vivek Kumar remark ref
#build an end-to-end pipeline, and supply the data into a regression model and train and fit within the main pipeline.
#It avoids leaking the test\val-set into the train-set
# Create the sub-pipeline

#create and train the sub-pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

sgd_subpipeline = Pipeline(steps=[#('scaler', MinMaxScaler()), # better to not rescale internally
                                  ('SGD',    SGDRegressor(random_state=0)),
])

# Define the hyperparameter grid
param_grid = {
    'SGD__loss':     ['squared_error', 'epsilon_insensitive', 'squared_epsilon_insensitive', 'huber'],
    'SGD__penalty':  ['l2', 'l1', 'elasticnet'],
    'SGD__alpha':    [0.0001, 0.001, 0.01],
    'SGD__l1_ratio': [0.15, 0.25, 0.5]
}

# Perform grid search
grid_search = GridSearchCV(sgd_subpipeline, param_grid, cv=5, n_jobs=-1, verbose=True, refit=True)
grid_search.fit(X_train, y_train)

# Get the best model
best_sgd_reg = grid_search.best_estimator_

# Print the best hyperparameters
print('=========================================[Best Hyperparameters info]=====================================')
print(grid_search.best_params_)

# summarize best
print('Best MAE: %.3f'  % grid_search.best_score_)
print('Best Config: %s' % grid_search.best_params_)
print('==========================================================================================================')

# Create the main pipeline by chaining refitted GridSerachCV sub-pipeline

sgd_pipeline = Pipeline(steps=[('scaler', MinMaxScaler()), # better to rescale externally
                               ('SGD',    grid_search),
])

# Fit the best model on the training data within pipeline (like fit any model/transformer): pipe.fit(traindf[features], traindf[labels]) #X, y

sgd_pipeline.fit(X_train, y_train)

#--------------------------------------------------------------
# Displaying a Pipeline with a Preprocessing Step and Regression
from sklearn import set_config
set_config(display="text")

img

Alternatively, you can use TransformedTargetRegressor (specifically if you need to descale y as @mloning commented here) and chain this component, including your regression model ref. Note:

  • you don't need to set transform argument unless you need descaling; please then check to related posts 1, 2, 3, 4, its score
  • Pay attention to this remark about not scaling here since:

... With scaling y you actually lose your units....

  • Here, It is recommended to:

... Do the transformation outside the pipeline. ...

#build an end-to-end pipeline, and supply the data into a regression model and train and fit within main pipeline.
#It avoids leaking the test\val-set into the train-set
# Create the sub-pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

sgd_subpipeline = Pipeline(steps=[#('scaler', MinMaxScaler()), # better to not rescale internally
                                  ('SGD',    SGDRegressor(random_state=0)),
])

# Define the hyperparameter grid
param_grid = {
    'SGD__loss':     ['squared_error', 'epsilon_insensitive', 'squared_epsilon_insensitive', 'huber'],
    'SGD__penalty':  ['l2', 'l1', 'elasticnet'],
    'SGD__alpha':    [0.0001, 0.001, 0.01],
    'SGD__l1_ratio': [0.15, 0.25, 0.5]
}

# Perform grid search
grid_search = GridSearchCV(sgd_subpipeline, param_grid, cv=5, n_jobs=-1, verbose=True, refit=True)
grid_search.fit(X_train, y_train)

# Get the best model
best_sgd_reg = grid_search.best_estimator_

# Print the best hyperparameters
print('=========================================[Best Hyperparameters info]=====================================')
print(grid_search.best_params_)

# summarize best
print('Best MAE: %.3f'  % grid_search.best_score_)
print('Best Config: %s' % grid_search.best_params_)
print('==========================================================================================================')



# Create the main pipeline using sub-pipeline made of TransformedTargetRegressor component
from sklearn.compose import TransformedTargetRegressor

TTR_sgd_pipeline = Pipeline(steps=[('scaler', MinMaxScaler()), # better to rescale externally
                                   #('SGD', SGDRegressor()),
                                    ('TTR', TransformedTargetRegressor(regressor= grid_search, #SGDRegressor(),
                                                                       #transformer=MinMaxScaler(),
                                                                       #func=np.log,
                                                                       #inverse_func=np.exp,
                                                                       check_inverse=False))
])



# Fit the best model on the training data within pipeline (like fit any model/transformer): pipe.fit(traindf[features], traindf[labels]) #X, y
#best_sgd_pipeline.fit(X_train, y_train)
TTR_sgd_pipeline.fit(X_train, y_train)

#--------------------------------------------------------------
# Displaying a Pipeline with a Preprocessing Step and Regression
from sklearn import set_config
set_config(display="diagram")

img

Tara answered 7/7, 2023 at 17:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.