GridSearchCV - XGBoost - Early Stopping
Asked Answered
W

3

38

i am trying to do hyperparemeter search with using scikit-learn's GridSearchCV on XGBoost. During gridsearch i'd like it to early stop, since it reduce search time drastically and (expecting to) have better results on my prediction/regression task. I am using XGBoost via its Scikit-Learn API.

    model = xgb.XGBRegressor()
    GridSearchCV(model, paramGrid, verbose=verbose ,fit_params={'early_stopping_rounds':42}, cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]), n_jobs=n_jobs, iid=iid).fit(trainX,trainY)

I tried to give early stopping parameters with using fit_params, but then it throws this error which is basically because of lack of validation set which is required for early stopping:

/opt/anaconda/anaconda3/lib/python3.5/site-packages/xgboost/callback.py in callback(env=XGBoostCallbackEnv(model=<xgboost.core.Booster o...teration=4000, rank=0, evaluation_result_list=[]))
    187         else:
    188             assert env.cvfolds is not None
    189 
    190     def callback(env):
    191         """internal function"""
--> 192         score = env.evaluation_result_list[-1][1]
        score = undefined
        env.evaluation_result_list = []
    193         if len(state) == 0:
    194             init(env)
    195         best_score = state['best_score']
    196         best_iteration = state['best_iteration']

How can i apply GridSearch on XGBoost with using early_stopping_rounds?

note: model is working without gridsearch, also GridSearch works without 'fit_params={'early_stopping_rounds':42}

Wrens answered 24/3, 2017 at 7:15 Comment(1)
GridSearchCV cannot perform a correct grid search while using early stopping because it will not set the eval_set validation set for us. Instead, we must grid search manually, see this example.Antimony
C
19

An update to @glao's answer and a response to @Vasim's comment/question, as of sklearn 0.21.3 (note that fit_params has been moved out of the instantiation of GridSearchCV and been moved into the fit() method; also, the import specifically pulls in the sklearn wrapper module from xgboost):

import xgboost.sklearn as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

cv = 2

trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]

# these are the evaluation sets
testX = trainX 
testY = trainY

paramGrid = {"subsample" : [0.5, 0.8]}

fit_params={"early_stopping_rounds":42, 
            "eval_metric" : "mae", 
            "eval_set" : [[testX, testY]]}

model = xgb.XGBRegressor()

gridsearch = GridSearchCV(model, paramGrid, verbose=1,             
         cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]))

gridsearch.fit(trainX, trainY, **fit_params)
Comp answered 16/8, 2019 at 18:59 Comment(8)
hi - can this be done using stratifiedkfold as well ?Kauri
@Sandeep: yup, that's actually the default if you choose to simply specify the cv parameter in GridSearchCV as an integer (indicating how many folds you want to use). i'm afraid I'm not too familiar with the TimeSeriesSplit method though, so if you want to use that you should check out the docs.Comp
thanks for the reply, this solution was what i had been looking for.Wrens
good idea, just one question, xgboost will use a different validation set for each cv to check for early stopping?Veilleux
Is it intended that the training and evaluation sets are the same? IE, you set testX = trainX.Formative
@YikeLu, I think I was just being lazy by not making a set of fake other arrays for the test data :) Sorry for the confusion.Comp
@Comp no problem, it's more the docs and behavior that are confusing. I have just run with early_stopping_rounds using the xgb.cv method and it does NOT ask for aneval set (I'm assuming it just uses the CV folds), and in fact does not require entry of eval_metric either, it just uses objective by default. (Edit/reposted to remove point about return value which I figured out on my own).Formative
I don't think that this solution works as asked in the OP. It seems to use the same validation set for early stopping, not the CV fold.Holmquist
U
23

When using early_stopping_rounds you also have to give eval_metric and eval_set as input parameter for the fit method. Early stopping is done via calculating the error on an evaluation set. The error has to decrease every early_stopping_rounds otherwise the generation of additional trees is stopped early.

See the documentation of xgboosts fit method for details.

Here you see a minimal fully working example:

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

cv = 2

trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]

# these are the evaluation sets
testX = trainX 
testY = trainY

paramGrid = {"subsample" : [0.5, 0.8]}

fit_params={"early_stopping_rounds":42, 
            "eval_metric" : "mae", 
            "eval_set" : [[testX, testY]]}

model = xgb.XGBRegressor()
gridsearch = GridSearchCV(model, paramGrid, verbose=1 ,
         fit_params=fit_params,
         cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX,trainY]))
gridsearch.fit(trainX,trainY)
Unseasoned answered 25/3, 2017 at 8:23 Comment(7)
thanks for reply, it works. but giving pre-defined eval_set is against the nature of the cross validation i guess.Wrens
I guess what you mean is that in real applications you have to make sure eval_set and train set are not overlapping or are the same as here - should have added that. I used the train set just for the sake of simplicity. Early stopping based on the train data does not prevent overfitting.Unseasoned
@glao: the eval set should be the hold-out set of the cross-validation process to make everything work as intended.Holmquist
nowadays "fit_params" is not recommendable because it is going to be deprecated.Glyptodont
Thanks @MichaelM, and how exactly can we do that? Any helpSubstantialize
@MichaelM He is right. valid_set should be a hold-out set.Alchemy
@Wrens I think you are right. If we perform CV, we do not need a hold-out validation set. I think CV is designed to optimize traditional train-vaild split method.Jerlenejermain
C
19

An update to @glao's answer and a response to @Vasim's comment/question, as of sklearn 0.21.3 (note that fit_params has been moved out of the instantiation of GridSearchCV and been moved into the fit() method; also, the import specifically pulls in the sklearn wrapper module from xgboost):

import xgboost.sklearn as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

cv = 2

trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]

# these are the evaluation sets
testX = trainX 
testY = trainY

paramGrid = {"subsample" : [0.5, 0.8]}

fit_params={"early_stopping_rounds":42, 
            "eval_metric" : "mae", 
            "eval_set" : [[testX, testY]]}

model = xgb.XGBRegressor()

gridsearch = GridSearchCV(model, paramGrid, verbose=1,             
         cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]))

gridsearch.fit(trainX, trainY, **fit_params)
Comp answered 16/8, 2019 at 18:59 Comment(8)
hi - can this be done using stratifiedkfold as well ?Kauri
@Sandeep: yup, that's actually the default if you choose to simply specify the cv parameter in GridSearchCV as an integer (indicating how many folds you want to use). i'm afraid I'm not too familiar with the TimeSeriesSplit method though, so if you want to use that you should check out the docs.Comp
thanks for the reply, this solution was what i had been looking for.Wrens
good idea, just one question, xgboost will use a different validation set for each cv to check for early stopping?Veilleux
Is it intended that the training and evaluation sets are the same? IE, you set testX = trainX.Formative
@YikeLu, I think I was just being lazy by not making a set of fake other arrays for the test data :) Sorry for the confusion.Comp
@Comp no problem, it's more the docs and behavior that are confusing. I have just run with early_stopping_rounds using the xgb.cv method and it does NOT ask for aneval set (I'm assuming it just uses the CV folds), and in fact does not require entry of eval_metric either, it just uses objective by default. (Edit/reposted to remove point about return value which I figured out on my own).Formative
I don't think that this solution works as asked in the OP. It seems to use the same validation set for early stopping, not the CV fold.Holmquist
E
5

Here's a solution that works in a Pipeline with GridSearchCV. The challenge occurs when you have a pipeline that is required to pre-process your training data. For example, when X is a text document and you need TFTDFVectorizer to vectorize it.

Over-ride the XGBRegressor or XGBClssifier.fit() Function

  • This step uses train_test_split() to select the specified number of validation records from X for the eval_set and then passes the remaining records along to fit().
  • A new parameter eval_test_size is added to .fit() to control the number of validation records. (see train_test_split test_size documenation)
  • **kwargs passes along any other parameters added by the user for the XGBRegressor.fit() function.
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split

class XGBRegressor_ES(XGBRegressor):
    
    def fit(self, X, y, *, eval_test_size=None, **kwargs):
        
        if eval_test_size is not None:
        
            params = super(XGBRegressor, self).get_xgb_params()
            
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=eval_test_size, random_state=params['random_state'])
            
            eval_set = [(X_test, y_test)]
            
            # Could add (X_train, y_train) to eval_set 
            # to get .eval_results() for both train and test
            #eval_set = [(X_train, y_train),(X_test, y_test)] 
            
            kwargs['eval_set'] = eval_set
            
        return super(XGBRegressor_ES, self).fit(X_train, y_train, **kwargs) 

Example Usage

Below is a multistep pipeline that includes multiple transformations to X. The pipeline's fit() function passes the new evaluation parameter to the XGBRegressor_ES class above as xgbr__eval_test_size=200. In this example:

  • X_train contains text documents passed to the pipeline.
  • XGBRegressor_ES.fit() uses train_test_split() to select 200 records from X_train for the validation set and early stopping. (This could also be a percentage such as xgbr__eval_test_size=0.2)
  • The remaining records in X_train are passed along to XGBRegressor.fit() for the actual fit().
  • Early stopping may now occur after 75 rounds of unchanged boosting for each cv fold in a gridsearch.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
   
xgbr_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer()),
                     ('vt',VarianceThreshold()),
                     ('scaler', StandardScaler()),
                     ('Sp', SelectPercentile()),
                     ('xgbr',XGBRegressor_ES(n_estimators=2000,
                                             objective='reg:squarederror',
                                             eval_metric='mae',
                                             learning_rate=0.0001,
                                             random_state=7))    ])

X_train = train_idxs['f_text'].values
y_train = train_idxs['Pct_Change_20'].values

Example Fitting the Pipeline:

%time xgbr_pipe.fit(X_train, y_train, 
                    xgbr__eval_test_size=200,
                    xgbr__eval_metric='mae', 
                    xgbr__early_stopping_rounds=75)

Example Fitting GridSearchCV:

learning_rate = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3]
param_grid = dict(xgbr__learning_rate=learning_rate)

grid_search = GridSearchCV(xgbr_pipe, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=10)
grid_result = grid_search.fit(X_train, y_train, 
                    xgbr__eval_test_size=200,
                    xgbr__eval_metric='mae', 
                    xgbr__early_stopping_rounds=75)
Equilibrist answered 1/4, 2022 at 17:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.