GridSearchCV - XGBoost - Early Stopping

Asked 24/3, 2017 at 7:15 Answered 1/4, 2022 at 17:23

Solved python-3.x scikit-learn regression data-science xgboost

i am trying to do hyperparemeter search with using scikit-learn's GridSearchCV on XGBoost. During gridsearch i'd like it to early stop, since it reduce search time drastically and (expecting to) have better results on my prediction/regression task. I am using XGBoost via its Scikit-Learn API.

    model = xgb.XGBRegressor()
    GridSearchCV(model, paramGrid, verbose=verbose ,fit_params={'early_stopping_rounds':42}, cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]), n_jobs=n_jobs, iid=iid).fit(trainX,trainY)

I tried to give early stopping parameters with using fit_params, but then it throws this error which is basically because of lack of validation set which is required for early stopping:

/opt/anaconda/anaconda3/lib/python3.5/site-packages/xgboost/callback.py in callback(env=XGBoostCallbackEnv(model=<xgboost.core.Booster o...teration=4000, rank=0, evaluation_result_list=[]))
    187         else:
    188             assert env.cvfolds is not None
    189 
    190     def callback(env):
    191         """internal function"""
--> 192         score = env.evaluation_result_list[-1][1]
        score = undefined
        env.evaluation_result_list = []
    193         if len(state) == 0:
    194             init(env)
    195         best_score = state['best_score']
    196         best_iteration = state['best_iteration']

How can i apply GridSearch on XGBoost with using early_stopping_rounds?

note: model is working without gridsearch, also GridSearch works without 'fit_params={'early_stopping_rounds':42}

Wrens answered 24/3, 2017 at 7:15 Comment(1)

GridSearchCV cannot perform a correct grid search while using early stopping because it will not set the eval_set validation set for us. Instead, we must grid search manually, see this example. – Antimony 15/5 at 23:58

An update to @glao's answer and a response to @Vasim's comment/question, as of sklearn 0.21.3 (note that fit_params has been moved out of the instantiation of GridSearchCV and been moved into the fit() method; also, the import specifically pulls in the sklearn wrapper module from xgboost):

import xgboost.sklearn as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

cv = 2

trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]

# these are the evaluation sets
testX = trainX 
testY = trainY

paramGrid = {"subsample" : [0.5, 0.8]}

fit_params={"early_stopping_rounds":42, 
            "eval_metric" : "mae", 
            "eval_set" : [[testX, testY]]}

model = xgb.XGBRegressor()

gridsearch = GridSearchCV(model, paramGrid, verbose=1,             
         cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]))

gridsearch.fit(trainX, trainY, **fit_params)

Comp answered 16/8, 2019 at 18:59 Comment(8)

hi - can this be done using stratifiedkfold as well ? – Kauri 18/8, 2019 at 15:25

@Sandeep: yup, that's actually the default if you choose to simply specify the cv parameter in GridSearchCV as an integer (indicating how many folds you want to use). i'm afraid I'm not too familiar with the TimeSeriesSplit method though, so if you want to use that you should check out the docs. – Comp 20/8, 2019 at 15:58

thanks for the reply, this solution was what i had been looking for. – Wrens 12/9, 2019 at 10:41

good idea, just one question, xgboost will use a different validation set for each cv to check for early stopping? – Veilleux 26/8, 2020 at 0:11

Is it intended that the training and evaluation sets are the same? IE, you set testX = trainX. – Formative 17/3, 2022 at 21:46

@YikeLu, I think I was just being lazy by not making a set of fake other arrays for the test data :) Sorry for the confusion. – Comp 19/3, 2022 at 19:27

@Comp no problem, it's more the docs and behavior that are confusing. I have just run with early_stopping_rounds using the xgb.cv method and it does NOT ask for aneval set (I'm assuming it just uses the CV folds), and in fact does not require entry of eval_metric either, it just uses objective by default. (Edit/reposted to remove point about return value which I figured out on my own). – Formative 23/3, 2022 at 1:58

I don't think that this solution works as asked in the OP. It seems to use the same validation set for early stopping, not the CV fold. – Holmquist 19/7, 2023 at 4:51

When using early_stopping_rounds you also have to give eval_metric and eval_set as input parameter for the fit method. Early stopping is done via calculating the error on an evaluation set. The error has to decrease every early_stopping_rounds otherwise the generation of additional trees is stopped early.

See the documentation of xgboosts fit method for details.

Here you see a minimal fully working example:

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

cv = 2

trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]

# these are the evaluation sets
testX = trainX 
testY = trainY

paramGrid = {"subsample" : [0.5, 0.8]}

fit_params={"early_stopping_rounds":42, 
            "eval_metric" : "mae", 
            "eval_set" : [[testX, testY]]}

model = xgb.XGBRegressor()
gridsearch = GridSearchCV(model, paramGrid, verbose=1 ,
         fit_params=fit_params,
         cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX,trainY]))
gridsearch.fit(trainX,trainY)

Unseasoned answered 25/3, 2017 at 8:23 Comment(7)

thanks for reply, it works. but giving pre-defined eval_set is against the nature of the cross validation i guess. – Wrens 31/3, 2017 at 13:14

I guess what you mean is that in real applications you have to make sure eval_set and train set are not overlapping or are the same as here - should have added that. I used the train set just for the sake of simplicity. Early stopping based on the train data does not prevent overfitting. – Unseasoned 31/3, 2017 at 13:25

@glao: the eval set should be the hold-out set of the cross-validation process to make everything work as intended. – Holmquist 23/11, 2017 at 8:58

nowadays "fit_params" is not recommendable because it is going to be deprecated. – Glyptodont 11/12, 2017 at 16:57

Thanks @MichaelM, and how exactly can we do that? Any help – Substantialize 22/3, 2019 at 6:14

@MichaelM He is right. valid_set should be a hold-out set. – Alchemy 2/9, 2019 at 23:59

@Wrens I think you are right. If we perform CV, we do not need a hold-out validation set. I think CV is designed to optimize traditional train-vaild split method. – Jerlenejermain 24/11, 2019 at 13:3

import xgboost.sklearn as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

cv = 2

trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]

# these are the evaluation sets
testX = trainX 
testY = trainY

paramGrid = {"subsample" : [0.5, 0.8]}

fit_params={"early_stopping_rounds":42, 
            "eval_metric" : "mae", 
            "eval_set" : [[testX, testY]]}

model = xgb.XGBRegressor()

gridsearch = GridSearchCV(model, paramGrid, verbose=1,             
         cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]))

gridsearch.fit(trainX, trainY, **fit_params)

Comp answered 16/8, 2019 at 18:59 Comment(8)

hi - can this be done using stratifiedkfold as well ? – Kauri 18/8, 2019 at 15:25

thanks for the reply, this solution was what i had been looking for. – Wrens 12/9, 2019 at 10:41

good idea, just one question, xgboost will use a different validation set for each cv to check for early stopping? – Veilleux 26/8, 2020 at 0:11

Is it intended that the training and evaluation sets are the same? IE, you set testX = trainX. – Formative 17/3, 2022 at 21:46

@YikeLu, I think I was just being lazy by not making a set of fake other arrays for the test data :) Sorry for the confusion. – Comp 19/3, 2022 at 19:27

I don't think that this solution works as asked in the OP. It seems to use the same validation set for early stopping, not the CV fold. – Holmquist 19/7, 2023 at 4:51

Here's a solution that works in a Pipeline with GridSearchCV. The challenge occurs when you have a pipeline that is required to pre-process your training data. For example, when X is a text document and you need TFTDFVectorizer to vectorize it.

Over-ride the XGBRegressor or XGBClssifier.fit() Function

This step uses train_test_split() to select the specified number of validation records from X for the eval_set and then passes the remaining records along to fit().
A new parameter eval_test_size is added to .fit() to control the number of validation records. (see train_test_split test_size documenation)
**kwargs passes along any other parameters added by the user for the XGBRegressor.fit() function.

from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split

class XGBRegressor_ES(XGBRegressor):
    
    def fit(self, X, y, *, eval_test_size=None, **kwargs):
        
        if eval_test_size is not None:
        
            params = super(XGBRegressor, self).get_xgb_params()
            
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=eval_test_size, random_state=params['random_state'])
            
            eval_set = [(X_test, y_test)]
            
            # Could add (X_train, y_train) to eval_set 
            # to get .eval_results() for both train and test
            #eval_set = [(X_train, y_train),(X_test, y_test)] 
            
            kwargs['eval_set'] = eval_set
            
        return super(XGBRegressor_ES, self).fit(X_train, y_train, **kwargs)

Example Usage

Below is a multistep pipeline that includes multiple transformations to X. The pipeline's fit() function passes the new evaluation parameter to the XGBRegressor_ES class above as xgbr__eval_test_size=200. In this example:

X_train contains text documents passed to the pipeline.
XGBRegressor_ES.fit() uses train_test_split() to select 200 records from X_train for the validation set and early stopping. (This could also be a percentage such as xgbr__eval_test_size=0.2)
The remaining records in X_train are passed along to XGBRegressor.fit() for the actual fit().
Early stopping may now occur after 75 rounds of unchanged boosting for each cv fold in a gridsearch.

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
   
xgbr_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer()),
                     ('vt',VarianceThreshold()),
                     ('scaler', StandardScaler()),
                     ('Sp', SelectPercentile()),
                     ('xgbr',XGBRegressor_ES(n_estimators=2000,
                                             objective='reg:squarederror',
                                             eval_metric='mae',
                                             learning_rate=0.0001,
                                             random_state=7))    ])

X_train = train_idxs['f_text'].values
y_train = train_idxs['Pct_Change_20'].values

Example Fitting the Pipeline:

%time xgbr_pipe.fit(X_train, y_train, 
                    xgbr__eval_test_size=200,
                    xgbr__eval_metric='mae', 
                    xgbr__early_stopping_rounds=75)

Example Fitting GridSearchCV:

learning_rate = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3]
param_grid = dict(xgbr__learning_rate=learning_rate)

grid_search = GridSearchCV(xgbr_pipe, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=10)
grid_result = grid_search.fit(X_train, y_train, 
                    xgbr__eval_test_size=200,
                    xgbr__eval_metric='mae', 
                    xgbr__early_stopping_rounds=75)

Equilibrist answered 1/4, 2022 at 17:23 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags