How to improve the catboostregressor? [closed]
Asked Answered
C

1

6

I am working on a data science regression problem with around 90,000 rows on train set and 8500 on test set. There are 9 categorical columns and no missing data. for this case, I am applied a catboostregressor which given me the pretty good R2(98.51) and MAE (3.77). Other nodels LGBM, XGBOOST performed under catboost.

Now I would like to increase the R2 value and decrease the MAE for more accurate results. That's what the demand too.

I have tuned many times by adding 'loss_function': ['MAE'], 'l2_leaf_reg':[3], 'random_strength': [4], 'bagging_temperature':[0.5] with different values but the performance is the same.

Can anyone help me how to boost the R2 value by minimizing MAE and MSE ?

Chantalchantalle answered 2/3, 2021 at 8:32 Comment(2)
You can try to tune hyperparameters for CatBoost. The second option would be to try feature engineering, maybe you can add some combination of existing features to the data that will improve the performance. You can also try MLJAR AutoML github.com/mljar/mljar-supervised it has built-in feature engineering (golden features + kmeans features)Osborne
Hi pplonski, Thank you. I did the tuning and got better score.Chantalchantalle
M
12

Simple method -

You can use Scikit-Learn's GridSearchCV to find the best hyperparameters for your CatBoostRegressor model. You can pass a dictionary of hyperparameters, and GridSearchCV will loop through all the hyperparameters and tell you which parameters are best. You can use it like this -

from sklearn.model_selection import GridSearchCV

model = CatBoostRegressor()
parameters = {'depth' : [6,8,10],
              'learning_rate' : [0.01, 0.05, 0.1],
              'iterations'    : [30, 50, 100]
              }

grid = GridSearchCV(estimator=model, param_grid = parameters, cv = 2, n_jobs=-1)
grid.fit(X_train, y_train)

Another method -

Now-a-days, models are complex and have a lot of parameters to tune. People are using Bayesian Optimization techniques, like Optuna, to tune hyperparameters. You can use Optuna to tune CatBoostClassifier like this:

!pip install optuna
import catboost
import optuna

def objective(trial):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2)

    param = {
        "objective": trial.suggest_categorical("objective", ["Logloss", "CrossEntropy"]),
        'learning_rate' : trial.suggest_loguniform('learning_rate', 0.001, 0.3),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.01, 0.1),
        "max_depth": trial.suggest_int("max_depth", 1, 15),
        "boosting_type": trial.suggest_categorical("boosting_type", ["Ordered", "Plain"]),
        "bootstrap_type": trial.suggest_categorical(
            "bootstrap_type", ["Bayesian", "Bernoulli", "MVS"]),
    }
    

    if param["bootstrap_type"] == "Bayesian":
        param["bagging_temperature"] = trial.suggest_float("bagging_temperature", 0, 10)
    elif param["bootstrap_type"] == "Bernoulli":
        param["subsample"] = trial.suggest_uniform("subsample", 0.1, 1)

    gbm = catboost.CatBoostClassifier(**param, iterations = 10000)

    gbm.fit(X_train, y_train, eval_set = [(X_val, y_val)], verbose = 0, early_stopping_rounds = 100)

    preds = gbm.predict(X_val)
    pred_labels = np.rint(preds)
    accuracy = accuracy_score(y_val, pred_labels)
    
    return accuracy

study = optuna.create_study(direction = "maximize")
study.optimize(objective, n_trials = 200, show_progress_bar = True)

This method take a lot of time (1-2 hrs, maybe). This method is best to use when you have a lot of parameters to tune. Else, use Grid Search CV.

Minuet answered 2/3, 2021 at 10:10 Comment(4)
Hi Adarsh Wase, I have implemented as you suggested by adding few more parameters. It is an improvement. but the model running so long time. Thank you.Chantalchantalle
Is it possible to let me know what are the best possible parameters to be added for better score again.Chantalchantalle
There are number of parameters in catboost regressor. And all of them have equal importance, we don't know which parameter is the best. It depends on what project you are working on. Also, I think you should read catboost regressor's documentations. Here - catboost.ai/docs/concepts/… By reading this, you will get enough knowledge of what hyperparameters are important to tune for your project. (Sorry for late reply)Minuet
What's the difference between GridSearchCV and the catboost built-in grid_search?Huskey

© 2022 - 2024 — McMap. All rights reserved.