Save only best weights with huggingface transformers

Asked 23/6, 2020 at 0:53 Answered 18/9, 2023 at 15:9

Solved deep-learning nlp pytorch huggingface-transformers

Currently, I'm building a new transformer-based model with huggingface-transformers, where attention layer is different from the original one. I used run_glue.py to check performance of my model on GLUE benchmark. However, I found that Trainer class of huggingface-transformers saves all the checkpoints that I set, where I can set the maximum number of checkpoints to save. However, I want to save only the weight (or other stuff like optimizers) with best performance on validation dataset, and current Trainer class doesn't seem to provide such thing. (If we set the maximum number of checkpoints, then it removes older checkpoints, not ones with worse performances). Someone already asked about same question on Github, but I can't figure out how to modify the script and do what I want. Currently, I'm thinking about making a custom Trainer class that inherits original one and change the train() method, and it would be great if there's an easy and simple way to do this. Thanks in advance.

Omeromero answered 23/6, 2020 at 0:53 Comment(0)

You may try the following parameters from trainer in the huggingface

training_args = TrainingArguments(
    output_dir='/content/drive/results',          # output directory
    do_predict= True, 
    num_train_epochs=3,              # total number of training epochs
    **per_device_train_batch_size=4,  # batch size per device during training
    per_device_eval_batch_size=2**,   # batch size for evaluation
    warmup_steps=1000,                # number of warmup steps for learning rate  
    save_steps=1000,
    save_total_limit=10,
    load_best_model_at_end= True,
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=0, evaluate_during_training=True)

There may be better ways to avoid too many checkpoints and selecting the best model. So far you can not save only the best model, but you check when the evaluation yields better results than the previous one.

Chromite answered 8/10, 2020 at 21:19 Comment(1)

Just one point, if you use load_best_model_at_end = True, than save_steps and, obviously, save_total_limit will be ignored – Nonce 20/3, 2021 at 2:3

I have not seen any parameter for that. However, there is a workaround.

Use following combinations

    evaluation_strategy =‘steps’,
    eval_steps = 10, # Evaluation and Save happens every 10 steps
    save_total_limit = 5, # Only last 5 models are saved. Older ones are deleted.
    load_best_model_at_end=True,

When I tried with the above combination, at any time 5 previous models will be saved in output directory, but if the best model is not one among them, it will keep the best model as well. So it will be 1 + 5 models. You can change save_total_limit = 1 so it will serve your purpose

Felony answered 20/5, 2021 at 6:45 Comment(0)

This answer could be useful

training_args = TrainingArguments(
    output_dir=repo_name,
    group_by_length=True,
    length_column_name='input_length',
    per_device_train_batch_size=24,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    num_train_epochs=20,
    fp16=True,
    save_steps=1000,
    save_strategy='steps', # we cannot set it to "no". Otherwise, the model cannot guess the best checkpoint.
    eval_steps=1000,
    logging_steps=1000,
    learning_rate=5e-5,
    warmup_steps=500,
    save_total_limit=3,
    load_best_model_at_end = True # this will let the model save the best checkpoint
)

Firewood answered 13/2, 2022 at 12:43 Comment(0)

As indicated here as well, there are different ways to save the best checkpoint. If you use save_total_limits=2 and load_best_model_at_end=True, then the latest and the best model will be saved. From the numbers in the names of these directories, one could infer which checkpoint is which. Even if save_total_limits=1, it is likely that two models will be saved again, the best and the latest (to resume training), if they are not the same.

When load_best_model_at_end=True, then doing trainer.state.best_model_checkpoint after training can be used to get the best model.

If the best model is loaded at the end of training, then this trainer.save_model(output_dir=custom_path) can also save the best model in a separate directory.

Ephrem answered 18/9, 2023 at 15:9 Comment(0)

Recommended topics

Hot tags