I do not seem to find an explanation on how the validation and training losses are calculated when we finetune a model using the huggingFace trainer. Does anyone know here to find this information?
In Short
Depends on what you want to do with the evaluation function, knowing the internal workings of the evaluation might or might not be practical for you to train the model appropriately.
Scroll down to the Summary
section of the answer and the QnA section after.
In Long
There are two common mode for training a model with Huggingface transformers
,
- with the
Trainer
(batteries included) - without the trainer and default Pytorch backpropagation
For example:
- https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py
- https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm_no_trainer.py
For (2), it should be self-explanatory as the evaluation/validation routine is explicitly coded out (other than the magical loss.backwords
and optimizer.step
)
Q: Where is the validation routine in the Trainer
object?
For (1), it is rather hard to find any blogpost or detailed doc on how the Trainer
object works but you can take a look at the source code, so lets go down the rabbit hole...
In the Trainer
object, there is an evaluate()
function that runs the evaluation/validation routine, https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L2925
How/When is the evaluate()
called?
When you call trainer.train()
, there's a lot of things happening but in general it's doing:
def train(
self,
resume_from_checkpoint: Optional[Union[str, bool]] = None,
trial: Union["optuna.Trial", Dict[str, Any]] = None,
ignore_keys_for_eval: Optional[List[str]] = None,
**kwargs,
):
# blah blah, argparsing and reading kwargs
# then do a lot more model/args munging to make check
# if you want to load a model or create a new one from config
# Then finally the most important thing:
return inner_training_loop(
args=args,
resume_from_checkpoint=resume_from_checkpoint,
trial=trial,
ignore_keys_for_eval=ignore_keys_for_eval,
)
Hmmmm, oh okay, trainer.train()
calls inner_training_loop()
And inside the inner_training_loop()
, https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L1552, there's like a 400-500 lines of code that eventually:
def inner_training_loop(...):
# Lots of code parsing args and checking stuff.
# Then the training part of the code, that is out-of-scope
# for this question but eventually, it does
...
self.optimizer.step()
...
# Then we see this after the gradients are computed
# and model updated with optimizer.step()
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
# Iterate through the training + evaluate/validation
# loop, until eventually the trainer.train() returns
...
return TrainOutput(self.state.global_step, train_loss, metrics)
Hmmmm, oh okay, trainer.train()
calls inner_training_loop()
, that calls _maybe_log_save_evaluate()
And inside the _maybe_log_save_evaluate()
, that's when you see the validation dataset gets accessed:
def _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval):
# Somehow, we have to respect the user and check if the user
# wants to log the metircs...
if self.control.should_log:
# Some log parsing for the loss,
# emits to somewhere code... Not that we care here =)
...
...
# Then comes the part that we want to know,
# the actual evaluation.
if self.control.should_evaluate:
if isinstance(self.eval_dataset, dict):
metrics = {}
for eval_dataset_name, eval_dataset in self.eval_dataset.items():
dataset_metrics = self.evaluate(
eval_dataset=eval_dataset,
ignore_keys=ignore_keys_for_eval,
metric_key_prefix=f"eval_{eval_dataset_name}",
)
metrics.update(dataset_metrics)
else:
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
self._report_to_hp_search(trial, self.state.global_step, metrics)
# Run delayed LR scheduler now that metrics are populated
if isinstance(self.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
metric_to_check = self.args.metric_for_best_model
if not metric_to_check.startswith("eval_"):
metric_to_check = f"eval_{metric_to_check}"
self.lr_scheduler.step(metrics[metric_to_check])
# Then check more stuff to see if user wants
# to save the model before exiting the function.
if self.control.should_save:
...
Note: The _maybe_log_save_evaluate()
calls evaluate()
at this line:
self.evaluate(eval_dataset=eval_dataset,
ignore_keys=ignore_keys_for_eval,
metric_key_prefix=f"eval_{eval_dataset_name}",
)
So, the trainer.train()
calls inner_training_loop()
, that calls _maybe_log_save_evaluate()
, that calls evaluate()
.
Then, we have calling evaluate()
calling evaluation_loop()
:
def evaluate(
self,
eval_dataset: Optional[Dataset] = None,
ignore_keys: Optional[List[str]] = None,
metric_key_prefix: str = "eval",
) -> Dict[str, float]:
...
# First, the function runs the forward pass through the
# prediction_loop
eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
output = eval_loop(
eval_dataloader,
description="Evaluation",
# No point gathering the predictions if there are no metrics, otherwise we defer to
# self.args.prediction_loss_only
prediction_loss_only=True if self.compute_metrics is None else None,
ignore_keys=ignore_keys,
metric_key_prefix=metric_key_prefix,
)
total_batch_size = self.args.eval_batch_size * self.args.world_size
if f"{metric_key_prefix}_jit_compilation_time" in output.metrics:
start_time += output.metrics[f"{metric_key_prefix}_jit_compilation_time"]
output.metrics.update(
speed_metrics(
metric_key_prefix,
start_time,
num_samples=output.num_samples,
num_steps=math.ceil(output.num_samples / total_batch_size),
)
)
self.log(output.metrics)
...
return output.metrics
Then inside the evaluation_loop, that is where eventually you see the
def evaluation_loop(
self,
dataloader: DataLoader,
description: str,
prediction_loss_only: Optional[bool] = None,
ignore_keys: Optional[List[str]] = None,
metric_key_prefix: str = "eval",
) -> EvalLoopOutput:
# Do lots of work parsing and optimizing with accelerate and GPUs
...
# Then the meat of the evaluation process:
# Main evaluation loop
for step, inputs in enumerate(dataloader):
# Update the observed num examples
observed_batch_size = find_batch_size(inputs)
if observed_batch_size is not None:
observed_num_examples += observed_batch_size
# For batch samplers, batch_size is not known by the dataloader in advance.
if batch_size is None:
batch_size = observed_batch_size
# Prediction step
loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
main_input_name = getattr(self.model, "main_input_name", "input_ids")
inputs_decode = self._prepare_input(inputs[main_input_name]) if args.include_inputs_for_metrics else None
# Then lots of code parsing the different prediction outputs of the different models supported by Huggingface
...
# And eventually emitting and returning the metrics numbers
if self.compute_metrics is not None and all_preds is not None and all_labels is not None:
if args.include_inputs_for_metrics:
metrics = self.compute_metrics(
EvalPrediction(predictions=all_preds, label_ids=all_labels, inputs=all_inputs)
)
else:
metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
else:
metrics = {}
# Then some more code, parsing the metrics outputs
...
# Finally, return the outputs.
return EvalLoopOutput(predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples)
Summary
With the Trainer
object there are a lot of code written to support different user modes, different trainer arguments and different models and evaluation routines.
In short, the trainer.train()
trainer.train()
callsinner_training_loop()
inner_training_loop()
calls_maybe_log_save_evaluate()
,_maybe_log_save_evaluate()
that callsevaluate()
,evaluate()
eventually calls theevaluation_loop()
function that does the metric computation
Q: If I want to customize the validation routine, should I change the Trainer object and the evaluate()
function?
You can try overloading the Trainer object's evaluate
function if you want to.
A: Try a custom compute_metric
function
But since the object is loaded to support generic use, if you want customized validation loops, first try changing how the compute_metric
works (most probably your task is a common one supported, so it's easy) e.g. https://www.kaggle.com/code/alvations/how-to-fine-tune-an-opus-mt-model/#The-Metric:-Lets-go-with-the-classic-BLEU-and-ChrF
A: Try using TrainerCallback
Or you can try https://huggingface.co/docs/transformers/main_classes/callback (take a look at https://oongjoon.github.io/huggingface/Trainer-Callback_en/, it's a little old but worth the read)
A: Train without the Trainer
object
And if you really need the forward pass to the model to be different and/or the outputs of the forward pass differently, then it might be easier to not use the Trainer
and roll your own in Pytorch blah blah; loss.backwards(); optimizer.step()
, e.g. https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation_no_trainer.py
© 2022 - 2024 — McMap. All rights reserved.