In Short

Depends on what you want to do with the evaluation function, knowing the internal workings of the evaluation might or might not be practical for you to train the model appropriately.

Scroll down to the Summary section of the answer and the QnA section after.

In Long

There are two common mode for training a model with Huggingface transformers,

with the Trainer (batteries included)
without the trainer and default Pytorch backpropagation

For example:

For (2), it should be self-explanatory as the evaluation/validation routine is explicitly coded out (other than the magical loss.backwords and optimizer.step)

Q: Where is the validation routine in the `Trainer` object?

For (1), it is rather hard to find any blogpost or detailed doc on how the Trainer object works but you can take a look at the source code, so lets go down the rabbit hole...

In the Trainer object, there is an evaluate() function that runs the evaluation/validation routine, https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L2925

How/When is the `evaluate()` called?

When you call trainer.train(), there's a lot of things happening but in general it's doing:

  def train(
        self,
        resume_from_checkpoint: Optional[Union[str, bool]] = None,
        trial: Union["optuna.Trial", Dict[str, Any]] = None,
        ignore_keys_for_eval: Optional[List[str]] = None,
        **kwargs,
    ):
        # blah blah, argparsing and reading kwargs
        # then do a lot more model/args munging to make check 
        # if you want to load a model or create a new one from config

        # Then finally the most important thing:

            return inner_training_loop(
                args=args,
                resume_from_checkpoint=resume_from_checkpoint,
                trial=trial,
                ignore_keys_for_eval=ignore_keys_for_eval,
            )

Hmmmm, oh okay, `trainer.train()` calls `inner_training_loop()`

And inside the inner_training_loop(), https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L1552, there's like a 400-500 lines of code that eventually:

    def inner_training_loop(...):
        # Lots of code parsing args and checking stuff.

        # Then the training part of the code, that is out-of-scope
        # for this question but eventually, it does
        ... 
        self.optimizer.step()

        ...
        # Then we see this after the gradients are computed
        # and model updated with optimizer.step()
         self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)

        # Iterate through the training + evaluate/validation
        # loop, until eventually the trainer.train() returns
        ...
        return TrainOutput(self.state.global_step, train_loss, metrics)

Hmmmm, oh okay, `trainer.train()` calls `inner_training_loop()`, that calls `_maybe_log_save_evaluate()`

And inside the _maybe_log_save_evaluate(), that's when you see the validation dataset gets accessed:


    def _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval):
        # Somehow, we have to respect the user and check if the user
        # wants to log the metircs...
        if self.control.should_log:
            # Some log parsing for the loss,
            # emits to somewhere code... Not that we care here =) 
            ...
    
        ...
        # Then comes the part that we want to know, 
        # the actual evaluation.
        if self.control.should_evaluate:
            if isinstance(self.eval_dataset, dict):
                metrics = {}
                for eval_dataset_name, eval_dataset in self.eval_dataset.items():
                    dataset_metrics = self.evaluate(
                        eval_dataset=eval_dataset,
                        ignore_keys=ignore_keys_for_eval,
                        metric_key_prefix=f"eval_{eval_dataset_name}",
                    )
                    metrics.update(dataset_metrics)
            else:
                metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
            self._report_to_hp_search(trial, self.state.global_step, metrics)
    
            # Run delayed LR scheduler now that metrics are populated
            if isinstance(self.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
                metric_to_check = self.args.metric_for_best_model
                if not metric_to_check.startswith("eval_"):
                    metric_to_check = f"eval_{metric_to_check}"
                self.lr_scheduler.step(metrics[metric_to_check])


        # Then check more stuff to see if user wants 
        # to save the model before exiting the function.
        if self.control.should_save:
            ...

Note: The _maybe_log_save_evaluate() calls evaluate() at this line:

self.evaluate(eval_dataset=eval_dataset,
    ignore_keys=ignore_keys_for_eval,
    metric_key_prefix=f"eval_{eval_dataset_name}",
)

So, the `trainer.train()` calls `inner_training_loop()`, that calls `_maybe_log_save_evaluate()`, that calls `evaluate()`.

Then, we have calling evaluate() calling evaluation_loop():

    def evaluate(
        self,
        eval_dataset: Optional[Dataset] = None,
        ignore_keys: Optional[List[str]] = None,
        metric_key_prefix: str = "eval",
    ) -> Dict[str, float]:
        ...
        # First, the function runs the forward pass through the
        # prediction_loop

        eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop

        output = eval_loop(
            eval_dataloader,
            description="Evaluation",
            # No point gathering the predictions if there are no metrics, otherwise we defer to
            # self.args.prediction_loss_only
            prediction_loss_only=True if self.compute_metrics is None else None,
            ignore_keys=ignore_keys,
            metric_key_prefix=metric_key_prefix,
        )

        total_batch_size = self.args.eval_batch_size * self.args.world_size
        if f"{metric_key_prefix}_jit_compilation_time" in output.metrics:
            start_time += output.metrics[f"{metric_key_prefix}_jit_compilation_time"]
        output.metrics.update(
            speed_metrics(
                metric_key_prefix,
                start_time,
                num_samples=output.num_samples,
                num_steps=math.ceil(output.num_samples / total_batch_size),
            )
        )

        self.log(output.metrics)
        ...

        return output.metrics

Then inside the evaluation_loop, that is where eventually you see the

    def evaluation_loop(
        self,
        dataloader: DataLoader,
        description: str,
        prediction_loss_only: Optional[bool] = None,
        ignore_keys: Optional[List[str]] = None,
        metric_key_prefix: str = "eval",
    ) -> EvalLoopOutput:
        # Do lots of work parsing and optimizing with accelerate and GPUs
        ...

        # Then the meat of the evaluation process:
        # Main evaluation loop
        for step, inputs in enumerate(dataloader):
            # Update the observed num examples
            observed_batch_size = find_batch_size(inputs)
            if observed_batch_size is not None:
                observed_num_examples += observed_batch_size
                # For batch samplers, batch_size is not known by the dataloader in advance.
                if batch_size is None:
                    batch_size = observed_batch_size

            # Prediction step
            loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
            main_input_name = getattr(self.model, "main_input_name", "input_ids")
            inputs_decode = self._prepare_input(inputs[main_input_name]) if args.include_inputs_for_metrics else None


        # Then lots of code parsing the different prediction outputs of the different models supported by Huggingface
        ...

        # And eventually emitting and returning the metrics numbers
        
        if self.compute_metrics is not None and all_preds is not None and all_labels is not None:
            if args.include_inputs_for_metrics:
                metrics = self.compute_metrics(
                    EvalPrediction(predictions=all_preds, label_ids=all_labels, inputs=all_inputs)
                )
            else:
                metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
        else:
            metrics = {}

         # Then some more code, parsing the metrics outputs
         ...
     
     # Finally, return the outputs.
     return EvalLoopOutput(predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples)

Summary

With the Trainer object there are a lot of code written to support different user modes, different trainer arguments and different models and evaluation routines.

In short, the trainer.train()

trainer.train() calls inner_training_loop()
inner_training_loop() calls _maybe_log_save_evaluate(),
_maybe_log_save_evaluate() that calls evaluate(),
evaluate() eventually calls the evaluation_loop() function that does the metric computation

Q: If I want to customize the validation routine, should I change the Trainer object and the `evaluate()` function?

You can try overloading the Trainer object's evaluate function if you want to.

A: Try a custom `compute_metric` function

But since the object is loaded to support generic use, if you want customized validation loops, first try changing how the compute_metric works (most probably your task is a common one supported, so it's easy) e.g. https://www.kaggle.com/code/alvations/how-to-fine-tune-an-opus-mt-model/#The-Metric:-Lets-go-with-the-classic-BLEU-and-ChrF

A: Try using `TrainerCallback`

Or you can try https://huggingface.co/docs/transformers/main_classes/callback (take a look at https://oongjoon.github.io/huggingface/Trainer-Callback_en/, it's a little old but worth the read)

A: Train without the `Trainer` object

And if you really need the forward pass to the model to be different and/or the outputs of the forward pass differently, then it might be easier to not use the Trainer and roll your own in Pytorch blah blah; loss.backwards(); optimizer.step() , e.g. https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation_no_trainer.py

In Short

In Long

Q: Where is the validation routine in the `Trainer` object?

How/When is the `evaluate()` called?

Hmmmm, oh okay, `trainer.train()` calls `inner_training_loop()`

Hmmmm, oh okay, `trainer.train()` calls `inner_training_loop()`, that calls `_maybe_log_save_evaluate()`

So, the `trainer.train()` calls `inner_training_loop()`, that calls `_maybe_log_save_evaluate()`, that calls `evaluate()`.

Summary

Q: If I want to customize the validation routine, should I change the Trainer object and the `evaluate()` function?

A: Try a custom `compute_metric` function

A: Try using `TrainerCallback`

A: Train without the `Trainer` object

Recommended topics

Hot tags

In Short

In Long

Q: Where is the validation routine in the Trainer object?

How/When is the evaluate() called?

Hmmmm, oh okay, trainer.train() calls inner_training_loop()

Hmmmm, oh okay, trainer.train() calls inner_training_loop(), that calls _maybe_log_save_evaluate()

So, the trainer.train() calls inner_training_loop(), that calls _maybe_log_save_evaluate(), that calls evaluate().

Summary

Q: If I want to customize the validation routine, should I change the Trainer object and the evaluate() function?

A: Try a custom compute_metric function

A: Try using TrainerCallback

A: Train without the Trainer object

Recommended topics

Hot tags

Q: Where is the validation routine in the `Trainer` object?

How/When is the `evaluate()` called?

Hmmmm, oh okay, `trainer.train()` calls `inner_training_loop()`

Hmmmm, oh okay, `trainer.train()` calls `inner_training_loop()`, that calls `_maybe_log_save_evaluate()`

So, the `trainer.train()` calls `inner_training_loop()`, that calls `_maybe_log_save_evaluate()`, that calls `evaluate()`.

Q: If I want to customize the validation routine, should I change the Trainer object and the `evaluate()` function?

A: Try a custom `compute_metric` function

A: Try using `TrainerCallback`

A: Train without the `Trainer` object