How to get the accuracy per epoch or step for the huggingface.transformers Trainer?

Asked 9/5, 2021 at 12:5 Answered 15/2, 2023 at 23:54

python tensorflow logging huggingface-transformers

I'm using the huggingface Trainer with BertForSequenceClassification.from_pretrained("bert-base-uncased") model.

Simplified, it looks like this:

model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

training_args = TrainingArguments(
        output_dir="bert_results",
        num_train_epochs=3,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=32,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir="bert_results/logs",
        logging_steps=10
        )

trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics
        )

The logs contain the loss for each 10 steps, but I can't seem to find the training accuracy. Does anyone know how to get the accuracy, for example by changing the verbosity of the logger? I can't seem to find anything about it online.

Complot answered 9/5, 2021 at 12:5 Comment(0)

You can load the accuracy metric and make it work with your compute_metrics function. As an example, it would be like:

from datasets import load_metric
metric = load_metric('accuracy')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

This example of compute_metrics function is based on the Hugging Face's text classification tutorial. It worked in my tests.

Kwang answered 8/11, 2021 at 0:58 Comment(0)

I had the same problem and I had solved this by adding a custom callback which calls the evaluate() method with train_dataset at the end of every callback.

class CustomCallback(TrainerCallback):
    
    def __init__(self, trainer) -> None:
        super().__init__()
        self._trainer = trainer
    
    def on_epoch_end(self, args, state, control, **kwargs):
        if control.should_evaluate:
            control_copy = deepcopy(control)
            self._trainer.evaluate(eval_dataset=self._trainer.train_dataset, metric_key_prefix="train")
            return control_copy

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
    tokenizer=tokenizer
)
trainer.add_callback(CustomCallback(trainer)) 
train = trainer.train()

This gives the train metrics like following:

{'train_loss': 0.7159061431884766, 'train_accuracy': 0.4, 'train_f1': 0.5714285714285715, 'train_runtime': 6.2973, 'train_samples_per_second': 2.382, 'train_steps_per_second': 0.159, 'epoch': 1.0}
{'eval_loss': 0.8529007434844971, 'eval_accuracy': 0.0, 'eval_f1': 0.0, 'eval_runtime': 2.0739, 'eval_samples_per_second': 0.964, 'eval_steps_per_second': 0.482, 'epoch': 1.0}

Another way to get train accuracy is extend the base Trainer class and over-ride compute_loss() method like following:

class CustomTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        
    def compute_loss(self, model, inputs, return_outputs=False):
        """
        How the loss is computed by Trainer. By default, all models return the loss in the first element.
        Subclass and override for custom behavior.
        """
        if self.label_smoother is not None and "labels" in inputs:
            labels = inputs.pop("labels")
        else:
            labels = None
        outputs = model(**inputs)

        # code for calculating accuracy
        if "labels" in inputs:
            preds = outputs.logits.detach()
            acc1 = accuracy_score(inputs.labels.reshape(1, len(inputs.labels))[0], preds.argmax(axis=1))
            self.log({'accuracy_score': acc1})
            acc = (
                (preds.argmax(axis=-1) == inputs.labels.reshape(1, len(inputs.labels))[0])
                .type(torch.float)
                .mean()
                .item()
            )
            self.log({"train_accuracy": acc})
        # end code for calculating accuracy
                    
        # Save past state if it exists
        # TODO: this needs to be fixed and made cleaner later.
        if self.args.past_index >= 0:
            self._past = outputs[self.args.past_index]

        if labels is not None:
            loss = self.label_smoother(outputs, labels)
        else:
            # We don't use .loss here since the model may return tuples instead of ModelOutput.
            loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

        return (loss, outputs) if return_outputs else loss

Then instead of trainer use CustomTrainer like this:

trainer = CustomTrainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
    tokenizer=tokenizer
)

Idolater answered 3/1, 2022 at 9:53 Comment(3)

What is deepcopy(control) ? – Raffinate 27/1, 2022 at 6:56

@Raffinate we are making copy of control object from trainer class into control_copy to return later, as far I remember directly changing control object was giving me some error. – Idolater 27/1, 2022 at 7:28

If not deep copy control, the trainer would not evaluate the evaluation set. I cannot understand why but it just turns out like this – Staciestack 7/7, 2022 at 2:22

Function to return the needed metric is needed. Here is the one I wrote, which returns the list of metrics (more is better, right?):

def compute_metrics(eval_pred):
    metrics = ["accuracy", "recall", "precision", "f1"] #List of metrics to return
    metric={}
    for met in metrics:
       metric[met] = load_metric(met)
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    metric_res={}
    for met in metrics:
       metric_res[met]=metric[met].compute(predictions=predictions, references=labels)[met]
    return metric_res

Also, if metrics need to be calculated per epoch, it needs to be defined in training args:

training_args = TrainingArguments(
    ...,
    evaluation_strategy = "epoch", #To calculate metrics per epoch
    logging_strategy="epoch", #Extra: to log training data stats for loss 
)

The last step is to add it to the trainer:

trainer = Trainer(
    ...,
    compute_metrics=compute_metrics,
)

Meara answered 5/4, 2022 at 22:5 Comment(0)

This is late, but for the benefit of those who are not successful with previous answers, another method I found is to override the evaluate method in the Trainer class in the Transformers library. The idea is to compute evaluations on the training set and add them to the logs. Make sure to combine the eval and train dictionaries into one when returning.

Extend the trainer class and override as follows:

    class CTCTrainer(Trainer):
        def evaluate(
            self,
            eval_dataset: Optional[Dataset] = None,
            ignore_keys: Optional[List[str]] = None,
            metric_key_prefix: str = "eval",
        ) -> Dict[str, float]:
            """
            Run evaluation and returns metrics.
            The calling script will be responsible for providing a method to compute metrics, as they are task-dependent
            (pass it to the init `compute_metrics` argument).
            You can also subclass and override this method to inject custom behavior.
            Args:
                eval_dataset (`Dataset`, *optional*):
                    Pass a dataset if you wish to override `self.eval_dataset`. If it is a [`~datasets.Dataset`], columns
                    not accepted by the `model.forward()` method are automatically removed. It must implement the `__len__`
                    method.
                ignore_keys (`Lst[str]`, *optional*):
                    A list of keys in the output of your model (if it is a dictionary) that should be ignored when
                    gathering predictions.
                metric_key_prefix (`str`, *optional*, defaults to `"eval"`):
                    An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
                    "eval_bleu" if the prefix is "eval" (default)
            Returns:
                A dictionary containing the evaluation loss and the potential metrics computed from the predictions. The
                dictionary also contains the epoch number which comes from the training state.
            """
            # memory metrics - must set up as early as possible
            self._memory_tracker.start()
    
            eval_dataloader = self.get_eval_dataloader(eval_dataset)
            train_dataloader = self.get_train_dataloader()
            start_time = time.time()
    
            eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
            eval_output = eval_loop(
                eval_dataloader,
                description="Evaluation",
                # No point gathering the predictions if there are no metrics, otherwise we defer to
                # self.args.prediction_loss_only
                prediction_loss_only=True if self.compute_metrics is None else None,
                ignore_keys=ignore_keys,
                metric_key_prefix=metric_key_prefix,
            )
    
            train_output = eval_loop(
                train_dataloader,
                description='Training Evaluation',
                prediction_loss_only=True if self.compute_metrics is None else None,
                ignore_keys=ignore_keys,
                metric_key_prefix="train",
            )
    
            total_batch_size = self.args.eval_batch_size * self.args.world_size
            if f"{metric_key_prefix}_jit_compilation_time" in eval_output.metrics:
                start_time += eval_output.metrics[f"{metric_key_prefix}_jit_compilation_time"]
            eval_output.metrics.update(
                speed_metrics(
                    metric_key_prefix,
                    start_time,
                    num_samples=eval_output.num_samples,
                    num_steps=math.ceil(eval_output.num_samples / total_batch_size),
                )
            )
    
            train_n_samples = len(self.train_dataset)
            train_output.metrics.update(speed_metrics('train', start_time, train_n_samples))
            self.log(train_output.metrics | eval_output.metrics)
    
            if DebugOption.TPU_METRICS_DEBUG in self.args.debug:
                # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
                xm.master_print(met.metrics_report())
    
            self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, train_output.metrics)
            self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, eval_output.metrics)
    
            self._memory_tracker.stop_and_update_metrics(eval_output.metrics)
            self._memory_tracker.stop_and_update_metrics(train_output.metrics)
    
            # only works in Python >= 3.9
            return train_output.metrics | eval_output.metrics

Remember to use your custom extended class to train your model, trainer = CTCTrainer (args) and trainer.train(). The code above will produce the following output in your log history.

"log_history": [
    {
      "epoch": 0.67,
      "learning_rate": 6.428571428571429e-05,
      "loss": 2.1279,
      "step": 5
    },
    {
      "epoch": 0.67,
      "eval_accuracy": 0.13333334028720856,
      "eval_loss": 2.1077311038970947,
      "eval_runtime": 10.683,
      "eval_samples_per_second": 5.616,
      "eval_steps_per_second": 1.404,
      "step": 5,
      "train_accuracy": 0.13333334028720856,
      "train_loss": 2.086669921875,
      "train_runtime": 10.683,
      "train_samples_per_second": 5.616
    }
}

Cudgel answered 15/2, 2023 at 23:54 Comment(0)

Recommended topics

Hot tags