HuggingFace Evaluate a Fine-tuned Zero-Shot Model

Asked 18/5, 2023 at 5:30 Answered 14/8, 2023 at 18:59

python deep-learning huggingface-transformers text-classification evaluation

I am finetuning the HuggingFace facebook/bart-large-mnli model to suit my need, I use the following parameters:

training_args = TrainingArguments(
    output_dir=model_directory,      # output directory
    num_train_epochs=30,              # total number of training epochs
    per_device_train_batch_size=1,  # batch size per device during training - 16 - Don't go over 1, it's out of memory
    per_device_eval_batch_size=2,   # batch size for evaluation - 64 - Don't go over 2, it's out of memory
    warmup_steps=500,                 # number of warmup steps for learning rate scheduler - 500
    weight_decay=0.01,               # strength of weight decay
)

model = BartForSequenceClassification.from_pretrained("facebook/bart-large-mnli")

trainer = Trainer(
    model=model,                          # the instantiated 🤗 Transformers model to be trained
    args=training_args,                   # training arguments, defined above
    compute_metrics=compute_metrics,      # a function to compute the metrics
    train_dataset=train_dataset,          # training dataset
    eval_dataset=test_dataset             # evaluation dataset
)

# Train the trainer
trainer.train()

The compute_metrics I use is:

import numpy as np
from datasets import Dataset, load_metric
from transformers import EvalPrediction

def compute_metrics(p: EvalPrediction):
  metric_acc = load_metric("accuracy")
  preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
  preds = np.argmax(preds, axis=1)
  result = {}
  result["accuracy"] = metric_acc.compute(predictions=preds, references=p.label_ids)["accuracy"]
  return result

But no matter how much train or test data I use, or how many epochs, when I use trainer.evaluate() I get an accuracy of 0.5.

My questions are:

How do I improve it?
How do I implement other metrics for the evaluation? for example F1 score.

I tried changing (adding) the metrics to this:

def compute_metrics(p: EvalPrediction):
  load_accuracy = load_metric("accuracy")
  load_f1 = load_metric("f1")
  preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
  preds = np.argmax(preds, axis=1)
  result = {}
  result["accuracy"] = load_accuracy.compute(predictions=preds, references=p.label_ids)["accuracy"]
  result["f1"] = load_f1.compute(predictions=preds, references=p.label_ids)["f1"]
  return result

But then I got this error while running trainer.evaluate():

ValueError: pos_label=1 is not a valid label. It should be one of [0, 2]

You can refer to my previous question for more details about my finetuning here

Update:

This is the tokenizer I used:

from transformers import BartTokenizerFast
tokenizer = BartTokenizerFast.from_pretrained('facebook/bart-large-mnli')

And as stated in my other linked questions, this is what I used in order to create and convert my dataset

As I wrote above, you can refer to the linked questions that I had for more data about all of my processes, I feel like it's unnecessary to put everything in every single question again, correct me if I'm wrong.

Armendariz answered 18/5, 2023 at 5:30 Comment(2)

Hi! I was not able to replicate the error you're getting. Would you mind providing a Minimal, Reproducible Example so it's possible to investigate what's going on? Also, why does your instantiated Trainer class not include a tokenizer? – Southdown 23/5, 2023 at 21:50

Hello, @SimonDavid what other information does it require? I added the parameters, dataset itself I can't share, but I used this answer in order to prepare my dataset. I added the tokenizer now to the body of the question, but I linked the questions that I had before that, you can refer to them in order to see all the links and process :) – Armendariz 24/5, 2023 at 3:44

0.5 is not a satisfactory accuracy score.

Answer to your first question. How to improve it?

As you have mentioned you have already tried increasing the no of epochs and batch size. You can try training with a different optimizer instead of AdamW and with a different weight decay.

Try using SGD or Adagrad.

they have fewer hyperparameters to tune compared to AdamW. This can make them easier to tune and more robust to different datasets and architectures.

Can help the model converge to a better solution by adapting the learning rate for each parameter based on the historical gradients. This can be particularly useful in tasks where the loss landscape is complex and the model needs to navigate through many local minima to find the global minimum.

  # for using SGD
  from transformers import AdamW, get_linear_schedule_with_warmup, SGD

  training_args = TrainingArguments(
      output_dir=model_directory,      
      num_train_epochs=30,              
      per_device_train_batch_size=1,  
      per_device_eval_batch_size=2,   
      warmup_steps=500,                 
      weight_decay=0.01,              
      learning_rate=0.01,              
      optimizer_type=SGD,             
      optimizer_params={"momentum": 0.9}  # specify the optimizer hyperparameters
  )

Here 'momentum' is used to accelerate the convergence of the optimization algorithm by adding a fraction of the previous update to the current update of the model parameters. Reduce the values to find the best fit.

 # For using Adagrad
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=model_directory,      
    num_train_epochs=30,              
    per_device_train_batch_size=1,  
    per_device_eval_batch_size=2,   
    warmup_steps=500,                 
    weight_decay=0.01,               
    learning_rate=0.01,              
    optimizer_type="Adagrad",        
    optimizer_params={"initial_accumulator_value": 0.1}  # specify the optimizer hyperparameters
)

Here 'initial_accumulator_value' is the initial value of the historical gradient accumulator for each parameter. This is a running sum of the squares of the gradients for each parameter, which is used to adapt the learning rate of each parameter during training. Try varrying the values.

Answer for 2nd question.

Try adding the average paramter as 'macro' for computing F1-score of multi class classification. I believe there are some errors while your dataset labels were encoded. If you are doing a binary classification ,0 or 2 be the valid values for the pos_label depending on what you want to treat as a positive label.

# for Multi-class classification
import numpy as np
from datasets import Dataset, load_metric
from transformers import EvalPrediction

def compute_metrics(p: EvalPrediction):
  metric_acc = load_metric("accuracy")
  metric_f1 = load_metric("f1")
  preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
  preds = np.argmax(preds, axis=1)
  result = {}
  result["accuracy"] = metric_acc.compute(predictions=preds, references=p.label_ids)["accuracy"]
  result["f1"] = metric_f1.compute(predictions=preds, references=p.label_ids, average='macro')["f1"]
  return result

Ottoottoman answered 28/5, 2023 at 11:19 Comment(4)

I see you imported also get_linear_schedule_with_warmup is there a reason for that? because you didn't use it I get 2 errors from your code, first is that it's unable to import SGD, and the second one is TypeError: __init__() got an unexpected keyword argument 'optimizer_type' – Armendariz 28/5, 2023 at 14:5

It is not necessary but if you want to optimize the model more you can try using it. The get_linear_schedule_with_warmup is used for adjusting the learning rate within the training. During the warmup period of the training a higher learning rate will be used and later it gradually decreases so that the model can explore a wider range of parameter values during the warmup period and then fine-tune its parameters during the rest of training. – Ottoottoman 28/5, 2023 at 23:42

The SGD can be imported from

import torch.optim as optim        #Load the BART model and tokenizer                                                                    optimizer = optim.SGD(model.parameters(), lr=0.01)

If you want to use get_linear_schedule_with_warmup this is how you can use it.

no_training_steps = 1000 no_warmup_steps = 100 scheduler = get_linear_schedule_with_warmup(optimizer, no_warmup_steps, no_training_steps)

– Ottoottoman 28/5, 2023 at 23:50

I also get the error TypeError: __init__() got an unexpected keyword argument 'optimizer_type' do you know why? – Armendariz 29/5, 2023 at 3:44

I am also working on same use case. You can use below snippet to integrate f1_score in your code.

from sklearn.metrics import f1_score

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(preds, axis=1)
    f1 = f1_score(p.label_ids, preds, average='weighted')
    return {"f1": f1}

Gombach answered 14/8, 2023 at 18:59 Comment(0)

Update:

Recommended topics

Hot tags