How to make a Trainer pad inputs in a batch with huggingface-transformers?
Asked Answered
L

4

17

I'm trying to train a model using a Trainer, according to the documentation (https://huggingface.co/transformers/master/main_classes/trainer.html#transformers.Trainer) I can specify a tokenizer:

tokenizer (PreTrainedTokenizerBase, optional) – The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model.

So padding should be handled automatically, but when trying to run it I get this error:

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

The tokenizer is created this way:

tokenizer = BertTokenizerFast.from_pretrained(pretrained_model)

And the Trainer like that:

trainer = Trainer(
    tokenizer=tokenizer,
    model=model,
    args=training_args,
    train_dataset=train,
    eval_dataset=dev,
    compute_metrics=compute_metrics
)

I've tried putting the padding and truncation parameters in the tokenizer, in the Trainer, and in the training_args. Nothing does. Any idea?

Lankester answered 24/9, 2020 at 13:13 Comment(2)
Have you tried to see in the model config? config = AutoConfig.from_pretrained(...)Mariselamarish
Same issue here, have you been able to find a solution?Tav
A
9

Look at the columns your tokenizer is returning. You might wanna limit it to only the required columns.

For Example

def preprocess_function(examples):
   #function to tokenize the dataset.
   if sentence2_key is None:
       return tokenizer(examples[sentence1_key], truncation=True, padding=True)
   return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True, padding=True)


encoded_dataset = dataset.map(preprocess_function, batched=True, load_from_cache_file=False)


#Thing you should do is 

columns_to_return = ['input_ids', 'label', 'attention_mask']
encoded_dataset.set_format(type='torch', columns=columns_to_return)
Asquith answered 1/2, 2021 at 10:43 Comment(1)
what is sentence1_key and sentence2_key here?Sidonie
E
1

I was able to solve this problem by adding a datacollator to the trainer.

from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer)
trainer = Trainer(
            model=model,
            args=args,
            train_dataset=...,
            eval_dataset=...,
            compute_metrics=compute_metrics,
            data_collator=data_collator,
            tokenizer=tokenizer,
            optimizers=(optimizer, None),
        )
Exanimate answered 3/2, 2022 at 20:44 Comment(0)
L
1

I solved by setting remove_unused_columns=True to TrainingArguments

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-v1",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=False,
    fp16=False,
    logging_steps=logging_steps,
    remove_unused_columns=True,
)
Lelia answered 9/5, 2023 at 12:33 Comment(0)
G
0

I had the same error when one of the inputs to the tokenizer is None.

My tokenizer takes two texts at the same time (so Bert will add [SEP] between them).

Gravelblind answered 4/2, 2022 at 19:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.