I'm trying to train a model using a Trainer, according to the documentation (https://huggingface.co/transformers/master/main_classes/trainer.html#transformers.Trainer) I can specify a tokenizer:
tokenizer (PreTrainedTokenizerBase, optional) – The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model.
So padding should be handled automatically, but when trying to run it I get this error:
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
The tokenizer is created this way:
tokenizer = BertTokenizerFast.from_pretrained(pretrained_model)
And the Trainer like that:
trainer = Trainer(
tokenizer=tokenizer,
model=model,
args=training_args,
train_dataset=train,
eval_dataset=dev,
compute_metrics=compute_metrics
)
I've tried putting the padding
and truncation
parameters in the tokenizer, in the Trainer, and in the training_args. Nothing does. Any idea?