"AssertionError: Cannot handle batch sizes > 1 if no padding token is > defined" and pad_token = eos_token
Asked Answered
S

1

11

I am trying to finetune a pre-trained GPT2-model. When applying the respective tokenizer, I originally got the error message:

Using pad_token, but it is not set yet.

Thus, I changed my code to:

GPT2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
GPT2_tokenizer.pad_token = GPT2_tokenizer.eos_token

When calling the trainer.train() later, I end up with the following error:

AssertionError: Cannot handle batch sizes > 1 if no padding token is defined.

Since I specifically defined the pad_token above, I expect these errors (or rather my fix of the original error and this new error) to be related - although I could be wrong. Is this a known problem that eos_token and pad_token somehow interfer? Is there an easy work-around?

Thanks a lot!

Southeaster answered 22/6, 2021 at 13:19 Comment(0)
C
24

I've been running into a similar problem, producing the same error message you were receiving. I can't be sure if your problem and my problem were caused by the same issue, since I can't see your full stack trace, but I'll post my solution in case it can help you or someone else who comes along.

You were totally correct to fix the first issue you described with your tokenizer by setting its pad token with the code provided. However, I also had to set the pad_token_id of my model's configuration to get my GPT2 model to function properly. I did this in the following way:

# instantiate the configuration for your model, this can be imported from transformers
configuration = GPT2Config()
# set up your tokenizer, just like you described, and set the pad token
GPT2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
GPT2_tokenizer.pad_token = GPT2_tokenizer.eos_token
# instantiate the model
model = GPT2ForSequenceClassification(configuration).from_pretrained(model_name).to(device)
# set the pad token of the model's configuration
model.config.pad_token_id = model.config.eos_token_id

I suppose this is because the tokenizer and the model function separately, and both need knowledge of the ID being used for the pad token. I can't tell if this will fix your problem (since this post is 6 months old, it may not matter anyway), but hopefully my answer may be able to help someone else.

Cockeye answered 8/12, 2021 at 16:6 Comment(2)
To match the tokenizer, you could also consider model.config.pad_token_id = GPT2_tokenizer.pad_token_idSeneschal
I'm new to LLM and HF library. What exactly is assigning eos_token to pad_token doing? It won't affect output of model inference?Microelement

© 2022 - 2024 — McMap. All rights reserved.