I am using T5 model and tokenizer for a downstream task. I want to add certain whitespaces to the tokenizer like line ending (\t)
and tab (\t)
. Adding these tokens work but somehow the tokenizer always ignores the second whitespace. So, it tokenizes the sequence “\n\n”
as a single line ending and the sequence "\n\n\n\n"
is tokenized as two line endings and so on. See below to reproduce.
from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-large")
tokenizer.add_tokens(["\n"])
tokenizer.encode("\n") # returns [32100, 1] as expected
tokenizer.encode("\n\n") # returns [32100, 1] but expected would be [32100, 32100, 1]
tokenizer.encode("\n\n\n\n") # returns [32100, 32100, 1] but expected would be [32100, 32100, 32100, 32100, 1]
what is the reasoning behind this behaviour? Is it a bug or something related to how tokenizer works? I noticed that this only happens for added whitespaces but not for other characters.
Is there way to prevent tokenizer from ignoring the repeated whitespaces?