why does huggingface t5 tokenizer ignore some of the whitespaces?
Asked Answered
L

1

7

I am using T5 model and tokenizer for a downstream task. I want to add certain whitespaces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer always ignores the second whitespace. So, it tokenizes the sequence “\n\n” as a single line ending and the sequence "\n\n\n\n" is tokenized as two line endings and so on. See below to reproduce.

from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-large")
tokenizer.add_tokens(["\n"])

tokenizer.encode("\n") # returns [32100, 1] as expected
tokenizer.encode("\n\n") # returns [32100, 1] but expected would be [32100, 32100, 1]
tokenizer.encode("\n\n\n\n") # returns [32100, 32100, 1] but expected would be [32100, 32100, 32100, 32100, 1]

what is the reasoning behind this behaviour? Is it a bug or something related to how tokenizer works? I noticed that this only happens for added whitespaces but not for other characters.

Is there way to prevent tokenizer from ignoring the repeated whitespaces?

Laager answered 12/5, 2022 at 11:4 Comment(1)
related, do you know how to do this: #73322962?Abutter
L
4

The behaviour is explained by how the tokenize method in T5Tokenizer strips tokens by default. What one can do is adding the token '\n' as a special token to the tokenizer. Because the special tokens are never seperated, it works as expected.

It is a bit hacky but seems to work.

from tokenizers import AddedToken
tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken("\n")]})
print(tokenizer.special_tokens_map)

Then it tokenizes the '\n' without skipping any occurences. Note that AddedToken is important because somehow the following does NOT work.

tokenizer.add_special_tokens({"additional_special_tokens": ["\n"]})

Edit

After spending more time on it, I actually found a way to add it as a normal token without using special tokens. The main reason for the issue is the normalization process that happens behind the scenes even before the tokenization. When you add a new token, you can specify if it should be normalized or not. By setting normalize to False, you avoid the tokenizer from stripping consecutive occurrences of the added token.

from tokenizers import AddedToken
tokenizer.add_tokens(AddedToken("\n", normalized=False))

You can find more information on this link: https://huggingface.co/course/en/chapter6/4?fw=pt

Laager answered 19/5, 2022 at 13:52 Comment(2)
related, do you know how to do this: #73322962?Abutter
I read it but did not understand what you are trying to doLaager

© 2022 - 2024 — McMap. All rights reserved.