what 's the meaning of "Using bos_token, but it is not set yet."

Asked 21/12, 2020 at 3:6 Answered 21/12, 2020 at 9:40

multilingual huggingface-transformers huggingface-tokenizers distilbert

When I run the demo.py

from transformers import AutoTokenizer, AutoModel
    
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")
model = AutoModel.from_pretrained("distilbert-base-multilingual-cased", return_dict=True)
# print(model)
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(count_parameters(model))
inputs = tokenizer("史密斯先生不在，他去看电影了。Mr Smith is not in. He ________ ________to the cinema", return_tensors="pt")
print(inputs)
outputs = model(**inputs)
print(outputs)

the code show

{'input_ids': tensor([[  101,  2759,  3417,  4332,  2431,  5600,  2080,  3031, 10064,  2196,
      2724,  5765,  5614,  3756,  2146,  1882, 12916, 11673, 10124, 10472,
     10106,   119, 10357,   168,   168,   168,   168,   168,   168,   168,
       168,   168,   168,   168,   168,   168,   168,   168,   168, 10114,
     10105, 18458,   119,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Using bos_token, but it is not set yet. Using eos_token, but it is not set yet. why bos_token is printed?

Hamner answered 21/12, 2020 at 3:6 Comment(2)

if you want so add standard special tokens see: #73322962 – Blunder 11/8, 2022 at 18:58

if you want so add standard special tokens see: #73322962 – Blunder 11/8, 2022 at 23:28

The __call__ method of the tokenizer has an attribute add_special_tokens which defaults to True. This means adding the BOS (beginning of a sentence) token at the beginning and the EOS (end of a sentence) token at the end. If you do not want to use these symbols, you can set add_special_tokens to False.

However, note that the models perform best if they use the same tokenization and special symbols as when they were trained. From your example, it seems to me you want to feed the model with a pair of sentences in different languages. Such pairs are typically separated by a special token [SEP]. You thus might want to use the encode_plus method of the tokenizer that can do the correct encoding of a sentence pair for you.

Booklet answered 21/12, 2020 at 9:40 Comment(1)

I want all special standard tokens to be available but they seem to be missing. How do I add them? I will fine tune to make sure they are there in my model later on. Details see: #73322962 – Blunder 11/8, 2022 at 14:50

I think this is the right way to do it. Let me know if not:

def add_special_all_special_tokens(tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast]):
    """
        special_tokens_dict = {"cls_token": "<CLS>"}

        num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
        print("We have added", num_added_toks, "tokens")
        # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
        model.resize_token_embeddings(len(tokenizer))

        assert tokenizer.cls_token == "<CLS>"

    """
    original_len: int = len(tokenizer)
    num_added_toks: dict = {}
    if tokenizer.bos_token is None:
        num_added_toks['bos_token'] = "<bos>"
    if tokenizer.bos_token is None:
        num_added_toks['cls_token'] = "<cls>"
    if tokenizer.bos_token is None:
        num_added_toks['sep_token'] = "<s>"
    if tokenizer.bos_token is None:
        num_added_toks['mask_token'] = "<mask>"
    # num_added_toks = {"bos_token": "<bos>", "cls_token": "<cls>", "sep_token": "<s>", "mask_token": "<mask>"}
    # special_tokens_dict = {'additional_special_tokens': new_special_tokens + tokenizer.all_special_tokens}
    num_new_tokens: int = tokenizer.add_special_tokens(num_added_toks)
    assert tokenizer.bos_token == "<bos>"
    assert tokenizer.cls_token == "<cls>"
    assert tokenizer.sep_token == "<s>"
    assert tokenizer.mask_token == "<mask>"
    err_msg = f"Error, not equal: {len(tokenizer)=}, {original_len + num_new_tokens=}"
    assert len(tokenizer) == original_len + num_new_tokens, err_msg

Blunder answered 21/12, 2020 at 3:6 Comment(0)

Recommended topics

Hot tags