How to encode multiple sentences using transformers.BertTokenizer?

Asked 1/7, 2020 at 3:32 Answered 19/7, 2022 at 7:42

Solved word-embedding huggingface-transformers huggingface-tokenizers

I would like to create a minibatch by encoding multiple sentences using transform.BertTokenizer. It seems working for a single sentence. How to make it work for several sentences?

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# tokenize a single sentence seems working
tokenizer.encode('this is the first sentence')
>>> [2023, 2003, 1996, 2034, 6251]

# tokenize two sentences
tokenizer.encode(['this is the first sentence', 'another sentence'])
>>> [100, 100] # expecting 7 tokens

Lexi answered 1/7, 2020 at 3:32 Comment(0)

transformers >= 4.0.0:
Use __call__ method of the tokenizer. It will generate a dictionary which contains the input_ids, token_type_ids and the attention_mask as list for each input sentence:

tokenizer(['this is the first sentence', 'another setence'])

Output:

{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

transformers < 4.0.0:
Use tokenizer.batch_encode_plus (documentation). It will generate a dictionary which contains the input_ids, token_type_ids and the attention_mask as list for each input sentence:

tokenizer.batch_encode_plus(['this is the first sentence', 'another setence'])

Output:

{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

Applies to call and batch_encode_plus:
In case you only want to generate the input_ids, you have to set return_token_type_ids and return_attention_mask to False:

tokenizer.batch_encode_plus(['this is the first sentence', 'another setence'], return_token_type_ids=False, return_attention_mask=False)

Output:

{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]]}

Ricoriki answered 2/7, 2020 at 2:56 Comment(7)

Thank you for the answer. I got the error: AttributeError: 'BertTokenizer' object has no attribute 'batch_encode_plus'. Does the tokenizer in yor answer refer to some other object? – Lexi 4/7, 2020 at 14:17

@LeiHao No, maybe your are using an older transformers version? Which version do you use? – Ricoriki 4/7, 2020 at 15:49

It's 2.1.1. I installed it using Conda. Which version are you using? – Lexi 5/7, 2020 at 14:19

3.0.0 but you need to install it from pip directly and not conda. Create a new conda environment and install everything via pip. – Ricoriki 5/7, 2020 at 15:36

Can you provide document to tokenizer.batch_encode_plus? It will add some more value to the answer. – Andee 28/3, 2021 at 11:16

I think we can also do without batch_encode_plus... tokenizer(['this is the first sentence', 'another setence'], return_token_type_ids=False, return_attention_mask=False) – Slowpoke 18/10, 2022 at 19:34

For decoding of multiple sentences: tokenizer.batch_decode(ids) – Phyllys 26/11, 2022 at 9:43

What you did is almost correct. You can pass the sentences as a list to the tokenizer.

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
two_sentences = ['this is the first sentence', 'another sentence']


tokenized_sentences = tokenizer(two_sentences)

The last line of code makes the difference.

The tokenized_sentences is a dict with the containing the following information

{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 6251, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1]]}

where the list of sentences produces a list of tokenized sentences stored under the input_ids key.

'this is the first sentence' = [101, 2023, 2003, 1996, 2034, 6251, 102] and 'another sentence' = [101, 2178, 6251, 102].

101 is the start token. 102 is the stop token.

Hypogeous answered 19/7, 2022 at 7:42 Comment(0)

Recommended topics

Hot tags