transformers >= 4.0.0:
Use __call__
method of the tokenizer. It will generate a dictionary which contains the input_ids
, token_type_ids
and the attention_mask
as list for each input sentence:
tokenizer(['this is the first sentence', 'another setence'])
Output:
{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}
transformers < 4.0.0:
Use tokenizer.batch_encode_plus
(documentation). It will generate a dictionary which contains the input_ids
, token_type_ids
and the attention_mask
as list for each input sentence:
tokenizer.batch_encode_plus(['this is the first sentence', 'another setence'])
Output:
{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}
Applies to call and batch_encode_plus:
In case you only want to generate the input_ids, you have to set return_token_type_ids
and return_attention_mask
to False:
tokenizer.batch_encode_plus(['this is the first sentence', 'another setence'], return_token_type_ids=False, return_attention_mask=False)
Output:
{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]]}