in the Tokenizer documentation from huggingface, the call fuction accepts List[List[str]] and says:
text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
things run normally if I run:
test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")
but if I try to emulate batches of sentences:
test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
test = [test, test]
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")
I get:
Traceback (most recent call last):
File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/modify_scores.py", line 53, in <module>
tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")
File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2548, in __call__
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2634, in _call_one
return self.batch_encode_plus(
File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2825, in batch_encode_plus
return self._batch_encode_plus(
File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 428, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
Is the documentation wrong? I just need a way to tokenize and predict using batches, it shouldn't be that hard.
Is it something to do with the is_split_into_words
arguments?
Contextualizing
I will feed that into a sentiment score model (the one defined in the code snippets). I am facing OOM problems when predicting it so I need to feed the data in batches to the model.
The documentation (refered above) stated that I can feed List[List[str]] in the tokenizer which is not the case. The question remains the same: How to tokenize batches of sentences?
Note: I don't need the tokenizing process to be in batches (although it would yield batches of tokens/attention_tokens), which would solve my problem: using the model for prediction with batches like this:
with torch.no_grad():
logits = model(**tokenized_test).logits
'distilbert-base-uncased-finetuned-sst-2-english'
model to do classification? We're asking because that specific model and tokenizer supports text_pairs. Be nice, we're not mind-readers ;P and just volunteers. – RaftAutoModel
,AutoModelForSequenceClassification
or something else? – Raft