How to do Tokenizer Batch processing? - HuggingFace
B

2

8

in the Tokenizer documentation from huggingface, the call fuction accepts List[List[str]] and says:

text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

things run normally if I run:

 test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
 tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
 tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")

but if I try to emulate batches of sentences:

 test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
 test = [test, test]
 tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
 tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")

I get:

Traceback (most recent call last):
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/modify_scores.py", line 53, in <module>
    tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2548, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2634, in _call_one
    return self.batch_encode_plus(
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2825, in batch_encode_plus
    return self._batch_encode_plus(
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 428, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Is the documentation wrong? I just need a way to tokenize and predict using batches, it shouldn't be that hard.

Is it something to do with the is_split_into_words arguments?


Contextualizing

I will feed that into a sentiment score model (the one defined in the code snippets). I am facing OOM problems when predicting it so I need to feed the data in batches to the model.

The documentation (refered above) stated that I can feed List[List[str]] in the tokenizer which is not the case. The question remains the same: How to tokenize batches of sentences?

Note: I don't need the tokenizing process to be in batches (although it would yield batches of tokens/attention_tokens), which would solve my problem: using the model for prediction with batches like this:

with torch.no_grad():
    logits = model(**tokenized_test).logits

Billon answered 7/6, 2023 at 10:15 Comment(17)
Please add to the question, a little more details on the inputs and the expected outputs. If I'm understanding it correctly, you are trying to run the tokenizer on list of strings and inside the list of string, it contains some multi-word expressions?Raft
No, I don't know why are you assuming so many things and changing my question so many times. The question is clear: I need to tokenize my dataset (collection of sentences) into batches. That is all. Please stop changing my question.Billon
What is the NLP task you're working on? Which model are you using eventually? And is it for classification/similarity? How does your data look like before feeding it to the tokenizer? Having those information will help us to help you better.Raft
It's because the question is ambiguous and I'm trying to get more clarification, otherwise we'll all be guessing. Please fill in the information asked in the comment above.Raft
1. Different tokenizers work differently in Hugigngface (unlike non-pretrained models NLP). 2. The task you are working on determine how the tokenizer function work 3. Not having an example of how the input and expected output will not help us help you.Raft
The example IS there. The model name IS there.Billon
And you're using 'distilbert-base-uncased-finetuned-sst-2-english' model to do classification? We're asking because that specific model and tokenizer supports text_pairs. Be nice, we're not mind-readers ;P and just volunteers.Raft
How is the model initialize? Using AutoModel, AutoModelForSequenceClassification or something else?Raft
yes, AutoModelForSequenceClassification. For Sentiment Classification but I use the logits to add a third class (NEUTRAL). That might be relevant context.Billon
One last question, for inference/training? It's different cos for inference, you have some nice feature to automate more things.Raft
Inference (prediction). As stated in the Contextualizing section of the question.Billon
For OOM issues, it's best to specify which GPU type and how many parallel GPU instances you have and how much RAM for each. And how many data points you need to infer/predict? Otherwise it's hard to solve the problem.Raft
You still haven't answer the only question I've posed.Billon
I think you're not using the function as intended. The documentation could have been wrong but isn't it easier to achieve the task you need more than asking if the doucmentation is wrong? Or is the question, "is the documentation wrong?" (I'm not sure if the documentation is wrong though)Raft
The question reads: "How to do Tokenizer Batch processing?". Of course that assumes more than one batch.Billon
Did the pipeline batch_size help? It'll automatically batch process and run it through the model, and it avoids manually fitting in batches with custom function? If it didn't, then maybe someone else can come along and help you with the question.Raft
I need to access the logits from the prediction. Not only the resultBillon
B
4

Use pipelines, but there is a catch.

Because you are passing all the processing steps, you need to pass the args for each one of them - when needed. For the tokenizer, we define:

 tokenizer = AutoTokenizer.from_pretrained(selected_model)
 tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}
 

The model is straight forward:

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

Then finally:

classifier = pipeline("text-classification", model=model, batch_size=32, tokenizer=tokenizer)

Specific to my application:

Since I need the logits and not the predicted classes, I will have to modify the pipeline class. Documentation says that in order to create a custom pipeline class, I need to define four mandatory methods: implement preprocess, _forward, postprocess, and _sanitize_parameters... OR I can overwrite postprocess method from the TextClassificationPipeline:

class MyPipeline(TextClassificationPipeline):
     def postprocess(self, model_outputs):
         return model_outputs["logits"][0]

and modify the call:

classifier = pipeline("text-classification", model=model, batch_size=32, tokenizer=tokenizer, pipeline_class=MyPipeline)

logits = classifier(text, **tokenizer_kwargs)
Billon answered 12/6, 2023 at 12:40 Comment(0)
R
12

How to tokenize a list of sentences?

If it's just tokenizing a list of sentences, do this:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
 
tokenizer(test)

It does the batching automatically:

{'input_ids': [
 [101, 7592, 2023, 2003, 1037, 3231, 102], [101, 2008, 21743, 1037, 2862, 1997, 11746, 102], 
 [101, 2046, 1037, 2862, 1997, 2862, 1997, 11746, 102], 
 [101, 1999, 2344, 2000, 7861, 9869, 1010, 1999, 2023, 2553, 1010, 2048, 14108, 2229, 1997, 1996, 2168, 18798, 13900, 102], 
 [101, 2000, 2022, 19204, 3550, 2011, 1996, 1044, 2546, 19204, 17629, 2005, 1996, 4225, 2944, 102]], 

'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

How to use it with the AutoModelForSequenceClassification?

And to use it with AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english'), it's this:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]

model(**tokenizer(test, return_tensors='pt', padding=True, truncation=True))

[out]:

SequenceClassifierOutput(loss=None, logits=tensor([[ 1.5094, -1.2056],
        [-3.4114,  3.5229],
        [ 1.8835, -1.6886],
        [ 3.0780, -2.5745],
        [ 2.5383, -2.1984]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

How to use the distilbert-base-uncased-finetuned-sst-2-english model for sentiment classification?

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)


text = ['hello this is a test',
 'that transforms a list of sentences',
 'into a list of list of sentences',
 'in order to emulate, in this case, two batches of the same lenght',
 'to be tokenized by the hf tokenizer for the defined model']
 
classifier(text)

[out]:

[{'label': 'NEGATIVE', 'score': 0.9379092454910278},
 {'label': 'POSITIVE', 'score': 0.9990271329879761},
 {'label': 'NEGATIVE', 'score': 0.9726701378822327},
 {'label': 'NEGATIVE', 'score': 0.9965035915374756},
 {'label': 'NEGATIVE', 'score': 0.9913086891174316}]

What happens when I've OOM issues with GPU?

If it's the distilbert-base-uncased-finetuned-sst-2-english, you should just use the CPU. For that you won't face much OOM issues.

If you need to use a GPU, consider using the pipeline(...) inference and it comes with the batch_size option, e.g.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)


text = ['hello this is a test',
 'that transforms a list of sentences',
 'into a list of list of sentences',
 'in order to emulate, in this case, two batches of the same lenght',
 'to be tokenized by the hf tokenizer for the defined model']

classifier(text, batch_size=2, truncation="only_first")

When you face OOM issues, it is usually not the tokenizer creating the problem unless you loaded the full large dataset into the device.

If it is just the model not being able to predict when you feed in the large dataset, consider using pipeline instead of using the model(**tokenize(text))

Take a look at https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching


If the question is regarding the is_split_into_words arguments, then from the doc

text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

And from the code

if is_split_into_words:
    is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))
else:
    is_batched = isinstance(text, (list, tuple))

And if we try that to see if your inputs is_batched:

text = ["hello", "this", "is a test"]
isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))

[out]:

False

But when you wrap the tokens around a list,

text = [["hello", "this", "is a test"]]
isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))

[out]:

True

Therefore, the usage of the tokenizer and is_split_into_words=True to get the batch processing working properly would look something like this:

from transformers import AutoTokenizer
from sacremoses import MosesTokenizer

moses = MosesTokenizer()
sentences = ["this is a test", "hello world"]
pretokenized_sents = [moses.tokenize(s) for s in sentences]

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

tokenizer(
  text=pretokenized_sents, 
  padding="max_length", 
  is_split_into_words=True, 
  truncation=True, 
  return_tensors="pt"
)

[out]:

{'input_ids': tensor([[ 101, 2023, 2003,  ...,    0,    0,    0],
        [ 101, 7592, 2088,  ...,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

Note: The use of the is_split_into_words argument is not to process batches of sentence but it's used to specify when your input to the tokenizers are already pre-tokenized.

Raft answered 7/6, 2023 at 11:8 Comment(15)
This argument is used for when you have alreade pretokenized the input. Thus, your input is split into WORDS. Mine isn't. My input is split into sentences, not into words. The value for the argument should be FALSE. You dont have batches in your solution. You basically modify the code to fit into a solution that doesn't suit my problem. I need to do batch processing, not just give a justification to use the is_split_into_words argument.Billon
Hmmm, I think it'll work too =) Let me edit my answer. BTW, be nice, we're all volunteers trying to help others with some answer.Raft
So you are saying that I need tokenize twice to do batch processing?Billon
Nope, just once but it depends on what you want to achieve. If you already pre-tokenized the output, the is_split_into_words actually helps you to stitch it back, not to "activate the batching".Raft
I will edit the questionBillon
The last question youve added is the one ive placed originally "not really caring how the inputs are pre-tokenized or not". But you dont provide a multibatch solution. Only a similar example to what I've already said that works fine. When you duplicate this 'single batch' is where I face the error.Billon
I am not trying to do sentence similarity. I am trying to get sentiment scores. The model I'm using is for sentiment scores.Billon
Then I think it works a different way, which model exactly are you trying to use eventually? The tokenizer is tied very much to the model you want to use.Raft
Also how does your input actually look like and are you doing training/inference after the tokenizer? The answer would be slightly different.Raft
You say that the tokenizer does the batching automatically and proceeds to feed a single batch of sentences. Completely ignoring the problem I'm asking help with, i.e., when there's actual batching involved (num_batches > 1). My input is literally a list of sentences. Exactly the way the question depicts.Billon
I find it very ironic how you answer a lot of surrounding questions BUT the one I've posed.Billon
Please check again if the How to ... section answers any of your question? After the comments in the questions.Raft
Does the classifier(text, batch_size=2, truncation="only_first") pipeline work for you? Or do you need to use the raw tokenizer to run through the batches?Raft
that could be a potential solution, the problem is that I don't want the classification from the model. I want the logits, as said on the question. Is there a way to access the logits using pipeline? Maybe I will have to set a custom pipeline?Billon
See the other answer from your other question =)Raft
B
4

Use pipelines, but there is a catch.

Because you are passing all the processing steps, you need to pass the args for each one of them - when needed. For the tokenizer, we define:

 tokenizer = AutoTokenizer.from_pretrained(selected_model)
 tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}
 

The model is straight forward:

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

Then finally:

classifier = pipeline("text-classification", model=model, batch_size=32, tokenizer=tokenizer)

Specific to my application:

Since I need the logits and not the predicted classes, I will have to modify the pipeline class. Documentation says that in order to create a custom pipeline class, I need to define four mandatory methods: implement preprocess, _forward, postprocess, and _sanitize_parameters... OR I can overwrite postprocess method from the TextClassificationPipeline:

class MyPipeline(TextClassificationPipeline):
     def postprocess(self, model_outputs):
         return model_outputs["logits"][0]

and modify the call:

classifier = pipeline("text-classification", model=model, batch_size=32, tokenizer=tokenizer, pipeline_class=MyPipeline)

logits = classifier(text, **tokenizer_kwargs)
Billon answered 12/6, 2023 at 12:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.