How to do Tokenizer Batch processing? - HuggingFace

Asked 7/6, 2023 at 10:15 Answered 12/6, 2023 at 12:40

Solved pytorch batch-processing tokenize huggingface-transformers huggingface-tokenizers

in the Tokenizer documentation from huggingface, the call fuction accepts List[List[str]] and says:

text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

things run normally if I run:

 test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
 tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
 tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")

but if I try to emulate batches of sentences:

 test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
 test = [test, test]
 tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
 tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")

I get:

Traceback (most recent call last):
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/modify_scores.py", line 53, in <module>
    tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2548, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2634, in _call_one
    return self.batch_encode_plus(
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2825, in batch_encode_plus
    return self._batch_encode_plus(
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 428, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Is the documentation wrong? I just need a way to tokenize and predict using batches, it shouldn't be that hard.

Is it something to do with the is_split_into_words arguments?

Contextualizing

I will feed that into a sentiment score model (the one defined in the code snippets). I am facing OOM problems when predicting it so I need to feed the data in batches to the model.

The documentation (refered above) stated that I can feed List[List[str]] in the tokenizer which is not the case. The question remains the same: How to tokenize batches of sentences?

Note: I don't need the tokenizing process to be in batches (although it would yield batches of tokens/attention_tokens), which would solve my problem: using the model for prediction with batches like this:

with torch.no_grad():
    logits = model(**tokenized_test).logits

Billon answered 7/6, 2023 at 10:15 Comment(17)

Please add to the question, a little more details on the inputs and the expected outputs. If I'm understanding it correctly, you are trying to run the tokenizer on list of strings and inside the list of string, it contains some multi-word expressions? – Raft 7/6, 2023 at 11:55

No, I don't know why are you assuming so many things and changing my question so many times. The question is clear: I need to tokenize my dataset (collection of sentences) into batches. That is all. Please stop changing my question. – Billon 7/6, 2023 at 13:23

What is the NLP task you're working on? Which model are you using eventually? And is it for classification/similarity? How does your data look like before feeding it to the tokenizer? Having those information will help us to help you better. – Raft 7/6, 2023 at 13:23

It's because the question is ambiguous and I'm trying to get more clarification, otherwise we'll all be guessing. Please fill in the information asked in the comment above. – Raft 7/6, 2023 at 13:23

1. Different tokenizers work differently in Hugigngface (unlike non-pretrained models NLP). 2. The task you are working on determine how the tokenizer function work 3. Not having an example of how the input and expected output will not help us help you. – Raft 7/6, 2023 at 13:26

The example IS there. The model name IS there. – Billon 7/6, 2023 at 13:30

And you're using 'distilbert-base-uncased-finetuned-sst-2-english' model to do classification? We're asking because that specific model and tokenizer supports text_pairs. Be nice, we're not mind-readers ;P and just volunteers. – Raft 7/6, 2023 at 13:31

How is the model initialize? Using AutoModel, AutoModelForSequenceClassification or something else? – Raft 7/6, 2023 at 13:32

yes, AutoModelForSequenceClassification. For Sentiment Classification but I use the logits to add a third class (NEUTRAL). That might be relevant context. – Billon 7/6, 2023 at 13:34

One last question, for inference/training? It's different cos for inference, you have some nice feature to automate more things. – Raft 7/6, 2023 at 13:36

Inference (prediction). As stated in the Contextualizing section of the question. – Billon 7/6, 2023 at 13:40

For OOM issues, it's best to specify which GPU type and how many parallel GPU instances you have and how much RAM for each. And how many data points you need to infer/predict? Otherwise it's hard to solve the problem. – Raft 7/6, 2023 at 13:46

You still haven't answer the only question I've posed. – Billon 7/6, 2023 at 13:49

I think you're not using the function as intended. The documentation could have been wrong but isn't it easier to achieve the task you need more than asking if the doucmentation is wrong? Or is the question, "is the documentation wrong?" (I'm not sure if the documentation is wrong though) – Raft 7/6, 2023 at 13:52

The question reads: "How to do Tokenizer Batch processing?". Of course that assumes more than one batch. – Billon 7/6, 2023 at 13:55

Did the pipeline batch_size help? It'll automatically batch process and run it through the model, and it avoids manually fitting in batches with custom function? If it didn't, then maybe someone else can come along and help you with the question. – Raft 7/6, 2023 at 13:59

I need to access the logits from the prediction. Not only the result – Billon 8/6, 2023 at 10:23

Use pipelines, but there is a catch.

Because you are passing all the processing steps, you need to pass the args for each one of them - when needed. For the tokenizer, we define:

 tokenizer = AutoTokenizer.from_pretrained(selected_model)
 tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}

The model is straight forward:

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

Then finally:

classifier = pipeline("text-classification", model=model, batch_size=32, tokenizer=tokenizer)

Specific to my application:

Since I need the logits and not the predicted classes, I will have to modify the pipeline class. Documentation says that in order to create a custom pipeline class, I need to define four mandatory methods: implement preprocess, _forward, postprocess, and _sanitize_parameters... OR I can overwrite postprocess method from the TextClassificationPipeline:

class MyPipeline(TextClassificationPipeline):
     def postprocess(self, model_outputs):
         return model_outputs["logits"][0]

and modify the call:

classifier = pipeline("text-classification", model=model, batch_size=32, tokenizer=tokenizer, pipeline_class=MyPipeline)

logits = classifier(text, **tokenizer_kwargs)

Billon answered 12/6, 2023 at 12:40 Comment(0)

How to tokenize a list of sentences?

If it's just tokenizing a list of sentences, do this:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
 
tokenizer(test)

It does the batching automatically:

{'input_ids': [
 [101, 7592, 2023, 2003, 1037, 3231, 102], [101, 2008, 21743, 1037, 2862, 1997, 11746, 102], 
 [101, 2046, 1037, 2862, 1997, 2862, 1997, 11746, 102], 
 [101, 1999, 2344, 2000, 7861, 9869, 1010, 1999, 2023, 2553, 1010, 2048, 14108, 2229, 1997, 1996, 2168, 18798, 13900, 102], 
 [101, 2000, 2022, 19204, 3550, 2011, 1996, 1044, 2546, 19204, 17629, 2005, 1996, 4225, 2944, 102]], 

'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

How to use it with the `AutoModelForSequenceClassification`?

And to use it with AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english'), it's this:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]

model(**tokenizer(test, return_tensors='pt', padding=True, truncation=True))

[out]:

SequenceClassifierOutput(loss=None, logits=tensor([[ 1.5094, -1.2056],
        [-3.4114,  3.5229],
        [ 1.8835, -1.6886],
        [ 3.0780, -2.5745],
        [ 2.5383, -2.1984]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

How to use the `distilbert-base-uncased-finetuned-sst-2-english` model for sentiment classification?

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)


text = ['hello this is a test',
 'that transforms a list of sentences',
 'into a list of list of sentences',
 'in order to emulate, in this case, two batches of the same lenght',
 'to be tokenized by the hf tokenizer for the defined model']
 
classifier(text)

[out]:

[{'label': 'NEGATIVE', 'score': 0.9379092454910278},
 {'label': 'POSITIVE', 'score': 0.9990271329879761},
 {'label': 'NEGATIVE', 'score': 0.9726701378822327},
 {'label': 'NEGATIVE', 'score': 0.9965035915374756},
 {'label': 'NEGATIVE', 'score': 0.9913086891174316}]

What happens when I've OOM issues with GPU?

If it's the distilbert-base-uncased-finetuned-sst-2-english, you should just use the CPU. For that you won't face much OOM issues.

If you need to use a GPU, consider using the pipeline(...) inference and it comes with the batch_size option, e.g.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)


text = ['hello this is a test',
 'that transforms a list of sentences',
 'into a list of list of sentences',
 'in order to emulate, in this case, two batches of the same lenght',
 'to be tokenized by the hf tokenizer for the defined model']

classifier(text, batch_size=2, truncation="only_first")

When you face OOM issues, it is usually not the tokenizer creating the problem unless you loaded the full large dataset into the device.

If it is just the model not being able to predict when you feed in the large dataset, consider using pipeline instead of using the model(**tokenize(text))

Take a look at https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching

If the question is regarding the is_split_into_words arguments, then from the doc

text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

And from the code

if is_split_into_words:
    is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))
else:
    is_batched = isinstance(text, (list, tuple))

And if we try that to see if your inputs is_batched:

text = ["hello", "this", "is a test"]
isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))

[out]:

False

But when you wrap the tokens around a list,

text = [["hello", "this", "is a test"]]
isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))

[out]:

True

Therefore, the usage of the tokenizer and is_split_into_words=True to get the batch processing working properly would look something like this:

from transformers import AutoTokenizer
from sacremoses import MosesTokenizer

moses = MosesTokenizer()
sentences = ["this is a test", "hello world"]
pretokenized_sents = [moses.tokenize(s) for s in sentences]

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

tokenizer(
  text=pretokenized_sents, 
  padding="max_length", 
  is_split_into_words=True, 
  truncation=True, 
  return_tensors="pt"
)

[out]:

{'input_ids': tensor([[ 101, 2023, 2003,  ...,    0,    0,    0],
        [ 101, 7592, 2088,  ...,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

Note: The use of the is_split_into_words argument is not to process batches of sentence but it's used to specify when your input to the tokenizers are already pre-tokenized.

Raft answered 7/6, 2023 at 11:8 Comment(15)

This argument is used for when you have alreade pretokenized the input. Thus, your input is split into WORDS. Mine isn't. My input is split into sentences, not into words. The value for the argument should be FALSE. You dont have batches in your solution. You basically modify the code to fit into a solution that doesn't suit my problem. I need to do batch processing, not just give a justification to use the is_split_into_words argument. – Billon 7/6, 2023 at 11:51

Hmmm, I think it'll work too =) Let me edit my answer. BTW, be nice, we're all volunteers trying to help others with some answer. – Raft 7/6, 2023 at 11:53

So you are saying that I need tokenize twice to do batch processing? – Billon 7/6, 2023 at 12:4

Nope, just once but it depends on what you want to achieve. If you already pre-tokenized the output, the is_split_into_words actually helps you to stitch it back, not to "activate the batching". – Raft 7/6, 2023 at 12:5

I will edit the question – Billon 7/6, 2023 at 12:7

The last question youve added is the one ive placed originally "not really caring how the inputs are pre-tokenized or not". But you dont provide a multibatch solution. Only a similar example to what I've already said that works fine. When you duplicate this 'single batch' is where I face the error. – Billon 7/6, 2023 at 12:15

I am not trying to do sentence similarity. I am trying to get sentiment scores. The model I'm using is for sentiment scores. – Billon 7/6, 2023 at 13:18

Then I think it works a different way, which model exactly are you trying to use eventually? The tokenizer is tied very much to the model you want to use. – Raft 7/6, 2023 at 13:19

Also how does your input actually look like and are you doing training/inference after the tokenizer? The answer would be slightly different. – Raft 7/6, 2023 at 13:20

You say that the tokenizer does the batching automatically and proceeds to feed a single batch of sentences. Completely ignoring the problem I'm asking help with, i.e., when there's actual batching involved (num_batches > 1). My input is literally a list of sentences. Exactly the way the question depicts. – Billon 7/6, 2023 at 13:39

I find it very ironic how you answer a lot of surrounding questions BUT the one I've posed. – Billon 7/6, 2023 at 13:49

Please check again if the How to ... section answers any of your question? After the comments in the questions. – Raft 7/6, 2023 at 13:50

Does the classifier(text, batch_size=2, truncation="only_first") pipeline work for you? Or do you need to use the raw tokenizer to run through the batches? – Raft 7/6, 2023 at 13:51

that could be a potential solution, the problem is that I don't want the classification from the model. I want the logits, as said on the question. Is there a way to access the logits using pipeline? Maybe I will have to set a custom pipeline? – Billon 8/6, 2023 at 10:23

See the other answer from your other question =) – Raft 8/6, 2023 at 20:14

Use pipelines, but there is a catch.

Because you are passing all the processing steps, you need to pass the args for each one of them - when needed. For the tokenizer, we define:

 tokenizer = AutoTokenizer.from_pretrained(selected_model)
 tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}

The model is straight forward:

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

Then finally:

classifier = pipeline("text-classification", model=model, batch_size=32, tokenizer=tokenizer)

Specific to my application:

class MyPipeline(TextClassificationPipeline):
     def postprocess(self, model_outputs):
         return model_outputs["logits"][0]

and modify the call:

classifier = pipeline("text-classification", model=model, batch_size=32, tokenizer=tokenizer, pipeline_class=MyPipeline)

logits = classifier(text, **tokenizer_kwargs)

Billon answered 12/6, 2023 at 12:40 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Contextualizing

Use pipelines, but there is a catch.

Specific to my application:

How to tokenize a list of sentences?

How to use it with the AutoModelForSequenceClassification?

How to use the distilbert-base-uncased-finetuned-sst-2-english model for sentiment classification?

What happens when I've OOM issues with GPU?

Use pipelines, but there is a catch.

Specific to my application:

Recommended topics

Hot tags

How to use it with the `AutoModelForSequenceClassification`?

How to use the `distilbert-base-uncased-finetuned-sst-2-english` model for sentiment classification?