Efficiently using Hugging Face transformers pipelines on GPU with large datasets

Asked 22/9, 2023 at 15:57 Answered 9/11, 2023 at 11:31

Solved python gpu huggingface-transformers huggingface-datasets

I'm relatively new to Python and facing some performance issues while using Hugging Face Transformers for sentiment analysis on a relatively large dataset. I've created a DataFrame with 6000 rows of text data in Spanish, and I'm applying a sentiment analysis pipeline to each row of text. Here's a simplified version of my code:

import pandas as pd
import torch
from tqdm import tqdm
from transformers import pipeline


data = {
    'TD': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'text': [
        # ... (your text data here)
    ]
}

df_model = pd.DataFrame(data)

device = 0 if torch.cuda.is_available() else -1
py_sentimiento = pipeline("sentiment-analysis", model="finiteautomata/beto-sentiment-analysis", tokenizer="finiteautomata/beto-sentiment-analysis", device=device, truncation=True)

tqdm.pandas()
df_model['py_sentimiento'] = df_model['text'].progress_apply(py_sentimiento)
df_model['py_sentimiento'] = df_model['py_sentimiento'].apply(lambda x: x[0]['label'])

However, I've encountered a warning message that suggests I should use a dataset for more efficient processing. The warning message is as follows:

"You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset."

I have a two questions:

What does this warning mean, and why should I use a dataset for efficiency?

How can I modify my code to batch my data and use parallel computing to make better use of my GPU resources, what code or function or library should be used with hugging face transformers?

I'm eager to learn and optimize my code.

Dionysian answered 22/9, 2023 at 15:57 Comment(3)

For starters, can you try defining a batch_size in your pipeline and see if it speeds things up on the GPU? – Feer 23/9, 2023 at 5:35

Your data are stored in a json file? I ask to help you to define a dataset (and at the same time use the batch) – Brosine 8/10, 2023 at 21:13

Chao Marco! my data is stored as a dataframe in a col named "text" and I have a col to identify each text that is "TD" is kind of an ID for each text so its tabular data – Dionysian 8/10, 2023 at 22:9

I think you can ignore this message. I found it being reported on different websites this year, but if I get it correctly, this Github issue on the Huggingface transformers (https://github.com/huggingface/transformers/issues/22387) shows that the warning can be safely ignored. In addition, batching or using datasets might not remove the warning or automatically use the resources in the best way. You can do call_count = 0 in here (https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/base.py#L1100) to ignore the warning, as explained by Martin Weyssow above.

How can I modify my code to batch my data and use parallel computing to make better use of my GPU resources:

You can add batching like this:

py_sentimiento = pipeline("sentiment-analysis", model="finiteautomata/beto-sentiment-analysis", tokenizer="finiteautomata/beto-sentiment-analysis", batch_size=8, device=device, truncation=True)

and most importantly, you can experiment with the batch size that will result to the highest GPU usage possible on your device and particular task.

Huggingface provides here some rules to help users figure out how to batch: https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching. Making the best resource/GPU usage possible might take some experimentation and it depends on the use case you work on every time.

What does this warning mean, and why should I use a dataset for efficiency?

This means the GPU utilization is not optimal, because the data is not grouped together and it is thus not processed efficiently. Using a dataset from the Huggingface library datasets will utilize your resources more efficiently. However, it is not so easy to tell what exactly is going on, especially considering that we don’t know exactly how the data looks like, what the device is and how the model deals with the data internally. The warning might go away by using the datasets library, but that does not necessarily mean that the resources are optimally used.

What code or function or library should be used with hugging face transformers?

Here is a code example with pipelines and the datasets library: https://huggingface.co/docs/transformers/v4.27.1/pipeline_tutorial#using-pipelines-on-a-dataset. It mentions that using iterables will fill your GPU as fast as possible and batching might also help with computational time improvements.

In your case it seems you are doing a relatively small POC (doing inference for under 10,000 documents with a medium size model), so I don’t think you need to use pipelines. I assume the sentiment analysis model is a classifier and you want to keep using Pandas as shown in the post, so here is how you can combine both. This is usually fast enough for my experiments and prints no warnings about the resources.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch as t
import pandas as pd
        
model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/beto-sentiment-analysis")
tokenizer = AutoTokenizer.from_pretrained("finiteautomata/beto-sentiment-analysis")
            
def classify_dataframe_row(
    example: pd.Series,
):
    output = model(**tokenizer(example["text"], return_tensors="pt"))
    prediction = t.argmax(output[0]).detach().numpy()
    return prediction

dataset = pd.read_csv("file")
dataset = dataset.assign(
    prediction=dataset.progress_apply(classify_dataframe_row, axis=1)
)

As soon as your inference starts, either with this snippet or with the datasets library code, you can run nvidia-smi in a terminal and check what the GPU usage is and play around with the parameters to optimize it. Beware that running the code on your local machine with a GPU vs running it on a larger machine, e.g., a Linux server with perhaps a more powerful GPU might lead to different performance and might need different tuning. If you wish to run the code for larger document collections, you can split the data in order to avoid GPU memory errors locally, or in order to speed up the inference with concurrent runs in a server.

Less answered 9/11, 2023 at 11:31 Comment(2)

When I try pipeline(task='feature-extraction', model=model_path, device=0) I keep getting OOM error. Is this correct way of utilizing GPU? It seems like it is loading the whole model in GPU, I've thought only computation should be done in GPU – Bushwhacker 21/3 at 8:54

@Bushwhacker without context, it is difficult to figure out why the OOM error appears. I don't think it is uncommon for a model to be loaded to the GPU during inference, but here huggingface.co/docs/accelerate/en/usage_guides/big_modeling is some information from huggingface about how to deal with big model inference. – Less 12/4 at 12:52

What does this warning mean, and why should I use a dataset for efficiency?

It means you will gain in efficiency by wrapping up your data in a Dataset object:

from datasets import Dataset
data = {
    'TD': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'text': [
        # ... (your text data here)
    ]
}
dataset = Dataset.from_dict(data)

If you want to understand the reason why you get this warning, you can have a look at this file: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/base.py#L1100. The idea is that it detects that you call the pipeline more than 10 times. It results in inefficient computations compared to using a Dataset object as input because the __call__ function of the base class of the pipeline is evaluated for each single input example.

If you use an iterator, e.g., a Dataset object, then all the examples are processed at once, and you can batch the preprocessing and inference processes:

from transformers.pipelines.pt_utils import KeyDataset
classifier = pipeline("sentiment-analysis",
                      model="finiteautomata/beto-sentiment-analysis",
                      device=device,
                      truncation=True,
                      batch_size=4)
for out in classifier(KeyDataset(dataset, "text")):
    print(out)

How can I modify my code to batch my data and use parallel computing to make better use of my GPU resources, what code or function or library should be used with hugging face transformers?

In the above solution, you can tune the batch_size to fit your available GPU memory and fasten the inference. Another option is to leverage Accelerate for distributed inference: https://huggingface.co/docs/accelerate/usage_guides/distributed_inference

Springhalt answered 8/11, 2023 at 3:6 Comment(0)

Recommended topics

Hot tags