Transformers: How to use CUDA for inferencing?

Asked 9/2, 2022 at 13:44 Answered 28/12, 2022 at 12:59

Solved python pytorch huggingface-transformers inference

I have fine-tuned my models with GPU but inferencing process is very slow, I think this is because inferencing uses CPU by default. Here is my inferencing code:

txt = "This was nice place"
model = transformers.BertForSequenceClassification.from_pretrained(model_path, num_labels=24)
tokenizer = transformers.BertTokenizer.from_pretrained('TurkuNLP/bert-base-finnish-cased-v1')
encoding = tokenizer.encode_plus(txt, add_special_tokens = True, truncation = True, padding = "max_length", return_attention_mask = True, return_tensors = "pt")
output = model(**encoding)
output = output.logits.softmax(dim=-1).detach().cpu().flatten().numpy().tolist()

Here is my second inferencing code, which is using pipeline (for different model):

classifier = transformers.pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier(txt)

How can I force transformers library to do faster inferencing on GPU? I have tried adding model.to(torch.device("cuda")) but that throws error:

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

I suppose the problem is related to the data not being sent to GPU. There is a similar issue here: pytorch summary fails with huggingface model II: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

How would I send data to GPU with and without pipeline? Any advise is highly appreciated.

Huntsville answered 9/2, 2022 at 13:44 Comment(2)

Are you sure you tried model.to(torch.device("cuda")), at the very least it should return you an error as your data doesn't seem to belong to the same device. – Hewes 9/2, 2022 at 13:52

It was in wrong part of the code, I have edited my question. – Huntsville 9/2, 2022 at 14:53

You should transfer your input to CUDA as well before performing the inference:

device = torch.device('cuda')

# transfer model
model.to(device)

# define input and transfer to device
encoding = tokenizer.encode_plus(txt, 
     add_special_tokens=True, 
     truncation=True, 
     padding="max_length", 
     return_attention_mask=True, 
     return_tensors="pt")

encoding = encoding.to(device)

# inference
output = model(**encoding)

Be aware nn.Module.to is in-place, while torch.Tensor.to is not (it does a copy!).

Hewes answered 9/2, 2022 at 15:28 Comment(2)

Oh thank you! Can I do something like result = classifier(txt).to(device) in order to do same with pipeline? – Huntsville 9/2, 2022 at 15:32

Since classifier is a model, then classifier.to(device) should work. – Hewes 9/2, 2022 at 15:34

For the pipeline code question

The problem is the default behavior of transformers.pipeline to use CPU. But from here you can add the device=0 parameter to use the 1st GPU, for example.

device=0 to utilize GPU cuda:0
device=1 to utilize GPU cuda:1

pipeline = pipeline(TASK, model=MODEL_PATH, device=0)

Your code becomes:

classifier = transformers.pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=0)
result = classifier(txt)

Manufactory answered 28/12, 2022 at 12:59 Comment(0)

For the pipeline code question

Recommended topics

Hot tags