Transformers: How to use CUDA for inferencing?
Asked Answered
H

2

8

I have fine-tuned my models with GPU but inferencing process is very slow, I think this is because inferencing uses CPU by default. Here is my inferencing code:

txt = "This was nice place"
model = transformers.BertForSequenceClassification.from_pretrained(model_path, num_labels=24)
tokenizer = transformers.BertTokenizer.from_pretrained('TurkuNLP/bert-base-finnish-cased-v1')
encoding = tokenizer.encode_plus(txt, add_special_tokens = True, truncation = True, padding = "max_length", return_attention_mask = True, return_tensors = "pt")
output = model(**encoding)
output = output.logits.softmax(dim=-1).detach().cpu().flatten().numpy().tolist()

Here is my second inferencing code, which is using pipeline (for different model):

classifier = transformers.pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier(txt)

How can I force transformers library to do faster inferencing on GPU? I have tried adding model.to(torch.device("cuda")) but that throws error:

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

I suppose the problem is related to the data not being sent to GPU. There is a similar issue here: pytorch summary fails with huggingface model II: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

How would I send data to GPU with and without pipeline? Any advise is highly appreciated.

Huntsville answered 9/2, 2022 at 13:44 Comment(2)
Are you sure you tried model.to(torch.device("cuda")), at the very least it should return you an error as your data doesn't seem to belong to the same device.Hewes
It was in wrong part of the code, I have edited my question.Huntsville
H
12

You should transfer your input to CUDA as well before performing the inference:

device = torch.device('cuda')

# transfer model
model.to(device)

# define input and transfer to device
encoding = tokenizer.encode_plus(txt, 
     add_special_tokens=True, 
     truncation=True, 
     padding="max_length", 
     return_attention_mask=True, 
     return_tensors="pt")

encoding = encoding.to(device)

# inference
output = model(**encoding)

Be aware nn.Module.to is in-place, while torch.Tensor.to is not (it does a copy!).

Hewes answered 9/2, 2022 at 15:28 Comment(2)
Oh thank you! Can I do something like result = classifier(txt).to(device) in order to do same with pipeline?Huntsville
Since classifier is a model, then classifier.to(device) should work.Hewes
M
7

For the pipeline code question

The problem is the default behavior of transformers.pipeline to use CPU. But from here you can add the device=0 parameter to use the 1st GPU, for example.

  • device=0 to utilize GPU cuda:0
  • device=1 to utilize GPU cuda:1
pipeline = pipeline(TASK, model=MODEL_PATH, device=0)

Your code becomes:

classifier = transformers.pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=0)
result = classifier(txt)
Manufactory answered 28/12, 2022 at 12:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.