I have fine-tuned my models with GPU but inferencing process is very slow, I think this is because inferencing uses CPU by default. Here is my inferencing code:
txt = "This was nice place"
model = transformers.BertForSequenceClassification.from_pretrained(model_path, num_labels=24)
tokenizer = transformers.BertTokenizer.from_pretrained('TurkuNLP/bert-base-finnish-cased-v1')
encoding = tokenizer.encode_plus(txt, add_special_tokens = True, truncation = True, padding = "max_length", return_attention_mask = True, return_tensors = "pt")
output = model(**encoding)
output = output.logits.softmax(dim=-1).detach().cpu().flatten().numpy().tolist()
Here is my second inferencing code, which is using pipeline (for different model):
classifier = transformers.pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier(txt)
How can I force transformers library to do faster inferencing on GPU? I have tried adding model.to(torch.device("cuda"))
but that throws error:
Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu
I suppose the problem is related to the data not being sent to GPU. There is a similar issue here: pytorch summary fails with huggingface model II: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu
How would I send data to GPU with and without pipeline? Any advise is highly appreciated.
model.to(torch.device("cuda"))
, at the very least it should return you an error as your data doesn't seem to belong to the same device. – Hewes