I have fine-tuned my models with GPU but inferencing process is very slow, I think this is because inferencing uses CPU by default. Here is my inferencing code:
txt = "This was nice place"
model = transformers.BertForSequenceClassification.from_pretrained(model_path, num_labels=24)
tokenizer = transformers.BertTokenizer.from_pretrained('TurkuNLP/bert-base-finnish-cased-v1')
encoding = tokenizer.encode_plus(txt, add_special_tokens = True, truncation = True, padding = "max_length", return_attention_mask = True, return_tensors = "pt")
output = model(**encoding)
output = output.logits.softmax(dim=-1).detach().cpu().flatten().numpy().tolist()
Here is my second inferencing code, which is using pipeline (for different model):
classifier = transformers.pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier(txt)
How can I force transformers library to do faster inferencing on GPU? I have tried adding model.to(torch.device("cuda"))
but that throws error:
Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu
I suppose the problem is related to the data not being sent to GPU. There is a similar issue here: pytorch summary fails with huggingface model II: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu
How would I send data to GPU with and without pipeline? Any advise is highly appreciated.
You should transfer your input to CUDA as well before performing the inference:
device = torch.device('cuda')
# transfer model
model.to(device)
# define input and transfer to device
encoding = tokenizer.encode_plus(txt,
add_special_tokens=True,
truncation=True,
padding="max_length",
return_attention_mask=True,
return_tensors="pt")
encoding = encoding.to(device)
# inference
output = model(**encoding)
Be aware nn.Module.to
is in-place, while torch.Tensor.to
is not (it does a copy!).