pythonpytorchhuggingface-transformersinference

Transformers: How to use CUDA for inferencing?


I have fine-tuned my models with GPU but inferencing process is very slow, I think this is because inferencing uses CPU by default. Here is my inferencing code:

txt = "This was nice place"
model = transformers.BertForSequenceClassification.from_pretrained(model_path, num_labels=24)
tokenizer = transformers.BertTokenizer.from_pretrained('TurkuNLP/bert-base-finnish-cased-v1')
encoding = tokenizer.encode_plus(txt, add_special_tokens = True, truncation = True, padding = "max_length", return_attention_mask = True, return_tensors = "pt")
output = model(**encoding)
output = output.logits.softmax(dim=-1).detach().cpu().flatten().numpy().tolist()

Here is my second inferencing code, which is using pipeline (for different model):

classifier = transformers.pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier(txt)

How can I force transformers library to do faster inferencing on GPU? I have tried adding model.to(torch.device("cuda")) but that throws error:

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

I suppose the problem is related to the data not being sent to GPU. There is a similar issue here: pytorch summary fails with huggingface model II: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

How would I send data to GPU with and without pipeline? Any advise is highly appreciated.


Solution

  • You should transfer your input to CUDA as well before performing the inference:

    device = torch.device('cuda')
    
    # transfer model
    model.to(device)
    
    # define input and transfer to device
    encoding = tokenizer.encode_plus(txt, 
         add_special_tokens=True, 
         truncation=True, 
         padding="max_length", 
         return_attention_mask=True, 
         return_tensors="pt")
    
    encoding = encoding.to(device)
    
    # inference
    output = model(**encoding)
    

    Be aware nn.Module.to is in-place, while torch.Tensor.to is not (it does a copy!).