I am using Nvidia Triton Inference Server and ONNX model for inference on a GPU instance.
The Dockerfile, containing the environment, inference server and models contains following from/pip
lines:
FROM --platform=linux/amd64 nvcr.io/nvidia/tritonserver:23.12-py3
RUN pip install torch transformers onnx onnxruntime-gpu onnxruntime
the model.py
for the Triton Inference Server has been simplified to following:
import onnxruntime as ort
import torch
import numpy as np
session = ort.InferenceSession("path/to/onnx.model", providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
...
io_binding = session.io_binding()
pt_script_embeddings = torch.rand(
size=(100, 2010), dtype=torch.float32, device="cuda:0"
).contiguous()
io_binding.bind_input(
name="np_script_embeddings",
device_type="cuda",
device_id=0,
element_type=np.float32,
shape=tuple(pt_script_embeddings.shape),
buffer_ptr=pt_script_embeddings.data_ptr(),
)
logit_output_shape = (100, 2)
logit_output = torch.empty(logit_output_shape, dtype=torch.float32, device='cuda:0').contiguous()
io_binding.bind_output(
name="np_logits",
device_type="cuda",
device_id=0,
element_type=np.float32,
shape=tuple(logit_output.shape),
buffer_ptr=logit_output.data_ptr()
)
session.run_with_iobinding(io_binding)
outputs = logit_output.cpu().numpy()
Unfortunately, the error below is triggered at the line io_binding.bind_input
causing me a lot of grief:
RuntimeError: Error when binding input: There's no data transfer registered for copying tensors from Device:[DeviceType:1 MemoryType:0 DeviceId:0] to Device:[DeviceType:0 MemoryType:0 DeviceId:0]
Note: articles reviewed before the SO submission:
To resolve the issue I needed to carefully match versions ofcuda
, pytorch
and onnxruntime
provided by the tritonserver
docker image with the Python packages of torch
and onnxruntime-gpu
installed manually. Here is the process in details:
onnxruntime-gpu
by visiting onnx cuda execution provider. In my case it was cuda==12.2
Container Version
with the matching cuda version from prior step. In my case it was tritonserver:23.10-py3
torch 2.1
Base on the collected versions, update the environment. In my case it is the Docker image with following changes:
FROM --platform=linux/amd64 nvcr.io/nvidia/tritonserver:23.10-py3
RUN pip install transformers
RUN pip install torch==2.1
# https://onnxruntime.ai/docs/install/
# https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements
RUN pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
NOTE: if your build environment has no access to the Azure repo: https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/ then retrieve and install the files manually from: https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-cuda-12 (make sure to correct cuda-12
for your version)