pytorchtritonserver

CUDA error: device-side assert triggered on tensor.to(device='cuda')


An ML Model is running under Triton Inference Server on a GPU instance group and after a certain amount of successful inferences starts throwing the exception: CUDA error: device-side assert triggered

With export CUDA_LAUNCH_BLOCKING=1 the stacktrace points to {key: val.to(device=COMPUTE_DEVICE) for key, val in inputs.items()}:

Traceback (most recent call last):
  File "/opt/triton_models/feature_based_pwsh_classifier/1/script_embeddings.py", line 129, in compute_code_embeddings
    inputs = {key: val.to(device=COMPUTE_DEVICE) for key, val in inputs.items()}
  File "/opt/triton_models/feature_based_pwsh_classifier/1/script_embeddings.py", line 129, in <dictcomp>
    inputs = {key: val.to(device=COMPUTE_DEVICE) for key, val in inputs.items()}
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Here is a simplified form of the problematic code:

max_length = llm.config.max_position_embeddings

# inputs is a dict with keys: [input_ids, attention_mask]
inputs = tokenizer(text, return_tensors='pt', max_length=max_length, truncation=True, padding=True)

# Move the inputs to the CUDA device
inputs = {key: val.to(device=COMPUTE_DEVICE) for key, val in inputs.items()}

with torch.no_grad():
    outputs = llm(**inputs)

Where:

Help and recommendation are appreciated!


Solution

  • The issue was caused by using max_position_embeddings of size 514 from graphcodebert config:

    max_length = llm.config.max_position_embeddings
    inputs = tokenizer(text, return_tensors='pt', max_length=max_length, truncation=True, padding=True)
    

    while in fact, the 512, which is standard for BERT models allowed Tokenizer to produce valid outputs.

    Few notes on debugging process:

    Helpful discussion: