python pytorch huggingface-transformers huggingface quantization

RuntimeError: CUDA error: named symbol not found when using TorchAoConfig with Qwen2.5-VL-7B-Instruct model

I'm trying to load the Qwen2.5-VL-7B-Instruct model from hugging face with 4-bit weight-only quantization using TorchAoConfig (similar to how its mentioned in the documentation here), but I'm getting a runtime error related to CUDA.

Code:

from transformers import Qwen2_5_VLForConditionalGeneration, TorchAoConfig, AutoProcessor
import torch

torch.cuda.empty_cache()

quantization_config = TorchAoConfig("int4_weight_only", group_size=128)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    quantization_config=quantization_config
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

Environment:

Python: 3.11

Transformers: latest

GPU: Collab T4

I got the following error:

Error:

RuntimeError: CUDA error: named symbol not found
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I am new to this, probably missing something simple , Any help or insights would be appreciated!.

Solution

Good day!

I have 2 solutions to your problem. Of course, I will attach links to sources so that you can better understand the topic.

1 Solution

In this solution, we use quantization using the bitsandbytes library. Read about quantization. And also, if you want to understand how you can still load the model using, for example, 8-bit Quantization, read about the BitsAndBytesConfig class

!pip install bitsandbytes accelerate -q

from transformers import Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig, AutoProcessor
import torch

torch.cuda.empty_cache()

# I use 4bit.
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    quantization_config=quant_config,
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

2 Solution

In this solution, we load the model without quantization (we load all weights with their original precision, without compression).

Advantage:

No losses after compression

Disadvantages:

Requires a lot of VRAM !!! (can be critical, as the machine may run out of memory)

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch

torch.cuda.empty_cache()

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.float16,  # fp16 + T4
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

If you have any questions, ask them, I will help you.