I'm trying to load the Qwen2.5-VL-7B-Instruct model from hugging face with 4-bit weight-only quantization using TorchAoConfig (similar to how its mentioned in the documentation here), but I'm getting a runtime error related to CUDA.
Code:
from transformers import Qwen2_5_VLForConditionalGeneration, TorchAoConfig, AutoProcessor
import torch
torch.cuda.empty_cache()
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
quantization_config=quantization_config
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
Environment:
Python: 3.11
Transformers: latest
GPU: Collab T4
I got the following error:
Error:
RuntimeError: CUDA error: named symbol not found
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I am new to this, probably missing something simple , Any help or insights would be appreciated!.
I have 2 solutions to your problem. Of course, I will attach links to sources so that you can better understand the topic.
In this solution, we use quantization using the bitsandbytes
library. Read about quantization. And also, if you want to understand how you can still load the model using, for example, 8-bit Quantization, read about the BitsAndBytesConfig class
!pip install bitsandbytes accelerate -q
from transformers import Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig, AutoProcessor
import torch
torch.cuda.empty_cache()
# I use 4bit.
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
quantization_config=quant_config,
device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
In this solution, we load the model without quantization (we load all weights with their original precision, without compression).
Advantage:
Disadvantages:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
torch.cuda.empty_cache()
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.float16, # fp16 + T4
device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
If you have any questions, ask them, I will help you.