from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cuda:3",
)
There are many GPUs on the server, but I can only use two of them. How should I configure device_map (or other parameters) so that the model runs on both GPUs?
To make use of multiple GPUs with Hugging Face transformers you need to understand two main approaches:
First, a key step: since you mentioned many GPUs are available but you can only use two (e.g., GPUs 3 and 4), you must restrict PyTorch's visibility to only those GPUs using the CUDA_VISIBLE_DEVICES environment variable. This ensures Accelerate/Transformers only see and use the allowed ones.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"
export CUDA_VISIBLE_DEVICES="3,4" before running your script.Now that we have the environment variable set, let's look at the two approaches.
device_map for Big Model InferenceThis comes from the Hugging Face Accelerate library and is designed for loading large models that don't fit on a single device. It splits the model layers/parameters across available GPUs (or even CPU/disk if needed) for inference, but it's not a true parallelism strategy. it's more about memory management and sequential execution with offloading.
device_map="auto", Accelerate automatically detects visible GPUs and distributes the model (e.g., some layers on GPU 3, others on GPU 4).Updated code example:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4" # Restrict to your allowed GPUs
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto", # Automatically split across visible GPUs
)
After loading, check the map with print(model.hf_device_map).
If you want actual parallelism (e.g., sharding tensors for faster computation via matrix splits), use Transformers' tensor parallelism.
device_map: TP focuses on computational parallelism, not just memory offloading.tp_plan="auto", it automatically shards across visible GPUs (e.g., degree=2 for your two GPUs).Updated code example (requires torch.distributed setup):
# Run: torchrun --nproc_per_node=2 run_model_dist.py
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"
import torch
from torch import distributed as dist
from transformers import AutoTokenizer, AutoModelForCausalLM
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
dist.init_process_group(backend="nccl")
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
tp_plan="auto", # Automatically apply tensor parallelism across visible GPUs
)
Note: Not all models support TP out-of-the-box; check if yours does. For multi-process setups (e.g., via torchrun), scale world_size to the number of GPUs.