pythonmachine-learninghuggingface-transformerstransformer-model

Choose available GPU devices with device_map


from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda:3",
)

There are many GPUs on the server, but I can only use two of them. How should I configure device_map (or other parameters) so that the model runs on both GPUs?


Solution

  • To make use of multiple GPUs with Hugging Face transformers you need to understand two main approaches:

    1. device mapping for efficient model loading
    2. tensor parallelism for sharding the model for parallel computation

    First, a key step: since you mentioned many GPUs are available but you can only use two (e.g., GPUs 3 and 4), you must restrict PyTorch's visibility to only those GPUs using the CUDA_VISIBLE_DEVICES environment variable. This ensures Accelerate/Transformers only see and use the allowed ones.

    import os
    os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"
    

    Now that we have the environment variable set, let's look at the two approaches.

    Approach 1: Using device_map for Big Model Inference

    This comes from the Hugging Face Accelerate library and is designed for loading large models that don't fit on a single device. It splits the model layers/parameters across available GPUs (or even CPU/disk if needed) for inference, but it's not a true parallelism strategy. it's more about memory management and sequential execution with offloading.

    Updated code example:

    import os
    os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"  # Restrict to your allowed GPUs
    
    from transformers import AutoTokenizer, AutoModelForCausalLM
    
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="auto",  # Automatically split across visible GPUs
    )
    

    After loading, check the map with print(model.hf_device_map).

    Approach 2: Using Tensor Parallelism for Multi-GPU Inference

    If you want actual parallelism (e.g., sharding tensors for faster computation via matrix splits), use Transformers' tensor parallelism.

    Updated code example (requires torch.distributed setup):

    # Run: torchrun --nproc_per_node=2 run_model_dist.py
    
    import os
    os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"
    
    import torch
    from torch import distributed as dist
    from transformers import AutoTokenizer, AutoModelForCausalLM
    
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    dist.init_process_group(backend="nccl")
    
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        tp_plan="auto",  # Automatically apply tensor parallelism across visible GPUs
    )
    

    Note: Not all models support TP out-of-the-box; check if yours does. For multi-process setups (e.g., via torchrun), scale world_size to the number of GPUs.