pythondeep-learninghuggingface-transformerstorchmulti-gpu

Loading a HuggingFace model on multiple GPUs using model parallelism for inference


I have access to six 24GB GPUs. When I try to load some HuggingFace models, for example the following

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("google/ul2")
model = AutoModelForSeq2SeqLM.from_pretrained("google/ul2")

I get an out of memory error, as the model only seems to be able to load on a single GPU. However, while the whole model cannot fit into a single 24GB GPU card, I have 6 of these and would like to know if there is a way to distribute the model loading across multiple cards, to perform inference.

HuggingFace seems to have a webpage where they explain how to do this but it has no useful content as of today.


Solution

  • When you load the model using from_pretrained(), you need to specify which device you want to load the model to. Thus, add the following argument, and the transformers library will take care of the rest:

    model = AutoModelForSeq2SeqLM.from_pretrained("google/ul2", device_map = 'auto')

    Passing "auto" here will automatically split the model across your hardware in the following priority order: GPU(s) > CPU (RAM) > Disk.

    Of course, this answer assumes you have cuda installed and your environment can see the available GPUs. Running nvidia-smi from a command-line will confirm this. Please report back if you run into further issues.