[SOLVED] Model generator is loading on only one GPU causing a CUDA out of memory error

Model generator is loading on only one GPU causing a CUDA out of memory error

I'm running a LLM code on 8 Nvidia A100 GPUs (1 node).
When I'm trying to load a big model (70B), I get a CUDA out of memory error.

from haystack.components.generators import HuggingFaceLocalGenerator
generator = HuggingFaceLocalGenerator(model="meta-llama/Meta-Llama-3-70B-Instruct",
                                      generation_kwargs={"max_new_tokens": 100,
                             # tried with and without the device_map set to auto !
                                                         "device_map": "auto"})

Then, when running query_pipeline.run() the model is loading, and i can see in the nvidia-smi output (ran it automatically every 3 seconds) that at the beginning all 8 GPUs are idle (1-2 MB used), and then only the first one starts to fill up, the others remain empty until the end when it crushes.

How to make sure the model loads on all GPUs?

Solution

Thanks, @Stefano Fiorucci - anakin87, your suggestion solved the problem. The right way to run the generator is:

generator = HuggingFaceLocalGenerator(model="meta-llama/Meta-Llama-3-70B-Instruct",
                                      generation_kwargs={"max_new_tokens": 100},
                                      huggingface_pipeline_kwargs={"device_map": "auto" 
                                                                   })

That way, i can see in the nvidia-smi output that the load is distributed among all GPUs (all are loaded to ±50%), and the model runs without crushing.