I'm running a LLM code on 8 Nvidia A100 GPUs (1 node).
When I'm trying to load a big model (70B), I get a CUDA out of memory
error.
from haystack.components.generators import HuggingFaceLocalGenerator
generator = HuggingFaceLocalGenerator(model="meta-llama/Meta-Llama-3-70B-Instruct",
generation_kwargs={"max_new_tokens": 100,
# tried with and without the device_map set to auto !
"device_map": "auto"})
Then, when running query_pipeline.run()
the model is loading, and i can see in the nvidia-smi
output (ran it automatically every 3 seconds) that at the beginning all 8 GPUs are idle (1-2 MB used), and then only the first one starts to fill up, the others remain empty until the end when it crushes.
How to make sure the model loads on all GPUs?
Thanks, @Stefano Fiorucci - anakin87, your suggestion solved the problem. The right way to run the generator is:
generator = HuggingFaceLocalGenerator(model="meta-llama/Meta-Llama-3-70B-Instruct",
generation_kwargs={"max_new_tokens": 100},
huggingface_pipeline_kwargs={"device_map": "auto"
})
That way, i can see in the nvidia-smi
output that the load is distributed among all GPUs (all are loaded to ±50%), and the model runs without crushing.