nsightacceleratedeepspeednsight-systems

Problems when profiling LLM-training using "huggingface/accelerate" to Night system


I am learning the Llama model in a multi-node environment using huggingface/accelerate, and if I run it as follows to profile it, the program will die due to a problem with the ssh connection to another node.

$ nsys profile accelerate launch train.py -b 1 -m Llama-2-7b-chat-hf -o sgd -t

screenshot

I know it's not an accurate profiling method for multi-node, but I thought at least profiling would work. But I can't connect to other nodes because I used nsys command…

Also, after that, if I don't give the nsys command and just run the application, the application won't work the same issue. Eventually, I have to stop the docker container and run it again to fix the issue… What is it?


Solution

  • The solution to this is to create a bash script and put the nsys command in it. Make sure to use the -o/--output switch and provide a report name using %p, that way reports from different ranks will not collide.

    Launch the bash script with huggingface using the --no_python option, e.g., accelerate launch --no_python <bash script>.

    These steps are also described at docs.nvidia.com/nsight-systems/UserGuide/index.html#deepspeed for similar parallel job launchers.