[SOLVED] Using torchrun with AWS sagemaker estimator on multi-GPU node

Using torchrun with AWS sagemaker estimator on multi-GPU node

I would like to run a training job ml.p4d.24xlarge machine on AWS SageMaker. I ran into a similar issue described here with significant slowdowns in training time. I understand now that I should run it with torchrun. My constraints are that I don't want to use the HuggingFace or PyTorch estimators from SageMaker (for customizability and to properly understand the stack).

Currently, the entrypoint to my container is set as such in my Dockerfile:

ENTRYPOINT ["python3", "/opt/program/entrypoint.py"]

How should I change it, and can I change it to use torchrun instead? Is it just a matter of setting:

ENTRYPOINT ["torchrun --nproc_per_node 8", "/opt/program/entrypoint.py"]

Solution

SageMaker Training Toolkit has the implementation to call torchrun command within the SageMaker's python sdk classes.

You can refer to the "TorchDistributedRunner._create_command()" to see how it constructs the torchrun command and its arguments.

Please also refer to PyTorch document how to use the torchrun command. https://pytorch.org/docs/stable/elastic/run.html