I am currently doing a project and coding up an ML model using pytorch lightning. The dataset I am training on is reasonably large, and hence infeasible to train on my local GPU. For this reason, I am thinking of using AWS cloud GPUs for training. I've heard a few terms thrown around, e.g. SageMaker, ray lightning, Docker but beyond that I'm not entirely sure where to start/which is the best for my use case.
I guess my question is: if I want to do multi-node cloud GPU training using pytorch lightning, what libraries/frameworks/tools should I be looking at?
I have researched different libraries/frameworks, but am currently unsure of which ones being the best for my use case
Amazon SageMaker and Bedrock are services you could use. This link will tell you about the differences (in a nutshell Bedrock isn't as flexible but handles the GPU work for you); https://repost.aws/questions/QURQ0DJ5oPSUyyaLv0jjS4vw/bedrock-vs-sagemaker
If you want to save money, you could spin up the GPUs in EC2 or use EKS or Kubernetes and Docker to run your GPUs. The more DIY you go, the cheaper you can make things but the more work you'll have to put in to managing the GPU cluster.
You could still use Ray in SageMaker. https://aws.amazon.com/blogs/machine-learning/orchestrate-ray-based-machine-learning-workflows-using-amazon-sagemaker/
You could likewise use Horovod. https://aws.amazon.com/blogs/machine-learning/multi-gpu-and-distributed-training-using-horovod-in-amazon-sagemaker-pipe-mode/