pythontensorflowkerasamazon-sagemakerhorizontal-scaling

sagemaker horizontally scaling tensorflow (keras) model


I am roughly following this script fashion-MNIST-sagemaker.

I see that in the notebook

from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py', 
                          role=role,
                          train_instance_count=1, 
                          train_instance_type='local',
                          framework_version='1.12', 
                          py_version='py3',
                          script_mode=True,
                          hyperparameters={'epochs': 1}
                         )

I am wondering to what extent I can and should use the train_instance_count parameter. Will it distribute training along some dimension automatically, if yes - what is the dimension?

Further, does it generally make sense to distribute training horizontally in a keras (with tensorflow) based setting?


Solution

  • distributed training is model and framework specific. Not all models are easy to distribute, and from ML framework to ML framework things are not equally easy. It is rarely automatic, even less so with TensorFlow and Keras.

    Neural nets are conceptually easy to distribute under the data-parallel paradigm, whereby the gradient computation of a given mini-batch is split among workers, which could be multiple devices in the same host (multi-device) or multiple hosts with each multiple devices (multi-device multi-host). The D2L.ai course provides an in-depth view of how neural nets are distributed here and here.

    Keras used to be trivial to distribute in multi-device, single host fashion with the multi_gpu_model, which will sadly get deprecated in 4 months. In your case, you seem to refer to multi-host model (more than one machine), and that requires writing ad-hoc synchronization code such as the one seen in this official tutorial.

    Now let's look at how does this relate to SageMaker.

    SageMaker comes with 3 options for algorithm development. Using distributed training may require a varying amount of custom work depending on the option you choose:

    1. The built-in algorithms is a library of 18 pre-written algorithms. Many of them are written to be distributed in single-host multi-GPU or multi-GPU multi-host. With that first option, you don't have anything to do apart from setting train_instance_count > 1 to distribute over multiple instances

    2. The Framework containers (the option you are using) are containers developed for popular frameworks (TensorFlow, PyTorch, Sklearn, MXNet) and provide pre-written docker environment in which you can write arbitrary code. In this options, some container will support one-click creation of ephemeral training clusters to do distributed training, however using train_instance_count greater than one is not enough to distribute the training of your model. It will just run your script on multiple machines. In order to distribute your training, you must write appropriate distribution and synchronization code in your mnist_keras_tf.py script. For some frameworks such code modification will be very simple, for example for TensorFlow and Keras, SageMaker comes with Horovod pre-installed. Horovod is a peer-to-peer ring-style communication mechanism that requires very little code modification and is highly scalable (initial annoucement from Uber, SageMaker doc, SageMaker example, SageMaker blog post). My recommendation would be to try using Horovod to distribute your code. Similarly, in Apache MXNet you can easily create Parameter Stores to host model parameters in a distributed fashion and sync with them from multiple nodes. MXNet scalability and ease of distribution is one of the reason Amazon loves it.

    3. The Bring-Your-Own Container requires you to write both docker container and algorithm code. In this situation, you can of course distribute your training over multiple machines but you also have to write machine-to-machine communication code

    For your specific situation my recommendation would be to scale horizontally first in a single node with multiple GPUs over bigger and bigger machine types, because latency and complexity increase drastically as you switch from single-host to multi-host context. If truly necessary, use multi-node context and things maybe easier if that's done with Horovod. In any case, things are still much easier to do with SageMaker since it manages creation of ephemeral, billed-per-second clusters with built-in, logging and metadata and artifacts persistence and also handles fast training data loading from s3, sharded over training nodes.

    Note on the relevancy of distributed training: Keep in mind that when you distribute over N devices a model that was running fine on one device, you usually grow the batch size by N so that the per-device batch size stays constant and each device keeps being busy. This will disturb your model convergence, because bigger batches means a less noisy SGD. A common heuristic is to grow the learning rate by N (more info in this great paper from Priya Goyal et al), but this on the other hand induces instability at the first couple epochs, so it is sometimes associated with a learning rate warmup. Scaling SGD to work well with very large batches is still an active research problem, with new ideas coming up frequently. Reaching good model performance with very large batches sometimes require ad-hoc research and a fair amount of parameter tuning, occasionally to the extent where the extra money spent on finding how to distribute well overcome the benefits of the faster training you eventually manage to run. A situation where distributed training makes sense is when an individual record represent too much compute to form a big enough physical batch on a device, a situation seen on big input sizes (eg vision over HD pictures) or big parameter counts (eg BERT). That being said, for those models requiring very big logical batch you don't necessarily have to distribute things physically: you can run sequentially N batches through your single GPU and wait N per-device batches before doing the gradient averaging and parameter update to simulate having an N times bigger GPU. (a clever hack sometimes called gradient accumulation)