tensorflowmachine-learningdistributed-training

train on multiple devices


I have know that TensorFlow offer Distributed Training API that can train on multiple devices such as multiple GPUs, CPUs, TPUs, or multiple computers ( workers) Follow this doc : https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras

But I have a question is this any possible way to split the train using Data Parallelism to train across multiple machines ( include mobile devices and computer devices)?

I would be really grateful if you have any tutorial/instruction.


Solution

  • As per my knowledge, Tensorflow only supports CPU, TPU, and GPU for distributed training, considering all the devices should be in the same network.

    For connecting multiple devices, as you mentioned you can follow Multi-worker training.

    tf.distribute.Strategy is integrated to tf.keras, so when model.fit is used with tf.distribute.Strategy instance and then using strategy.scope() for your model allows to create distributed variables.This allows it to equally divide your input data on your devices. You can follow this tutorial for more details.
    Also Distributed input could help you.