optimizationgputensorflowmulti-gpu

What is the advantage of doing a Multi-GPU training in TensorFlow?


In this TensorFlow tutorial, you can use N number of GPUs to distribute N mini-batches (each containing M training samples) to each GPU and calculate the gradients concurrently.

Then you average the gradients collected from N GPUs and update the model parameters.

But this has the same effect as using a single GPU to calculate the gradients of N*M training samples, then updating the parameters.

So the only advantage seems to me is that you can use a larger-sized mini-batch in the same amount of time.

But is the larger-sized mini-batch necessarily better?

I thought you shouldn't use a large-sized mini-batch, in order to make the optimization more robust to saddle points.

If the larger-sized mini-batch is indeed not better, why would you care about Multi-GPU learning, or even Multi-server learning?

(The tutorial above is a synchronous training. If it was asynchronous training, then I can see the merit, since the parameters will be updated without averaging the gradients calculated by each GPU)


Solution

  • The main purpose for multi-GPU learning is to enable you train on large data set in shorter time. It is not necessarily better with larger mini-batch, but at least you can finish learning in a more feasible time.

    More precisely, those N mini-batches are not trained in a synchronized way if you use Asynchronous SGD algorithm. As the algorithm changes when using multi-GPU, it is not equal to using MxN size mini-batch on single-GPU with SGD algorithm.

    If you use sync multi-GPU training, the benefit is mainly time reduction. You could use M/N-size mini-match to maintain the effective mini-batch size, and of course the scalability is limited as smaller mini-batch size leads to more overhead. Data-exchange and synchronization on large number of computing nodes are also disasters.

    Finally to solve the scalability issue, people move to A-SGD when using large number of GPUs concurrently. So probably you won't see someone using sync multi-GPU training on hundreds of (or even tens of) GPUs.