deep-learningpytorchnvidiadistributed-computingnvlink

Does NVLink accelerate training with DistributedDataParallel?


Nvidia's NVLink accelerates data transfer between several GPUs on the same machine. I train large models on such a machine using PyTorch.

I see why NVLink would make model-parallel training faster, since one pass through a model will involve several GPUs.

But would it accelerate a data-parallel training process using DistributedDataParallel?


Solution

  • How does data-parallel training on k GPUs works?
    You split your mini batch into k parts, each part is forwarded on a different GPU, and gradients are estimated on each GPU. However, (and this is super crucial) updating the weights must be synchronized between all GPUs. This is where NVLink becomes important for data-parallel training as well.