neural-networkdeep-learningpytorchlibtorch

Using torch.nn.DataParallel with a custom CUDA extension


To my understanding, the built-in PyTorch operations all automatically handle batches through implicit vectorization, allowing parallelism across multiple GPUs.

However, when writing a custom operation in CUDA as per the Documentation, the LLTM example given performs operations that are batch invariant, for example computing the gradient of the Sigmoid function elementwise.

However, I have a use case that is not batch element invariant and not vectorizable. Running on a single GPU, I currently (inefficiently) loop over each element in the batch, performing a kernel launch for each, like so (written in the browser, just to demonstrate):

std::vector<at::Tensor> op_cuda_forward(at::Tensor input, 
                                        at::Tensor elementSpecificParam) {
    
    auto output = at::zeros(torch::CUDA(/* TYPE */), {/* DIMENSIONS */});
    
    const size_t blockDim = //
    const size_t gridDim = //
    const size_t = numBatches = //

    for (size_t i = 0; i < numBatches; i++) {
        op_cuda_forward_kernel<T><<<gridDim, blockDim>>>(input[i],
                                                         elementSpecificParam[i], 
                                                         output[i]);
    }

    return {output};
}

However, I wish to split this operation over multiple GPUs by batch element.

How would the allocation of the output Tensor work in a multi-GPU scenario?

Of course, one may create intermediate Tensors on each GPU before launching the appropriate kernel, however, the overhead of copying the input data to each GPU and back again would be problematic.

Is there a simpler way to launch the kernels without first probing the environment for GPU information (# GPU's etc)?

The end goal is to have a CUDA operation that works with torch.nn.DataParallel.


Solution

  • This is kind of unusual, as commonly "Batch" is exactly defined as all operations of the network being invariant along that dimension. So you could, for example, just introduce another dimension. So you have the "former batch dimension" in which your operation is not invariant. For this keep your current implementation. Then, parallelize over the new dimension of multiple "actual batches" of data.

    But, to stay closer to the question you asked, I see two options:

    Network(nn.Module):
      ...
      def forward(x, parameter):
        x=self.pre_modules(x)
        x=self.custom_module(x,parameter)
        return x
    
    parameter=torch.zeros(16,requires_grad=True)
    net=nn.DataParallel(model)
    net(input,parameter)
    

    If your are willing to accept that this will be a leaky abstraction of the network and are mainly interested in getting things to work, I would try out the latter approach first.