Using torch.nn.DataParallel with a custom CUDA extension

Asked 18/7, 2018 at 11:15 Answered 10/2, 2023 at 10:9

Solved neural-network deep-learning pytorch libtorch

To my understanding, the built-in PyTorch operations all automatically handle batches through implicit vectorization, allowing parallelism across multiple GPUs.

However, when writing a custom operation in CUDA as per the Documentation, the LLTM example given performs operations that are batch invariant, for example computing the gradient of the Sigmoid function elementwise.

However, I have a use case that is not batch element invariant and not vectorizable. Running on a single GPU, I currently (inefficiently) loop over each element in the batch, performing a kernel launch for each, like so (written in the browser, just to demonstrate):

std::vector<at::Tensor> op_cuda_forward(at::Tensor input, 
                                        at::Tensor elementSpecificParam) {
    
    auto output = at::zeros(torch::CUDA(/* TYPE */), {/* DIMENSIONS */});
    
    const size_t blockDim = //
    const size_t gridDim = //
    const size_t = numBatches = //

    for (size_t i = 0; i < numBatches; i++) {
        op_cuda_forward_kernel<T><<<gridDim, blockDim>>>(input[i],
                                                         elementSpecificParam[i], 
                                                         output[i]);
    }

    return {output};
}

However, I wish to split this operation over multiple GPUs by batch element.

How would the allocation of the output Tensor work in a multi-GPU scenario?

Of course, one may create intermediate Tensors on each GPU before launching the appropriate kernel, however, the overhead of copying the input data to each GPU and back again would be problematic.

Is there a simpler way to launch the kernels without first probing the environment for GPU information (# GPU's etc)?

The end goal is to have a CUDA operation that works with torch.nn.DataParallel.

Obala answered 18/7, 2018 at 11:15 Comment(0)

This is kind of unusual, as commonly "Batch" is exactly defined as all operations of the network being invariant along that dimension. So you could, for example, just introduce another dimension. So you have the "former batch dimension" in which your operation is not invariant. For this keep your current implementation. Then, parallelize over the new dimension of multiple "actual batches" of data.

But, to stay closer to the question you asked, I see two options:

As you said, inside your implementation figure out which original batch you are operating on (depending on total number of parallel splits, etc). This can become hairy.
Consider your parameter as Part of Input! In your outside call, pass the parameter along your input data to the forward of your model. So (Pythonlike-Pseudocode):

Network(nn.Module):
  ...
  def forward(x, parameter):
    x=self.pre_modules(x)
    x=self.custom_module(x,parameter)
    return x

parameter=torch.zeros(16,requires_grad=True)
net=nn.DataParallel(model)
net(input,parameter)

If your are willing to accept that this will be a leaky abstraction of the network and are mainly interested in getting things to work, I would try out the latter approach first.

Oralee answered 10/2, 2023 at 10:9 Comment(0)

Is there a simpler way to launch the kernels without first probing the environment for GPU information (# GPU's etc)?

Using environmental information, like ranks, local_ranks and local_rank, is a pretty common practice in distributed training (both DP and DDP)

These information are also used in sharding dataset, mapping workers to devices and etc.

Serosa answered 25/1, 2023 at 0:6 Comment(0)

Recommended topics

Hot tags