To my understanding, the built-in PyTorch
operations all automatically handle batches through implicit vectorization, allowing parallelism across multiple GPUs.
However, when writing a custom operation in CUDA
as per the Documentation, the LLTM example given performs operations that are batch invariant, for example computing the gradient of the Sigmoid function elementwise.
However, I have a use case that is not batch element invariant and not vectorizable. Running on a single GPU, I currently (inefficiently) loop over each element in the batch, performing a kernel launch for each, like so (written in the browser, just to demonstrate):
std::vector<at::Tensor> op_cuda_forward(at::Tensor input,
at::Tensor elementSpecificParam) {
auto output = at::zeros(torch::CUDA(/* TYPE */), {/* DIMENSIONS */});
const size_t blockDim = //
const size_t gridDim = //
const size_t = numBatches = //
for (size_t i = 0; i < numBatches; i++) {
op_cuda_forward_kernel<T><<<gridDim, blockDim>>>(input[i],
elementSpecificParam[i],
output[i]);
}
return {output};
}
However, I wish to split this operation over multiple GPUs by batch element.
How would the allocation of the output
Tensor work in a multi-GPU scenario?
Of course, one may create intermediate Tensors on each GPU before launching the appropriate kernel, however, the overhead of copying the input data to each GPU and back again would be problematic.
Is there a simpler way to launch the kernels without first probing the environment for GPU information (# GPU's etc)?
The end goal is to have a CUDA operation that works with torch.nn.DataParallel.