GPUDirect RDMA transfer from GPU to remote host

Asked 14/8, 2012 at 10:47 Answered 12/10, 2023 at 8:55

Scenario:

I have two machines, a client and a server, connected with Infiniband. The server machine has an NVIDIA Fermi GPU, but the client machine has no GPU. I have an application running on the GPU machine that uses the GPU for some calculations. The result data on the GPU is never used by the server machine, but is instead sent directly to the client machine without any processing. Right now I'm doing a cudaMemcpy to get the data from the GPU to the server's system memory, then sending it off to the client over a socket. I'm using SDP to enable RDMA for this communication.

Question:

Is it possible for me to take advantage of NVIDIA's GPUDirect technology to get rid of the cudaMemcpy call in this situation? I believe I have the GPUDirect drivers correctly installed, but I don't know how to initiate the data transfer without first copying it to the host.

My guess is that it isn't possible to use SDP in conjunction with GPUDirect, but is there some other way to initiate an RDMA data transfer from the server machine's GPU to the client machine?

Bonus: If somone has a simple way to test if I have the GPUDirect dependencies correctly installed that would be helpful as well!

Paphos answered 14/8, 2012 at 10:47 Comment(6)

In CUDA code samples SDK you could find some sample code that demonstrates what you want - developer.nvidia.com/cuda/cuda-cc-sdk-code-samples. You would need to use cudaMemcpyAsync to asynchronously copy to the GPU w.r.t host. – Telegraphese 15/8, 2012 at 19:17

I have the CUDA SDK, but I don't see any examples using GPUDirect technology. Do you know of a specific sample program I should look at? – Paphos 16/8, 2012 at 3:32

I currently don't have it downloaded, but I think "Simple Peer-to-Peer Transfers with Multi-GPU" example in the link I gave is what you want. – Telegraphese 16/8, 2012 at 16:47

I'll go take a look at that and post back if I'm wrong, but I'm not looking for GPU-to-GPU (P2P) transfers. I'm pretty sure I can do that with the normal cudaMemcpy call. What I'm looking for is a way to transfer directly from the GPU to memory on another host using RDMA and Infiniband. – Paphos 17/8, 2012 at 3:5

Okay, in that case you would definitely need to use pinned memory (malloc via cudaMallocHost), or use cudaHostRegister function. I guess you just have to pin the memory, and GPUDirect would enable RDMA transfer if the setup is okay (if your throughput after doing this is any better than the current, then you could be certain about improvement). And as far as I know, GPUDirect would only accelerate cudaMemCpy, and that it cannot be removed, if you have many memcpy functions (H2D,D2H), then you could just use cudaMemcpyDefault. – Telegraphese 17/8, 2012 at 14:46

Thanks! I'll look into using cudaHostRegister to set up the client as a remote host and then do a cudaMemcpy call to transfer directly from the GPU to the client. – Paphos 17/8, 2012 at 14:54

Yes, it is possible with supporting networking hardware. See the GPUDirect RDMA documentation.

Ivetteivetts answered 31/8, 2012 at 4:27 Comment(8)

I've seen that feature, but it looks like it targets GPU P2P transfers. Will it also allow me to copy data directly to a remote node without involving the CPU on the source node? – Paphos 31/8, 2012 at 11:46

Yes, that is what RDMA means -- "Remote Direct Memory Access". – Ivetteivetts 3/9, 2012 at 0:50

To quote from the page you linked to: "Eliminate CPU bandwidth and latency bottlenecks using direct memory access (DMA) between GPUs and other PCIe devices ..." This leave me unclear as to whether or not the CUDA driver has RDMA support for the situation I described above, or if it's only for P2P transfers. It seems like it would be easily supported, but that page doesn't seem very explicit on the matter. This seems still like a good answer though so I'll accept it. – Paphos 3/9, 2012 at 1:10

The key word here is "Remote", i.e. not peers on the same PCI-e bus. This will require support from specific Infiniband card makers that NVIDIA partners with. – Ivetteivetts 3/9, 2012 at 1:17

@Ivetteivetts But can we access peer-to-peer via Infiniband-RDMA, i.e. can GPU1-Core access by pointer in kernel<<<>>>-function to the GPU2-RAM? GPU1-Core <-Infiniband->GPU2-RAM – Avalos 19/11, 2013 at 17:5

@Alex, no, GPU1 of PC1 can't access RAM (GPU2-RAM) of remote PC2 with normal memory read operations. RDMA means that PC1 can post requests with infiniband to copy some memory from PC2 (or GPU2-RAM) to some local memory (PC1 RAM or GPU1 RAM) without remote PC2 doing interrupt or memcpy. Request is posted explicitly in QP: mellanox.com/related-docs/prod_software/… page 106 "5.2.7 rdma_post_read... The contents of the remote memory region will be read into the local data buffer". You may access local copy of data only after this request completion. – Davedaveda 26/5, 2017 at 1:35

I ended up here in 2021; I don't think this is the answer any longer 😂 – Coracoid 12/11, 2021 at 14:35

@Grant edited to provide the current docs. – Ivetteivetts 14/12, 2021 at 2:49

I would like to share my investigation regarding the question. For using GPUDirrect between GPU and NIC your network card should support RDMA. So, if you are using e.g. NVIDIA Mellanox MCX623106AN-CDAT ConnectX®-6 Dx network card and e.g. NVIDIA Quadro card with RDMA support. You can use this example for sending data between GPU and NIC

https://github.com/Mellanox/gpu_direct_rdma_access

Staphylococcus answered 12/10, 2023 at 8:55 Comment(0)

Scenario:

Question:

Recommended topics

Hot tags