Does the nVidia RDMA GPUDirect always operate only physical addresses (in physical address space of the CPU)?
Asked Answered
G

1

9

As we know: http://en.wikipedia.org/wiki/IOMMU#Advantages

Peripheral memory paging can be supported by an IOMMU. A peripheral using the PCI-SIG PCIe Address Translation Services (ATS) Page Request Interface (PRI) extension can detect and signal the need for memory manager services.

enter image description here

But when we use nVidia GPU with CUDA >= 5.0, we can use RDMA GPUDirect, and know that:

http://docs.nvidia.com/cuda/gpudirect-rdma/index.html#how-gpudirect-rdma-works

Traditionally, resources like BAR windows are mapped to user or kernel address space using the CPU's MMU as memory mapped I/O (MMIO) addresses. However, because current operating systems don't have sufficient mechanisms for exchanging MMIO regions between drivers, the NVIDIA kernel driver exports functions to perform the necessary address translations and mappings.

http://docs.nvidia.com/cuda/gpudirect-rdma/index.html#supported-systems

RDMA for GPUDirect currently relies upon all physical addresses being the same from the PCI devices' point of view. This makes it incompatible with IOMMUs and hence they must be disabled for RDMA for GPUDirect to work.

And if we allocate and mapping CPU-RAM to the UVA, as here:

#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

int main() {
    // Can Host map memory
    cudaSetDeviceFlags(cudaDeviceMapHost);  

    // Allocate memory
    unsigned char *host_src_ptr = NULL;
    cudaHostAlloc(&host_src_ptr, 1024*1024, cudaHostAllocMapped);
    std::cout << "host_src_ptr = " << (size_t)host_src_ptr << std::endl;

    // Get UVA-pointer
    unsigned int *uva_src_ptr = NULL;
    cudaHostGetDevicePointer(&uva_src_ptr, host_src_ptr, 0);
    std::cout << "uva_src_ptr  = " << (size_t)uva_src_ptr << std::endl;

    int b;  std::cin >> b;
    return 0;
}

We get equal pointers in Windwos7x64, that means that cudaHostGetDevicePointer() do nothing:

host_src_ptr = 68719476736

uva_src_ptr = 68719476736

What does it mean "sufficient mechanisms for exchanging MMIO regions between drivers", what mechanism is here meant, and why I can not use IOMMU by using the virtual address to access via PCIe to the physical region of BAR - another memory mapped device via PCIe?

And does this mean that the RDMA GPUDirect always operates only physical addresses (in physical address space of the CPU), but why we send to the kernel-function uva_src_ptr which is equal to host_src_ptr - simple pointer in CPU's virtual address space?

Gibbs answered 7/11, 2013 at 16:50 Comment(0)
P
15

The IOMMU is very useful in that it provides a set of mapping registers. It can arrange for any physical memory to appear within the address range accessible by a device, and it can cause physically scattered buffers to look contiguous to devices, too. This is not good for 3rd party PCI/PCI-Express cards or remote machines attempting to access the raw physical offset of an nVidia GPU, as this may result in not actually accessing the intended regions of memory or inhibiting/restricting such accesses on a per-card basis by the IOMMU unit. This must be disabled, then, because

"RDMA for GPUDirect currently relies upon all physical addresses being the same from the PCI devices' point of view."

-nVidia, Design Considerations for rDMA and GPUDirect

When drivers attempt to utilize the CPU's MMU and map regions of memory mapped I/O (MMIO) for use within kernel-space, they typically keep the returned address from the memory mapping to themselves. Because each driver operates within it's own context or namespace, exchanging these mappings between nVidia's driver(s) and other 3rd party vendor's drivers that wish to support rDMA+GPUDirect would be very difficult and would result in a vendor-specific solution (possibly even product-specific if drivers greatly vary between products from the 3rd party). Also, today's operating systems currently don't have any good solution for exchanging MMIO mappings between drivers, thus nVidia exports several functions that allow 3rd party drivers to easily access this information from within kernel-space, itself.

nVidia enforces the use of "physical addressing" to access each card via rDMA for GPUDirect. This greatly simplifies the process of moving data from one computer to a remote system's PCI-Express bus by using that machine's physical addressing scheme without having to worry about problems related to virtual addressing (e.g. resolving virtual addresses to physical ones). Each card has a physical address it resides at and can be accessed at this offset; only a small bit of logic must be added to the 3rd party driver attempting to perform rDMA operations. Also, these 32- or 64-bit Base Address Registers are part of the standard PCI configuration space, so the physical address of the card could easily be obtained by simply reading from it's BAR's rather than having to obtain a mapped address that nVidia's driver obtained upon attaching to the card. nVidia's Universal Virtual Addressing (UVA) takes care of the aforementioned physical address mappings to a seemingly-contiguous region of memory for user-space applications, like so:

CUDA Virtual Address Space

These regions of memory are further divided into three types: CPU, GPU, and FREE, which are all documented here.

Back to your usage case, though: since you're in user-space, you don't have direct access to the physical address space of the system, and the addresses you're using are probably virtual addresses provided to you by nVidia's UVA. Assuming no previous allocations were made, your memory allocation should reside at offset +0x00000000, which would result in you seeing the same offset of the GPU, itself. If you were to allocate a second buffer, I imagine you'd see the this buffer start immediately after the end of the first buffer (at offset +0x00100000 from the base virtual address of the GPU in your case of 1 MB allocations).

If you were in kernel-space, however, and were writing a driver for your company's card to utilize rDMA for GPUDirect, you would use the 32- or 64-bit physical addresses assigned to the GPU by the system's BIOS and/or OS to rDMA data directly to and from the GPU, itself.

Additionally, it may be worth noting that not all DMA engines actually support virtual addresses for transfers -- in fact, most require physical addresses, as handling virtual addressing from a DMA engine can get complex (page 7), thus many DMA engines lack support for this.

To answer the question from your post's title, though: nVidia currently only supports physical addressing for rDMA+GPUDirect in kernel-space. For user-space applications, you will always be using the virtual address of the GPU given to you by nVidia's UVA, which is in the virtual address space of the CPU.


Relating to your application, here's a simplified breakdown of the process you can do for rDMA operations:

  1. Your user-space application creates buffers, which are in the scope of the Unified Virtual Addressing space nVidia provides (virtual addresses).
  2. Make a call to cuPointerGetAttribute(...) to obtain P2P tokens; these tokens pertain to memory inside the context of CUDA.
  3. Send all this information to kernel-space somehow (e.g. IOCTL's, read/write's to your driver, etc). At a minimum, you'll want these three things the end up in your kernel-space driver:
    • P2P token(s) returned by cuPointerGetAttribute(...)
    • UVA virtual address(es) of the buffer(s)
    • Size of the buffer(s)
  4. Now translate those virtual addresses to their corresponding physical addresses by calling nVidia's kernel-space functions, as these addresses are held in nVidia's page tables and can be accessed with function's nVidia's exported, such as: nvidia_p2p_get_pages(...), nvidia_p2p_put_pages(...), and nvidia_p2p_free_page_table(...).
  5. Use these physical addresses acquired in the previous step to initialize your DMA engine that will be manipulating those buffers.

A more in-depth explanation of this process can be found here.

Prosaic answered 21/11, 2013 at 23:27 Comment(7)
Big thanks! 1. i.e. in kernel-space we always must use physical addressing and disable IOMMU, but in user-space we always must use virtual addressing by using enabled IOMMU, but we disable IOMMU at boot-time, then how does it works virtual addressing (UVA)? 2. "Because each driver operates within it's own context or namespace" - but as I know all kernel-space's drivers operate within single address space (context), then what do you mean about "own context"?Gibbs
3. "This is not good for 3rd party PCI/PCI-Express cards or remote machines attempting to access the raw physical offset of an nVidia GPU" - do you mean that 3rd party cards may don't use IOMMU, but 3rd cards operates within physical addresses, and what problem if GPU uses IOMMU or does not, if in any case GPU still operates virtual addresses? For comparison: We do not see any obstacles to the use of conventional MMU when the CPU is working with virtual addresses, even if we use RDMA (CPU-CPU).Gibbs
1: On many systems (such as x86), there actually is no IOMMU for all devices to use. Most of the time the devices, themselves, will have functional blocks that serve to do page-table lookups and basically function as an IOMMU. Also, you can use both virtual and physical addressing in kernel-space. This relates to question 2, since each driver is now in charge of setting up these mappings within that device's IOMMU -- 3rd party drivers don't have access to this. And by "context" I simply mean the private resources referenced by the functional driver, itself.Prosaic
3: Since most devices actually implement their own IOMMU, the driver for that device must manage the address resolutions happening within that device's IOMMU. What nVidia is saying is that if a generic IOMMU in the system exists that all devices use for accessing memory regions of other cards and/or physical memory, this needs to be disabled since the physical address of the card should be used, and not some other address that it would otherwise get resolved to in the generic IOMMU of the system.Prosaic
By having a generic IOMMU present, this would break the assumption of all physical addresses being the same from the PCI devices' point of view; thus the need to disable it. I'm sure if an IOMMU were present and simply disabled or were configured to not actually resolve addresses at all this scenario would work, but this is not usually the case :(Prosaic
Big thanks! I.e. for RDMA(over Infiniband) I must to disable CPU-IOMMU, and use physical addressing between Infiniband and GPU, because Infiniband-IOMMU and GPU-IOMMU use different virtual addresses? I make pinned memory region on GPU-RAM and get page_table/SGL(scatter-gather-list)/SGEs(scatter/gather entries) by using nvidia_p2p_get_pages() in kernel-space, I give their to the Infiniband ibv_post_send() and sends data to remote CPU-RAM?Gibbs
MMU for the CPU may remain on, but all generic IOMMU's and device-specific IOMMU's must be disabled and physical addressing should be used. Also, I put a brief explanation of how to do what you're talking about at the end of my post :)Prosaic

© 2022 - 2024 — McMap. All rights reserved.