4000% Performance Decrease in SYCL when using Unified Shared Memory instead of Device Memory
Asked Answered
F

1

14

In SYCL, there are three types of memory: host memory, device memory, and Unified Shared Memory (USM). For host and device memory, data exchange requires explicit copying. Meanwhile, data movement from and to USM is automatically managed by the SYCL runtime implicitly.

Unfortunately, during the process of implementing GPU acceleration for a numerical kernel using SYCL, I found an up-to 4000% decrease of performance just by switching from sycl::malloc_device() to sycl::malloc_shared() - even if all I do is repeatedly resubmitting the same SYCL kernel, without any attempt to access data from the host.

When building the code with sycl::malloc_device() with OpenSYCL targeting AMD HIP GFX906 (Radeon VII / Instinct MI50), the program finishes in 0.27 seconds:

$ time ./fdtd.elf 
simulate 16974593 cells for 10 timesteps.

real    0m0.271s
user    0m0.253s
sys     0m0.020s

When building the same code with sycl::malloc_shared(), the program takes 10.6 seconds to complete:

simulate 16974593 cells for 10 timesteps.

real    0m10.649s
user    0m15.172s
sys     0m0.196s

This is a 3925% slowdown.

After enabling "Above 4G Decoding" and "Re-sizable BAR" support in BIOS, now it takes 3.8 seconds instead of 10.6 seconds. But this doesn't fix the actual problem of needless memory tranfers - a 1300% performance hit is still pretty significant.

I also tested a similar kernel using the Intel DPC++ compiler previously, and saw similar results on the same hardware.

I suspect that the slowdown is caused by needless host and device copying, but I'm not sure. What heuristics does a SYCL runtime use to determine whether copying is needed?

The sample code is attached below.

ArrayNXYZ.hpp: 4-dimensional array (n, x, y, z) wrapper class.

#include <sycl/sycl.hpp>

template <typename T>
struct ArrayXYZN
{
    ArrayXYZN() {}

    inline T& operator() (const unsigned int n, const unsigned int x, const unsigned int y, const unsigned int z) const
    {
        size_t offset = n * n_stride + x * x_stride + y * y_stride + z;
        return array[offset];
    }

    unsigned long n_stride, x_stride, y_stride, size;
    T *array;
};

template <typename T>
ArrayXYZN<T>* CreateArrayXYZN(sycl::queue Q, const unsigned int* numLines)
{
    unsigned int n_max = 3;
    unsigned int x_max = numLines[0];
    unsigned int y_max = numLines[1];
    unsigned int z_max = numLines[2];

    unsigned long n_stride = x_max * y_max * z_max;
    unsigned long x_stride = y_max * z_max;
    unsigned long y_stride = z_max;

    if (n_stride % 128 != 0)
    {
        n_stride += 128 - (n_stride % 128);
    }

    // allocate 1D linear buffer
    size_t size = n_stride * n_max;

#ifdef USM
    T *buf = sycl::malloc_shared<T>(size, Q);
#else
    T *buf = sycl::malloc_device<T>(size, Q);
#endif

    // zero memory
    Q.submit([&](sycl::handler& h) {
        h.memset(buf, 0, size * sizeof(T));
    });
    Q.wait();

    // allocate wrapper class
    ArrayXYZN<T>* array = new ArrayXYZN<T>();
    array->n_stride = n_stride;
    array->x_stride = x_stride;
    array->y_stride = y_stride;
    array->size = size * sizeof(T);
    array->array = buf;

    return array;
}

fdtd.cpp:

#include <sycl/sycl.hpp>
#include "ArrayNXYZ.hpp"

/*
 * UpdateVoltages
 *
 * Using Finite Difference Time Domain (FDTD) method,
 * calculate new electric field array "volt" based on
 * magnetic field "curr" and two electromagnetic field
 * operators "vv" and "vi", precalculated from the
 * physical materials before starting up simulation.
 */
void UpdateVoltages(
        const ArrayXYZN<float>& volt,
        const ArrayXYZN<float>& curr,
        const ArrayXYZN<float>& vv,
        const ArrayXYZN<float>& vi,
        int x, int y, int z
)
{
    // note: each (x, y, z) cell has three polarizations
    // x, y, z, these are different from the cell's
    // coordinates (x, y, z)

    //for x polarization
    float volt0 = volt(0, x, y, z);
    volt0 *= vv(0, x, y, z);
    volt0 +=
        vi(0, x, y, z) * (
        curr(2, x, y  , z  ) -
        curr(2, x, y-1, z  ) -
        curr(1, x, y  , z  ) +
        curr(1, x, y  , z-1)
        );

    //for y polarization
    float volt1 = volt(1, x, y, z);
    volt1 *= vv(1, x, y, z);
    volt1 +=
        vi(1, x, y, z) * (
        curr(0, x  , y, z  ) -
        curr(0, x  , y, z-1) -
        curr(2, x  , y, z  ) +
        curr(2, x-1, y, z  )
        );

    //for z polarization
    float volt2 = volt(2, x, y, z);
    volt2 *= vv(2, x, y, z);
    volt2 +=
        vi(2, x, y, z) * (
        curr(1, x  , y  , z) -
        curr(1, x-1, y  , z) -
        curr(0, x  , y  , z) +
        curr(0, x  , y-1, z)
        );

    volt(0, x, y, z) = volt0;
    volt(1, x, y, z) = volt1;
    volt(2, x, y, z) = volt2;
}

int main(void)
{
    const unsigned int numLines[3] = {257, 257, 257};
    const int timesteps = 10;

    sycl::queue Q;

    ArrayXYZN<float>& volt = *CreateArrayXYZN<float>(Q, numLines);
    ArrayXYZN<float>& curr = *CreateArrayXYZN<float>(Q, numLines);
    ArrayXYZN<float>& vv = *CreateArrayXYZN<float>(Q, numLines);
    ArrayXYZN<float>& vi = *CreateArrayXYZN<float>(Q, numLines);

    size_t size = numLines[0] * numLines[1] * numLines[2];
    fprintf(stderr, "simulate %ld cells for %d timesteps.\n", size, timesteps);

    for (int i = 0; i < timesteps; i++) {
        Q.submit([&](sycl::handler &h) {
            h.parallel_for<class Voltage>(
                sycl::range(numLines[0] - 1, numLines[1] - 1, numLines[2] - 1),
                [=](sycl::item<3> itm) {
                    /*
                     * The first cell on each dimension has data dependency
                     * outside the simulation box (boundary condition).
                     * Ignore them for now.
                     */
                    int x = itm.get_id(0) + 1;
                    int y = itm.get_id(1) + 1;
                    int z = itm.get_id(2) + 1;

                    UpdateVoltages(volt, curr, vv, vi, x, y, z);
                }
            );
        });
        Q.wait();
    }
}
Frequent answered 16/7, 2023 at 20:36 Comment(0)
F
15

I've solved the problem myself. There are three problems.

XNACK

The fundamental problem is that xnack is disabled.

Exactly what XNACK does, or how can it be enabled, is poorly documented in all but a few places. I believe this answer is the only comprehensive guide on the entire Web.

XNACK (GPU retry on page fault) is needed for on-demand page migration between the host and the GPU. Without it, HIP's shared memory operates in a degraded mode - memory will not be automatically migrated based on access patterns. Thus, if USM is to be used, XNACK must be enabled. You can check whether xnack is enabled by looking at your platform name. If it has xnack- (e.g. gfx906:sramecc+:xnack-), it means XNACK is disabled. If it has xnack+, it means XNACK is enabled.

Unfortunately, not all dedicated GPUs are supported. Most GPUs from the GFX10/GFX11 series since RDNA do not support XNACK. Thus, the use of Unified Share Memory, which is the recommended practice and heavily used in SYCL programming, suffers a serious hit.

If you're lucky, it turns out that many dedicated GPUs in the GFX9 series supports XNACK (based on the ISA table in ROCm), but it's disabled by the amdgpu kernel driver by default. It's possibly due to stability concerns as it's still an experimental feature. It can be manually enabled by the amdgpu kernel module parameter noretry=0 or the boot-time kernel argument amdgpu.noretry=0.

To enable XNACK:

  1. It must be supported by the hardware.

  2. It must be enabled in the Linux kernel via the noretry=0 flag. After enabling, clinfo or rocminfo should report xnack+ in GPU's ISA name.

  3. It must be enabled in the runtime via the environmental variable HSA_XNACK=1 before running a HIP program.

  4. (Optionally) Compile your code with a xnack+ target (e.g use gfx906:xnack+ instead of a plain target name gfx906. This should maximize performance, but your binary will no longer run on devices without XNACK. I found in my case, there's almost no performance difference.

To check whether XNACK is really enabled, AMD has a small demo program /opt/rocm/hip/bin/hipInfo. Run it with AMD_LOG_LEVEL=4 HSA_XNACK=1 ./hipInfo, it should report xnack: 1 at the beginning of the output.

Initializing HSA stack.
Loading COMGR library.
Numa selects cpu 
HMM support: 1, xnack: 1, direct host access: 0

On my particular distro (Gentoo), one needs to build dev-libs/rocr-runtime with USE=debug to allow debugging. The hipInfo program is not built by default, but it can be found in /usr/share/hip/samples/1_Utils/hipInfo. Change Makefile's HIPCC=$(HIP_PATH)/bin/hipcc to HIPCC=hipcc and run make.

After enabling XNACK, performance of my code becomes normal, and the performance hit is now only 200%, not 1000% or 4000%.

Without XNACK:

$ time HSA_XNACK=0 ./fdtd_usm.elf 
simulate 16974593 cells for 10 timesteps.

real    0m3.345s
user    0m4.272s
sys     0m0.223s

With XNACK:

$ time HSA_XNACK=1 ./fdtd_usm.elf 
simulate 16974593 cells for 10 timesteps.

real    0m0.385s
user    0m0.343s
sys     0m0.050s

prefetch() and mem_advise()

The next problem is how to achieving good performance without xnack. The answer is using performance hints prefetch() and mem_advise(). When XNACK is disabled, this is essentially a manual copy from host to the GPU.

Also, since the overhead of USM is not zero, and page migration is imperfect, they're also needed to maximize performance if XNACK is supported.

prefetch()

One should prefetch data before the GPU needs to use data. Add the following lines immediately after CreateArrayXYZN():

/* 
 * Prefetch memory into the GPU. Performance critical!
 */
Q.prefetch(volt.array, volt.size);
Q.prefetch(curr.array, curr.size);
Q.prefetch(vv.array, vv.size);
Q.prefetch(vi.array, vi.size);
Q.wait();

After this change, the performance hit reduced to only 200%, not 1000% or 4000%.

mem_advise()

Then, one uses platform-specific performance hints to tell the underlying runtime that we want to make the data stay on the GPU. Unfortunately, there's no standard for the available hints. So it's device-specific, and you may need to use a lookup-table in your program.

For OpenSYCL with AMD HIP's backend, it passes mem_advise() hints directly into HIP's hipMemAdvise() (source code). AMD HIP provides the following useful hints of our interests:

  • hipMemAdviseSetReadMostly: Data will mostly be read and only occasionally be written to
  • hipMemAdviseSetPreferredLocation: Set the preferred location for the data as the specified device
  • hipMemAdviseSetCoarseGrain: The default memory model is fine-grain. That allows coherent operations between host and device, while executing kernels. The coarse-grain can be used for data that only needs to be coherent at dispatch boundaries for better performance.

Thus, I added the following lines:

#define hipMemAdviseSetReadMostly 1
#define hipMemAdviseSetPreferredLocation 3
#define hipMemAdviseSetCoarseGrain 100

/*
 * Tell the runtime that we prefer data to stay on the GPU, and that
 * data coherency during simultaneously execution on both host and device
 * is not necessary.
 *
 * Note: Q.mem_advise() is the standard but OpenSYCL hasn't added its
 * support yet, so the OpenSYCL synchronous extension sycl::mem_advise
 * is used instead. The advise is hardware-specific! Here we use AMD HIP
 * advise values.
 */
sycl::mem_advise(volt.array, volt.size, hipMemAdviseSetPreferredLocation, Q);
sycl::mem_advise(volt.array, volt.size, hipMemAdviseSetCoarseGrain, Q);

sycl::mem_advise(curr.array, curr.size, hipMemAdviseSetReadMostly, Q);
sycl::mem_advise(curr.array, curr.size, hipMemAdviseSetPreferredLocation, Q);
sycl::mem_advise(curr.array, curr.size, hipMemAdviseSetCoarseGrain, Q);

sycl::mem_advise(vv.array, vv.size, hipMemAdviseSetReadMostly, Q);
sycl::mem_advise(vv.array, vv.size, hipMemAdviseSetPreferredLocation, Q);
sycl::mem_advise(vv.array, vv.size, hipMemAdviseSetCoarseGrain, Q);

sycl::mem_advise(vi.array, vi.size, hipMemAdviseSetReadMostly, Q);
sycl::mem_advise(vi.array, vi.size, hipMemAdviseSetPreferredLocation, Q);
sycl::mem_advise(vi.array, vi.size, hipMemAdviseSetCoarseGrain, Q);

After this modification, USM performance is now almost as good as device memory.

I found hipMemAdviseSetReadMostly and hipMemAdviseSetPreferredLocation had no effect, but hipMemAdviseSetCoarseGrain was able to close the final performance gap between device memory and USM - at the expense of data coherency during simultaneous execution between host and device, this may or may not be acceptable for your application. I believe that for my use case, explicit Q.wait() is adequate.

Above 4G Decoding & Resizable BAR

Finally, enabling "4G Decoding" and "Resizable BAR" can improve performance of host-to-GPU data transfer. After enabling these features in firmware, I saw that the performance hit without XNACK or prefetching reduced from 4000% to 1300%. It's not a real solution to the problem, but helps to maximize performance after USM is fixed using the previous methods.

Discussion

Lack of XNACK

The fundamental problem appears to be that most AMD discrete GPUs either disabled the XNACK feature by default, or unsupport it outright. Even though the silicon theoretically appears to have this capabilities since GFX8, according to the ISA table in ROCm.

Exactly what XNACK does, or how can it be enabled, is poorly documented in all but a few places.

What is XNACK

According to AMD's tutorial:

On MI200 GPUs there is an option to automatically migrate pages of memory between host and device. This is important for managed memory, where the locality of the data is important for performance. Depending on the system, page migration may be disabled by default in which case managed memory will act like pinned host memory and suffer degraded performance.

Enabling page migration allows the GPU (or host) to retry after a page fault (normally a memory access error), and instead retrieve the missing page. On MI200 platforms we can enable page migration by setting the environment variable HSA_XNACK=1. While this environment variable is required at kernel runtime to enable page migration, it is also helpful to enable this environment variable at compile time, which can change the performance of any compiled kernels

The documentation of Oak Ridge National Laboratory's supercomputer also states:

XNACK (pronounced X-knack) refers to the AMD GPU’s ability to retry memory accesses that fail due to a page fault. The XNACK mode of an MI250X can be changed by setting the environment variable HSA_XNACK before starting a process that uses the GPU. Valid values are 0 (disabled) and 1 (enabled), and all processes connected to a GPU must use the same XNACK setting. The default MI250X on Crusher is HSA_XNACK=0.

If HSA_XNACK=0, page faults in GPU kernels are not handled and will terminate the kernel. Therefore all memory locations accessed by the GPU must either be resident in the GPU HBM or mapped by the HIP runtime. Memory regions may be migrated between the host DDR4 and GPU HBM using explicit HIP library functions such as hipMemAdvise and hipPrefetchAsync, but memory will not be automatically migrated based on access patterns alone.

If HSA_XNACK=1, page faults in GPU kernels will trigger a page table lookup. If the memory location can be made accessible to the GPU, either by being migrated to GPU HBM or being mapped for remote access, the appropriate action will occur and the access will be replayed. Page migration will happen between CPU DDR4 and GPU HBM according to page touch. The exceptions are if the programmer uses a HIP library call such as hipPrefetchAsync to request migration, or if a preferred location is set via hipMemAdvise, or if GPU HBM becomes full and the page must forcibly be evicted back to CPU DDR4 to make room for other data.

How do I enable XNACK

  1. It must be supported by the hardware.

  2. It must be enabled in the Linux kernel via the noretry=0 flag. After enabling, clinfo or xnack+ should report xnack+ in GPU's ISA name.

  3. It must be enabled in the runtime via the environmental variable HSA_XNACK=1 before running a HIP program.

Unfortunately, many dedicated desktop GPUs do not support it, making USM almost useless.

If you're lucky, it turns out that many dedicated GPUs in the GFX9 series supports XNACK. According to the Linux kernel source code:

bool kfd_process_xnack_mode(struct kfd_process *p, bool supported)
{
    int i;

    /* On most GFXv9 GPUs, the retry mode in the SQ must match the
     * boot time retry setting. Mixing processes with different
     * XNACK/retry settings can hang the GPU.
     *
     * Different GPUs can have different noretry settings depending
     * on HW bugs or limitations. We need to find at least one
     * XNACK mode for this process that's compatible with all GPUs.
     * Fortunately GPUs with retry enabled (noretry=0) can run code
     * built for XNACK-off. On GFXv9 it may perform slower.
     *
     * Therefore applications built for XNACK-off can always be
     * supported and will be our fallback if any GPU does not
     * support retry.
     */
    for (i = 0; i < p->n_pdds; i++) {
        struct kfd_dev *dev = p->pdds[i]->dev;

        /* Only consider GFXv9 and higher GPUs. Older GPUs don't
         * support the SVM APIs and don't need to be considered
         * for the XNACK mode selection.
         */
        if (!KFD_IS_SOC15(dev))
            continue;
        /* Aldebaran can always support XNACK because it can support
         * per-process XNACK mode selection. But let the dev->noretry
         * setting still influence the default XNACK mode.
         */
        if (supported && KFD_SUPPORT_XNACK_PER_PROCESS(dev))
            continue;

        /* GFXv10 and later GPUs do not support shader preemption
         * during page faults. This can lead to poor QoS for queue
         * management and memory-manager-related preemptions or
         * even deadlocks.
         */
        if (KFD_GC_VERSION(dev) >= IP_VERSION(10, 1, 1))
            return false;

        if (dev->noretry)
            return false;
    }

    return true;
}

It turns out that many dedicated GPUs in the GFX9 series supports XNACK (based on the ISA table in ROCm), but it's disabled by the amdgpu kernel driver by default. It's possibly due to stability concerns as it's still an experimental feature. It can be manually enabled by the kernel parameter noretry=0 or the boot-time kernel argument amdgpu.noretry=0.

The Linux kernel documentation says:

noretry (int): Disable XNACK retry in the SQ by default on GFXv9 hardware. On ASICs that do not support per-process XNACK this also disables retry page faults. (0 = retry enabled, 1 = retry disabled, -1 auto (default))

Support Status

Unfortunately, many dedicated GPUs since RDNA (most GPUs from the GFX10/GFX11 series) do not support XNACK.

Page migration is not always available – e.g. on the AMD RDNA™ 2 GPUs or in operating systems that do not support heterogeneous memory management (HMM).

Even on supercomputer cards like the MI100 or the MI250x, support is non-existent until recently, even then it's still experimental by now. According to a 2020 research paper:

UM only works on recent AMD GPUs, including Vega10 and MI100. Older GPUs such as Fiji and Polaris are not supported. There are two flavors of the support: XNACK-enabled and XNACK-disabled. In the XNACK-enabled mode [...] The XNACK-enabled mode only has experimental support. Not all the math libraries included in ROCm support the XNACK-enabled mode on current hardware. A mode can be chosen at boot-time, and the default is XNACK-disabled. Due to the uncertainties of the XNACK-enabled mode, our evaluation is limited to the XNACK-disabled mode. We would like to investigate the XNACK-enabled mode in our future work

AMD ROCm developers currently states the feature is still experimental:

Occasionally, HSA_XNACK forces some page faults, and memory may not initialize as expected in specific cases. This issue is under investigation and will be fixed in a future release.

SYCL

It appears that the SYCL runtime has very limited capabilities on managing implicit memory transfers. I asked "what heuristics does a SYCL runtime use to determine whether copying is needed?". The answer is that there's currently little or none (unlike, say a CPU's memory controller). On AMD GPUs, OpenSYCL's USM is implemented as hipMallocManaged(), thus, SYCL's on-demand paging depends entirely on HIP.

The DPC++ runtime is not an oracle—it cannot predict what data an application will access before it does it. Additionally, pointer analysis remains a very difficult problem for compilers, which may not be able to accurately analyze and identify every allocation that might be used inside a kernel.

  • Data Parallel C++, Chapter 6.

Thus, good performance is achieved by using programmer-provided hints.

DPC++ gives us a way to modify the performance of the automatic migration mechanisms. It does this by defining two functions: prefetch and mem_advise. Figure 6-8 shows a simple utilization of each. These functions let us give hints to the runtime about how kernels will access data so that the runtime can choose to start moving data before a kernel tries to access it. Note that this example uses the queue shortcut methods that directly invoke parallel_for on the queue object instead of inside a lambda passed to the submit method (a command group).

The simplest way for us to do this is by invoking prefetch. This function is invoked as a member function of the handler or queue class and takes a base pointer and number of bytes. This lets us inform the runtime that certain data is about to be used on a device so that it can eagerly start migrating it. Ideally, we would issue these prefetch hints early enough such that by the time the kernel touches the data, it is already resident on the device, eliminating the latency we previously described.

The other function provided by DPC++ is mem_advise. This function allows us to provide device-specific hints about how memory will be used in kernels. An example of such possible advice that we could specify is that the data will only be read in a kernel, not written. In that case, the system could realize it could copy, or duplicate, the data on the device, so that the host’s version does not need to be updated after the kernel is complete. However, the advice passed to mem_advise is specific to a particular device, so be sure to check the documentation for hardware before using this function.

  • Data Parallel C++, Chapter 6.

The SYCL 2020 specification also says:

Performance hints for shared allocations may be specified by the user by enqueueing prefetch operations on a device. These operations inform the SYCL runtime that the specified shared allocation is likely to be accessed on the device in the future, and that it is free to migrate the allocation to the device. More about prefetch is found in Table 28 and Table 135. If a device supports concurrent access to shared allocations, then prefetch operations may be overlapped with kernel execution.

Additionally, users may use the mem_advise member function to annotate shared allocations with advice. Valid advice is defined by the device and its associated backend. See Table 28 and Table 135 for more information.

Frequent answered 17/7, 2023 at 0:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.