In CUDA, what instruction is used to load data from global memory to shared memory?

Asked 15/11, 2022 at 5:5 Answered 15/11, 2022 at 8:4

Solved cuda gpu nvidia gpu-shared-memory

I am currently studying CUDA and learned that there are global memory and shared memory.

I have checked the CUDA document and found that GPUs can access shared memory and global memory using ld.shared/st.shared and ld.global/st.global instructions, respectively.

What I am curious about is what instruction is used to load data from global memory to shared memory?

It would be great if someone could let me know.

Thanks!

__global__ void my_function(int* global_mem)
{
    __shared__ int shared_mem[10];
    for(int i = 0; i < 10; i++) {
        shared_mem[i] = global_mem[i];  // What instrcuton is used for this load operation?
    }
}

Carnay answered 15/11, 2022 at 5:5 Comment(1)

The easy way to find out is to run the executable with the kernel in the question through cuobjdump --dump-sass and look at the machine code (SASS). – Facia 15/11, 2022 at 5:17

In the case of

__shared__ float smem[2];
smem[0] = global_memory[0];

Then the operation is (in SASS)

LDG  Rx, [Ry]
STS  [Rz], Rx

To expand a bit more, read https://forums.developer.nvidia.com/t/whats-different-between-ld-and-ldg-load-from-generic-memory-vs-load-from-global-memory/40856/2

Summary:

instruction	meaning
`LDS`	load from shared space
`LDC`	load from constant space
`LDG`	load from global space
`LD`	generic load - space deduced from the supplied address

Murphree answered 15/11, 2022 at 5:9 Comment(1)

Oh, I couldn't think about that sequence of instructions. Thanks for your reply! – Carnay 15/11, 2022 at 5:13

With NVIDIA's Ampere microarchitecture, pipelining functionality was introduced to improve, among other things, the performance of copying from global to shared memory. Thus, we no longer need two instructions per element loaded, which keep the thread busier than it needs to be. Instead, you could write something like this:

#define NO_ZFILL 0

// ...

for(int i = 0; i < 10; i++) {
    __pipeline_memcpy_async(&shared_mem[i], &global_mem[i], sizeof(int), NO_ZFILL);
}
__pipeline_commit();
__pipeline_wait_prior(0); // wait for the first commited batch of pipeline ops

And the resulting PTX code looks like this:

{
        ld.param.u64    %rd1, [my_function(int*)_param_0];
        mov.u32         %r1, my_function(int*)::shared_mem;
        cp.async.ca.shared.global [%r1], [%rd1], 4, 4;
        add.s64         %rd2, %rd1, 4;
        add.s32         %r2, %r1, 4;
        cp.async.ca.shared.global [%r2], [%rd2], 4, 4;
        add.s64         %rd3, %rd1, 8;
        add.s32         %r3, %r1, 8;
        cp.async.ca.shared.global [%r3], [%rd3], 4, 4;
        add.s64         %rd4, %rd1, 12;
        add.s32         %r4, %r1, 12;
        cp.async.ca.shared.global [%r4], [%rd4], 4, 4;
        add.s64         %rd5, %rd1, 16;
        add.s32         %r5, %r1, 16;
        cp.async.ca.shared.global [%r5], [%rd5], 4, 4;
        add.s64         %rd6, %rd1, 20;
        add.s32         %r6, %r1, 20;
        cp.async.ca.shared.global [%r6], [%rd6], 4, 4;
        add.s64         %rd7, %rd1, 24;
        add.s32         %r7, %r1, 24;
        cp.async.ca.shared.global [%r7], [%rd7], 4, 4;
        add.s64         %rd8, %rd1, 28;
        add.s32         %r8, %r1, 28;
        cp.async.ca.shared.global [%r8], [%rd8], 4, 4;
        add.s64         %rd9, %rd1, 32;
        add.s32         %r9, %r1, 32;
        cp.async.ca.shared.global [%r9], [%rd9], 4, 4;
        add.s64         %rd10, %rd1, 36;
        add.s32         %r10, %r1, 36;
        cp.async.ca.shared.global [%r10], [%rd10], 4, 4;
        cp.async.commit_group;
        cp.async.wait_group 0;
        ret;

}

Notes about the PTX:

The key instructions are those beginning with cp.async, and the the add's are address computations.
Compiled with target virtual architecture sm_80.
The compiler has unrolled the loop (although it didn't have to).
This still needs to be compiled further into actual assembly instructions.

For more details, see Section B.27.3 Pipeline Primitives in the CUDA Programming Guide.

There is a fancier, but more opaque, way of doing this using the "cooperative groups" C++ interface bundled

Tonl answered 15/11, 2022 at 8:4 Comment(0)

As Kryrene already said, loading data from global memory to shared memory typically requires two instructions.

However, GPUs since the Ampere architecture (CC >= 8.0) also support loading data directly from global memory into shared memory with a single instruction which issues an asynchronous copy. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#async_data_operations

In ptx this could be cp.async.ca.shared.global , in SASS it could be LDGSTS

Portecochere answered 15/11, 2022 at 7:50 Comment(0)

Recommended topics

Hot tags