With NVIDIA's Ampere microarchitecture, pipelining functionality was introduced to improve, among other things, the performance of copying from global to shared memory. Thus, we no longer need two instructions per element loaded, which keep the thread busier than it needs to be. Instead, you could write something like this:
#define NO_ZFILL 0
// ...
for(int i = 0; i < 10; i++) {
__pipeline_memcpy_async(&shared_mem[i], &global_mem[i], sizeof(int), NO_ZFILL);
}
__pipeline_commit();
__pipeline_wait_prior(0); // wait for the first commited batch of pipeline ops
And the resulting PTX code looks like this:
{
ld.param.u64 %rd1, [my_function(int*)_param_0];
mov.u32 %r1, my_function(int*)::shared_mem;
cp.async.ca.shared.global [%r1], [%rd1], 4, 4;
add.s64 %rd2, %rd1, 4;
add.s32 %r2, %r1, 4;
cp.async.ca.shared.global [%r2], [%rd2], 4, 4;
add.s64 %rd3, %rd1, 8;
add.s32 %r3, %r1, 8;
cp.async.ca.shared.global [%r3], [%rd3], 4, 4;
add.s64 %rd4, %rd1, 12;
add.s32 %r4, %r1, 12;
cp.async.ca.shared.global [%r4], [%rd4], 4, 4;
add.s64 %rd5, %rd1, 16;
add.s32 %r5, %r1, 16;
cp.async.ca.shared.global [%r5], [%rd5], 4, 4;
add.s64 %rd6, %rd1, 20;
add.s32 %r6, %r1, 20;
cp.async.ca.shared.global [%r6], [%rd6], 4, 4;
add.s64 %rd7, %rd1, 24;
add.s32 %r7, %r1, 24;
cp.async.ca.shared.global [%r7], [%rd7], 4, 4;
add.s64 %rd8, %rd1, 28;
add.s32 %r8, %r1, 28;
cp.async.ca.shared.global [%r8], [%rd8], 4, 4;
add.s64 %rd9, %rd1, 32;
add.s32 %r9, %r1, 32;
cp.async.ca.shared.global [%r9], [%rd9], 4, 4;
add.s64 %rd10, %rd1, 36;
add.s32 %r10, %r1, 36;
cp.async.ca.shared.global [%r10], [%rd10], 4, 4;
cp.async.commit_group;
cp.async.wait_group 0;
ret;
}
Notes about the PTX:
- The key instructions are those beginning with
cp.async
, and the the add
's are address computations.
- Compiled with target virtual architecture sm_80.
- The compiler has unrolled the loop (although it didn't have to).
- This still needs to be compiled further into actual assembly instructions.
For more details, see Section B.27.3 Pipeline Primitives in the CUDA Programming Guide.
There is a fancier, but more opaque, way of doing this using the "cooperative groups" C++ interface bundled
cuobjdump --dump-sass
and look at the machine code (SASS). – Facia