CUDA coalesced access to global memory

Asked 25/4, 2012 at 23:30 Answered 26/4, 2012 at 5:29

I have read CUDA programming guide, but i missed one thing. Let's say that i have array of 32bit int in global memory and i want to copy it to shared memory with coalesced access. Global array has indexes from 0 to 1024, and let's say i have 4 blocks each with 256 threads.

__shared__ int sData[256];

When is coalesced access performed?

sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y];

Adresses in global memory are copied from 0 to 255, each by 32 threads in warp, so here it's ok?

sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y + someIndex];

If someIndex is not multiple of 32 it is not coalesced? Misaligned adresses? Is that correct?

Vitalize answered 25/4, 2012 at 23:30 Comment(1)

Neither of these can be coalesced, except for the first block in the grid. Threads are numbered in column major order. – Brockman 26/4, 2012 at 5:32

What you want ultimately depends on whether your input data is a 1D or 2D array, and whether your grid and blocks are 1D or 2D. The simplest case is both 1D:

shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + threadIdx.x];

This is coalesced. The rule of thumb I use is that the most rapidly varying coordinate (the threadIdx) is added on as offset to the block offset (blockDim * blockIdx). The end result is that the indexing stride between threads in the block is 1. If the stride gets larger, then you lose coalescing.

The simple rule (on Fermi and later GPUs) is that if the addresses for all threads in a warp fall into the same aligned 128-byte range, then a single memory transaction will result (assuming caching is enabled for the load, which is the default). If they fall into two aligned 128-byte ranges, then two memory transactions result, etc.

On GT2xx and earlier GPUs, it gets more complicated. But you can find the details of that in the programming guide.

Additional examples:

Not coalesced:

shmem[threadIdx.x] = gmem[blockDim.x + blockIdx.x * threadIdx.x];

Not coalesced, but not too bad on GT200 and later:

stride = 2;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];

Not coalesced at all:

stride = 32;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];

Coalesced, 2D grid, 1D block:

int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.x] = gmem[blockIdx.y * elementPitch + 
                          blockIdx.x * blockDim.x + threadIdx.x];

Coalesced, 2D grid and block:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.y * blockDim.x + threadIdx.x] = gmem[y * elementPitch + x];

Im answered 26/4, 2012 at 5:29 Comment(1)

Added more rigor and examples. – Im 26/4, 2012 at 10:45

Your indexing at 1 is wrong (or intentionally so strange it seems wrong), some blocks access same element in each thread, so there is no way for coalesced access in these blocks.

Proof:

Example:

Grid = dim(2,2,0)

t(blockIdx.x, blockIdx.y)

//complete block reads at 0
t(0,0) -> sData[threadIdx.x] = gData[0];
//complete block reads at 2
t(0,1) -> sData[threadIdx.x] = gData[2];
//definetly coalesced
t(1,0) -> sData[threadIdx.x] = gData[threadIdx.x];
//not coalesced since 2 is no multiple of a half of the warp size = 16
t(1,1) -> sData[threadIdx.x] = gData[threadIdx.x + 2];

So its a "luck" game if a block is coalesced, so in general No

But coalesced memory reads rules are not as strict on newer cuda versions as before.
But for compatibility issues you should try to optimise kernels for lowest cuda versions, if it is possible.

Here is some nice source:

http://mc.stanford.edu/cgi-bin/images/0/0a/M02_4.pdf

Laurent answered 26/4, 2012 at 3:17 Comment(0)

The rules for which accesses can be coalesced are somewhat complicated and they have changed over time. Each new CUDA architecture is more flexible in what it can coalesce. I would say not to worry about it at first. Instead, do the memory accesses in whatever way is the most convenient and then see what the CUDA profiler says.

Vasculum answered 25/4, 2012 at 23:53 Comment(0)

-1

Your examples are correct if you intended to use a 1D grid and thread-geometry. I think the indexing you intended to use is [blockIdx.x*blockDim.x + threadIdx.x].

With #1, the 32 threads in a warp execute that instruction 'simultaneously' so their requests, which are sequential and aligned to 128B (32 x 4), are coalesced in both Tesla and Fermi architectures, I believe.

With #2, it is a bit blurry. If someIndex is 1, then it won't coalesce all of the 32 requests in a warp, but it might do partial coalescing. I believe Fermi devices will coalesce the accesses for threads 1-31 in a warp as a part of a 128B sequential segment of memory (and the first 4B, which no thread needs, are wasted). I think Tesla architecture devices would make that an uncoalesced access due to the misalignment, but I am not sure.

With someIndex as, say, 8, Tesla will have 32B aligned addresses, and Fermi might group them as 32B, 64B, and 32B. But the bottom line is, depending on the value of someIndex and the architecture, what happens is blurry, and it won't necessarily be terrible.

Breland answered 26/4, 2012 at 3:0 Comment(5)

it cannot be said, since his indexing is wrong or very strange, see my answer – Laurent 26/4, 2012 at 3:23

Hmm, you're right, nice catch. @Hlavson, based on your question, I'm assuming you have a 1D grid and 1D thread geometry. So you'll want to index with [blockIdx.x*blockDim.x + threadIdx.x]. – Breland 26/4, 2012 at 3:36

Ths answer is completely wrong, I am afraid. Thread numbering is column major within a block, and all have threadIdx.x multiplied by a stride (blockIdx.x). Full oalescing will happen for the first block in the first case, but not after that. The second case is the same as the first with an offset. – Brockman 26/4, 2012 at 5:26

Sorry it is not. For case #1. If you have a 1D block, then the first block has a read stride of 1 word, which will be coalesced. The second block has a read stride of 2, which won't coalesce, the third block has a stride of 3, and so on. The equivalent formula for case #1 with a 1D block is threadIdx.x * blockIdx.x+gridDim.x. That will never fully coalesce. Case #2 is just case #1 with an additional offset. – Brockman 26/4, 2012 at 7:16

I'm sorry, I have no clue what you're talking about. The only difference, in any block, between two threads is the difference in threadIdx.x; so within a warp, if it starts aligned, it coalesces, and if it doesn't, it does strange things. I agree that the index in his question was wrong - and I addressed that in my comment. But that's no reason to ignore the question at hand, which is about when memory accesses coalesce, and when they do not. – Breland 26/4, 2012 at 7:28

Recommended topics

Hot tags