In CUDA, what is memory coalescing, and how is it achieved?
Asked Answered
A

4

106

What is "coalesced" in CUDA global memory transaction? I couldn't understand even after going through my CUDA guide. How to do it? In CUDA programming guide matrix example, accessing the matrix row by row is called "coalesced" or col.. by col.. is called coalesced? Which is correct and why?

Aufmann answered 18/2, 2011 at 12:33 Comment(0)
C
196

It's likely that this information applies only to compute capabality 1.x, or cuda 2.0. More recent architectures and cuda 3.0 have more sophisticated global memory access and in fact "coalesced global loads" are not even profiled for these chips.

Also, this logic can be applied to shared memory to avoid bank conflicts.


A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. This is oversimple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.

So, if threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.

In a matrix example, keep in mind that you want your matrix to reside linearly in memory. You can do this however you want, and your memory access should reflect how your matrix is laid out. So, the 3x4 matrix below

0 1 2 3
4 5 6 7
8 9 a b

could be done row after row, like this, so that (r,c) maps to memory (r*4 + c)

0 1 2 3 4 5 6 7 8 9 a b

Suppose you need to access element once, and say you have four threads. Which threads will be used for which element? Probably either

thread 0:  0, 1, 2
thread 1:  3, 4, 5
thread 2:  6, 7, 8
thread 3:  9, a, b

or

thread 0:  0, 4, 8
thread 1:  1, 5, 9
thread 2:  2, 6, a
thread 3:  3, 7, b

Which is better? Which will result in coalesced reads, and which will not?

Either way, each thread makes three accesses. Let's look at the first access and see if the threads access memory consecutively. In the first option, the first access is 0, 3, 6, 9. Not consecutive, not coalesced. The second option, it's 0, 1, 2, 3. Consecutive! Coalesced! Yay!

The best way is probably to write your kernel and then profile it to see if you have non-coalesced global loads and stores.

Communalism answered 18/2, 2011 at 17:20 Comment(12)
Thanks for the explanation looking on which thread accesses which element. Currently I have the first option (thread 0: 0, 1, 2 etc...) so I'm looking out for a better option now :-)Pieper
@Communalism - I want to ask how to profile kernel to see non-coalesced global loads and stores.Pulitzer
@Pulitzer Can you use the Visual Profiler? developer.nvidia.com/nvidia-visual-profilerCommunalism
@Communalism - Since i work in non-graphical environment i searched and found nvprof in command line mode. but when i wanted to run it there was and error: nvprof couldn't load libcuda.so.1 there is no such file or directory! do you know why?Pulitzer
@jmilloy:Hello , very nice example!Thanks!I wanted to ask you ,when you say you can run the profiler to see if you have coalesced or not access , how can you do it?For xample, running : nvprof --metrics gld_efficiency ? And the higher the better?Rasbora
@Rasbora I was using the visual profiler. nvprof seems like a powerful tool that will work for you as well. I want to emphasize that the metrics that are important depend on your device compute capability and CUDA version, and nvprof should allow you monitor any of them. Get your kernel working first, and then optimize with any one of the available profilers.Communalism
@jmilloy:Ok , thanks , I just wanted to know if it is the "gld_efficiency" command for this.Rasbora
@Communalism A relatively stupid question, but what is the issue if the memory access is non-coalesced (Option 1 in your example.)? The threads still access the data and there are no race conditions.Preengage
@filtfilt sequential (instead of simultaneous) reads, so inefficiency.Communalism
btw I think in 2018 (after like compute 2) "half warp" should be changed to "warp" in this answer.Besought
What if all threads access the same (global) memory location at the same time? Is that also a coalesced access, or is that slower?Observable
Given that this information is outdated, do you have any resources for optimizing memory loads on newer architectures? In particular, is it still important for threads to be reading consecutive memory addresses? This article says, it's not, at least for some architectures. cvw.cac.cornell.edu/gpu/coalescedArrangement
I
18

Memory coalescing is a technique which allows optimal usage of the global memory bandwidth. That is, when parallel threads running the same instruction access to consecutive locations in the global memory, the most favorable access pattern is achieved.

enter image description here

The example in Figure above helps explain the coalesced arrangement:

In Fig. (a), n vectors of length m are stored in a linear fashion. Element i of vector j is denoted by v j i. Each thread in GPU kernel is assigned to one m-length vector. Threads in CUDA are grouped in an array of blocks and every thread in GPU has a unique id which can be defined as indx=bd*bx+tx, where bd represents block dimension, bx denotes the block index and tx is the thread index in each block.

Vertical arrows demonstrate the case that parallel threads access to the first components of each vector, i.e. addresses 0, m, 2m... of the memory. As shown in Fig. (a), in this case the memory access is not consecutive. By zeroing the gap between these addresses (red arrows shown in figure above), the memory access becomes coalesced.

However, the problem gets slightly tricky here, since the allowed size of residing threads per GPU block is limited to bd. Therefore coalesced data arrangement can be done by storing the first elements of the first bd vectors in consecutive order, followed by first elements of the second bd vectors and so on. The rest of vectors elements are stored in a similar fashion, as shown in Fig. (b). If n (number of vectors) is not a factor of bd, it is needed to pad the remaining data in the last block with some trivial value, e.g. 0.

In the linear data storage in Fig. (a), component i (0 ≤ i < m) of vector indx (0 ≤ indx < n) is addressed by m × indx +i; the same component in the coalesced storage pattern in Fig. (b) is addressed as

(m × bd) ixC + bd × ixB + ixA,

where ixC = floor[(m.indx + j )/(m.bd)]= bx, ixB = j and ixA = mod(indx,bd) = tx.

In summary, in the example of storing a number of vectors with size m, linear indexing is mapped to coalesced indexing according to:

m.indx +i −→ m.bd.bx +i .bd +tx

This data rearrangement can lead to a significant higher memory bandwidth of GPU global memory.


source: "GPU‐based acceleration of computations in nonlinear finite element deformation analysis." International journal for numerical methods in biomedical engineering (2013).

Isochronal answered 14/2, 2014 at 20:39 Comment(0)
E
12

If the threads in a block are accessing consecutive global memory locations, then all the accesses are combined into a single request(or coalesced) by the hardware. In the matrix example, matrix elements in row are arranged linearly, followed by the next row, and so on. For e.g 2x2 matrix and 2 threads in a block, memory locations are arranged as:

(0,0) (0,1) (1,0) (1,1)

In row access, thread1 accesses (0,0) and (1,0) which cannot be coalesced. In column access, thread1 accesses (0,0) and (0,1) which can be coalesced because they are adjacent.

Ecru answered 18/2, 2011 at 18:8 Comment(1)
nice and concise, but.. remember that coalesced is not about two serial accesses by thread1, but a simultaneous access by thread1 and thread2 in parallel. In your row access example, if thread1 accesses (0,0) and (1,0), then I assume thread2 is accessing (0,1) and (1,1). Thus, the first parallel access is 1:(0,0) and 2:(0,1) --> coalesced!Communalism
U
4

The criteria for coalescing are nicely documented in the CUDA 3.2 Programming Guide, Section G.3.2. The short version is as follows: threads in the warp must be accessing the memory in sequence, and the words being accessed should >=32 bits. Additionally, the base address being accessed by the warp should be 64-, 128-, or 256-byte aligned for 32-, 64- and 128-bit accesses, respectively.

Tesla2 and Fermi hardware does an okay job of coalescing 8- and 16-bit accesses, but they are best avoided if you want peak bandwidth.

Note that despite improvements in Tesla2 and Fermi hardware, coalescing is BY NO MEANS obsolete. Even on Tesla2 or Fermi class hardware, failing to coalesce global memory transactions can result in a 2x performance hit. (On Fermi class hardware, this seems to be true only when ECC is enabled. Contiguous-but-uncoalesced memory transactions take about a 20% hit on Fermi.)

Unship answered 23/4, 2011 at 18:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.