Why aren't there bank conflicts in global memory for Cuda/OpenCL?

Asked 1/10, 2010 at 21:2 Answered 11/8, 2023 at 9:13

One thing I haven't figured out and google isn't helping me, is why is it possible to have bank conflicts with shared memory, but not in global memory? Can there be bank conflicts with registers?

UPDATE Wow I really appreciate the two answers from Tibbit and Grizzly. It seems that I can only give a green check mark to one answer though. I am newish to stack overflow. I guess I have to pick one answer as the best. Can I do something to say thank you to the answer I don't give a green check to?

Maliamalice answered 1/10, 2010 at 21:2 Comment(2)

You can always upvote any question or answer you like – Derayne 4/10, 2010 at 18:47

Bank conflicts can happen at other levels of the memory hierarchy as well as in the register file. Shared memory bank conflicts can significantly impact kernel performance and are completely controllable by the developer. Other types of bank conflicts have less impact on performance and cannot be resolved by the developer so they are not communicated to the developer. – Boar 28/11, 2012 at 5:42

Short Answer: There are no bank conflicts in either global memory or in registers.

Explanation:

The key to understanding why is to grasp the granularity of the operations. A single thread does not access the global memory. Global memory accesses are "coalesced". Since global memory is soo slow, any access by the threads within a block are grouped together to make as few requests to the global memory as possible.

Shared memory can be accessed by threads simultaneously. When two threads attempt to access an address within the same bank, this causes a bank conflict.

Registers cannot be accessed by any thread except the one to which it is allocated. Since you can't read or write to my registers, you can't block me from accessing them -- hence, there aren't any bank conflicts.

Who can read & write to global memory?

Only blocks. A single thread can make an access, but the transaction will be processed at the block level (actually the warp / half warp level, but I'm trying not be complicated). If two blocks access the same memory, I don't believe it will take longer and it may happen accelerated by the L1 cache in the newest devices -- though this isn't transparently evident.

Who can read & write to shared memory?

Any thread within a given block. If you only have 1 thread per block you can't have a bank conflict, but you won't have reasonable performance. Bank conflicts occur because a block is allocated with several, say 512 threads and they're all vying for different addresses within the same bank (not quite the same address). There are some excellent pictures of these conflicts at the end of the CUDA C Programming Guide -- Figure G2, on page 167 (actually page 177 of the pdf). Link to version 3.2

Who can read & write to registers?

Only the specific thread to which it is allocated. Hence only one thread is accessing it at one time.

Irrelative answered 2/10, 2010 at 19:52 Comment(1)

Note that my comment about the L1 cache is actually a question of my own -- do bank conflicts occur in the L1 cache. As this is handled entirely in the hardware, I don't believe we've been told in the latest documentation. (But the L1 is only in the newest 2.* hardware -- so if you don't have a Fermi GPU, this point it mute). – Irrelative 2/10, 2010 at 19:54

Whether or not there can be bank conflicts on a given type of memory is obviously dependent on the structure of the memory and therefore of its purpose.

So why is shared memory designed in a way which allows for bank conflicts?

Thats relatively simple, its not easy to design a memory controller which can handle independent accesses to the same memory simultaneously (proven by the fact that most can't). So in order to allow each thread in a halfwarp to access an individualy addressed word the memory is banked, with an independent controller for each bank (at least thats how one can think about it, not sure about the actual hardware). These banks are interleaved to make sequential threads accessing sequential memory fast. So each of these banks can handle one request at a time ideally allowing for concurrent executions of all requests in the halfwarp (obviously this model can theoretically sustain higher bandwidth due to the independence of those banks, which is also a plus).

What about registers?

Registers are designed to be accessed as operands for ALU instructions, meaning they have to be accessed with very low latency. Therefore they get more transistors/bit to make that possible. I'm not sure how exactly registers are accessed in modern processors (not the kind of information you need often and not that easy to find out). However it would obviously be highly unpractical to organize registers in banks (for simpler architectures you typically see all registers hanging on one big multiplexer). So no, there won't be bank conflicts for registers.

Global memory

First of all global memory works on a different granuality then shared memory. Memory is accessed in 32, 64 or 128byte blocks (for GT200 atleast, for fermi it is 128B always, but cached, AMD is a bit different), where everytime you want something from a block the whole block is accessed/transferred. That is why you need coalesced accesses, since if every thread accesses memory from a different block you have to transfer all blocks.

But who says there aren't bank conflicts? I'm not completely sure about this, because I haven't found any actual sources to support this for NVIDIA hardware, but it seems logical: The global memory is typically distributed to several ram chips (which can be easily verified by looking on a graphicscard). It would make sense, if each of these chips is like a bank of local memory, so you would get bank conflicts if there are several simultaneous requests on the same bank. However the effects would be much less pronounced for one thing (since most of the time consumed by memory accesses is the latency to get the data from A to B anyways), and it won't be an effect noticible "inside" of one workgroup (since only one halfwarp executes at a time and if that halfwarp issues more then one request you have an uncoalesced memory access, so you are already taking a hit making it hard to measure the effects of this conflict. So you would only get conflicts if several workgroups try to access the same bank. In your typical situation for gpgpu you have a large dataset lying in sequential memory so the effects shouldn't really be noticible since there are enough other workgroups accessinng the other banks at the same time, but it should be possible to construct situations where the dataset is centered on just a few banks, which would make for a hit on bandwidth (since the maximal bandwidth would come from equaly distributing access on all banks, so each bank would only have a fraction of that bandwidth). Again I haven't read anything to prove this theory for nvidia hardware (mostly everything focusses on coalescing, which of course is more important as it makes this a nonproblem for natural datasets to). However according to the ATI Stream computing guide this is the situation for Radeon cards (for 5xxx: banks are 2kb apart and you want to make sure that you distribute your accesses (meaning from all worgroups simulateously active) equaly over all banks), so I would imagine that NVidia cards behave similary.

Of course for most scenarious the possibility of bank conflicts on global memory is a non issue, so in practice you can say:

Watch for coalescing when accessing global memory
Watch for bank conflicts when accessing local memory
No problems with accessing registers

Derayne answered 2/10, 2010 at 21:52 Comment(9)

Watch out for partition camping in global memory too! – Alembic 6/10, 2010 at 22:40

Looking for information on partition camping, I stumbled into this answer. You're right, global memory is physically divided in partitions, and some access patterns may generate conflicts even when accesses are coalesced (see the documentation about the Matrix transpose example in the CUDA SDK). However, in Fermi architectures and in general for devices with compute capability 2.x, global memory accesses are cached, of 32-byte width, and addresses are hashed, so theoretically partition camping shouldn't be a problem. – Scrapple 18/7, 2012 at 10:7

Do these improvements in memory architecture in compute capability >=2.0 also reduce the impact of partition camping for global memory writes? – Jukebox 7/12, 2012 at 15:22

Running the CUDA 5 matrix transpose example, partition camping appears to not affect C2050 Tesla (compute capability 2.0) as there's barely any difference between coarse- and fine-grained pseudo-transposes. But I'd like official confirmation. – Jukebox 7/12, 2012 at 15:28

Is there official documentation about the address hashing in Fermi architectures? – Gorlicki 23/4, 2013 at 15:30

Regarding accessing global memory in 32-, 64-, or 128-bit chunks, is there a difference in efficiency between threads accessing sequential 8-bit (char) chunks and sequential 32-bit (int) chunks? – Pestalozzi 14/11, 2016 at 16:8

@Scrapple Where did you read that "addresses are hashed" in Nvidia GPUs. It wold help me a lot if you could point me to the source of this statement. – Turne 1/6, 2018 at 21:39

@SakshamJain Sorry. This was 5 years ago and I'm afraid I can't remember. Probably it was in the CUDA documentation, but I can't say for sure. – Scrapple 6/6, 2018 at 8:11

"and if that halfwarp issues more then one request you have an uncoalesced memory access,". I believe you meant "coalesced" instead? – Carrion 27/6, 2018 at 3:41

multiple threads accessing the same bank does not necessarily mean there is a bank conflict. There is a conflict if threads want to read at the same time from A DIFFERENT ROW within the same bank.

Tanatanach answered 13/2, 2011 at 15:55 Comment(0)

Is why is it possible to have bank conflicts with shared memory, but not in global memory?

Bank conflicts and channel conflicts indeed exist for global memory accesses. Maximum global memory bandwidth is only achieved when memory channels and banks are evenly accessed in a round-robin manner. For linear memory accesses to a single 1D array, the memory controller is usually designed to automatically interleave memory requests each bank and channel evenly. However, when multiple 1D arrays (or different rows of a multi-dimensional array) are accessed at the same time, and if their base addresses are multiples of the size of a memory channel or bank, imperfect memory interleaving may occur. In this case, one channel or bank is hit harder than another channel or bank, serializing memory access and reducing available global memory bandwidth.

Due to lack of documentation, I don't entirely understand how it works, but it surely exists. In my experiments, I've observed 20% performance degradation due to unlucky memory base addresses. This problem can be rather insidious - depending on the memory allocation size, performance degradation may occur randomly. Sometimes the default alignment size of the memory allocator can also be too clever for its own good - when every array's base address is aligned to a large size, it can increase the chance of channel/bank conflict, sometimes making it happen 100% of the time. I also found allocating a large pool of memory, then adding manual offsets to "misalign" smaller arrays away from the same channel/bank can help mitigating the problem.

The memory interleaving pattern can sometimes be tricky. For example, AMD's manual says Radeon HD 79XX-series GPUs have 12 memory channels - this is not a power of 2, so channel mapping is far from intuitive without documentation, since cannot just be deduced from the memory address bits alone. Unfortunately, I found it's often poorly documented by the GPU vendors so it may require some trial-and-error. For example, AMD's OpenCL optimization manual is only limited to GCN hardware, and it doesn't provide any information for hardware newer than Radeon HD 7970 - information about newer GCN GPUs with HBM VRAM found in Vega, or the newer RDNA/CDNA architectures are completely absent. However, AMD provides OpenCL extensions to report the channel and bank sizes of the hardware, which may help with experiments. On my Radeon VII / Instinct MI50, they're:

Global memory channels (AMD)                    128
Global memory banks per channel (AMD)           4
Global memory bank width (AMD)                  256 bytes

The huge number of channels is likely a result of the 4096-bit HBM2 memory.

AMD's Optimization Manual

AMD's old AMD APP SDK OpenCL Optimization Guide provides the following explanation:

2.1 Global Memory Optimization

[...] If two memory access requests are directed to the same controller, the hardware serializes the access. This is called a channel conflict. Similarly, if two memory access requests go to the same memory bank, hardware serializes the access. This is called a bank conflict. From a developer’s point of view, there is not much difference between channel and bank conflicts. Often, a large power of two stride results in a channel conflict. The size of the power of two stride that causes a specific type of conflict depends on the chip. A stride that results in a channel conflict on a machine with eight channels might result in a bank conflict on a machine with four. In this document, the term bank conflict is used to refer to either kind of conflict.

2.1.1 Channel Conflicts

The important concept is memory stride: the increment in memory address, measured in elements, between successive elements fetched or stored by consecutive work-items in a kernel. Many important kernels do not exclusively use simple stride one accessing patterns; instead, they feature large non-unit strides. For instance, many codes perform similar operations on each dimension of a two- or three-dimensional array. Performing computations on the low dimension can often be done with unit stride, but the strides of the computations in the other dimensions are typically large values. This can result in significantly degraded performance when the codes are ported unchanged to GPU systems. A CPU with caches presents the same problem, large power-of-two strides force data into only a few cache lines.

One solution is to rewrite the code to employ array transpositions between the kernels. This allows all computations to be done at unit stride. Ensure that the time required for the transposition is re latively small compared to the time to perform the kernel calculation.

For many kernels, the reduction in performance is sufficiently large that it is worthwhile to try to understand and solve this problem.

In GPU programming, it is best to have adjacent work-items read or write adjacent memory addresses. This is one way to avoid channel conflicts. When the application has complete control of the access pattern and address generation, the developer must arrange the data structures to minimize bank conflicts. Accesses that differ in the lower bits can run in parallel; those that differ only in the upper bits can be serialized.

In this example:
for (ptr=base; ptr<max; ptr += 16KB)
    R0 = *ptr ;
where the lower bits are all the same, the memory requests all access the same bank on the same channel and are processed serially. This is a low-performance pattern to be avoided. When the stride is a power of 2 (and larger than the channel interleave), the loop above only accesses one channel of memory.

It's also worth noting that distributing memory access across all channels does not always help with performance, it can degrade performance instead. AMD warns that, it can be better to access the same memory channel/bank in the same workgroup - as the GPU is running many workgroups simultaneously, ideal memory interleaving is achieved. On the other hand, accessing multiple memory channels/bank in the same workgroup degrades performance.

If every work-item in a work-group references consecutive memory addresses and the address of work-item 0 is aligned to 256 bytes and each work-item fetches 32 bits, the entire wavefront accesses one channel. Although this seems slow, it actually is a fast pattern because it is necessary to consider the memory access over the entire device, not just a single wavefront.

[...]

At any time, each compute unit is executing an instruction from a single wavefront. In memory intensive kernels, it is likely that the instruction is a memory access. Since there are 12 channels on the AMD Radeon HD 7970 GPU, at most 12 of the compute units can issue a memory access operation in one cycle. It is most efficient if the accesses from 12 wavefronts go to different channels. One way to achieve this is for each wavefront to access consecutive groups of 256 = 64 * 4 bytes. Note, as sh own in Figure 2.1, fetching 256 * 12 bytes in a row does not always cycle through all channels. An inefficient access pattern is if each wavefront accesses all the channels. This is likely to happen if consecutive work-items access data that has a large power of two strides.

Read the original manual for more hardware implementation details, which are omitted here.

Rake answered 11/8, 2023 at 9:13 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

AMD's Optimization Manual

Recommended topics

Hot tags