Is why is it possible to have bank conflicts with shared memory, but not in global memory?
Bank conflicts and channel conflicts indeed exist for global memory accesses. Maximum global memory bandwidth is only achieved when memory channels and banks are evenly accessed in a round-robin manner. For linear memory accesses to a single 1D array, the memory controller is usually designed to automatically interleave memory requests each bank and channel evenly. However, when multiple 1D arrays (or different rows of a multi-dimensional array) are accessed at the same time, and if their base addresses are multiples of the size of a memory channel or bank, imperfect memory interleaving may occur. In this case, one channel or bank is hit harder than another channel or bank, serializing memory access and reducing available global memory bandwidth.
Due to lack of documentation, I don't entirely understand how it works, but it surely exists. In my experiments, I've observed 20% performance degradation due to unlucky memory base addresses. This problem can be rather insidious - depending on the memory allocation size, performance degradation may occur randomly. Sometimes the default alignment size of the memory allocator can also be too clever for its own good - when every array's base address is aligned to a large size, it can increase the chance of channel/bank conflict, sometimes making it happen 100% of the time. I also found allocating a large pool of memory, then adding manual offsets to "misalign" smaller arrays away from the same channel/bank can help mitigating the problem.
The memory interleaving pattern can sometimes be tricky. For example, AMD's manual says Radeon HD 79XX-series GPUs have 12 memory channels - this is not a power of 2, so channel mapping is far from intuitive without documentation, since cannot just be deduced from the memory address bits alone. Unfortunately, I found it's often poorly documented by the GPU vendors so it may require some trial-and-error. For example, AMD's OpenCL optimization manual is only limited to GCN hardware, and it doesn't provide any information for hardware newer than Radeon HD 7970 - information about newer GCN GPUs with HBM VRAM found in Vega, or the newer RDNA/CDNA architectures are completely absent. However, AMD provides OpenCL extensions to report the channel and bank sizes of the hardware, which may help with experiments. On my Radeon VII / Instinct MI50, they're:
Global memory channels (AMD) 128
Global memory banks per channel (AMD) 4
Global memory bank width (AMD) 256 bytes
The huge number of channels is likely a result of the 4096-bit HBM2 memory.
AMD's Optimization Manual
AMD's old AMD APP SDK OpenCL Optimization Guide provides the following explanation:
2.1 Global Memory Optimization
[...] If two memory access requests are directed to the same controller, the hardware
serializes the access. This is called a channel conflict. Similarly, if two memory
access requests go to the same memory bank, hardware serializes the access.
This is called a bank conflict. From a developer’s point of view, there is not much
difference between channel and bank conflicts. Often, a large power of two stride
results in a channel conflict. The size of the power of two stride that causes a specific type of conflict depends on the chip. A stride that results in a channel
conflict on a machine with eight channels might result in a bank conflict on a
machine with four.
In this document, the term bank conflict is used to refer to either kind of conflict.
2.1.1 Channel Conflicts
The important concept is memory stride: the increment in memory address,
measured in elements, between successive elements fetched or stored by
consecutive work-items in a kernel. Many important kernels do not exclusively
use simple stride one accessing patterns; instead, they feature large non-unit
strides. For instance, many codes perform similar operations on each dimension
of a two- or three-dimensional array. Performing computations on the low
dimension can often be done with unit stride, but the strides of the computations
in the other dimensions are typically large values. This can result in significantly
degraded performance when the codes are ported unchanged to GPU systems.
A CPU with caches presents the same problem, large power-of-two strides force
data into only a few cache lines.
One solution is to rewrite the code to employ array transpositions between the
kernels. This allows all computations to be done at unit stride. Ensure that the
time required for the transposition is re latively small compared to the time to
perform the kernel calculation.
For many kernels, the reduction in performance is sufficiently large that it is
worthwhile to try to understand and solve this problem.
In GPU programming, it is best to have adjacent work-items read or write
adjacent memory addresses. This is one way to avoid channel conflicts.
When the application has complete control of the access pattern and address
generation, the developer must arrange the data structures to minimize bank
conflicts. Accesses that differ in the lower bits can run in parallel; those that differ
only in the upper bits can be serialized.
In this example:
for (ptr=base; ptr<max; ptr += 16KB)
R0 = *ptr ;
where the lower bits are all the same, the memory requests all access the same bank on the same channel and are processed serially. This is a low-performance pattern to be avoided. When the stride is a power of
2 (and larger than the channel interleave), the loop above only accesses one
channel of memory.
It's also worth noting that distributing memory access across all channels does not always help with performance, it can degrade performance instead. AMD warns that, it can be better to access the same memory channel/bank in the same workgroup - as the GPU is running many workgroups simultaneously, ideal memory interleaving is achieved. On the other hand, accessing multiple memory channels/bank in the same workgroup degrades performance.
If every work-item in a work-group references consecutive memory addresses
and the address of work-item 0 is aligned to 256 bytes and each work-item
fetches 32 bits, the entire wavefront accesses one channel. Although this seems
slow, it actually is a fast pattern because it is necessary to consider the memory
access over the entire device, not just a single wavefront.
[...]
At any time, each compute unit is executing an instruction from a single
wavefront. In memory intensive kernels, it is likely that the instruction is a
memory access. Since there are 12 channels on the AMD Radeon HD 7970
GPU, at most 12 of the compute units can issue a memory access operation in
one cycle. It is most efficient if the accesses from 12 wavefronts go to different
channels. One way to achieve this is for each wavefront to access consecutive
groups of 256 = 64 * 4 bytes. Note, as sh own in Figure 2.1, fetching 256 * 12
bytes in a row does not always cycle through all channels.
An inefficient access pattern is if each wavefront accesses all the channels. This
is likely to happen if consecutive work-items access data that has a large power
of two strides.
Read the original manual for more hardware implementation details, which are omitted here.