Memory Coalescing vs. Vectorized Memory Access
Asked Answered
E

2

7

I am trying to understand the relationship between memory coalescing on NVIDIA GPUs/CUDA and vectorized memory access on x86-SSE/C++.

It is my understanding that:

  • Memory coalescing is a run-time optimization of the memory controller (implemented in hardware). How many memory transactions are required to fulfill the load/store of a warp is determined at run-time. A load/store instruction of a warp may be issued repeatedly unless there is perfect coalescing.
  • Memory vectorization is a compile-time optimization. The number of memory transactions for a vectorized load/store is fixed. Each vector load/store instruction is issued exactly once.
  • Coalescable GPU load/store instructions are more expressive than SSE vector load/store instructions. E.g., a st.global.s32 PTX instruction may store into 32 arbitrary memory locations (warp size 32), whereas a movdqa SSE instruction can only store into a consecutive block of memory.
  • Memory coalescing in CUDA seems to guarantee efficient vectorized memory access (when accesses are coalescable), whereas on x86-SSE, we have to hope that the compiler actually vectorizes the code (it may fail to do so) or vectorize code manually with SSE intrinsics, which is more difficult for programmers.

Is this correct? Did I miss an important aspect (thread masking, maybe)?

Now, why do GPUs have run-time coalescing? This probably requires extra circuits in hardware. What are the main benefits over compile-time coalescing as in CPUs? Are there applications/memory access patterns that are harder to implement on CPUs because of missing run-time coalescing?

Elissa answered 10/7, 2019 at 8:26 Comment(3)
Characterization of memory coalescing as a "runtime optimization" isn't really correct. It is much more like the default execution model of the GPU. Divergence is what happens whenever an instruction can't be executed in lock-step fashion across a warp and that is what happens when a memory instruction can't be serviced in a single transactionAutotruck
coalescing takes place (or not) as a function of the addresses supplied by each thread in the warp, to the LD/ST unit for a given LD or ST instruction. The memory controller looks at the actual addresses and determines which can be grouped together into specific cachelines or memory segments, and then issues the requests to those lines/segments. Since the addresses indicated by each thread in the warp cannot be known until runtime (in the general case) this activity cannot be pre-computed at compile time.Etom
the addressing flexibility afforded by memory coalescing (as compared to your movdqa example) allows the CUDA programmer to write arbitrary thread code and expect functionally correct results. There is presumably some value to this. The programmer is allowed to do inefficient things for ease of programming, but has a roadmap to maximum performance/efficiency of use of the memory subsystem. Giving the programmer both options is considered to be of value.Etom
W
7

caveat: I don't really know / understand the architecture / microarchitecture of GPUs very well. Some of this understanding is cobbled together from the question + what other people have written in comments / answers here.

The way GPUs let one instruction operate on multiple data is very different from CPU SIMD. That's why they need special support for memory coalescing at all. CPU-SIMD can't be programmed in a way that needs it.

BTW, CPUs have cache to absorb multiple accesses to the same cache line before the actual DRAM controllers get involved. GPUs have cache too, of course.


Yes, memory-coalescing basically does at runtime what short-vector CPU SIMD does at compile time, within a single "core". The CPU-SIMD equivalent would be gather/scatter loads/stores that could optimize to a single wide access to cache for indices that were adjacent. Existing CPUs don't do this: each element accesses cache separately in a gather. You shouldn't use a gather load if you know that many indices will be adjacent; it will be faster to shuffle 128-bit or 256-bit chunks into place. For the common case where all your data is contiguous, you just use a normal vector load instruction instead of a gather load.

The point of modern short-vector CPU SIMD is to feed more work through a fetch/decode/exec pipeline without making it wider in terms of having to decode + track + exec more CPU instructions per clock cycle. Making a CPU pipeline wider quickly hits diminishing returns for most use-cases, because most code doesn't have a lot of ILP.

A general-purpose CPU spends a lot of transistors on instruction-scheduling / out-of-order execution machinery, so just making it wider to be able to run many more uops in parallel isn't viable. (https://electronics.stackexchange.com/questions/443186/why-not-make-one-big-cpu-core).

To get more throughput, we can raise the frequency, raise IPC, and use SIMD to do more work per instruction/uop that the out-of-order machinery has to track. (And we can build multiple cores on a single chip, but cache-coherent interconnects between them + L3 cache + memory controllers are hard). Modern CPUs use all of these things, so we get a total throughput capability of frequency * IPC * SIMD, and times number of cores if we multithread. They aren't viable alternatives to each other, they're orthogonal things that you have to do all of to drive lots of FLOPs or integer work through a CPU pipeline.

This is why CPU SIMD has wide fixed-width execution units, instead of a separate instruction for each scalar operation. There isn't a mechanism for one scalar instruction to flexibly be fed to multiple execution units.

Taking advantage of this requires vectorization at compile time, not just of your loads / stores but also your ALU computation. If your data isn't contiguous, you have to gather it into SIMD vectors either with scalar loads + shuffles, or with AVX2 / AVX512 gather loads that take a base address + vector of (scaled) indices.


But GPU SIMD is different. It's for massively parallel problems where you do the same thing to every element. The "pipeline" can be very lightweight because it doesn't need to support out-of-order exec or register renaming, or especially branching and exceptions. This makes it feasible to just have scalar execution units without needing to handle data in fixed chunks from contiguous addresses.

These are two very different programming models. They're both SIMD, but the details of the hardware that runs them is very different.


Each vector load/store instruction is issued exactly once.

Yes, that's logically true. In practice the internals can be slightly more complicated, e.g. AMD Ryzen splitting 256-bit vector operations into 128-bit halves, or Intel Sandybridge/IvB doing that for just loads+stores while having 256-bit wide FP ALUs.

There's a slight wrinkle with misaligned loads/stores on Intel x86 CPUs: on a cache-line split, the uop has to get replayed (from the reservation station) to do the other part of the access (to the other cache line).

In Intel terminology, the uop for a split load gets dispatched twice, but only issues + retires once.

Aligned loads/stores like movdqa, or movdqu when the memory happens to be aligned at runtime, are just a single access to L1d cache (assuming a cache hit). Unless you're on a CPU that decodes a vector instruction into two halves, like AMD for 256-bit vectors.


But that stuff is purely inside the CPU core for access to L1d cache. CPU <-> memory transactions are in whole cache lines, with write-back L1d / L2 private caches, and shared L3 on modern x86 CPUs - Which cache mapping technique is used in intel core i7 processor? (Intel since Nehalem, the start of the i3/i5/i7 series, AMD since Bulldozer I think introduced L3 caches for them.)

In a CPU, it's the write-back L1d cache that basically coalesces transactions into whole cache lines, whether you use SIMD or not.

What SIMD helps with is getting more work done inside the CPU, to keep up with faster memory. Or for problems where the data fits in L2 or L1d cache, to go really fast over that data.

Waller answered 10/7, 2019 at 8:46 Comment(0)
T
0

Memory coalescing is related to parallel accesses: when each core in a SM will access a subsequent memory location, the memory access is optimized.

Viceversa, SIMD is a single core optimization: when a vector register is filled with operands and a SSE operation is performed, the parallelism is inside the CPU core, with one operation being performed on each internal logical unit per clock cycle.

However you are right: coalesced/uncoalesced memory access is a runtime aspect. SIMD operations are compiled in. I don't think they can compare well.

If I would make a parallelism, I would compare coalesing in GPUs to memory prefetching in CPUs. This is a very important runtime optimization as well - and I believe it's active behind the scene using SSE as well.

However there is nothing similar to colescing in Intel CPU cores. Because of cache coherency, the best you can do in optimizing parallel memory accesses, is to let each core access to independent memory regions.

Now, why do GPUs have run-time coalescing?

Graphical processing is optimized for executing a single task in parallel on adjacent elements.

For example, think to perform an operation on every pixel of an image, assigning each pixel to a different core. Now it's clear that you want to have an optimal path to load the image spreading one pixel to each core.

That's why memory coalescing is deeply buried in the GPUs architecture.

Travelled answered 10/7, 2019 at 9:53 Comment(16)
This answer starts out well, but then makes some strange analogies / comparisons / conclusions. It's just not plausible that the HW prefetchers took up 50% of the total die area of a Pentium III or Pentium 4, or any other modern out-of-order x86. They have significant amounts of on-die cache (which probably takes at least 50% of the die area). Maybe that's what you were thinking of? Because HW prefetch does fetch into cache. If you don't count cache area, OoO machinery and SIMD execution units (MMX + SSE1) take up significant area, and so does the front-end / decoders.Waller
I think the analogy to HW prefetching doesn't work at all. That's a separate concept. Your description of coalescing makes me think it is just the GPU equivalent of CPU-SIMD vector load/store. But GPUs expose each scalar execution unit separately, while CPU SIMD hides 4x 32-bit adders behind one vector add instruction. So you have to do a vector load + vector add instruction. GPU coalescing just lets 4x load + 4x add turn into a wide load instead of 4 accesses, but CPU SIMD never has that problem in the first place because you have to vectorize at compile time.Waller
If you say x86-SIMD is a single-core optimization, I would argue that memory coalescing is also a single-core optimization. Even though, every GPU SM consists of many CUDA cores, let's say 128 CUDA cores as an example, it is my understanding that in hardware we have only 128/32 = 4 physical cores. And each full core just operates on a very large register (128 bytes, large enough for 32 scalars). I.e., we have a program counter, instruction decoder, etc. only for physical cores and not for every CUDA core.Elissa
It obviously doesn't make sense that the prefetchers occupy anywhere close to 50% of the die area (more like less than 1%). Early Itanium processors didn't include hardware prefetchers because the VLIW architecture itself emphasizes compiler optimizations. Also Itanium has a better support for software prefetching compared to x86 CPUs (in particular, implicit prefetching). All Itanium processors since Poulson (2012) include a hardware data prefetcher.Toponymy
@PeterCordes 50% of silicon die for prefetcher was something I really thought to remember fine from time ago - so I will find a reference to it or fix my answer: it was really a lot of time ago actually! On the other hand, as the OP sayd, coalescing is a runtime/hw optimization: there is no assembly instruction for it: it's performed by load/store units. If independent cores within a SM access independent memory locations, those accesses are serialized in independent transactions. If the load/store unit recognize a pattern at runtime, it is able to coalesc them in one single tranaction.Travelled
In this it's very similiar to HW prefetchers: if the hw is able to apply a regular access pattern to pre-fetch cache lines, this will result in the optimized load of data elements. Clearly what is a "time" optimization for a sequential core, is.a "spatial" optimization for GPUs, where many cores are involved when facing the same problem: hiding memory latency.Travelled
Ok, I see why you're connecting them, because they benefit the same access pattern. But it's a totally separate microarchitectural mechanism that benefits it in a different way. As far as understanding how they work under the hood, there's zero connection.Waller
@MatthiasSpringer: Yes, memory-coalescing basically does at runtime what short-vector CPU SIMD does at compile time, within a single "core". That's because CPU SIMD is done very differently in general where you have wide fixed-width execution units, rather than GPU-style multiple execution units that can each handle one scalar, and be flexibly driven for data that's contiguous or not.Waller
@MatthiasSpringer the point is the there is a load/store unit for every GPU SM serving all the warps/cores. So all the threads executed by a warp could access memory with coalesced accesses. If this happend or not, it depends on the card being used, on how many cores in a SM, on the nvidia architecture (eg: on G80 only regular accesses were coalesed, but this condition as been relaxed in newer architectures). I cannot see how memory coalescing is a single-core optimizationTravelled
@MatthiasSpringer: Updated my answer to better explain why CPU SIMD has to hand off vectorization to the compiler, instead of just doing it at runtime when data happens to be contiguous. It's because CPUs are designed to run one instruction stream of arbitrary instructions including branches, and SIMD is a way to feed more work through that instruction pipeline without widening it, orthogonal to superscalar / out-of-order exec. GPUs are designed from the ground up to only be good at this kind of workload, so their pipeline overhead vs. execution units + cache is tiny compared to CPUs.Waller
@PeterCordes: The question is how do we define "core" on GPUs? NVIDIA tells us that a Titan Xp has 3840 CUDA cores. But these cores are different from CPU cores. To explain certain performance characteristics of GPUs, I like to think of a Titan Xp as having only 120 physical cores, each operating on 128-byte vector registers. In fact, I believe that this is how it is implemented in hardware (not sure about it though). CUDA just gives us the illusion of having 3840 cores, i.e., this is only a property of the programming model. Then, memory coalescing is a single-(physical)core optimization.Elissa
@PeterCordes Why illusion? They are physical. CUDA doesn't guantee to use all of them (see occupancy) because of restrictions (being available registers, one of them, if not the most important). So you have 120 SM, each with common load/store units capable to optimize memory accesses when instructions (by the mean of warps) access concurrently subsequent memory locations (and not only). I think you are confusing the SIMT programming model (that's an abstraction with the illusion of having much more cores than 3840, dispatching threads over cores) with the physical hardware.Travelled
@PeterCordes images.nvidia.com/content/pdf/tesla/whitepaper/…Travelled
Sigismondo: I think you meant to reply to @MatthiasSpringer re: "illusion".Waller
@Sigismondo: Let me phrase it this way. Why does NVIDIA call them CUDA cores and not just cores? CUDA is a programming language (software) and I would not expect this kind of terminology in the description of the hardware. Now, on a GPU, what does most closely resemble what we call a core on CPUs? I would say not a single CUDA core but a group of 32 CUDA cores (warp). Mainly because a single CUDA core does not have a control unit (for fetch/decode/execute) but a warp does. I do not fully understand the hardware, though, so there may be flaws in this analogy.Elissa
indepndently of how they call them and why, the software counterpart of cores are threads, and NVidia is consistent with this classical naming scheme. A WARP, on the other hand, is a group of 32 threads and is a software component. The corresponding hardware component is the "warp scheduler", which takes care to dispatch the threads in warps on the cores. I don't find the need to find a parallelism in the "completeness" or not of the CPU/GPU cores. A "core" is a "computational core". Find a lot of information on the hardware in the whitepaper for the pascal architecture (Titan Xp'sTravelled

© 2022 - 2024 — McMap. All rights reserved.