What you assume from the context is correct.
On Intel and nVidia GPUs, the hardware SIMD width is 32. On AMD it’s often 64 but on newer AMD GPUs can also be 32. The approach helps with power consumption and therefore performance because GPU cores share the transistors doing instruction fetch and decode across these 32 or 64 threads. Current instruction pointer is also shared across the complete wavefront.
On the hardware level, GPUs actually have that many execution units. With a few exceptions like FP64 math instructions, lanes in these SIMD registers are computing in parallel by different execution units. GPU cores are missing many pieces found in CPU ones. GPUs don’t do branch prediction, speculative execution, instructions reordering. Their RAM access is much simpler because optimized for throughput and don’t care too much about latency, and their cache coherency guarantees are very limited. That’s how they can afford spending much larger percentage of their transistors on the execution units who actually compute stuff. For instance, my old 1080Ti GPU has 12 billion transistors, 3584 shader units (organized into 28 cores, when doing FP32 math each one can handle 4 wavefronts = 128 threads in parallel), and delivers up to 11 TFlops FP32. My CPU has about the same count of transistors, but only delivers up to 1 TFlops FP32.
On recent hardware (feature level 12.2), for pixel and compute shaders these wavefronts are even accessible to programmers through wave intrinsics.
For compute shaders things are straightforward. If you write [numthreads( 64, 1, 1 )]
and dispatch threads count multiple of 64, each thread group of the compute shader will run as 2 wave fronts on nVidia and 1 wave front on AMD. If you dispatch that shader with x
threads count not multiple of 64, one last wave front will contain less threads, some of them will be inactive. GPUs maintain a bit mask of active threads in each running wavefront.
For pixel shaders things are less straightforward because GPUs need partial derivatives.
For this reason, pixel shader wavefronts are organized as 2x2 squares. Pixel outside of triangles are computed as usual, but their output values aren’t written anywhere. And, wave intrinsics for pixel shaders include functions to read form other pixels of these 2x2 squares.
For vertex shaders and the rest of them, how things are assigned to wavefronts is a moot point. Not only it’s implementation dependent, it even depends on things besides GPU model and driver. If there’s a geometry shader down the pipeline of the VS, GPUs organize work in such a way so the outputs of vertex shader stay in the on-chip memory before being passed to the geometry shader. Same applies to tessellation shaders. Also, most real-life meshes are indexed, GPUs are aware and they have a cache for transformed vertices. Count of vertex shader calls per vertex depends on size of that cache, and on the mesh topology in the index buffer. GPUs do whatever they can to avoid marshalling data between shader stages through external VRAM. At their scale, the external memory is very expensive to access in terms of both latency and electricity.