Difference Between Calling numthreads and Dispatch in a Unity Compute Shader

Asked 22/7, 2020 at 12:45 Answered 25/7, 2020 at 3:53

Solved unity-game-engine shader hlsl compute-shader

Hypothetically, say I wanted to use a compute shader to run Kernel_X using thread dimensions of (8, 1, 1).

I could set it up as:

In Script:

Shader.Dispatch(Kernel_X, 8, 1, 1);

In Shader:

[numthreads(1,1,1)]
void Kernel_X(uint id : SV_DispatchThreadID) { ... }

or I could set it up like this:

In Script:

Shader.Dispatch(Kernel_X, 1, 1, 1);

In Shader:

[numthreads(8,1,1)]
void Kernel_X(uint id : SV_DispatchThreadID) { ... }

I understand that at the end of this code, the dimensions would come out to be (8, 1, 1); however, I was wondering how switching up the numbers actually differed from each other. My guess would be that running Dispatch (Kernel_X, 8, 1, 1), "ran" a kernel of 1x1x1 8 times, while running numthreads(8,1,1) would run an 8x1x1 kernel once.

Volscian answered 22/7, 2020 at 12:45 Comment(0)

To understand the difference, a bit of hardware knowledge is required:

Internally, a GPU works on so-called wave fronts, which are SIMD-style processing units (Like a group of threads, where each thread can have it's own data, but they all have to execute the exact same instruction at the exact same time, allways). The number of Threads per wave front is hardware dependent, but is usual either 32 (NVidia) or 64 (AMD).

Now, with [numthreads(8,1,1)] you request a shader thread group size of 8 x 1 x 1 = 8 threads, which the hardware is free to distribute among it's wave fronts. So, with 32 threads per wave front, the hardware would schedule one wave front per shader group, with 8 active threads in that wave front (the other 24 threads are "inactive", meaning they do the same work, but are discarding any memory writes). Then, with Dispatch(1, 1, 1), you are dispatching one such shader group, meaning there will be one wave front running on the hardware.

Would you use [numthreads(1,1,1)] instead, only one thread in a wave front could be active. So, by calling Dispatch(8, 1, 1) on that one, the hardware would require to run 8 shader groups (= 8 wave fronts), each one running just with 1/32 active threads, so while you would get the same result, you would waste a lot more computational power.

So, in general, for optimal performance you want to have shader group sizes that are multiples of 32 (or 64), while trying to call Dispatch with as low numbers as reasonable possible.

Auberge answered 25/7, 2020 at 3:53 Comment(3)

If I only take your description, why should I ever use something else than Dispatch(1, 1, 1) and numthreads(a * 32, b * 32, c * 32) for the a, b, c that are suitable for me? In fact, say the compute shader is simply rescaling the values of a texture2d, wouldn't it be best to set numthreads(width, height, 1), where width and height are the width and height of the texture? I asked a separate question for this: https://mcmap.net/q/1922137/-how-to-properly-parallelize-rescaling-of-a-texture-in-a-compute-shader/547231. – Shellbark 4/4 at 18:25

(I know that there is a thread limit of 1024 (cs_5_0), but conceptually I don't get why I shouldn't always set numthreads to be numthreads(32, 32, 1), for example) – Shellbark 4/4 at 18:43

@Shellbark The main reason is, that the input to the Dispatch function can be a variable that is decided at runtime, while the values inside numthreads are constants decided when the shader is compiled. And as long as the group size is a multiple of the wave front size, there is no performance benefit to larger group sizes (in fact, if the shader uses group memory barriers or group shared memory, there can even be a slight penalty if too few, large groups are dispatched) – Auberge 19/4 at 8:32

The Dispatch() call determines the number of thread groups you are invoking. This way you invoke 8 times 1 times 1 = 8 groups.

Shader.Dispatch(Kernel_X, 8, 1, 1);

And in the shader the [numthreads] tag specifies the size of the thread groups. This for example declares 8 times 1 times 1 = 8 threads for every group.

[numthreads(8,1,1)] void Kernel_X(uint id : SV_DispatchThreadID)
{ }

If you want to achieve a total of 8 threads, you can invoke a single group with 8 threads per group, or 8 groups with a single thread per group. The end result is going to be the same, though performance is not. Usually, you may want to have a threadgroup size that is a power of 2, and with nvidia you usually set it at least at 32 while AMD cards are optimized for at least 64 threads per group.

Btw, you usually dispatch way more than 8 threads, as it’s rather pointless to code a compute shader for just 8 threads and your cpu would probably be faster. So, you may want to call:

Shader.Dispatch(Kernel_X, Mathf.CeilToInt((float)wantedThreadNumber/wantedGroupSize), 1, 1);

Canady answered 22/7, 2020 at 20:51 Comment(0)

Recommended topics

Hot tags