How to properly parallelize rescaling of a texture in a compute shader?

Asked 3/4 at 20:10 Answered 23/4 at 23:27

I have a RWTexture2D<float4> which was filled by a ray generation shader. I need to scale every pixel by a common constant value, which is only known after the ray generation shader has finished. So, I'm doing this rescaling in a compute shader.

Unfortunately, I'm not very familiar with compute shaders. I clearly want the rescaling operation to be as fast as possible. So I think I want to use the maximal parallelization which is available. I've seen that there are things like threads and gorups and corresponding system values SV_GroupID, SV_GroupThreadID, SV_GroupIndex and SV_DispatchThreadID. But it is still not clear to me what the optimal choice for [numthreads(THREAD_COUNT_X, THREAD_COUNT_Y, 1)] and the command list Dispatch call would be.

For the implementation, I've tried the following:

uint const stride_size_x = texture_width / THREAD_COUNT_X,
    stride_size_y = texture_height / THREAD_COUNT_Y,
    offset_x = thread_id.x * stride_size_x,
    offset_y = thread_id.y * stride_size_y;
for (uint v = offset_y; v < offset_y + stride_size_y; ++v)
{
    for (uint u = offset_x; u < offset_x + stride_size_x; ++u)
        mytexture[uint2(u, v)] *= myscaling;
}

But, to my surprise, this is not working correctly. A small part of the image (at the bottom) seems not to be captured by my loop. What am I doing wrong here and/or should I implement this differently?

Remark: During the loop I will also write a transform of mytexture[uint2(u, v)] to another texture per (u, v). So, in case this matters, it's not only the rescaling which I want to do here.

Proliferation answered 3/4 at 20:10 Comment(0)

The optimal answer depends on the hardware you are targeting. AMD organizes threads in groups of 64, while NVidia uses a 32 thread group size. So the optimal numthreads resolves to 32 if you are targeting only NVidia and 64 otherwise (slightly less efficient on NVidia, but should barely make a difference). You can use the SV_DispatchThreadId to easily convert the thread index to pixel coordinates. Then all you have to do in the shader is the actual scaling.

[numthreads(8, 8, 1)] // 8 * 8 = 64
void main(uint3 id : SV_DispatchThreadID)
{
    mytexture[id.xy] *= myscaling;
}

In the kernel above, each group will spawn 8x8 threads. On NVidia this means that half the threads will be executed sequentially. The groups themselves are, however, executed in parallel. Each group will run on a quadratic 8x8 pixel tile in the texture. You can of course change this to e.g. 4x16 or 1x64. This would not affect performance, but a smaller quadratic tile size is preferrable in this case, as it's easier to have texture dimensions that are a multiple of 8. If your texture dimensions are not a multiple of 8, you may need to add a check and only apply scaling if you're within bounds. Effectively threads outside the texture will no-op:

[numthreads(8, 8, 1)]
void main(uint3 id : SV_DispatchThreadID)
{
    uint2 dimensions;
    result.GetDimensions(dimensions.x, dimensions.y);

    if (id.x < dimensions.x && id.y < dimensions.y)
        mytexture[id.xy] *= myscaling;
}

The semantic value SV_DispatchThreadID refers the 3D index of the thread within the whole dispatch, so the xy part can be directly mapped to the pixel position. The docs contain more info on how it is derived.

As for dispatch size, you have to provide the number of groups (not threads) to spawn:

commandList->Dispatch(texture->GetDesc().Width / 8, texture->GetDesc().Height / 8, 1);

Samarasamarang answered 22/4 at 10:27 Comment(0)

The simplest way to do it, is to have one thread per output pixel.

void Dispatch(
  [in] UINT ThreadGroupCountX,
  [in] UINT ThreadGroupCountY,
  [in] UINT ThreadGroupCountZ
);
// hlsl
[numthreads(THREAD_COUNT_X, THREAD_COUNT_Y, 1)]

Then you want ThreadGroupCountX * THREAD_COUNT_X = TextureWidth and ThreadGroupCountY * THREAD_COUNT_Y = TextureHeight.

Return answered 23/4 at 23:27 Comment(0)

Recommended topics

Hot tags