I have a RWTexture2D<float4>
which was filled by a ray generation shader. I need to scale every pixel by a common constant value, which is only known after the ray generation shader has finished. So, I'm doing this rescaling in a compute shader.
Unfortunately, I'm not very familiar with compute shaders. I clearly want the rescaling operation to be as fast as possible. So I think I want to use the maximal parallelization which is available. I've seen that there are things like threads and gorups and corresponding system values SV_GroupID
, SV_GroupThreadID
, SV_GroupIndex
and SV_DispatchThreadID
. But it is still not clear to me what the optimal choice for [numthreads(THREAD_COUNT_X, THREAD_COUNT_Y, 1)]
and the command list Dispatch
call would be.
For the implementation, I've tried the following:
uint const stride_size_x = texture_width / THREAD_COUNT_X,
stride_size_y = texture_height / THREAD_COUNT_Y,
offset_x = thread_id.x * stride_size_x,
offset_y = thread_id.y * stride_size_y;
for (uint v = offset_y; v < offset_y + stride_size_y; ++v)
{
for (uint u = offset_x; u < offset_x + stride_size_x; ++u)
mytexture[uint2(u, v)] *= myscaling;
}
But, to my surprise, this is not working correctly. A small part of the image (at the bottom) seems not to be captured by my loop. What am I doing wrong here and/or should I implement this differently?
Remark: During the loop I will also write a transform of mytexture[uint2(u, v)]
to another texture per (u, v)
. So, in case this matters, it's not only the rescaling which I want to do here.