DirectCompute optimal numthreads setup
Asked Answered
S

2

6

I've recently been playing with compute shaders and I'm trying to determine the most optimal way to setup my [numthreads(x,y,z)] and dispatch calls. My demo window is 800x600 and I am launching 1 thread per pixel. I am performing 2D texture modifications - nothing too heavy.

My first try was to specify

[numthreads(32,32,1)]

My Dispatch() calls are always

Dispatch(ceil(screenWidth/numThreads.x),ceil(screenHeight/numThreads.y),1)

So for the first instance that would be

Dispatch(25,19,1)

This ran at 25-26 fps. I then reduced to [numthreads(4,4,1)] which ran at 16 fps. Increasing that to [numthreads(16,16,1)] started yeilding nice results of about 30 fps. Toying with the Y thread group number [numthreads(16,8,1)] managed to push it to 32 fps.

My question is is there an optimal way to determine the thread number so I can utilize the GPU most effectively or is the just good ol' trial and error?

Shepard answered 24/10, 2013 at 7:53 Comment(0)
T
4

It's pretty GPU-specific but if you are on NVIDIA hardware you can try using the CUDA Occupancy Calculator.

I know you are using DirectCompute, but they map to the same underlying hardware. If you look at the output of FXC you can see the shared memory size and registers per thread in the assembly. Also you can deduce the compute capability from which card you have. Compute capability is the CUDA equivalent of profiles like cs_4_0, cs_4_1, cs_5_0, etc.

The goal is to increase the "occupancy", or in other words occupancy == 100% - %idle-due-to-HW-overhead

Teilo answered 24/10, 2013 at 9:7 Comment(2)
What compile options do I have to set in order ot get the FXC assembly output? I tried /Fc but nothing in the outputted file gives me the information you described. I'm using msdn.microsoft.com/en-us/library/windows/desktop/… for reference.Shepard
I just compile it with just the profile (/T) you can see the assembly code. dcl_temps tells you the register count and dcl_tgsm_* statements tell you the shared memory size.Teilo
B
2

Profiling is the only way to guarantee maximum performance on a particular piece of hardware. But as a general rule, as long as you keep your live register count low (16 or lower) and don't use a ton of shared memory, thread groups of exactly 256 threads should be able to saturate most compute hardware (assuming you're dispatching at least 8 or so groups).

Bibulous answered 24/10, 2013 at 19:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.