HLSL branch avoidance

Asked 17/9, 2012 at 13:47 Answered 14/8, 2015 at 17:5

I have a shader where I want to move half of the vertices in the vertex shader. I'm trying to decide the best way to do this from a performance standpoint, because we're dealing with well over 100,000 verts, so speed is critical. I've looked at 3 different methods: (pseudo-code, but enough to give you the idea. The <complex formula> I can't give out, but I can say that it involves a sin() function, as well as a function call (just returns a number, but still a function call), as well as a bunch of basic arithmetic on floating point numbers).

if (y < 0.5)
{
    x += <complex formula>;
}

This has the advantage that the <complex formula> is only executed half the time, but the downside is that it definitely causes a branch, which may actually be slower than the formula. It is the most readable, but we care more about speed than readability in this context.

x += step(y, 0.5) * <complex formula>;

Using HLSL's step() function (which returns 0 if the first param is greater and 1 if less), you can eliminate the branch, but now the <complex formula> is being called every time, and its results are being multiplied by 0 (thus wasted effort) half of the time.

x += (y < 0.5) ? <complex formula> : 0;

This I don't know about. Does the ?: cause a branch? And if not, are both sides of the equation evaluated or only the one that is relevant?

The final possibility is that the <complex formula> could be offloaded back to the CPU instead of the GPU, but I worry that it will be slower in calculating sin() and other operations, which might result in a net loss. Also, it means one more number has to be passed to the shader, and that could cause overhead as well. Anyone have any insight as to which would be the best course of action?

Addendum:

According to http://msdn.microsoft.com/en-us/library/windows/desktop/bb509665%28v=vs.85%29.aspx

the step() function uses a ?: internally, so it's probably no better than my 3rd solution, and potentially worse since <complex formula> is definitely called every time, whereas it may be only called half the time with a straight ?:. (Nobody's answered that part of the question yet.) Though avoiding both and using:

x += (1.0 - y) * <complex formula>;

may well be better than any of them, since there's no comparison being made anywhere. (And y is always either 0 or 1.) Still executes the <complex formula> needlessly half the time, but might be worth it to avoid branches altogether.

Silky answered 17/9, 2012 at 13:47 Comment(5)

The approach depends on target hardware. You may compare assembly codes of these variants (for example, RenderMonkey can analyze performance for Radeon cards). In addition, is bottleneck in vertex shader? Maybe all variants will give the same results :) – Rabbin 17/9, 2012 at 14:33

Shaders on some cores must execute in lockstep, so complex formula gets evaluated no matter what in any case. – Brookite 17/9, 2012 at 15:45

Also, if y can be computed as a function of the mesh, maybe you just split the mesh or the scene geometry and run two different shaders. – Brookite 17/9, 2012 at 15:46

Target platforms are XBox360, PS3, and PC. (The PC build is only for testing, so less critical. In general, it uses same code as PS3.) It might help to know that y will always be either exactly 0 or 1, nothing in between. (It's the texture coords of a quad.) I suppose the step() function could be replaced with simply (1.0 - y) and have the same effect. Still causes the formula to calculate twice as much as strictly necessary... – Silky 17/9, 2012 at 16:40

Oh, and per above, two different shaders is not an option, since every poly would cross over the threshold. It's just rendering a lot of quads. Think of it like a particle effect. It's not exactly that, but that's as close as I think I can safely get without violating NDA. – Silky 17/9, 2012 at 16:51

Perhaps look at this answer.

My guess (this is a performance question: measure it!) is that you are best off keeping the if statement.

Reason number one: The shader compiler, in theory (and if invoked correctly), should be clever enough to make the best choice between a branch instruction, and something similar to the step function, when it compiles your if statement. The only way to improve on it is to profile^[1]. Note that it's probably hardware-dependent at this level of granularity.

[1] Or if you have specific knowledge about how your data is laid out, read on...

Reason number two is the way shader units work: If even one fragment or vertex in the unit takes a different branch to the others, then the shader unit must take both branches. But if they all take the same branch - the other branch is ignored. So while it is per-unit, rather than per-vertex - it is still possible for the expensive branch to be skipped.

For fragments, the shader units have on-screen locality - meaning you get best performance with groups of nearby pixels all taking the same branch (see the illustration in my linked answer). To be honest, I don't know how vertices are grouped into units - but if your data is grouped appropriately - you should get the desired performance benefit.

Finally: It's worth pointing out that your <complex formula> - if you're saying that you can hoist it out of your HLSL manually - it may well get hoisted into a CPU-based pre-shader anyway (on PC at least, from memory Xbox 360 doesn't support this, no idea about PS3). You can check this by decompiling the shader. If it is something that you only need to calculate once per-draw (rather than per-vertex/fragment) it probably is best for performance to do it on the CPU.

Bezique answered 18/9, 2012 at 11:38 Comment(3)

Fortunately, I'm not dealing with the frag shader here, so that's not an issue. For the verts, it's probably following a 00110011 pattern - the top 2 and then bottom 2 verts of every quad. I'm pretty sure all quads are being drawn in one single draw-call, but the vertex movement is on a per-quad basis. It honestly doesn't really matter what happens on the PC, since that's not the shippable. Others here have said that if is bad and should be avoided at all costs, but I guess I won't know for sure until we can get this profiling. It may well get moved to CPU, but that is a bit trickier... – Silky 18/9, 2012 at 14:22

The numbers are in for the XBox: Control (no vertex movement): 830K GPU cycles. Method #1 (if): 834K cycles. Method #2 (step()): 836K cycles. Method #3 (?:): 835K cycles. Method #4 (1-y): 844K cycles. I'm really surprised the last was slowest, since it's the only one with no branching. But you were correct about if - at least in this case. I'm told the PS3 will be another story, though. We'll see how that goes. – Silky 19/9, 2012 at 19:24

Whoops - hold up. I redid the tests with a different camera angle (a distant shot showing many more quads but much further away), and the results got flip-turned upside-down. Control: 566K, Method #1: 595K, Method #2: 590K, Method #3: 591K, Method #4: 611K. 1-y is still surprisingly the worst, but the other 3 are completely reversed. This is probably a more typical view angle, so I'll have to cast my vote with step() as the best solution in this case, though ?: is very close in both cases. Jury's still out on PS3. – Silky 19/9, 2012 at 20:2

I got tired of my conditionals being ignored so I just made a another kernel and did an override in c execution. If you need it to be accurate all the time I suggest this fix.

Cartoon answered 14/8, 2015 at 17:5 Comment(4)

Not sure how this is in any way relevant? Also, this question was 3 years and 2 jobs ago, so I no longer even have access to the code in question... – Silky 14/8, 2015 at 18:30

Just putting it out there. – Cartoon 14/8, 2015 at 22:3

Are you saying you moved the logic to a compute kernel on GPU? Or you had access to source of existing GPU kernel that executes shaders, and modified that? Suggested links for learning how to do what you did? – Bonkers 22/1, 2018 at 22:8

It's been a while sorry, but I worked it out myself and moved the logic to GPU. You simply make a new kernel, linear (so id by thread.x) you do this by buffer transfer between control and compute. Which allows for additional verification/logging. – Cartoon 8/2, 2018 at 0:46

Recommended topics

Hot tags