Speed of cos() and sin() function in GLSL shaders?

Asked 14/4, 2012 at 15:54 Answered 14/8, 2015 at 17:49

Solved optimization opengl glsl shader jogl

I'm interested in information about the speed of sin() and cos() in Open GL Shader Language.

The GLSL Specification Document indicates that:

The built-in functions basically fall into three categories:

...

...

They represent an operation graphics hardware is likely to accelerate at some point. The trigonometry functions fall into this category.

EDIT:

As has been pointed out, counting clock cycles of individual operations like sin() and cos() doesn't really tell the whole performance story.

So to clarify my question, what I'm really interested in is whether it's worthwhile to optimize away sin() and cos() calls for common cases.

For example, in my application it'll be very common for the argument to be 0. So does something like this make sense:

float sina, cosa;

if ( rotation == 0 )
{
   sina = 0;
   cosa = 1;
}
else
{
   sina = sin( rotation );
   cosa = cos( rotation );
}

Or will the GLSL compiler or the sin() and cos() implementations take care of optimizations like that for me?

Lather answered 14/4, 2012 at 15:54 Comment(10)

What do you mean do "modern GPUs provide hardware acceleration for sin() and cos()?" If it's running on the GPU it can be said to be hardware accelerated. In any event your best bet is to try it out and profile it, as clock cycles on a GPU are somewhat meaningless without more context as to what you're doing. Even between different cards from the same vendor, there can be differences in number of execution units, so cycles only tells you part of the story. – Vicereine 14/4, 2012 at 16:3

With those GPUs, I think you'll have the fastest possible execution time of those trigonometric functions. Interesting question... – Merritt 14/4, 2012 at 16:6

As pointed out in this and this question, this question is essentially unanswerable. A particular use of sin might cost nothing, depending on where you use it and the hardware. – Woodward 14/4, 2012 at 16:44

@Vicereine Good points. I've modified my question to try to make it a little more explicit. – Lather 14/4, 2012 at 19:17

@NicolBolas Thanks for the links. #8415751 is particularly informative regarding why simply counting gpu execution unit clock cycles doesn't tell the whole performance story. I've edited my question to try to more explicitly address whether the particular optimization that I'm thinking about making is worthwhile. – Lather 14/4, 2012 at 19:19

For the above, you might find the shader executes both branches and only then decides which result to make use of. The kind of optimisation you're making here is, in my opinion, not worth the trouble and may even result in a reduction in performance, not an increase. – Operatic 14/4, 2012 at 19:32

Hmm, don't know if it is reasonable to assume some kind of optimization for specific uniform vars. Doesn't make sense for in/attribute vars, though. – Goa 16/4, 2012 at 5:7

I'm voting to close this question as off-topic because this question is basically asking "how fast is this operation in this language", whichis unanswerable, because it depends on compiler, platform, and a bunch of other things, none of which were specified. – Todd 8/11, 2016 at 15:15

@Operatic that's not good advice for a long time now. If the branch is on a uniform, or even dynamic but a lot of waves in the wavefront take the same path, it can be faster. Whether it's worth it in the case of sin/cos is up to measurement, though. – Pit 4/5, 2021 at 20:58

I think at 9 years old, questions and replies about performance are somewhat out of date, yes. – Operatic 6/5, 2021 at 11:32

For example, in my application it'll be very common for the argument to be 0. So does something like this make sense:

No.

Your compiler will do one of two things.

It will issue an actual conditional branch. In the best possible case, if 0 is a value that is coherent locally (such that groups of shaders will often hit 0 or non-zero together), then you might get improved performance.
It will evaluate both sides of the condition, and only store the result for the correct one of them. In which case, you've gained nothing.

In general, it's not a good idea to use conditional logic to dance around small performance like this. It needs to be really big to be worthwhile, like a discard or something.

Also, do note that floating-point equivalence is not likely to work. Not unless you actually pass a uniform or vertex attribute containing exactly 0.0 to the shader. Even interpolating between 0 and non-zero will likely never produce exactly 0 for any fragment.

Woodward answered 14/4, 2012 at 19:46 Comment(4)

I would be actually passing the 0.0 value to the shader as a vertex attribute. But good point, if I wasn't testing that the value is some small epsilon away from 0 would probably be necessary. But point taken about it probably not being worthwhile in the first place. – Lather 14/4, 2012 at 21:1

Depending on the amount of work each shader has to do, you might win by having two variants of it, one for where you know it's zero and one where it isn't. But switching shader isn't cheap, so it depends on the workload. – Operatic 14/4, 2012 at 21:3

@NicolBolas And actually, after reading your answer and remembering some of my CUDA, I think there's a third option: the shader may evaluate the first side of the condition for the threads where rotation==0 while the others block (or noop), then evaluate the second side while the first block. Which would obviously be bad as well. Although that's assuming shaders evaluate similarly to CUDA kernels. – Lather 14/4, 2012 at 21:5

Sometimes discard is really expensive too. If you don't mind writing Z, or aren't writing Z anyway, a zero alpha write can be much faster. (I've gotten 100+ percent speed-ups replacing discards with 0 alpha draws.) GPUs love it when all of the threads are doing the same thing. – Comte 8/4, 2016 at 5:46

This is a good question. I too wondered this.

Google'd links say cos and sin are single-cycle on mainstream cards since 2005 or so.

Unequivocal answered 6/8, 2012 at 6:23 Comment(0)

You'd have to test this out yourself, but I'm pretty sure that branching in a shader is far more expensive than a sin or cos calculation. GLSL compilers are pretty good about optimizing shaders, worrying about this is premature optimization. If you later find that, through your entire program, your shaders are the bottleneck, then you can worry about optimizing this.

If you want to take a look at the assembly code of your shader for a specific platform, I would recommend AMD GPU ShaderAnalyzer.

Vienna answered 14/4, 2012 at 19:46 Comment(3)

"at an assembly code". There is no "the assembly" for shaders. It changes from platform to platform. And even from driver revision to driver revision. – Woodward 14/4, 2012 at 19:52

A branch on a bool uniform is likely to be free of cost. I've used that technique in this type of situation when it was appropriate. – Linn 14/4, 2012 at 20:51

broken link, here's an update to the URL: developer.amd.com/tools-and-sdks/graphics-development/… – Favourable 26/3, 2015 at 19:32

Not sure if this answers your question, but it's very difficult to tell you how many clocks/slots an instruction takes as it depends very much on the GPU. Usually it's a single cycle. But even if not, the compiler may rearrange the order of instruction execution to hide the true cost. It's certainly slower to use texture lookups for sin/cos as it is to execute the instructions.

Operatic answered 14/4, 2012 at 16:28 Comment(3)

I don't see any mention of sincos() in the spec opengl.org/registry/doc/GLSLangSpec.Full.1.40.05.pdf what is the actual function name? Is that an extension? – Lather 14/4, 2012 at 19:22

My apologies, actually I think that might be D3D only, and even then I think the compiler implicitly generates a sin and a cos instruction for it. – Operatic 14/4, 2012 at 19:30

FWIW, there's an ARB Fragment instruction SCS <operand> which returns sine(input.x) in the x component and cos(input.x) in the y component. – Vicereine 14/4, 2012 at 23:17

see how many sin's you can get in one shader in a row, compared to math.abs,frac, ect... i think a gtx 470 can handle 200 sin functions per fragment no probs, the frame will be 10 percent slower than an empty shader. it's farly fast, you can send results in. it will be a good indicator of computational efficiency.

Overrun answered 19/8, 2014 at 15:45 Comment(0)

-1

The compiler evaluates both branches, which makes conditions quite expensive. If you use both sin and cos in your shader, you can calculate only sin(a) and cos(a) = sqrt(1.0 - sin(a)) since sin(x)*sin(x) + cos(x)*cos(x) is always 1.0

Foskett answered 14/8, 2015 at 17:49 Comment(2)

sin(x) + cos(x) is not generally 1.0. You're probably thinking of the identity that sin(x) * sin(x) + cos(x) * cos(x) is 1.0. While that identity can be used to calculate one value from the other, this involves a square root, which is probably just as expensive as calculating the value. So it's not really useful. Also, modern GPUs don't typically evaluate both branches as long as the condition values are the same for all fragment values that are processed together. – Redcoat 15/8, 2015 at 3:57

Yes, I was thinking of cos^2(x)+ sin^2(x) = 1 from Pythagoras' theorem. My bad. – Foskett 26/9, 2015 at 9:31

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags