GLSL shader not unrolling loop when needed
Asked Answered
G

1

9

My 9600GT hates me.

Fragment shader:

#version 130

uint aa[33] = uint[33](
    0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,
    0,0,0
);

void main() {
    int i=0;
    int a=26;

    for (i=0; i<a; i++) aa[i]=aa[i+1];

    gl_FragColor=vec4(1.0,0.0,0.0,1.0);

}

If a=25 program runs at 3000 fps.
If a=26 program runs at 20 fps.
If size of aa <=32 issue doesn't appear.
Viewport size is 1000x1000.
Problem occurs only when the size of aa is >32.
Value of a as the threshold varies with the calls to the array inside the loop (aa[i]=aa[i+1]+aa[i-1] gives a different deadline).
I know gl_FragColor is deprecated. But that's not the issue.

My guess is that GLSL doesn't unroll automatically the loop if a>25 and size(aa)>32. Why. The reason why it depends on the size of the array is unknown to mankind.

A quite similar behavior explained here:
http://www.gamedev.net/topic/519511-glsl-for-loops/

Unwinding the loop manually does solve the issue (3000 fps), even if aa size is >32:

    aa[0]=aa[1];
    aa[1]=aa[2];
    aa[2]=aa[3];
    aa[3]=aa[4];
    aa[4]=aa[5];
    aa[5]=aa[6];
    aa[6]=aa[7];
    aa[7]=aa[8];
    aa[8]=aa[9];
    aa[9]=aa[10];
    aa[10]=aa[11];
    aa[11]=aa[12];
    aa[12]=aa[13];
    aa[13]=aa[14];
    aa[14]=aa[15];
    aa[15]=aa[16];
    aa[16]=aa[17];
    aa[17]=aa[18];
    aa[18]=aa[19];
    aa[19]=aa[20];
    aa[20]=aa[21];
    aa[21]=aa[22];
    aa[22]=aa[23];
    aa[23]=aa[24];
    aa[24]=aa[25];
    aa[25]=aa[26];
    aa[26]=aa[27];
    aa[27]=aa[28];
    aa[28]=aa[29];
    aa[29]=aa[30];
    aa[30]=aa[31];
    aa[31]=aa[32];
    aa[32]=aa[33];
Gird answered 1/9, 2013 at 10:54 Comment(7)
So... what's your question?Jamilla
@NicolBolas Why when a=26 framerate drops drastically?Gird
The only person who would know that is the person who implemented your OpenGL compiler.Jamilla
Is there a command like #pragma optionNV(unroll none) that forces opengl to always perform unrolling?Gird
@user2464424: Yes, NV does have quite a few proprietary GLSL #pragma directives. And unrolling loops is one of those directives. The particular non-portable pragma you want is #pragma optionNV (unroll all). It's usually better to unroll the stuff yourself, since AMD/Intel/... don't know what this #pragma is. The joys of each vendor implementing their own compiler - one thing I sort of like about HLSL, Microsoft implements the one and only compiler so everything is pretty consistent there.Cogency
I expect the speed difference to be primarily due to the loop code all being dead-code eliminated after unrolling (and not being eliminated if its not unrolled), rather than due to loop overhead. If you replace the loop with something that can't be dead-code eliminated or constant folded to almost nothing, I expect you won't see much difference in speed between unrolled and non-unrolled code...Cretinism
Try to port your GLSL code to core profile. this usually helps because in newer drivers the older stuff support is not as good as it was before. This solved me a lot of weird compiler related problems in the past (on nVidia and also ATI cards).Equipoise
O
8

I am just putting in a summarizing answer of the comments here so this does not show up as unanswered anymore.

"#pragma optionNV (unroll all)"

fixes the immediate issue on nvidia.

In general though, GLSL compilers are very implementation dependent. The reason why there is a drop of at exactly 32 is easily explained by hitting a compiler heuristic like "don't unroll loops longer than 32". Also the huge speed difference might come from an unrolled loop using constants while a dynamic loop will require addressable array memory. Another reason could be that when unrolling dead code elimination an constant folding kicks in reducing the entire loop to nothing.

The most portable way to fix this is really manual unrolling, or even better manual constant folding. It is always questionable to compute constants in a fragment shader that can be computed outside. Some drivers might catch it for some cases, but it is better not to rely on that.

Outfoot answered 16/2, 2014 at 3:50 Comment(1)
I shall point out that the loop length is 26, not 32. 32 is the size of the array. Basically, this is an hellish extreme case. There are at least 3 factors conflicting with each other: 1) compiler deleting dead code only if the loop is unrolled. 2) constant array turning into a VRAM allocated array after 32 floats. 3) length threshold of loop before it gets unrolled automatically depends on the variable types involved in the loop itself. Conclusion: no one should EVER happen in this case, as the code I posted is very bad practice and should never be used in any real-world scenario.Gird

© 2022 - 2024 — McMap. All rights reserved.