Why is a texture lookup so much slower than a direct computation?
Asked Answered
M

3

7

I'm working on an OpenGL implementation of the oculus Rift distortion shader. The shader works by taking the input texture coordinate (of a texture containing a previously rendered scene) and transforming it using distortion coefficients, and then using the transformed texture to determine the fragment color.

I'd hoped to improve performance by pre-computing the distortion and storing it in a second texture, but the result is actually slower than the direct computation.

The direct calculation version looks basically like this:

float distortionFactor(vec2 point) {
    float rSq = lengthSquared(point);
    float factor =  (K[0] + K[1] * rSq + K[2] * rSq * rSq + K[3] * rSq * rSq * rSq);
    return factor;
}

void main()
{
    vec2 distorted = vRiftTexCoord * distortionFactor(vRiftTexCoord);
    vec2 screenCentered = lensToScreen(distorted);
    vec2 texCoord = screenToTexture(screenCentered);
    vec2 clamped = clamp(texCoord, ZERO, ONE);
    if (!all(equal(texCoord, clamped))) {
        vFragColor = vec4(0.5, 0.0, 0.0, 1.0);
        return;
    }
    vFragColor = texture(Scene, texCoord);
}

where K is a vec4 that's passed in as a uniform.

On the other hand, the displacement map lookup looks like this:

void main() {
    vec2 texCoord = vTexCoord;
    if (Mirror) {
        texCoord.x = 1.0 - texCoord.x;
    }
    texCoord = texture(OffsetMap, texCoord).rg;
    vec2 clamped = clamp(texCoord, ZERO, ONE);
    if (!all(equal(texCoord, clamped))) {
        discard;
    }
    if (Mirror) {
        texCoord.x = 1.0 - texCoord.x;
    }
    FragColor =  texture(Scene, texCoord);
}

There's a couple of other operations for correcting the aspect ratio and accounting for the lens offset, but they're pretty simple. Is it really reasonable to expect this to outperform a simple texture lookup?

Manutius answered 15/12, 2013 at 6:30 Comment(0)
G
14

GDDR memory is pretty high latency and modern GPU architectures have plenty of number crunching capabilities. It used to be the other way around, GPUs were so ill-equipped to do calculations that normalization was cheaper to do by fetching from a cube map.

Throw in the fact that you are not doing a regular texture lookup here, but rather a dependent lookup and it comes as no surprise. Since the location you are fetching from depends on the result of another fetch, it is impossible to pre-fetch / efficiently cache (an effective latency hiding strategy) the memory needed by your shader. That is no "simple texture lookup."

What is more, in addition to doing a dependent texture lookup your second shader also includes the discard keyword. This will effectively eliminate the possibility of early depth testing on a lot of hardware.

Honestly, I do not see why you want to "optimize" the distortionFactor (...) function into a lookup. It uses squared length, so you are not even dealing with a sqrt, just a bunch of multiplication and addition.

Greenhead answered 15/12, 2013 at 8:49 Comment(4)
It's interesting though. The lookup for taking actual fragment color from the framebuffer rendered texture is always conditional. I never use the input texture coordinates without modifying them first. However, I suppose it's possible that the driver is smart enough to pre-fetch input pixels near the source of an adjacent fragment as it moves over the whole output.. but then that would also hold for the method using a texture lookup of the coordinates instead of a computation.Manutius
@Jherico: You are correct, technically anything where the texture coordinates are computed and not taken from stage input is considered a dependent texture lookup. However, thread scheduling units (warps on NV hardware, wavefronts on AMD hardware) can be re-scheduled on modern GPUs so that useful calculations are done while waiting for a memory fetch. A shader that has to do a texture lookup to find the coordinates for a second texture lookup, rather than compute the coordinates is going to stall more frequently.Greenhead
@Jherico: The GPU will try to counteract this by switching to another warp/wavefront, but they will quickly stall too because there is only about 1 calculation that can be performed before the first texture lookup in your shader. Ideally you would want to do several calculations before the first texture lookup so that you can have warps/wavefronts actively working on different parts of your shader anytime they have to stall for a memory fetch. There have been a lot of papers written on this topic, and modern GPUs are starting to adopt a multi-level cache architecture similar to CPUs.Greenhead
Oh, the bit about discard disabling early depth testing is a handy thing to know, but I should have pointed out that Oculus Rift distortion is taking a 2D image and doing lens correction. Depth testing would be completely turned off anyway as all the rendering is working in a 2D plane.Manutius
A
6

Andon M. Coleman already explained what's going in. Essentially memory bandwith and more importantly memory latency are the main bottlenecks of modern GPUs, hence everthing built between about 2007 to today simple calculations are often way faster than a texture lookup.

In fact memory access patterns have such a large impact on efficiency that slightly rearranging the access pattern and assuring proper alignment can easily give performance boosts of a factor of 1000 (BT;DT however that was CUDA programming). Dependent lookup is not necessarily a performance killer, though: If the dependent texture coordinate lookup is monotonic with the controller texture it's usually not so bad.


That being said, did you never hear about Horner's Method? You can rewrite

float factor =  (K[0] + K[1] * rSq + K[2] * rSq * rSq + K[3] * rSq * rSq * rSq);

trivially to

float factor =  K[0]  + rSq * (K[1] + rSq * (K[2] + rSq * K[3]) );

Saving you a couple of operations.

Actinic answered 15/12, 2013 at 10:45 Comment(4)
I'd be surprised if the shader compiler/optimizer didn't reorder that expression to take advantage of Horner's rule automatically. But no reason not to be explicit about it.Monitory
@DrewHall: Actually a compiler can not optimize this kind of thing. To illustrate throw the following into a C compiler: ` #include <stdio.h> int main() { printf("%d %d %d\n", 5 * 10 / 20, (5*10)/20, 5*(10/20)); return 0; } ` and run it. Although it's using ints instead of floats it showcases the same principle, namely that a C style compiler parses an expression "by the operator", and the intermediate values are taken as actual L-values for the next subexpression. So while purely mathematically they are the same, on a C style code level a Horner method contraction differs the expansion.Actinic
I think it depends heavily on the compiler itself & on the flags you pass (e.g. how strict you tell it to be with floating point calcs). I would expect a GPU compiler to be a little more liberal with floating point reordering vs. ensuring strict accuracy. As for the intermediate/L-values, I agree that that's true for the parser part of the compiler, but the optimizer can still do a lot with control flow without altering observable behavior.Monitory
At least NVIDIA's GLSL compiler is so liberal that e.g. double-single tricks don't work without hiding some values to prevent too aggressive optimization. See an example of my hack here. What NVIDIA's compiler is doing for GLSL is essentially what gcc does with -ffast-math for C, C++ and other CPU-targeted languages.Salmonella
N
0

The GPU is massively parallel, and can compute up to 1000's of results at a single clock-cycle. Memory read is always sequential. If it takes f.e. 5 clocks for the multiplications to compute, one can calculate 1000 results in 5 clock-cycles. If the data has to be read sequentially with f.e. 10 datasets per clock-cycle it would take 100 clock cycles instead of 5 to acquire the data. Number just by random to make you understand :)

Nip answered 31/3, 2014 at 4:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.