How can adding code to a loop make it faster?

Asked 27/3, 2009 at 2:38 Answered 29/10, 2009 at 5:5

I have a simple function with an inner loop - it scales the input value, looks up an output value in a lookup table, and copies it to the destination. (ftol_ambient is a trick I copied from the web for fast conversion of float to int).

for (i = 0;  i < iCount;  ++i)
{
    iScaled = ftol_ambient(*pSource * PRECISION3);
    if (iScaled <= 0)
        *pDestination = 0;
    else if (iScaled >= PRECISION3)
        *pDestination = 255;
    else
    {
        iSRGB = FloatToSRGBTable3[iScaled];
        *pDestination = iSRGB;
    }
    pSource++;
    pDestination++;
}

Now my lookup table is finite, and floats are infinite, so there's a possibility of off-by-one errors. I created a copy of the function with some code to handle that case. Notice that the only difference is the added 2 lines of code - please ignore the ugly pointer casting.

for (i = 0;  i < iCount;  ++i)
{
    iScaled = ftol_ambient(*pSource * PRECISION3);
    if (iScaled <= 0)
        *pDestination = 0;
    else if (iScaled >= PRECISION3)
        *pDestination = 255;
    else
    {
        iSRGB = FloatToSRGBTable3[iScaled];
        if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))
            ++iSRGB;
        *pDestination = (unsigned char) iSRGB;
    }
    pSource++;
    pDestination++;
}

Here's the strange part. I'm testing both versions with identical input of 100000 elements, repeated 100 times. On my Athlon 64 1.8 GHz (32 bit mode), the first function takes 0.231 seconds, and the second (longer) function takes 0.185 seconds. Both functions are adjacent in the same source file, so there's no possibility of different compiler settings. I've run the tests many times, reversing the order they're run in, and the timings are roughly the same every time.

I know there's a lot of mystery in modern processors, but how is this possible?

Here for comparison are the relevant assembler outputs from the Microsoft VC++6 compiler.

; 173  :    for (i = 0;  i < iCount;  ++i)

$L4455:

; 174  :    {
; 175  :        iScaled = ftol_ambient(*pSource * PRECISION3);

    fld DWORD PTR [esi]
    fmul    DWORD PTR __real@4@400b8000000000000000
    fstp    QWORD PTR $T5011[ebp]

; 170  :    int i;
; 171  :    int iScaled;
; 172  :    unsigned int iSRGB;

    fld QWORD PTR $T5011[ebp]

; 173  :    for (i = 0;  i < iCount;  ++i)

    fistp   DWORD PTR _i$5009[ebp]

; 176  :        if (iScaled <= 0)

    mov edx, DWORD PTR _i$5009[ebp]
    test    edx, edx
    jg  SHORT $L4458

; 177  :            *pDestination = 0;

    mov BYTE PTR [ecx], 0

; 178  :        else if (iScaled >= PRECISION3)

    jmp SHORT $L4461
$L4458:
    cmp edx, 4096               ; 00001000H
    jl  SHORT $L4460

; 179  :            *pDestination = 255;

    mov BYTE PTR [ecx], 255         ; 000000ffH

; 180  :        else

    jmp SHORT $L4461
$L4460:

; 181  :        {
; 182  :            iSRGB = FloatToSRGBTable3[iScaled];
; 183  :            *pDestination = (unsigned char) iSRGB;

    mov dl, BYTE PTR _FloatToSRGBTable3[edx]
    mov BYTE PTR [ecx], dl
$L4461:

; 184  :        }
; 185  :        pSource++;

    add esi, 4

; 186  :        pDestination++;

    inc ecx
    dec edi
    jne SHORT $L4455

$L4472:

; 199  :    {
; 200  :        iScaled = ftol_ambient(*pSource * PRECISION3);

    fld DWORD PTR [esi]
    fmul    DWORD PTR __real@4@400b8000000000000000
    fstp    QWORD PTR $T4865[ebp]

; 195  :    int i;
; 196  :    int iScaled;
; 197  :    unsigned int iSRGB;

    fld QWORD PTR $T4865[ebp]

; 198  :    for (i = 0;  i < iCount;  ++i)

    fistp   DWORD PTR _i$4863[ebp]

; 201  :        if (iScaled <= 0)

    mov edx, DWORD PTR _i$4863[ebp]
    test    edx, edx
    jg  SHORT $L4475

; 202  :            *pDestination = 0;

    mov BYTE PTR [edi], 0

; 203  :        else if (iScaled >= PRECISION3)

    jmp SHORT $L4478
$L4475:
    cmp edx, 4096               ; 00001000H
    jl  SHORT $L4477

; 204  :            *pDestination = 255;

    mov BYTE PTR [edi], 255         ; 000000ffH

; 205  :        else

    jmp SHORT $L4478
$L4477:

; 206  :        {
; 207  :            iSRGB = FloatToSRGBTable3[iScaled];

    xor ecx, ecx
    mov cl, BYTE PTR _FloatToSRGBTable3[edx]

; 208  :            if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))

    mov edx, DWORD PTR _SRGBCeiling[ecx*4]
    cmp edx, DWORD PTR [esi]
    jg  SHORT $L4481

; 209  :                ++iSRGB;

    inc ecx
$L4481:

; 210  :            *pDestination = (unsigned char) iSRGB;

    mov BYTE PTR [edi], cl
$L4478:

; 211  :        }
; 212  :        pSource++;

    add esi, 4

; 213  :        pDestination++;

    inc edi
    dec eax
    jne SHORT $L4472

Edit: Trying to test Nils Pipenbrinck's hypothesis, I added a couple of lines before and inside of the loop of the first function:

int one = 1;
int two = 2;

        if (one == two)
            ++iSRGB;

The run time of the first function is now down to 0.152 seconds. Interesting.

Edit 2: Nils pointed out that the comparison would be optimized out of a release build, and indeed it is. The changes in the assembly code are very subtle, I will post it here to see if it provides any clues. At this point I'm wondering if it's code alignment?

; 175  :    for (i = 0;  i < iCount;  ++i)

$L4457:

; 176  :    {
; 177  :        iScaled = ftol_ambient(*pSource * PRECISION3);

    fld DWORD PTR [edi]
    fmul    DWORD PTR __real@4@400b8000000000000000
    fstp    QWORD PTR $T5014[ebp]

; 170  :    int i;
; 171  :    int iScaled;
; 172  :    int one = 1;

    fld QWORD PTR $T5014[ebp]

; 173  :    int two = 2;

    fistp   DWORD PTR _i$5012[ebp]

; 178  :        if (iScaled <= 0)

    mov esi, DWORD PTR _i$5012[ebp]
    test    esi, esi
    jg  SHORT $L4460

; 179  :            *pDestination = 0;

    mov BYTE PTR [edx], 0

; 180  :        else if (iScaled >= PRECISION3)

    jmp SHORT $L4463
$L4460:
    cmp esi, 4096               ; 00001000H
    jl  SHORT $L4462

; 181  :            *pDestination = 255;

    mov BYTE PTR [edx], 255         ; 000000ffH

; 182  :        else

    jmp SHORT $L4463
$L4462:

; 183  :        {
; 184  :            iSRGB = FloatToSRGBTable3[iScaled];

    xor ecx, ecx
    mov cl, BYTE PTR _FloatToSRGBTable3[esi]

; 185  :            if (one == two)
; 186  :                ++iSRGB;
; 187  :            *pDestination = (unsigned char) iSRGB;

    mov BYTE PTR [edx], cl
$L4463:

; 188  :        }
; 189  :        pSource++;

    add edi, 4

; 190  :        pDestination++;

    inc edx
    dec eax
    jne SHORT $L4457

Delwin answered 27/3, 2009 at 2:38 Comment(14)

Can you post the datatypes of the variables you use? – Amuck 27/3, 2009 at 2:42

Sure, sorry. unsigned char * pDestination, const float * pSource, int iCount; int i; int iScaled; unsigned int iSRGB; – Delwin 27/3, 2009 at 2:50

Almost forgot #define PRECISION3 4096 – Delwin 27/3, 2009 at 2:51

And you probably want to know static unsigned char FloatToSRGBTable3[PRECISION3] and static float SRGBCeiling[256], too. – Delwin 27/3, 2009 at 2:52

I suspect caching problems (maybe; I have no idea). Can you disable caching (e.g. use valgrind) and see how that affects the difference in performance between the two methods? – Amuck 27/3, 2009 at 3:0

Sorry, running under Windows, no valgrind for me. – Delwin 27/3, 2009 at 3:15

The optimizer will remove your extra code in a release build btw.. the branch is based on a constant, so it's safe to do so. If you want your branch to be evaluated all the time make either of the variables one or two volatile. – Splitlevel 27/3, 2009 at 3:43

You're right! I thought the changed timing was enough evidence that the branch had not been optimized out, but when I look in the assembly I don't see it. Register allocation has changed though, and there's one extra instruction: xor ecx,ecx. – Delwin 27/3, 2009 at 4:5

yea - all these out of order and speculative execution features of a modern CPU are funky, aren't they? You change something that shouldn't make a difference and the code suddenly runs faster.. I see this all to often. – Splitlevel 27/3, 2009 at 4:21

btw - have you ever considered to change to a newer compiler? VC6 is quite old. The newer compilers avoid to place branch-targets in the same 32 byte chunk. That avoids most of the effects you're seeing. – Splitlevel 27/3, 2009 at 4:41

This is for my home hobby project. I can't justify the bucks for a new compiler, and I'm still dependent on MFC so I can't use Microsoft's freebies or gcc. – Delwin 27/3, 2009 at 12:22

Sanity test: if you swap the order of the functions, do the timings remain the same? (I remember a few very nasty surprises here) – Mozellamozelle 27/3, 2009 at 15:34

Ah! I didn't knew that the MS freebies can't do MFC. If rarely do GUI stuff. If I do I use WTL, which works fine with the express edition. – Splitlevel 27/3, 2009 at 16:35

hello mark i see your blog your blog is under construction when you made your blog completely inform me then i can read your blog – Intercostal 7/11, 2009 at 7:46

My guess is, that in the first case two different branches end up in the same branch-prediction slot on the CPU. If these two branches predict different each time the code will slow down.

In the second loop, the added code may just be enough to move one of the branches to a different branch prediction slot.

To be sure you can give the Intel VTune analyzer or the AMD CodeAnalyst tool a try. These tools will show you what's exactly going on in your code.

However, keep in mind that it's most probably not worth to optimize this code further. If you tune your code to be faster on your CPU it may at the same time become slower on a different brand.

EDIT:

If you want to read on the branch-prediction give Agner Fog's excellent web-site a try: http://www.agner.org/optimize/

This pdf explains the branch-prediction slot allocation in detail: http://www.agner.org/optimize/microarchitecture.pdf

Splitlevel answered 27/3, 2009 at 3:20 Comment(2)

Of course it's not worth it, why should I complain that the better code is faster? But this is my hobby code, and I hate not understanding what's going on. P.S. good thinking on the branch-prediction slot, I'll try to make up a test for that. – Delwin 27/3, 2009 at 3:25

See my edit to the question. And thanks for the suggestion on AMD CodeAnalyst - I'll have to try it if I can find the time. Doesn't look like I can use the integrated version though. – Delwin 27/3, 2009 at 3:39

My first guess is that the branch is being predicted better in the second case. Possibly because the nested if gives whatever algorithm the processor's using more information to guess from. Just out of curiousity, what happens when you remove the line

if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))

Shenashenan answered 27/3, 2009 at 2:47 Comment(3)

Commenting out that line makes it roughly the same speed as the other code, within the variation I see in the timing results. – Delwin 27/3, 2009 at 2:56

How do you think a well predicted branch can be faster than no branch at all? – Delwin 27/3, 2009 at 2:58

/Two/ well predicted branches faster than /one/ badly predicted one. (Or some other numbers; there are 3 branches in the first listing and 4 in the second) – Contamination 27/3, 2009 at 3:23

How are you timing these routines? I wonder if paging or caching is having an effect on the timings? It's possible that calling the first routine loads both into memory, crosses a page boundary or causes the stack to cross into an invalid page (causing a page-in), but only the first routine pays the price.

You may want to to run through both functions once before making the calls that take the measurements to reduce the effects that virtual memory and caching might have.

Cyprus answered 27/3, 2009 at 3:50 Comment(1)

I thought of that, but the loop is 100000, and it's being repeated 100 times. Plus I swapped the order of the two functions. I'm using QueryPerformanceCounter for timing. – Delwin 27/3, 2009 at 4:14

Are you just testing this inner loop, or are you testing your undisclosed outer loop as well? If so, look at these three lines:

if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))  
    ++iSRGB;
*pDestination = (unsigned char) iSRGB;

Now, it looks like *pDestination is the counter for the outer loop. So by sometimes doing an extra increment of the iSRGB value you get to skip some of the iterations in the outer loop, thereby reducing the total amount of work the code needs to do.

Ignition answered 27/3, 2009 at 3:15 Comment(2)

I included the loop statement in the samples; i is the loop counter. Although it's assigned to different registers in the assembly output, I don't see how it would make a difference. The ++iSRGB condition is only hit 3% of the time. – Delwin 27/3, 2009 at 3:20

I see you clarified your answer. No, my outer loop is just as simple as the inner one: for (int i = 0; i < repeats; ++i) – Delwin 27/3, 2009 at 3:23

I once had a similar situation. I hoisted some code out of a loop to make it faster, but it got slower. Confusing. Turns out, the average number of times though the loop was less than 1.

The lesson (which you don't need, obviously) is that a change doesn't make your code faster unless you measure it actually running faster.

Lubric answered 29/10, 2009 at 5:5 Comment(0)

Recommended topics

Hot tags