Performance differ significantly when inner loop iterates less times?

I wrote a toy program that compares the performance of two very similar functions. The entire file (minus a couple of macros) looks like this:

constexpr int iterations = 100000;

using u64 = uint64_t;

// Slow.
void test1()
{
    u64 sum = 0;

    for (int i = 0; i < iterations; i++) {
        for (int j = 0; j < 4; j++)
            sum += i + j;
        doNotOptimize(sum);
    }
}

// Fast.
void test2()
{
    u64 sum = 0;

    for (int i = 0; i < iterations; i++) {
        for (int j = 0; j < 10; j++)
            sum += i + j;
        doNotOptimize(sum);
    }
}

void driver()
{
    int runs = 1000;
    BENCH(runs, test1);
    BENCH(runs, test2);
}

I am measuring 1000 executions of each function using __rdtsc and computing the average. With this formula, I am seeing a performance difference of ~172,000 (ticks?) between test1 and test2. What's surprising is that test2 is the one that's faster.

An exotic little detail is that the only magic numbers for which test1 is slower are 4, 8, and 16. If I change the internal loop's condition to j < x where x is anything but those 3 numbers, performance match up.

In the assembly, I am observing that the inner loops in both functions are eliminated and replaced by a few arithmetic operations performed as operands of lea. So in this case, it would make sense if both functions were equally fast. But that's not at all what's happening. Here's the disassembly and the program source in its entirety: https://godbolt.org/z/d5PsG4YeY

So what's really going on? Is something wrong with my measurements?

Execution environment:

Processor: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (Kaby Lake)
L1 Cache: 128Kib
OS: Linux (64-bit)
Toolchain: GCC Version 10.3.0
Compiler Options: -O3 -fno-tree-vectorize

4 and 8 are scale factors that x86 addressing modes support, tempting GCC into using a slow-LEA on the critical path dependency chain when adding 4*i or 8*i to the sum along with the constant sum of 1..4 or 1..8 (either way just a constant, irrelevant what it is). Apparently also as part of the dep chain for 16. And you used inline asm to force the sum dep chain to include a store/reload.

Analyzing the assembly, I am observing that the inner loops in both functions are eliminated and replaced by a few arithmetic operations done as operands of lea. So in this case, it would make sense if both functions ran at the same speed.

They're both fast, but the different multiples of i take different instructions. So there's little reason to assume they'd be the same. The interesting thing here is the one with more total instructions is faster, because it has a shorter dependency chain through sum.

And you forced a store/reload of it with the somewhat-clunky asm volatile("" :: "g"(&sum) : "memory"), instead of just requiring the compiler to have the value in a register with asm volatile("" : "+r"(sum)). So that dep chain includes store-forwarding latency (typically 3 to 5 cycles) so it's a bigger bottleneck than front-end or ALU throughput of the independent work.

test1():
        xor     eax, eax                   # i = 0
        xor     ecx, ecx                   # sum = 0
        lea     rsi, [rsp-8]                 # operand for inline asm
        jmp     .L3
.L5:
        mov     rcx, QWORD PTR [rsp-8]     # load sum
        mov     rax, rdx                   # i = i+1
.L3:
        lea     rdx, [rax+1]               # i+1, leaving the old value in RAX
        lea     rax, [rcx+6+rax*4]         # sum += 4*i + 6.  (1+2+3)
        mov     QWORD PTR [rsp-8], rax     # store sum
        cmp     rdx, 100000
        jne     .L5
        ret

An LEA with 3 components (two + operations) in the addressing mode has 3 cycle latency on your Skylake-derived CPU. See x86 Multiplication with 3: IMUL vs SHL + ADD

So the loop-carried dependency chain is a slow-LEA (3 cycles) between load/store.

test2():
        xor     eax, eax
        xor     ecx, ecx
        lea     rdi, [rsp-8]                  # operand for inline asm
        jmp     .L8
.L9:
        mov     rcx, QWORD PTR [rsp-8]      # load sum
        mov     rax, rdx
.L8:
        lea     rsi, [rax+36+rax*8]
        lea     rdx, [rax+1]
        lea     rax, [rax+9+rsi]               # prepare some stuff to be added
        add     rax, rcx                    # first use of the RCX load result (sum)
        mov     QWORD PTR [rsp-8], rax      # store sum
        cmp     rdx, 100000
        jne     .L9
        ret

So the loop-carried dep chain through the store/reload only includes an add, 1 cycle latency instead of 3.

I assume your performance ratios were something like 3:4 or so, from the 5+1 cycles (6) vs. 5+3 (8) cycles.

See What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? for more details.

The compiler could have spent more instructions in test1 to reduce the critical path latency to 1 cycle there, instead of folding more work into an lea on the critical path. For a loop running 100k iterations, this is pretty much a missed optimization. But I assume the optimizer isn't tuned for artificially-induced spill/reload from inline asm; normally it would only have to introduce that store/reload if it ran out of registers from doing a lot of stuff in a loop, and that many different values would usually mean there was some instruction-level parallelism.

GCC makes better asm from simpler source on Quick-bench

@TedLyngmo linked the code on Quick-bench without -fno-tree-vectorize, and using benchmark::DoNotOptimize(sum); which only forces GCC to materialize the value in a register, without blocking constant-propagation through it, or as many other optimizations. Unlike taking its address and telling GCC that was potentially globally visible, like the current custom asm.

The inner loop body is just add %rdx,%rax / add $0x4,%rdx (and cmp rdx + jne as the loop branch), if you look at the asm on Quickbench. Or with rdx+=10 for the other loop. So same loops, different constants. Same performance, of course.

The current source is essentially compiling to asm that does

for (int i = 0 ; i<iterations ; i++){
    sum += 8*i + 1234;
    force_store_reload(&sum);
}

But if you actually write it that way (https://godbolt.org/z/4ja38or9j), we get asm like quick-bench, except for keeping the value in memory instead of a register. (So about 6x slower.)

.L6:
        add     QWORD PTR [rsp-8], rax   # memory destination add
        add     rax, 4
        cmp     rax, 401234
        jne     .L6

It seems to be a missed-optimization bug that GCC doesn't compile your existing source to that. Specifically, missing the strength-reduction from 8*i re-evaluated for every i into tmp += 8.

BTW, it looks like omitting -fno-tree-vectorize makes your original test1() compile even worse. It starts without jumping into the middle of the loop, but it has a longer dep chain.

#gcc 10.3 -O3        (without -fno-tree-vectorize)
test1_orig():
        mov     QWORD PTR [rsp-8], 6      # init sum
        lea     rsi, [rsp-8]              # operand for inline asm
        mov     eax, 1
.L2:
        mov     rdx, QWORD PTR [rsp-8]   # load sum
        mov     rcx, rax
        add     rdx, rax                 # used: 1 cycle latency
        add     rax, 1
        add     rdx, rax                 # used: 1 cycle latency
        lea     rdx, [rdx+5+rcx*2]       # used: 3 cycle latency
        mov     QWORD PTR [rsp-8], rdx   # store
        cmp     rax, 100000
        jne     .L2
        ret

GCC makes better asm from simpler source on Quick-bench

Recommended topics

Hot tags