4 and 8 are scale factors that x86 addressing modes support, tempting GCC into using a slow-LEA on the critical path dependency chain when adding 4*i
or 8*i
to the sum
along with the constant sum of 1..4
or 1..8
(either way just a constant, irrelevant what it is). Apparently also as part of the dep chain for 16. And you used inline asm to force the sum
dep chain to include a store/reload.
Analyzing the assembly, I am observing that the inner loops in both functions are eliminated and replaced by a few arithmetic operations done as operands of lea. So in this case, it would make sense if both functions ran at the same speed.
They're both fast, but the different multiples of i
take different instructions. So there's little reason to assume they'd be the same. The interesting thing here is the one with more total instructions is faster, because it has a shorter dependency chain through sum
.
And you forced a store/reload of it with the somewhat-clunky asm volatile("" :: "g"(&sum) : "memory")
, instead of just requiring the compiler to have the value in a register with asm volatile("" : "+r"(sum))
. So that dep chain includes store-forwarding latency (typically 3 to 5 cycles) so it's a bigger bottleneck than front-end or ALU throughput of the independent work.
test1():
xor eax, eax # i = 0
xor ecx, ecx # sum = 0
lea rsi, [rsp-8] # operand for inline asm
jmp .L3
.L5:
mov rcx, QWORD PTR [rsp-8] # load sum
mov rax, rdx # i = i+1
.L3:
lea rdx, [rax+1] # i+1, leaving the old value in RAX
lea rax, [rcx+6+rax*4] # sum += 4*i + 6. (1+2+3)
mov QWORD PTR [rsp-8], rax # store sum
cmp rdx, 100000
jne .L5
ret
An LEA with 3 components (two +
operations) in the addressing mode has 3 cycle latency on your Skylake-derived CPU. See x86 Multiplication with 3: IMUL vs SHL + ADD
So the loop-carried dependency chain is a slow-LEA (3 cycles) between load/store.
test2():
xor eax, eax
xor ecx, ecx
lea rdi, [rsp-8] # operand for inline asm
jmp .L8
.L9:
mov rcx, QWORD PTR [rsp-8] # load sum
mov rax, rdx
.L8:
lea rsi, [rax+36+rax*8]
lea rdx, [rax+1]
lea rax, [rax+9+rsi] # prepare some stuff to be added
add rax, rcx # first use of the RCX load result (sum)
mov QWORD PTR [rsp-8], rax # store sum
cmp rdx, 100000
jne .L9
ret
So the loop-carried dep chain through the store/reload only includes an add
, 1 cycle latency instead of 3.
I assume your performance ratios were something like 3:4 or so, from the 5+1 cycles (6) vs. 5+3 (8) cycles.
See What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? for more details.
The compiler could have spent more instructions in test1
to reduce the critical path latency to 1 cycle there, instead of folding more work into an lea
on the critical path. For a loop running 100k iterations, this is pretty much a missed optimization. But I assume the optimizer isn't tuned for artificially-induced spill/reload from inline asm; normally it would only have to introduce that store/reload if it ran out of registers from doing a lot of stuff in a loop, and that many different values would usually mean there was some instruction-level parallelism.
GCC makes better asm from simpler source on Quick-bench
@TedLyngmo linked the code on Quick-bench without -fno-tree-vectorize
, and using benchmark::DoNotOptimize(sum);
which only forces GCC to materialize the value in a register, without blocking constant-propagation through it, or as many other optimizations. Unlike taking its address and telling GCC that was potentially globally visible, like the current custom asm.
The inner loop body is just add %rdx,%rax
/ add $0x4,%rdx
(and cmp rdx + jne as the loop branch), if you look at the asm on Quickbench. Or with rdx+=10 for the other loop. So same loops, different constants. Same performance, of course.
The current source is essentially compiling to asm that does
for (int i = 0 ; i<iterations ; i++){
sum += 8*i + 1234;
force_store_reload(&sum);
}
But if you actually write it that way (https://godbolt.org/z/4ja38or9j), we get asm like quick-bench, except for keeping the value in memory instead of a register. (So about 6x slower.)
.L6:
add QWORD PTR [rsp-8], rax # memory destination add
add rax, 4
cmp rax, 401234
jne .L6
It seems to be a missed-optimization bug that GCC doesn't compile your existing source to that. Specifically, missing the strength-reduction from 8*i
re-evaluated for every i
into tmp += 8
.
BTW, it looks like omitting -fno-tree-vectorize
makes your original test1()
compile even worse. It starts without jumping into the middle of the loop, but it has a longer dep chain.
#gcc 10.3 -O3 (without -fno-tree-vectorize)
test1_orig():
mov QWORD PTR [rsp-8], 6 # init sum
lea rsi, [rsp-8] # operand for inline asm
mov eax, 1
.L2:
mov rdx, QWORD PTR [rsp-8] # load sum
mov rcx, rax
add rdx, rax # used: 1 cycle latency
add rax, 1
add rdx, rax # used: 1 cycle latency
lea rdx, [rdx+5+rcx*2] # used: 3 cycle latency
mov QWORD PTR [rsp-8], rdx # store
cmp rax, 100000
jne .L2
ret
$ time
the two functions after splitting them into two files. – Dahlberglea
andadd
in the loop oftest2
. I would assume that if anything, the two extra instructions should slightly slow downtest2
. But the situation is completely opposite. Not sure where to go from here. I am not very good at reasoning about performance differences at such low level. – Dahlbergtest1
toi < x
, wherex
is any number excluding 4, 8, and 16...the performance of both functions measure equal. – Dahlbergbenchmark::DoNotOptimize(sum);
is only forcing GCC to materialize the value in a register, without blocking constant-propagation through it, or other optimizations. The inner loop body is justadd %rdx,%rax
/add $0x4,%rdx
(and cmp rdx + jne as the loop branch), if you look at the asm on Quickbench. Or with rdx+=10 for the other loop. So same loops, different constants. – Cipher