Update: Skylake store/reload latency is as low as 3c, but only if the timing is right. Consecutive loads involved in a store-forwarding dependency chain that are naturally spaced out by 3 or more cycles will experience the faster latency (e.g. with 4 imul eax,eax
in the loop, mov [rdi], eax
/ mov eax, [rdi]
only takes the cycle count up from 12 to 15 cycles per iteration.) but when the loads are allow to execute more densely than that, some type of contention is suffered and you get about 4.5 cycles per iteration. The non-integer average throughput is also a big clue that there's something unusual.
I saw the same effect for 32B vectors (best case 6.0c, back-to-back 6.2 to 6.9c), but 128b vectors were always around 5.0c. See details on Agner Fog's forum.
Update2: Adding a redundant assignment speeds up code when compiled without optimization and a 2013 blog post indicate that this effect is present on all Sandybridge-family CPUs.
The back-to-back (worst case) store-forwarding latency on Skylake is 1 cycle better than on previous uarches, but the variability when the load can't execute right away is similar.
With the right (mis-)alignment, the extra call
in the loop can actually help Skylake observe lower store-forwarding latency from push to pop. I was able to reproduce this with perf counters (Linux perf stat -r4
), using YASM. (I've heard it's less convenient to use perf counters on Windows, and I don't have a Windows dev machine anyway. Fortunately the OS isn't really relevant to the answer; anyone should be able to reproduce my perf-counter results on Windows with VTune or something.)
I saw the faster times at offset = 0..10, 37, 63-74, 101, and 127 following an align 128
at the spot specified in the question. L1I cache lines are 64B, and the uop-cache is cares about 32B boundaries. It looks alignment relative to a 64B boundary is all that matters.
The no-call loop is a steady 5 cycles always, but the call
loop can get down to 4c per iteration from its usual almost-exactly-5 cycles. I saw slower-than-usual performance at offset=38 (5.68 +- 8.3% cycles per iteration). There are small glitches at other points, like 5.17c +- 3.3%, according to perf stat -r4
(which does 4 runs and averaging).
It seems to be an interaction between the front-end not queueing up so many uops ahead, causing the back end to have lower latency for store-forwarding from push to pop.
IDK if reusing the same address repeatedly for store-forwarding makes it slower (with multiple store-address uops already executed ahead of the corresponding store-data uops), or what.
Test code: bash
shell loop to build & profile the asm with every different offset:
(set -x; for off in {0..127};do
asm-link -m32 -d call-tight-loop.asm -DFUNC=normal_call -DOFFSET=$off &&
ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults:u,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,dsb2mite_switches.penalty_cycles -r4 ./call-tight-loop;
done ) |& tee -a call-tight-loop.call.offset-log
(set -x)
in a subshell is a handy way to log commands along with their output when redirecting to a log file.
asm-link
is a script that runs yasm -felf32 -Worphan-labels -gdwarf2 call-tight-loop.asm "$@" && ld -melf_i386 -o call-tight-loop call-tight-loop.o
, then runs objdumps -drwC -Mintel
on the result.
NASM / YASM Linux test program (assembles into a complete static binary that runs the loop and then exits, so you can profile the whole program.) Direct port of the OP's FASM source, with no optimizations to the asm.
CPU p6 ; YASM directive. For NASM, %use smartalign.
section .text
iter equ 100000000
%ifndef OFFSET
%define OFFSET 0
%endif
align 128
;;offset equ 23 ; this is the number I am changing
times OFFSET nop
times 16 nop
no_call:
mov ecx, iter
.loop:
push ecx
pop ecx
dec ecx
cmp ecx, 0
jne .loop
ret
times 55 nop
normal_function:
ret
times 58 nop
normal_call:
mov ecx, iter
.loop:
push ecx
call normal_function
pop ecx
dec ecx
cmp ecx, 0
jne .loop
ret
%ifndef FUNC
%define FUNC no_call
%endif
align 64
global _start
_start:
call FUNC
mov eax,1 ; __NR_exit from /usr/include/asm/unistd_32.h
xor ebx,ebx
int 0x80 ; sys_exit(0), 32-bit ABI
Sample output from a fast call
run:
+ asm-link -m32 -d call-tight-loop.asm -DFUNC=normal_call -DOFFSET=3
...
080480d8 <normal_function>:
80480d8: c3 ret
...
08048113 <normal_call>:
8048113: b9 00 e1 f5 05 mov ecx,0x5f5e100
08048118 <normal_call.loop>:
8048118: 51 push ecx
8048119: e8 ba ff ff ff call 80480d8 <normal_function>
804811e: 59 pop ecx
804811f: 49 dec ecx
8048120: 83 f9 00 cmp ecx,0x0
8048123: 75 f3 jne 8048118 <normal_call.loop>
8048125: c3 ret
...
Performance counter stats for './call-tight-loop' (4 runs):
100.646932 task-clock (msec) # 0.998 CPUs utilized ( +- 0.97% )
0 context-switches # 0.002 K/sec ( +-100.00% )
0 cpu-migrations # 0.000 K/sec
1 page-faults:u # 0.010 K/sec
414,143,323 cycles # 4.115 GHz ( +- 0.56% )
700,193,469 instructions # 1.69 insn per cycle ( +- 0.00% )
700,293,232 uops_issued_any # 6957.919 M/sec ( +- 0.00% )
1,000,299,201 uops_executed_thread # 9938.695 M/sec ( +- 0.00% )
83,212,779 idq_mite_uops # 826.779 M/sec ( +- 17.02% )
5,792 dsb2mite_switches_penalty_cycles # 0.058 M/sec ( +- 33.07% )
0.100805233 seconds time elapsed ( +- 0.96% )
Old answer before noticing the variable store-forwarding latency
You push/pop your loop counter, so everything except the call
and ret
instructions (and the cmp
/jcc
) are part of the critical path loop-carried dependency chain involving the loop counter.
You'd expect that pop
would have to wait for updates to the stack pointer by call
/ret
, but the stack engine handles those updates with zero latency. (Intel since Pentium-M, AMD since K10, according to Agner Fog's microarch pdf, so I'm assuming your CPU has one, even though you didn't say anything about what CPU microarchitecture you ran your tests on.)
The extra call
/ret
still need to execute, but out-of-order execution can keep the critical path instructions running at their max throughput. Since this includes the latency of a store->load forwarding from push/pop + 1 cycle for dec
, this is not high throughput on any CPU, and it's a surprise that the front-end can ever be a bottleneck with any alignment.
push
->pop
latency is 5 cycles on Skylake, according to Agner Fog, so on that uarch your loop can only run at best one iteration per 6 cycles.
This is plenty of time for out-of-order-execution to run the call
and ret
instructions. Agner lists a max throughput for call
of one per 3 cycles, and ret
at one per 1 cycle. Or on AMD Bulldozer, 2 and 2. His tables don't list anything about the throughput of a call
/ret
pair, so IDK whether those can overlap or not. On AMD Bulldozer, store/reload latency with mov
is 8 cycles. I assume it's about the same with push/pop.
It seems that different alignments for the top of the loop (i.e. no_call.loop_start:
) are causing front-end bottlenecks. The call
version has 3 branches per iteration: the call, the ret, and the loop-branch. Note that the ret
's branch target is the instruction right after the call
. Each of these potentially disrupts the front-end. Since you're seeing an actual slow-down in practice, we must be seeing more than 1 cycle delay per branch. Or for the no_call version, a single fetch/decode bubble worse than about 6 cycles, leading to an actual wasted cycle in issuing uops into the out-of-order part of the core. That's weird.
It's too complicated to guess about what the actual microarchitectural details are for every possible uarch, so let us know what CPU you tested on.
I will mention though that push
/pop
inside a loop on Skylake stops it from issuing from the Loop Stream Detector, and has to be re-fetched from the uop cache every time. Intel's optimization manual says that for Sandybridge, a mismatched push/pop inside a loop stops it from using the LSD. That implies that it can use the LSD for loops with balanced push/pop. In my testing, that's not the case on Skylake (using the lsd.uops
performance counter), but I haven't seen any mention of whether that was a change, or whether SnB was actually like that, too.
Also, unconditional branches always end a uop-cache line. It's possible that with normal_function:
in the same naturally-aligned 32B chunk of machine code as the call
and jne
, maybe the block of code doesn't fit in the uop cache. (Only 3 uop-cache lines can cache decoded uops for a single 32B chunk of x86 code). But that wouldn't explain the possibility of problems for the no_call loop, so you're probably not running on an Intel SnB-family microarchitecture.
(Update, yes, the loop sometimes runs mostly from legacy decode (idq.mite_uops
), but usually not exclusively. dsb2mite_switches.penalty_cycles
is usually ~8k, and probably only happens on timer interrupts. The runs where the call
loop runs faster seem to be correlated with lower idq.mite_uops
, but it's still 34M +- 63% for the offset=37 case where the 100M iterations took 401M cycles.)
This is really one of those "don't do that" cases: inline tiny functions instead of calling them from inside very tight loops.
You might see different results if you push
/pop
a register other than your loop counter. That would separate the push/pop from the loop counter, so there'd be 2 separate dependency chains. It should speed up both the call and no_call versions, but maybe no equally. It could just make a front-end bottleneck more obvious.
You should see a huge speedup if you push edx
but pop eax
, so the push/pop instructions don't form a loop-carried dependency chain. Then the extra call
/ret
would definitely be a bottleneck.
Side-note: dec ecx
already sets ZF the way you want, so you could have just used dec ecx / jnz
. Also, cmp ecx,0
is less efficient than test ecx,ecx
(larger code size and can't macro-fuse on as many CPUs). Anyway, totally irrelevant to the question about relative performance of your two loops. (Your lack of an ALIGN
directive between functions means that changing the first one would have changed the alignment of the loop branch in the 2nd, but you already explored different alignments.)
clock
is not the best choice to perform this analysis. – Bankableclock
is fine. Try looking at the resulting assembly of the compiled C code. Also it looks (judging the fact the linking order matters) that some link-time optimizations are taking place. – Bankableno_call
first aligned on 0x00401635 andnormal_call
on 0x00401644 – Humicnormal_call
would be on a 4 byte alignment – Humicjne @b
) is important. Unfortunately you didn't name them explicitly. Theno_call
andnormal_call
are used just once, so any unaligned penalty there is not important (far beyond the [im]precision ofclock
timing). And asnormal_function
is called extensively, having aligned that one MAY help too. Usually 4 or 8 boundary is enough, but feel free to experiment up to 64 (I think the modern cache lines are 32B long? But 64 is for sure enough for anything). – Abdullahno_call
+normal_call
first, to both ramp-up the CPU freq. and make the cache state similar for both variants (pre-cached). – Abdullahinc eax
into no_call-loop and normal_function? – Hutiter equ 1000000000
to run 10 times longer. I'm getting about 1.55 seconds run time for both functions. I triedalign 16
before the loops, but it didn't make a significant different. The entire program fits inside the code cache, which may be why aligning didn't help. – Poilu162.168
and131.578
seconds. – Humicalign 16
directive, then I'd suggest porting your code to NASM or YASM for playing with alignment stuff. – Eyelashi7-6700k
ori5-4310U
. (Just saying "i5" or "i3" would be useless). – Eyelashi5-7200U
, would you mind verifying that what you said applies for that? Also I did mean 64 byte alignment – Humic