Loop with function call faster than an empty loop
Asked Answered
H

2

16

I linked some assembly with some c to test the cost of a function call, with the following assembly and c source (using fasm and gcc respectively)

assembly:

format ELF

public no_call as "_no_call"
public normal_call as "_normal_call"

section '.text' executable

iter equ 100000000

no_call:
    mov ecx, iter
@@:
    push ecx
    pop ecx
    dec ecx
    cmp ecx, 0
    jne @b
    ret

normal_function:
    ret

normal_call:
    mov ecx, iter
@@:
    push ecx
    call normal_function
    pop ecx
    dec ecx
    cmp ecx, 0
    jne @b
    ret

c source:

#include <stdio.h>
#include <time.h>

extern int no_call();
extern int normal_call();

int main()
{
    clock_t ct1, ct2;

    ct1 = clock();
    no_call();
    ct2 = clock();
    printf("\n\n%d\n", ct2 - ct1);

    ct1 = clock();
    normal_call();
    ct2 = clock();
    printf("%d\n", ct2 - ct1);

    return 0;
}

The results I got were suprising. First of all, the speed depended on the order in which I linked mattered. If I linked as gcc intern.o extern.o, a typical output is

162
181

But linking in the opposite order gcc extern.o intern.o, I got an output more like:

162
130

That they are different was very suprising but is not the question I am asking. (relevant question here)

The question I am asking is how is it that in the second run the loop with the function call was faster than the loop without one, how was the cost of calling a function apparently negative.

Edit: Just to make mention of some of the things tried in the comments:

  • In the compiled bytecode the function calls were not optimized away.
  • Adjusting the alignment of the functions and loops to be on everything from 4 to 64 byte boundaries did not speed up no_call, though some alignments did slow down normal_call
  • Giving the CPU/OS a chance to warm up by calling the functions multiple times rather than just once had no noticable effect of the lengths of times measured, neither does changing the order of the calls or running seperately
  • Running for longer times does not affect the ratio, for example running 1000 times longer I got 162.168 and 131.578 seconds for my run times

Additionally, after modifying the assembly code to align on bytes, I tested giving the set of functions an additional offset and came to some more strange conclusions. Here is the updated code:

format ELF

public no_call as "_no_call"
public normal_call as "_normal_call"

section '.text' executable

iter equ 100000000

offset equ 23 ; this is the number I am changing
times offset nop

times 16 nop
no_call:
    mov ecx, iter
no_call.loop_start:
    push ecx
    pop ecx
    dec ecx
    cmp ecx, 0
    jne no_call.loop_start
    ret

times 55 nop
normal_function:
    ret


times 58 nop
normal_call:
    mov ecx, iter
normal_call.loop_start:
    push ecx
    call normal_function
    pop ecx
    dec ecx
    cmp ecx, 0
    jne normal_call.loop_start
    ret

I had to manually (and non portably) force the 64 byte alignment since FASM does not support more than 4 byte alignment for the executable section, at least on my machine. Offsetting the program by offset bytes, here is what I found.

if (20 <= offset mod 128 <= 31) then we get an output of (approximately):

162
131

else

162 (+/- 10)
162 (+/- 10)

Not sure at all what to make of it, but that's what I've discovered so far

Edit 2:

Another thing I noticed is that if you remove push ecx and pop ecx from both functions, the output becomes

30
125

which indicates that that is the most expensive part of it. The stack alignment is the same both times, so that is not the reason for the discrepency. My best guess is that somehow the hardware is optimized to expect a call after a push or something similar, but I do not know of anything like that

Humic answered 1/8, 2017 at 15:54 Comment(24)
Were these average results, or just a singular run each?Bogtrotter
clock is not the best choice to perform this analysis.Bankable
Not average, but I have run each 20+ times with the same resultsHumic
@Eugene Sh. What would you recommend instead?Humic
Well, at the second though I guess clock is fine. Try looking at the resulting assembly of the compiled C code. Also it looks (judging the fact the linking order matters) that some link-time optimizations are taking place.Bankable
I did, there was no optimization of any kind. Just call to clock, saving variable, call to method, then call to clock again for both fuctionsHumic
Alignment of loop is the same? Check in debugger, where the loop address lands after loading the binary.Abdullah
@Abdullah Just to make sure I'm doing this right, what address is important? Is it where the JMP instruction goes to? or somewhere else. If it is that, neither are aligned well, with the no_call first aligned on 0x00401635 and normal_call on 0x00401644Humic
actually, I guess normal_call would be on a 4 byte alignmentHumic
The address where most of the jumps land (target of jne @b) is important. Unfortunately you didn't name them explicitly. The no_call and normal_call are used just once, so any unaligned penalty there is not important (far beyond the [im]precision of clock timing). And as normal_function is called extensively, having aligned that one MAY help too. Usually 4 or 8 boundary is enough, but feel free to experiment up to 64 (I think the modern cache lines are 32B long? But 64 is for sure enough for anything).Abdullah
Another thing skewing the results may be dynamic change of CPU frequencies under load, maybe the no-call loop is understood as idle-loop and the CPU+OS does switch the freq. down, although I think this is quite unlikely to have such sophisticated code analysis in CPU. But you are skipping the warm-up phase, it may take the OS a while to detect the 100% CPU core usage before upping up the power, so maybe do one un-clocked run of both no_call + normal_call first, to both ramp-up the CPU freq. and make the cache state similar for both variants (pre-cached).Abdullah
Interesting question. What happens if you add inc eax into no_call-loop and normal_function?Hut
@Hut the no_call time remains essentially unchanged and the normal_call time increases to about 210Humic
@Abdullah The warmup does not seem to have an effect, I get the same result with running the loops twice as you suggested. I'm working getting the alignment right, but I'm new to assembly and it's taking me a whileHumic
On everything from a 4 to a 64 byte alignment, the speed of no_call never decreased, though some alignments caused normal_call to increase. It does not seem to be an alignment issueHumic
@Humic - I tried the same code with Visual Studio / Windows. I added a zero, changing to iter equ 1000000000 to run 10 times longer. I'm getting about 1.55 seconds run time for both functions. I tried align 16 before the loops, but it didn't make a significant different. The entire program fits inside the code cache, which may be why aligning didn't help.Poilu
1.55 sec is too short to make any good statistical conclusion. Can you increase it more? Can you split it into 2 separate programs and run them sequentially? Does scheduling priority makes any difference?Abducent
@Abducent Installed 64 bit version of Mingw to test longer times (can't store that large of value in ecx). I also ran individually as you suggested. I got 162.168 and 131.578 seconds.Humic
@Abducent Also, how could the two possibly be different scheduling priorities, especially when run as part of the same program?Humic
"force the 64 bit alignment": did you mean 64 byte alignment? You earlier said something about modifying the asm to "align on bytes", which makes no sense because x86 machine code always uses whole bytes. Everything is always 1-byte aligned; it's not a bitstream. Anyway, if FASM doesn't support an align 16 directive, then I'd suggest porting your code to NASM or YASM for playing with alignment stuff.Eyelash
What CPU are you testing on? Intel Haswell? Skylake? AMD Bulldozer? Ryzen? Different microarchitectures perform very differently for microbenchmarks like this. If you don't know what uarch, then just say the exact model-number of your CPU, like i7-6700k or i5-4310U. (Just saying "i5" or "i3" would be useless).Eyelash
@PeterCordes my specific cpu is an i5-7200U, would you mind verifying that what you said applies for that? Also I did mean 64 byte alignmentHumic
@rtpax: Interesting. That's a Kaby Lake, which is identical clock-for-clock to Skylake, so I can try on my desktop and should be able to reproduce your results. Apparently my conclusion was wrong that you wouldn't see all of these effects on a CPU with a uop cache, unless turbo / frequency-scaling effects changed the time but not the clock-cycle count.Eyelash
I was able to reproduce this. updated my answer.Eyelash
E
5

Update: Skylake store/reload latency is as low as 3c, but only if the timing is right. Consecutive loads involved in a store-forwarding dependency chain that are naturally spaced out by 3 or more cycles will experience the faster latency (e.g. with 4 imul eax,eax in the loop, mov [rdi], eax / mov eax, [rdi] only takes the cycle count up from 12 to 15 cycles per iteration.) but when the loads are allow to execute more densely than that, some type of contention is suffered and you get about 4.5 cycles per iteration. The non-integer average throughput is also a big clue that there's something unusual.

I saw the same effect for 32B vectors (best case 6.0c, back-to-back 6.2 to 6.9c), but 128b vectors were always around 5.0c. See details on Agner Fog's forum.

Update2: Adding a redundant assignment speeds up code when compiled without optimization and a 2013 blog post indicate that this effect is present on all Sandybridge-family CPUs.

The back-to-back (worst case) store-forwarding latency on Skylake is 1 cycle better than on previous uarches, but the variability when the load can't execute right away is similar.


With the right (mis-)alignment, the extra call in the loop can actually help Skylake observe lower store-forwarding latency from push to pop. I was able to reproduce this with perf counters (Linux perf stat -r4), using YASM. (I've heard it's less convenient to use perf counters on Windows, and I don't have a Windows dev machine anyway. Fortunately the OS isn't really relevant to the answer; anyone should be able to reproduce my perf-counter results on Windows with VTune or something.)

I saw the faster times at offset = 0..10, 37, 63-74, 101, and 127 following an align 128 at the spot specified in the question. L1I cache lines are 64B, and the uop-cache is cares about 32B boundaries. It looks alignment relative to a 64B boundary is all that matters.

The no-call loop is a steady 5 cycles always, but the call loop can get down to 4c per iteration from its usual almost-exactly-5 cycles. I saw slower-than-usual performance at offset=38 (5.68 +- 8.3% cycles per iteration). There are small glitches at other points, like 5.17c +- 3.3%, according to perf stat -r4 (which does 4 runs and averaging).

It seems to be an interaction between the front-end not queueing up so many uops ahead, causing the back end to have lower latency for store-forwarding from push to pop.

IDK if reusing the same address repeatedly for store-forwarding makes it slower (with multiple store-address uops already executed ahead of the corresponding store-data uops), or what.


Test code: bash shell loop to build & profile the asm with every different offset:

(set -x; for off in {0..127};do 
    asm-link -m32 -d call-tight-loop.asm -DFUNC=normal_call -DOFFSET=$off && 
    ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults:u,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,dsb2mite_switches.penalty_cycles -r4 ./call-tight-loop;
done ) |& tee -a call-tight-loop.call.offset-log

(set -x) in a subshell is a handy way to log commands along with their output when redirecting to a log file.

asm-link is a script that runs yasm -felf32 -Worphan-labels -gdwarf2 call-tight-loop.asm "$@" && ld -melf_i386 -o call-tight-loop call-tight-loop.o, then runs objdumps -drwC -Mintel on the result.

NASM / YASM Linux test program (assembles into a complete static binary that runs the loop and then exits, so you can profile the whole program.) Direct port of the OP's FASM source, with no optimizations to the asm.

CPU p6    ; YASM directive.  For NASM, %use smartalign.
section .text
iter equ 100000000

%ifndef OFFSET
%define OFFSET 0
%endif

align 128
;;offset equ 23 ; this is the number I am changing
times OFFSET nop

times 16 nop
no_call:
    mov ecx, iter
.loop:
    push ecx
    pop ecx
    dec ecx
    cmp ecx, 0
    jne .loop
    ret

times 55 nop
normal_function:
    ret

times 58 nop
normal_call:
    mov ecx, iter
.loop:
    push ecx
    call normal_function
    pop ecx
    dec ecx
    cmp ecx, 0
    jne .loop
    ret

%ifndef FUNC
%define FUNC no_call
%endif

align 64
global _start
_start:
    call FUNC

    mov eax,1             ; __NR_exit from /usr/include/asm/unistd_32.h
    xor ebx,ebx
    int 0x80              ; sys_exit(0), 32-bit ABI

Sample output from a fast call run:

+ asm-link -m32 -d call-tight-loop.asm -DFUNC=normal_call -DOFFSET=3
...

080480d8 <normal_function>:
 80480d8:       c3                      ret    
...

08048113 <normal_call>:
 8048113:       b9 00 e1 f5 05          mov    ecx,0x5f5e100
08048118 <normal_call.loop>:
 8048118:       51                      push   ecx
 8048119:       e8 ba ff ff ff          call   80480d8 <normal_function>
 804811e:       59                      pop    ecx
 804811f:       49                      dec    ecx
 8048120:       83 f9 00                cmp    ecx,0x0
 8048123:       75 f3                   jne    8048118 <normal_call.loop>
 8048125:       c3                      ret    

 ...

 Performance counter stats for './call-tight-loop' (4 runs):

    100.646932      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.97% )
             0      context-switches          #    0.002 K/sec                    ( +-100.00% )
             0      cpu-migrations            #    0.000 K/sec                  
             1      page-faults:u             #    0.010 K/sec                  
   414,143,323      cycles                    #    4.115 GHz                      ( +-  0.56% )
   700,193,469      instructions              #    1.69  insn per cycle           ( +-  0.00% )
   700,293,232      uops_issued_any           # 6957.919 M/sec                    ( +-  0.00% )
 1,000,299,201      uops_executed_thread      # 9938.695 M/sec                    ( +-  0.00% )
    83,212,779      idq_mite_uops             #  826.779 M/sec                    ( +- 17.02% )
         5,792      dsb2mite_switches_penalty_cycles #    0.058 M/sec                    ( +- 33.07% )

   0.100805233 seconds time elapsed                                          ( +-  0.96% )

Old answer before noticing the variable store-forwarding latency

You push/pop your loop counter, so everything except the call and ret instructions (and the cmp/jcc) are part of the critical path loop-carried dependency chain involving the loop counter.

You'd expect that pop would have to wait for updates to the stack pointer by call/ret, but the stack engine handles those updates with zero latency. (Intel since Pentium-M, AMD since K10, according to Agner Fog's microarch pdf, so I'm assuming your CPU has one, even though you didn't say anything about what CPU microarchitecture you ran your tests on.)

The extra call/ret still need to execute, but out-of-order execution can keep the critical path instructions running at their max throughput. Since this includes the latency of a store->load forwarding from push/pop + 1 cycle for dec, this is not high throughput on any CPU, and it's a surprise that the front-end can ever be a bottleneck with any alignment.

push->pop latency is 5 cycles on Skylake, according to Agner Fog, so on that uarch your loop can only run at best one iteration per 6 cycles. This is plenty of time for out-of-order-execution to run the call and ret instructions. Agner lists a max throughput for call of one per 3 cycles, and ret at one per 1 cycle. Or on AMD Bulldozer, 2 and 2. His tables don't list anything about the throughput of a call/ret pair, so IDK whether those can overlap or not. On AMD Bulldozer, store/reload latency with mov is 8 cycles. I assume it's about the same with push/pop.

It seems that different alignments for the top of the loop (i.e. no_call.loop_start:) are causing front-end bottlenecks. The call version has 3 branches per iteration: the call, the ret, and the loop-branch. Note that the ret's branch target is the instruction right after the call. Each of these potentially disrupts the front-end. Since you're seeing an actual slow-down in practice, we must be seeing more than 1 cycle delay per branch. Or for the no_call version, a single fetch/decode bubble worse than about 6 cycles, leading to an actual wasted cycle in issuing uops into the out-of-order part of the core. That's weird.

It's too complicated to guess about what the actual microarchitectural details are for every possible uarch, so let us know what CPU you tested on.

I will mention though that push/pop inside a loop on Skylake stops it from issuing from the Loop Stream Detector, and has to be re-fetched from the uop cache every time. Intel's optimization manual says that for Sandybridge, a mismatched push/pop inside a loop stops it from using the LSD. That implies that it can use the LSD for loops with balanced push/pop. In my testing, that's not the case on Skylake (using the lsd.uops performance counter), but I haven't seen any mention of whether that was a change, or whether SnB was actually like that, too.

Also, unconditional branches always end a uop-cache line. It's possible that with normal_function: in the same naturally-aligned 32B chunk of machine code as the call and jne, maybe the block of code doesn't fit in the uop cache. (Only 3 uop-cache lines can cache decoded uops for a single 32B chunk of x86 code). But that wouldn't explain the possibility of problems for the no_call loop, so you're probably not running on an Intel SnB-family microarchitecture.

(Update, yes, the loop sometimes runs mostly from legacy decode (idq.mite_uops), but usually not exclusively. dsb2mite_switches.penalty_cycles is usually ~8k, and probably only happens on timer interrupts. The runs where the call loop runs faster seem to be correlated with lower idq.mite_uops, but it's still 34M +- 63% for the offset=37 case where the 100M iterations took 401M cycles.)

This is really one of those "don't do that" cases: inline tiny functions instead of calling them from inside very tight loops.


You might see different results if you push/pop a register other than your loop counter. That would separate the push/pop from the loop counter, so there'd be 2 separate dependency chains. It should speed up both the call and no_call versions, but maybe no equally. It could just make a front-end bottleneck more obvious.

You should see a huge speedup if you push edx but pop eax, so the push/pop instructions don't form a loop-carried dependency chain. Then the extra call/ret would definitely be a bottleneck.


Side-note: dec ecx already sets ZF the way you want, so you could have just used dec ecx / jnz. Also, cmp ecx,0 is less efficient than test ecx,ecx (larger code size and can't macro-fuse on as many CPUs). Anyway, totally irrelevant to the question about relative performance of your two loops. (Your lack of an ALIGN directive between functions means that changing the first one would have changed the alignment of the loop branch in the 2nd, but you already explored different alignments.)

Eyelash answered 6/8, 2017 at 7:4 Comment(13)
Somehow I always know when it is one of your answers -- before I even scroll far enough to see the author. :) (I guess because of all the good learning that occurs on the way down the page)Remonstrance
@DavidC.Rankin: I think I have a fairly distinctive writing style (and formatting), so that's a hint even apart from the information content. In a lot of answers, I bold some key points for the benefit of people skimming through, and most people don't do that either.Eyelash
Really interesting result about store-forwarding being as fast as 3 cycles in some scenarios. I can think of a couple of possibilities: perhaps the "will forward" predictor can only issue a prediction every N cycles (where N is 5 or 6 or something) or every N uops. A more likely possibility is that the store forwarding has two phases: searching the store buffer, and then the actual forwarding. Since the address [rdi] is unmodified and not part of the dep chain, the first part can overlap with the imul work and is hidden. If you are doing it back-to-back, then it can't overlap with itself.Pugliese
@BeeOnRope: searching the store-buffer seems most likely to me. Perhaps having a lot of store-address uops executed beyond the load uop means store-forwarding has to search past those entries to find the most recent store in program order. I guess the store buffer is big enough that they don't make it fixed-latency in SKL. Maybe it is in some earlier uarches. Hmm, that only explains it if the searching can start before the store-data uop runs, though, and then just wait for data or set up the forwarding, since imul latency won't stop store-address uops from executing ahead of store-data.Eyelash
According to David Kanter, "store-address" uops write the address into the store buffer.Eyelash
@rtpax: thanks for asking the question. Nice job getting started with the alignment experiment. I probably wouldn't have taken the time to try it myself if I didn't have your code as a starting-point, and your data to show that there was an alignment-sensitive effect. I still don't know exactly what difference the alignment is making; front-end alignment stuff gets really complicated really really fast, especially when it sometimes stops the loop from running from the uop-cache!Eyelash
@PeterCordes - I found that Skylake has store-forwarding latency as low as 3c even if they are "back to back", as long as they are timed/spaced out correctly. For example, the loop mov rcx, [rsp - 8] ; mov [rsp - 8], rcx ; times 9 nop ; dec rdi ; jne .top runs at 3 cycles per iteration on my Skylake and there is one store-forward per loop. If you remove the nops it gets much slower.Pugliese
You can also space them out with dependent instructions instead of nops - if the loads are spaced out exactly by 3 cycles say using a series of add rsp, 0 on the address register, it also works out. I guess what happens is that if the store is "ready" it can be immediately forwarded to the load, but if it load tries too early, it has to retry, and the retry either doesn't occur every cycle, or it does and competes with resources needed by the stores. For the "too early" cases that are slower, the port4 (store) uops show 4.5x the expected count which is weird, as if the stores were retrying.Pugliese
I updated the top of your answer to reflect what I believe to be the case, most emphasizing that's the load timing that matters (or the relative load and store timing), not really how frequently you do the store forwarding (in the fastest case you are doing one forwarding every 3 cycles). See if it's OK. Here's the change I added to uarch-bench to test this. It also shows there is no apparent throughput bottleneck for store forwarding: you can do 1 per cycle if you do enough in parallel (about 5).Pugliese
@BeeOnRope: thanks, yeah this answer is a good place to put your findings. I saw your interesting comments before that edit. With fewer in parallel, port 4 bottlenecks on "retries" that aren't ready yet?Eyelash
Well the main (latency) test doesn't involve any store/reload in parallel necessarily: only one load/store pair, which presumably executes serially across iterations (since the have a dependency though memory). I see results like this. The later columns are uops dispatched to ports 2, 3, 4, 7. The port4 number is the weird one: showing numbers about 4.3x higher than expected. I don't know if that's "real" or just a quirk of the perf counters though. It is weird for the store to retry though? I would expect the load...Pugliese
Also interesting is the delay 3 case (the fastest one), shows 2 uops per iteration for port 4. However, this result is alignment or otherwise state dependent, since adding nops before the test (outside the loop) causes it to alternate between 1 and 2 (i.e., every time a nop is added it changes), although the timing is otherwise the same in either case. One possible cause that the "too fast" cases probably trigger the memory speculation stuff to prevent the load from occurring before the store, and maybe this mechanism adds the latency.Pugliese
The "just right" (or slower cases) maybe never get a mis-speculation at all, so don't need to trigger the special "hold this load back" scenario. BTW, the "tput" tests there have N parallel store/forward pairs to different locations, and they scale in the way you'd expect if store forwarding has throughput of at least one per cycle.Pugliese
E
0

The call to normal_function and the return from it will be correctly predicted every time except the first, so I wouldn't expect to see any difference in timing due to the presence of the call. Thus all of the differences in timing that you see (whether faster or slower) are due to other effects (such as those mentioned in the comments) rather than to the difference in code that you're actually trying to measure.

Enclosure answered 6/8, 2017 at 7:13 Comment(1)
Even correctly-predicted branches can cause instruction-fetch delays. If the loop body wasn't so slow, you'd see an even bigger effect.Eyelash

© 2022 - 2024 — McMap. All rights reserved.