Why does my empty loop run twice as fast if called as a function, on Intel Skylake CPUs?
Asked Answered
A

1

9

I was running some tests to compare C to Java and ran into something interesting. Running my exactly identical benchmark code with optimization level 1 (-O1) in a function called by main, rather than in main itself, resulted in roughly double performance. I'm printing out the size of test_t to verify beyond any doubt that the code is being compiled to x64.

I sent the executables to my friend who's running an i7-7700HQ and got similar results. I'm running an i7-6700.


Here's the slower code:

#include <stdio.h>
#include <time.h>
#include <stdint.h>

int main() {
    printf("Size = %I64u\n", sizeof(size_t));
    int start = clock();
    for(int64_t i = 0; i < 10000000000L; i++) {
        
    }
    printf("%ld\n", clock() - start);
    return 0;
}

And the faster:

#include <stdio.h>
#include <time.h>
#include <stdint.h>

void test() {
    printf("Size = %I64u\n", sizeof(size_t));
    int start = clock();
    for(int64_t i = 0; i < 10000000000L; i++) {
        
    }
    printf("%ld\n", clock() - start);
}

int main() {
    test();
    return 0;
}

I'll also provide the assembly code for you to dig in to. I don't know assembly. Slower:

    .file   "dummy.c"
    .text
    .def    __main; .scl    2;  .type   32; .endef
    .section .rdata,"dr"
.LC0:
    .ascii "Size = %I64u\12\0"
.LC1:
    .ascii "%ld\12\0"
    .text
    .globl  main
    .def    main;   .scl    2;  .type   32; .endef
    .seh_proc   main
main:
    pushq   %rbx
    .seh_pushreg    %rbx
    subq    $32, %rsp
    .seh_stackalloc 32
    .seh_endprologue
    call    __main
    movl    $8, %edx
    leaq    .LC0(%rip), %rcx
    call    printf
    call    clock
    movl    %eax, %ebx
    movabsq $10000000000, %rax
.L2:
    subq    $1, %rax
    jne .L2
    call    clock
    subl    %ebx, %eax
    movl    %eax, %edx
    leaq    .LC1(%rip), %rcx
    call    printf
    movl    $0, %eax
    addq    $32, %rsp
    popq    %rbx
    ret
    .seh_endproc
    .ident  "GCC: (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0"
    .def    printf; .scl    2;  .type   32; .endef
    .def    clock;  .scl    2;  .type   32; .endef

Faster:

    .file   "dummy.c"
    .text
    .section .rdata,"dr"
.LC0:
    .ascii "Size = %I64u\12\0"
.LC1:
    .ascii "%ld\12\0"
    .text
    .globl  test
    .def    test;   .scl    2;  .type   32; .endef
    .seh_proc   test
test:
    pushq   %rbx
    .seh_pushreg    %rbx
    subq    $32, %rsp
    .seh_stackalloc 32
    .seh_endprologue
    movl    $8, %edx
    leaq    .LC0(%rip), %rcx
    call    printf
    call    clock
    movl    %eax, %ebx
    movabsq $10000000000, %rax
.L2:
    subq    $1, %rax
    jne .L2
    call    clock
    subl    %ebx, %eax
    movl    %eax, %edx
    leaq    .LC1(%rip), %rcx
    call    printf
    nop
    addq    $32, %rsp
    popq    %rbx
    ret
    .seh_endproc
    .def    __main; .scl    2;  .type   32; .endef
    .globl  main
    .def    main;   .scl    2;  .type   32; .endef
    .seh_proc   main
main:
    subq    $40, %rsp
    .seh_stackalloc 40
    .seh_endprologue
    call    __main
    call    test
    movl    $0, %eax
    addq    $40, %rsp
    ret
    .seh_endproc
    .ident  "GCC: (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0"
    .def    printf; .scl    2;  .type   32; .endef
    .def    clock;  .scl    2;  .type   32; .endef

Here's my batch script for compilation:

@echo off
set /p file= File to compile: 
del compiled.exe
gcc -Wall -Wextra -std=c17 -O1 -o compiled.exe %file%.c
compiled.exe
PAUSE

And for compilation to assembly:

@echo off
set /p file= File to compile: 
del %file%.s
gcc -S -Wall -Wextra -std=c17 -O1 %file%.c
PAUSE
Adena answered 7/6, 2021 at 19:52 Comment(26)
I don't believe it. I can't see how empty loop is not optimized out.Autoplasty
Did you run the tests in random order, several times, discarding outliers?Covin
@Autoplasty It did get optimized out if I used anything higher than -O1, which is why I used -O1 and not -O3.Adena
@Covin I ran both tests multiple times with a few minutes in between and with relatively arbitrary/random order whilst discussing this strangeness with my friends. There was some variation, of course, but not very much. The general trend was that the faster code was roughly twice as fast as the slower code. I also tried doubling the maximum value from 10bil to 20bil to the same effect. It's also worth noting that the results were on the order of seconds, so it's not like tiny amounts of overhead or random delays could've caused this.Adena
It could be due to the address of the branch (which is different in both versions, as a random side effect) combined with the fix for the jcc erratum, if you provide the address of L2 in both cases then that can be checkedHydrate
@harold How would you like me to go about doing that? I'd be happy to provide you with any info you need... I just don't know how.Adena
You should be able to see the address in a debugger, your choice of tooling confuses me slightly (seem like you're using GCC on windows? so maybe you don't use visual studio as your debugger?) but any of them should be able to tell you this information. Btw ASLR doesn't matter, the alignment relevant for the fix of the JCC erratum is 32 so not affected by ASLR.Hydrate
@harold I don't code in C very often... I'm using VS Code and I don't really have a debugger. What's ALSR?Adena
IDK how to set up debugging in VS Code but surely it is possible somehow? Or I suppose you could use x64dbg. Anyway ASLR randomizes addresses somewhat, so I mentioned that to avoid you getting worried about that, it shouldn't affect this question.Hydrate
@AnonymousNoob Probably you run this program when your computer has a different load. This code will give you almost exactly the same results. Your results make no sense so I believe that your experiment methodology is invalid.Moulder
I cant reproduce itMoulder
@Moulder I don't want to conclude yet that it is the JCC erratum thing, but if it is then it's also difficult to reproduce, because it relies on exact code placement (which your compiler won't necessarily reproduce) and on running it on a particular family of processorsHydrate
@harold it will not double execution time. It will affect the same both versions.Moulder
@Moulder I sent the executables to my friend who's running an i7-7700HQ and got similar results. I'm running an i7-6700.Adena
I am running on I9-something. MAybe you got a virus which affects the execution :)Moulder
@Moulder there's no a-priori version why both versions should be affected the same. It really comes down to where exactly those branches are. If they're nowhere near a 32byte boundary, sure, it won't matter. But the addresses of those branches are for sure not equal in both versions (different offset from starting label), and if one of them is near a 32 byte boundary in one of the bad ways and the other is in the clear, then that's a big difference.Hydrate
@harold I don't really know what I'm looking at here... It's a lot more complicated than Java debuggers I've used! Hopefully this helps. This is the fast version: i.imgur.com/UWrGVVw.png And the slow version: i.imgur.com/e7RJi28.pngAdena
I'm running an i7-920 and I do not get your results. I'm running under linux. So, I'd suggest that you boot linux and see if you get similar results on the same processor/system. That will tell you something. (You could probably get the results by booting a shell off a "live" USB installer if you don't want to install to your hard disk(s)). I get a max variation of 100,000 clocks over a few hundred trials, nowhere near 2x. And, I take the minimum time of N trials for each program (usually 3-10 trials) to eliminate cache variations and swappiness.Lauricelaurie
@AnonymousNoob please press F9 once, to skip out of this system code into your own code.Hydrate
@harold Fast: i.imgur.com/ddlUwLk.png Slow: i.imgur.com/wdZQS2Q.png (ayy figured out dark theme)Adena
@AnonymousNoob OK that wasn't it either, closer I suppose, but the loop is not in sight, you could look for it but if you somehow give me the executable I'd be willing to take a look (E: btw I won't be able to reproduce it, my CPU is from the Haswell family so not affected by the JCC erratum, but I can find out if the branch could be affected by that)Hydrate
@harold Sure. Slow: file.io/download/NEYMbrxFIivX Fast: file.io/mnFha7kqY72n Thank you for taking the time to help me try to understand this. My friend and I will be very happy once we get to the bottom of this mystery.Adena
@harold (This is in another comment to make sure you see it) I added a getchar() to the end of both executables because I was sending them to my friend and wanted him to just be able to run them without any trouble. It doesn't seem to affect the results.Adena
@Moulder I did get my answer... but you had no reason to invalidate my methodology just because my results didn't make sense. Apparently, they DID make sense. :)Adena
@0___________: Maybe you got a virus which affects the execution - Actually a microcode update from Intel which introduces a performance pothole to work around a bug they missed while testing their CPUs. Quality control in Skylake was worse than in previous generations, and there have been at least a couple new performance potholes introduced by microcode updates since release. (Disabling LSD, and the JCC erratum workaround, are the most notable.)Hanus
And Ice Lake recently disabled mov-elimination to work around another design bug. :( Anyway, it's normal for weird perf effects in very small-scale microbenchmarks to be specific to a CPU version. @0___________, if you'd said what i9 version you tested on, your comment wouldn't have been totally useless.Hanus
H
16

The slow version:

enter image description here

Note that the sub rax, 1 \ jne pair goes right across the boundary of the ..80 (which is a 32byte boundary). This is one of the cases mentioned in Intels document regarding this issue namely as this diagram:

enter image description here

So this op/branch pair is affected by the fix for the JCC erratum (which would cause it to not be cached in the µop cache). I'm not sure if that is the reason, there are other things at play too, but it's a thing.

In the fast version, the branch is not "touching" a 32byte boundary, so it is not affected.

enter image description here

There may be other effects that apply. Still due to crossing a 32byte boundary, in the slow case the loop is spread across 2 chunks in the µop cache, even without the fix for JCC erratum that may cause it to run at 2 cycles per iteration if the loop cannot execute from the Loop Stream Detector (which is disabled on some processors by an other fix for an other erratum, SKL150). See eg this answer about loop performance.

To address the various comments saying they cannot reproduce this, yes there are various ways that could happen:

  • Whichever effect was responsible for the slowdown, it is likely caused by the exact placement of the op/branch pair across a 32byte boundary, which happened by pure accident. Compiling from source is unlikely to reproduce the same circumstances, unless you use the same compiler with the same setup as was used by the original poster.
  • Even using the same binary, regardless of which of the effects is responsible, the weird effect would only happen on particular processors.
Hydrate answered 7/6, 2021 at 21:19 Comment(3)
Fun fact: even without the JCC erratum, tiny loops of more than 1 uop that get split across 2 cache lines can run half speed on Skylake-family CPUs (which have their loop buffer disabled by a microcode update in SKL, and in initial microcode on later CPUs until Ice Lake). Because the uop cache can only fetch from 1 line per cycle, so it has to alternate between lines to fetch the whole loop body. (Oh, you already mentioned that effect.)Hanus
I think even in this sub/jcc loop (which can macro-fuse into a single uop), it could actually fail to macro-fuse and end up as 2 uops in 2 separate cache lines even without the JCC erratum. IIRC, Macro-fusion doesn't work when split perfectly across an I-cache line boundary (64 bytes). But on CPUs with a working LSD, those 2 uops can still be "unrolled" and issued in groups of 4. (On Haswell and later, or at least in groups of 2 on SnB)Hanus
Note that similar effects affect AMD Ryzen processors : instruction decoding is slower when a small loop is split in two cache lines resulting in an overall slow loop (due to the instruction cache).Chiu

© 2022 - 2024 — McMap. All rights reserved.