I'm currently trying to understand the performance properties of certain loops on x86_64 (specifically, my Intel(R) Core(TM) i3-8145U CPU @ 2.10GHz processor). Specifically, adding an extra instruction to read memory inside the body of the loop almost doubles the performance, with the details not being particularly important.
I've been using a test program that consists of two main parts: a testing loop, and a function under test. The testing loop runs the function under test 232 times, once with each signed 32-bit integer as argument (in order from INT_MIN
to INT_MAX
). The function under test (named body
) is a small function that checks to see if it was called with the expected arguments, and records the error in a global variable otherwise. The test program touches a sufficiently small amount of memory that everything is likely to fit in the L1 cache.
In order to eliminate any speed differences that might be caused by the behaviour of a compiler, I wrote both of the functions in question in assembly language (I'm using clang
as the assembler), and have been forced to start at fixed addresses (the performance of this sort of test loop often gets dominated by effects relating to alignment or caching, so using fixed addresses will eliminate any alignment effects or cache effects unrelated to the change).
Here's the test loop, disassembled (it takes the address of the function to loop over in %rdi
):
401300: 53 push %rbx
401301: 55 push %rbp
401302: 51 push %rcx
401303: 48 89 fd mov %rdi,%rbp
401306: bb 00 00 00 80 mov $0x80000000,%ebx
loop:
40130b: 89 df mov %ebx,%edi
40130d: ff d5 callq *%rbp
40130f: 83 c3 01 add $0x1,%ebx
401312: 71 f7 jno 40130b <loop>
401314: 59 pop %rcx
401315: 5d pop %rbp
401316: 5b pop %rbx
401317: c3 retq
and here's the simplest version of body
, the function under test:
401200: 33 3d 3a 3e 00 00 xor 0x3e3a(%rip),%edi # 405040 <next_expected>
401206: 09 3d 30 3e 00 00 or %edi,0x3e30(%rip) # 40503c <any_errors>
40120c: ff 05 2e 3e 00 00 incl 0x3e2e(%rip) # 405040 <next_expected>
401212: c3 retq
(The basic idea is that body
checks to see if its argument %edi
is equal to next_expected
, and sets any_errors
to a nonzero value if it isn't, otherwise leaving it unchanged. Then it increments next_expected
.)
The test loop, with this version of body
as %rdi
, takes approximately 11 seconds to run on my processor. However, adding in an extra read of memory causes it to run in under 6 seconds (a difference much too large to be explained by random variation):
401200: 33 3d 3a 3e 00 00 xor 0x3e3a(%rip),%edi # 405040 <next_expected>
401206: 09 3d 30 3e 00 00 or %edi,0x3e30(%rip) # 40503c <any_errors>
40120c: 33 3d 2e 3e 00 00 xor 0x3e2e(%rip),%edi # 405040 <next_expected>
401212: ff 05 28 3e 00 00 incl 0x3e28(%rip) # 405040 <next_expected>
401218: c3 retq
I've tried lots of different variants of this code to see what specific properties of the additional statement (labelled 401212
above) give the "fast" behaviour. The common pattern seems to be that the statement needs to read from memory. The various statements I've tried there (note: each of these is a single statement that's exactly 6 bytes long, so there are no alignment considerations to worry about):
These statements run quickly (~6 seconds):
- It doesn't seem to matter what operation we read the memory with:
xor 0x3e2e(%rip),%edi # 405040 <next_expected>
and 0x3e2e(%rip),%edi # 405040 <next_expected>
mov 0x3e2e(%rip),%edi # 405040 <next_expected>
- or what register we read into:
and 0x3e2e(%rip),%eax # 405040 <next_expected>
- or (in most cases) what we're reading:
xor 0x11c7(%rip),%edi # 4023d9 <main>
or -0x12(%rip),%edi # 401200 <body>
- or whether we're writing to memory in addition to reading it:
xor %edi,0x3e2a(%rip) # 40503c <junk>
- Additionally, adding
xor 0x11c7(%rip),%edi # 4023d9 <main>
, but after rather than before theincl
command, also gave fast performance.
These statements run slowly (~11 seconds):
- It isn't sufficient to use an instruction that's 6 bytes long but doesn't read memory:
nopw %cs:(%rax,%rax,1)
lea 0x3e2e(%rip),%edi # 405040 <next_expected>
- It isn't sufficient to write memory without reading it:
mov %edi,0x3e2a(%rip) # 40503c <junk>
Additionally, I tried writing the read value back to next_expected
rather than incrementing it in place:
401200: 33 3d 3a 3e 00 00 xor 0x3e3a(%rip),%edi # 405040 <next_expected>
401206: 09 3d 30 3e 00 00 or %edi,0x3e30(%rip) # 40503c <any_errors>
40120c: 8b 3d 2e 3e 00 00 mov 0x3e2e(%rip),%edi # 405040 <next_expected>
401212: ff c7 inc %edi
401214: 89 3d 26 3e 00 00 mov %edi,0x3e26(%rip) # 405040 <next_expected>
40121a: c3 retq
This had a performance very close to the original 11-second program.
One anomaly is the statement xor 0x3e2a(%rip),%edi # 40503c <any_errors>
; adding that as the 401212
statement consistently gave a performance of 7.3 seconds, not matching either of the other two performances. I suspect that what's happening here is that the read of memory is sufficient to send the function onto the "fast path", but the statement itself is slow (because we just wrote any_errors
in the preceding line; writing and immediately reading the same memory address is something that processors can struggle with), and thus we're getting fast-path performance + a slowdown from the use of a slow statement. Something similar happens if I add a read of next_expected
(rather than main
after rather than before the incl
statement (again, we're reading a memory address that was just written, so the similar behaviour is not surprising).
Another experiment was adding xor next_expected(%rip), %eax
earlier in the function (before the write to %edi
or right at the start, before the read of next_expected
). These gave a performance of around 8.5 seconds.
Anyway, at this point it seems fairly clear what is causing the fast behaviour (adding an additional read from memory is making the function run faster, at least when it's combined with the specific test loop shown here; it wouldn't surprise me if the details of the test loop were relevant). What I don't understand, though, is why the processor would behave like this. In particular, is there a general rule that can be used to work out when adding an extra read to a program will make it run (this much) faster?
If you want to experiment with this yourself
Here's a minimal version of the program that can be compiled and run, and exhibits this problem (this is C with gcc
/clang
extensions, and specific to x86_64 processors):
#include <limits.h>
unsigned any_errors = 0;
unsigned next_expected = INT_MIN;
extern void body(signed);
extern void loop_over_all_ints(void (*f)(signed));
asm (
".p2align 8\n"
"body:\n"
" xor next_expected(%rip), %edi\n"
" or %edi, any_errors(%rip)\n"
// " xor next_expected(%rip), %edi\n"
" addl $2, next_expected(%rip)\n"
" retq\n"
".p2align 8\n"
"loop_over_all_ints:\n"
" push %rbx\n"
" push %rbp\n"
" push %rcx\n"
" mov %rdi, %rbp\n"
" mov $0x80000000, %ebx\n"
"loop:\n"
" mov %ebx, %edi\n"
" call *%rbp\n"
" inc %ebx\n"
" jno loop\n"
" pop %rcx\n"
" pop %rbp\n"
" pop %rbx\n"
" retq\n"
);
int
main(void)
{
loop_over_all_ints(&body);
return 0;
}
(The commented-out line is an example of an extra memory read that makes the program run faster.)
Further experiments
After posting the question, I tried some further experiments in which the testing loop was unrolled to depth 2, and modified so that the two calls to the function under test could now be made to go to two different functions. When calling the loop with body
as both functions, there was still a clear performance difference between the versions of the code with and without the extra memory read (6-7 seconds with, >11 seconds without), giving a clearer platform to look at the differences.
Here are the results of the tests with two separate body
functions:
- Same
any_errors
/next_expected
variables for both, no extra read: ~11 seconds - Same
any_errors
/next_expected
variables for both, extra read for both: 6-7 seconds - Same
any_errors
/next_expected
variables for both, extra read in one but not the other: 6-7 seconds - Same
next_expected
variable but differentany_errors
variables, no extra reads: ~11 seconds - Same
any_errors
variable but differentnext_expected
variables (thus an error is reported), no extra reads: 5-5½ seconds (noticeably faster than any case so far) - Same
any_errors
variable but differentnext_expected
variables,addl $2
rather thanincl
onnext_expected
(so that no error is reported), no extra reads: 5-5½ seconds - Same as previous case, but with the extra reads: 5-5½ seconds (and an almost identical cycle count: it's different by only tens of millions compared to the billions of iterations, so the number of cycles per iteration must be the same)
It looks a lot like whatever is going on here is somehow related to the dependency chain on next_expected
, because breaking that chain gives faster performance than anything that's possible with the chain present.
Further experiments #2
I've been trying a lot more variations of the program to try to eliminate possibilities. I've now managed to shrink a test case reproducing this behaviour down to the following asm file (built with gas
via assembling with as test.s -o test.o; ld test.o
; this isn't linking against libc
, thus is Linux-specific):
.bss
.align 256
a:
.zero 4
b:
.zero 4
c:
.zero 4
.text
.p2align 8, 0
.globl _start
_start:
mov $0x80000000, %ebx
1:
// .byte 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90
// .byte 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x66, 0x90
mov a(%rip), %esi
or %esi, b(%rip)
or $1, a(%rip)
mov c(%rip), %eax
add $1, %ebx
jno 1b
mov $60, %rax
mov $0, %rdi
syscall
There are three versions of the program to compare: the version as written, the version which has 12 single-byte NOP instructions, and the version which has 11 NOP instructions (I made one of them two-byte to get the same alignment as in the 12-NOP case, but it doesn't matter).
When running the program with no NOPs, or with 11 NOPs, it runs in 11 seconds. When 12 single-byte NOPs are used, it runs in 7 seconds.
At this point, I think that it's obvious that something is going wrong when the loop in question runs "too fast", and fixes itself when the loop was artificially slowed down. The original version of the program was presumably bottlenecked on the bandwidth for reading memory from L1 cache; so the problem fixed itself when we added an extra read. This version of the program speeds up when it is (artificially) bottlenecked on the front end; the only difference between "12 single-byte NOPs" and "10 single-byte NOPs and a 2-byte NOP" is how quickly the NOP instructions can get through the front end of the processor. So it seems that the loop runs faster if it's artificially slowed down, and it doesn't seem to matter what mechanism is used to slow it down.
Some performance counter information to exclude possibilities: the loop is running out of the loop stream decoder (lsd.cycles_active
over 25 billion, idq.dsb_cycles
and idq.mite_cycles
less than 10 million, in both the 11-NOP and 12-NOP cases), which eliminates the possibility that the huge number of NOPs added here are overloading the instruction caching mechanisms; and ld_blocks.store_forward
is a single digit (I thought store-forwarding might be involved, and it still might be, but this is the only performance counter related to it so we won't get any more information about it this way).
The specific pattern of reads and writes used above is:
- read memory into a register;
- read-modify-write a different address, using the same register;
- read-modify-write the original address;
- read another address (in the original example, the pop of the instruction pointer served as the unrelated read).
This seems to be the simplest pattern that reproduces the behaviour; I haven't found any further simplifications which cause the behaviour to reproduce.
I still have no idea what's going on here, but hopefully this information will be useful to anyone who wants to try to figure it out.
call
/ret
loads/stores, like in Loop with function call faster than an empty loop? I assume your timing methods are controlling for idle vs. turbo frequency? Idiomatic way of performance evaluation? – Bixlernext_expected
to have different effects from a read of some other location. Thexor %edi,0x3e2a(%rip)
vs.mov %edi,0x3e2a(%rip)
result may help to rule out a lot of possibilities. – Koineperf stat --all-user
to clean that up, equivalent of putting:u
on all your events. We certainly don't expect any I-cache misses, assuming branch prediction is working and not speculatively fetching from crazy locations. (No reason to expect that either). – Bixler:u
(myperf
is too old to support--all-user
) fixed that issue. I've been trawling the detailed performance counters looking for other oddities, but haven't found much.cycle_activity.stalls_mem_any:u
is at 7,625,732,425 on the slow path and 373,451,069 on the fast path, which is very likely relevant but too general to deduce much (the totalcpu-cycles:u
is 80,967,248,265 on the slow path and 45,281,021,934 on the fast path), and I haven't yet found an appropriate "more precise" counter to explain the stalls. – Koineany_errors
ornext_expected
. double time might be a uop replay, but IDK exactly why that would be happening. Might be possible to figure out which of the 2 memory vars are actually the critical bottleneck, or if either one could be depending on circumstances. e.g. turninc mem
into load / inc / store, and try adding some latency to that value while you have it in a register (e.g.imul $1, %eax, %eax
oror $0, %eax
) to see if that slows you down by another cycle – Bixlernext_expected
that's the performance bottleneck. – Koineperf stat -r 5 ./myexe
and show the output. BTW, if I compile the code withgcc
9.3.0 at-O0
, the assembly gets emitted in the non-executable.data
section, so it crashes. But if I compile it with-O3
, it goes into the executable.text
section and works fine. – Publiasperf
's default output format is that useful (except inasmuch as it shows the standard deviations for the various counts), but in case it helps you, here it is. I ran the blocks of 5 runs, for each of the two versions of the program, 3 times, in an attempt to control for CPU frequency scaling effects, cache warming, etc. (although the 3 runs have very consistent timings anyway). – Koine0
forlsd.cycles_active
, with uops coming from the DSB at best. Anyway, that's a tangent and probably unrelated to your problem. The fact that uops are still coming from the LSD, not MITE, proves you didn't overflow the uop cache limit for a 32-byte region with that many single-byte nops. – Bixlertest.s
) I am seeing "fast" time for all 3 versions. Tested on my Whiskey Lake and seeing the same results. Worth noting with the two slow versions I saw that the loads are overscheduled or port3 (p2=9.9 bil, p3=11.5 bil) as opposed to the fast version where they are evenly split between ports 2/3 (p2 = p3 = 10.7 bil). Cant check load port distribution on Tigerlake as they seem to have compressed the p2 / p3 events into one. Don't think port distribution can explain near 2x speedup but it does explain some of it. – Siracusa