I have been experimenting with a simple true/false sharing benchmark, which does regular load+increment+write on a pointer. Basically this:
static void do_increments(volatile size_t *buffer, size_t iterations)
{
while (iterations) {
buffer[0]++;
--iterations;
}
}
This function is called right after waiting on a barrier from two threads that are pinned to to different physical cores. Depending on the value of the buffer
pointer, this can be compiled to excibit true sharing (same address for both cores), false sharing (same cacheline, but different addresses), or no sharing (different cachelines).
When run on x86, true and false sharing scenarios show slowdown when compared to no sharing. However, on some ARM cores, like the Cortex A73, no slowdown is seen regardless of the address of buffer
. I have also seen some RISC-V cores excibit the same behaviour (no slowdown).
To try and understand why some platforms slow down and some others don't, I tried to gain a deeper understanding of why exactly false sharing causes slowdown, and for x86 it is nicely explained in Why does false sharing still affect non atomics, but much less than atomics?
Basically, on x86 chips you get stalls from false sharing because of either:
(correct my if I'm wrong on this please!)
- Memory ordering machine clears, which happen when a write from a different core becomes visible after a load from our core has started, but before it has determined its value
- When our store buffer has filled up, meaning we have to flush it, and do the rounds to keep the cache coherent (i.e. invalidate the cacheline in other cores and wait for invalidate ack)
ARM cores I tested seem to have all the same implementation details: a store buffer that can be forwarded from, and coherent caches. Maybe there are some relaxed memory ordering rules that help prevent these stalls on an ARM core?
Moreover, some other cores (e.g. Cortex A76) I tested do show slowdown from false sharing. Presumably, they obey the same memory ordering rules, so it has to be some microarchitectural detail that causes slowdown from false sharing?
More details
This snippet when cross-compiled for ARM with aarch64-linux-gnu-gcc version 13.2.0 (Debian 13.2.0-12)
at -O2
, produces the following assembly for the inner loop:
a68: f9408060 ldr x0, [x3, #256] // load
a6c: f1000442 subs x2, x2, #0x1 // dec loop counter
a70: 91000400 add x0, x0, #0x1 // inc
a74: f9008060 str x0, [x3, #256] // store
a78: 54ffff81 b.ne a68 <worker+0x88> // jump
The amount of iterations is set to 1 billion in the source.
Time is measured as the sum of wallclock times it took each thread to do the increments.
Compiling with SHARING_MODE
set to 0/1/2 and running on a Cortex A73 produces the following output:
user@a73:~$ ./truesharing_arm
5441755153 ns total, 5.44 ns/op
user@a73:~$ ./falsesharing_arm
5429813288 ns total, 5.43 ns/op
user@a73:~$ ./nosharing_arm
5057420129 ns total, 5.06 ns/op
user@a73:~$ ./singlethread_arm
5469989462 ns total, 5.47 ns/op // added as reference
The same binary on Cortex A55, there is no slowdown as well:
user@a55:~$ ./truesharing_arm
4066713396 ns total, 4.07 ns/op
user@a55:~$ ./falsesharing_arm
4066216996 ns total, 4.07 ns/op
user@a55:~$ ./nosharing_arm
4068883325 ns total, 4.07 ns/op
user@a55:~$ ./singlethread_arm
4065773094 ns total, 4.07 ns/op
While on Cortex A76 the same binary shows approximately 2x slowdown:
user@a76:~$ ./truesharing_arm
4798097347 ns total, 4.80 ns/op
user@a76:~$ ./falsesharing_arm
4747878672 ns total, 4.75 ns/op
user@a76:~$ ./nosharing_arm
2348259956 ns total, 2.35 ns/op
user@a76:~$ ./singlethread_arm
2347031787 ns total, 2.35 ns/op
As for x86, compiling with gcc version 13.3.0 (Debian 13.3.0-1)
with -O2
, produces this assembly:
1278: mov rax,QWORD PTR [rcx] // load
127b: add rax,0x1 // inc
127f: mov QWORD PTR [rcx],rax // store
1282: sub rdx,0x1 // dec loop counter
1286: jne 1278 <worker+0x48> // jump
Which does show slowdown in sharing scenarios, as expected:
user@x86:~$ ./truesharing_x86
703032167 ns total, 0.70 ns/op
user@x86:~$ ./falsesharing_x86
836444486 ns total, 0.84 ns/op
user@x86:~$ ./nosharing_x86
190924220 ns total, 0.19 ns/op
user@x86:~$ ./singlethread_x86
186088826 ns total, 0.19 ns/op
iterations
if it is on the stack instead of in a register. Again, need to see the asm to know what is going on. I'd like to see firm evidence that the effect is real before we start to speculate (no pun intended) about the cause. – Leonivolatile
tobuffer
in the original version of the post. I added it, as it is, of course, present in the actual code. – Touraco_Atomic
, and it doesn't give you an atomic increment instruction either. Still, not really relevant to the point, as we are really looking at the asm instead. – Leonistride
to a significantly larger number? – Leonibuffer[0]
being cached in a register. – Touracostride
are you thinking, Nate? Different pages? – Touracons/op
metric is off by two right now, since we give each coreN
ops. I will update the numbers so that each core doesN/nthreads
instead. This makes it easier to compare to the single core example. – Touracostd::thread
or pthreads will start threads across cores that aren't cache-coherent. GCC and Clang assume that__atomic
operations only need to sync with other cores in the same Inner Shareable domain. And as the OP found, the Linux kernel also assumes thatvolatile
load/store works like std::atomic withrelaxed
memory order, i.e. visibility to other cores via HW cache coherency without manual flushing. – Estaterelaxed
, but don't because it's not better than std::atomic.) – Estate