How can a weaker memory model prevent slowdown from false sharing?

Asked 30/6, 2024 at 10:55 Answered 30/6, 2024 at 10:55

multithreading x86 arm cpu-architecture false-sharing

I have been experimenting with a simple true/false sharing benchmark, which does regular load+increment+write on a pointer. Basically this:

static void do_increments(volatile size_t *buffer, size_t iterations)
{
    while (iterations) {
        buffer[0]++;
        --iterations;
    }
}

This function is called right after waiting on a barrier from two threads that are pinned to to different physical cores. Depending on the value of the buffer pointer, this can be compiled to excibit true sharing (same address for both cores), false sharing (same cacheline, but different addresses), or no sharing (different cachelines).

When run on x86, true and false sharing scenarios show slowdown when compared to no sharing. However, on some ARM cores, like the Cortex A73, no slowdown is seen regardless of the address of buffer. I have also seen some RISC-V cores excibit the same behaviour (no slowdown).

To try and understand why some platforms slow down and some others don't, I tried to gain a deeper understanding of why exactly false sharing causes slowdown, and for x86 it is nicely explained in Why does false sharing still affect non atomics, but much less than atomics?

Basically, on x86 chips you get stalls from false sharing because of either:

_{(correct my if I'm wrong on this please!)}

Memory ordering machine clears, which happen when a write from a different core becomes visible after a load from our core has started, but before it has determined its value
When our store buffer has filled up, meaning we have to flush it, and do the rounds to keep the cache coherent (i.e. invalidate the cacheline in other cores and wait for invalidate ack)

ARM cores I tested seem to have all the same implementation details: a store buffer that can be forwarded from, and coherent caches. Maybe there are some relaxed memory ordering rules that help prevent these stalls on an ARM core?

Moreover, some other cores (e.g. Cortex A76) I tested do show slowdown from false sharing. Presumably, they obey the same memory ordering rules, so it has to be some microarchitectural detail that causes slowdown from false sharing?

More details

This snippet when cross-compiled for ARM with aarch64-linux-gnu-gcc version 13.2.0 (Debian 13.2.0-12) at -O2, produces the following assembly for the inner loop:

a68:   f9408060        ldr     x0, [x3, #256]     // load
a6c:   f1000442        subs    x2, x2, #0x1       // dec loop counter
a70:   91000400        add     x0, x0, #0x1       // inc
a74:   f9008060        str     x0, [x3, #256]     // store
a78:   54ffff81        b.ne    a68 <worker+0x88>  // jump

The amount of iterations is set to 1 billion in the source.

Time is measured as the sum of wallclock times it took each thread to do the increments.

Compiling with SHARING_MODE set to 0/1/2 and running on a Cortex A73 produces the following output:

user@a73:~$ ./truesharing_arm 
5441755153 ns total, 5.44 ns/op
user@a73:~$ ./falsesharing_arm 
5429813288 ns total, 5.43 ns/op
user@a73:~$ ./nosharing_arm 
5057420129 ns total, 5.06 ns/op
user@a73:~$ ./singlethread_arm 
5469989462 ns total, 5.47 ns/op // added as reference

The same binary on Cortex A55, there is no slowdown as well:

user@a55:~$ ./truesharing_arm 
4066713396 ns total, 4.07 ns/op
user@a55:~$ ./falsesharing_arm 
4066216996 ns total, 4.07 ns/op
user@a55:~$ ./nosharing_arm 
4068883325 ns total, 4.07 ns/op
user@a55:~$ ./singlethread_arm 
4065773094 ns total, 4.07 ns/op

While on Cortex A76 the same binary shows approximately 2x slowdown:

user@a76:~$ ./truesharing_arm
4798097347 ns total, 4.80 ns/op
user@a76:~$ ./falsesharing_arm 
4747878672 ns total, 4.75 ns/op
user@a76:~$ ./nosharing_arm 
2348259956 ns total, 2.35 ns/op
user@a76:~$ ./singlethread_arm 
2347031787 ns total, 2.35 ns/op

As for x86, compiling with gcc version 13.3.0 (Debian 13.3.0-1) with -O2, produces this assembly:

1278:       mov    rax,QWORD PTR [rcx]    // load
127b:       add    rax,0x1                // inc
127f:       mov    QWORD PTR [rcx],rax    // store
1282:       sub    rdx,0x1                // dec loop counter
1286:       jne    1278 <worker+0x48>     // jump

Which does show slowdown in sharing scenarios, as expected:

user@x86:~$ ./truesharing_x86 
703032167 ns total, 0.70 ns/op
user@x86:~$ ./falsesharing_x86 
836444486 ns total, 0.84 ns/op
user@x86:~$ ./nosharing_x86 
190924220 ns total, 0.19 ns/op
user@x86:~$ ./singlethread_x86 
186088826 ns total, 0.19 ns/op

Touraco answered 30/6, 2024 at 10:55 Comment(30)

You might want to clarify whether you are actually running C code, or assembly for which the C function you gave is just meant as descriptive pseudocode. In the former case, compiler optimizations could be confusing things, especially in the "true sharing" case where you have a data race. – Leoni 30/6, 2024 at 16:47

Note also that since you have a non-atomic increment, the "true sharing" case won't work as expected. – Leoni 30/6, 2024 at 16:48

I think what would be helpful to see here would be the assembly (whether handwritten or compiler-generated) that you are executing in your A73 and/or RISC-V tests, and a description of your method for measuring the time taken (including how many iterations, etc). – Leoni 30/6, 2024 at 16:59

ARM certainly does have weaker memory ordering rules, but they should only come into play if the cores are accessing some other location in memory, such as iterations if it is on the stack instead of in a register. Again, need to see the asm to know what is going on. I'd like to see firm evidence that the effect is real before we start to speculate (no pun intended) about the cause. – Leoni 30/6, 2024 at 17:39

I will add generated assembly and a runnable example shortly. Should have done that from the start, my bad. – Touraco 30/6, 2024 at 18:47

Updated the post with more details, including full source, diassembly, and test results from different platforms. – Touraco 30/6, 2024 at 21:30

I also forgot to add volatile to buffer in the original version of the post. I added it, as it is, of course, present in the actual code. – Touraco 30/6, 2024 at 21:32

Volatile doesn't really make it better. ISO C doesn't allow volatile to fix a data race, only _Atomic, and it doesn't give you an atomic increment instruction either. Still, not really relevant to the point, as we are really looking at the asm instead. – Leoni 30/6, 2024 at 22:54

Does anything change if you change stride to a significantly larger number? – Leoni 1/7, 2024 at 3:3

ARM's relaxed memory model allows out-of-order commit of stores, potentially allowing a CPU to coalesce multiple stores to the same location for one commit once it does get ownership of the cache line. And potentially avoiding filling up the store buffer, depending on how/where the coalescing happens. And lack of load-load ordering is also huge, allowing store-forwarding to just work without caring about ownership or invalidations of the cache line (I think x86 HW loads early, but does a memory-order machine clear if the line was invalidated between then and when architecturally allowed) – Estate 1/7, 2024 at 5:55

I added volatile here only to prevent buffer[0] being cached in a register. – Touraco 1/7, 2024 at 6:14

Peter, hello! This was my theory as well, however I don't understand memory ordering quite well enough to connect it with implementation details like store forwarding. – Touraco 1/7, 2024 at 6:17

But most importantly, if write coalescing was the culprit, why would a more recent arm core (a76) not do it? – Touraco 1/7, 2024 at 6:18

How big of a stride are you thinking, Nate? Different pages? – Touraco 1/7, 2024 at 6:22

Sure, for instance. Just in case there is something special about cache lines that are adjacent or very close. – Leoni 1/7, 2024 at 6:45

Because we can interpret the results in two ways. One is that the older cores don't suffer a slowdown in the sharing cases, and are thus actually better than the new ones. That would be surprising. The other is that the old cores are failing to see a speedup in the "no sharing" case, i.e. you're seeing the "penalized" times in all your tests. So I'd like to see if maybe there is a different case that doesn't suffer this penalty. – Leoni 1/7, 2024 at 6:54

Just to eliminate some other variables: these are in fact all multi-core machines, right? There's no other significant CPU load? And you've made sure they are not doing thermal throttling or anything like that? – Leoni 1/7, 2024 at 6:56

Yes, these are all multicore machines with very little background activity. I have tested a stride of 4096 bytes, and it behaves the same way a 64b stride does. – Touraco 1/7, 2024 at 7:6

Considering the "speedup vs slowdown" - very interesting point! I have some more data about this: 1. Looking a bit closer, there appears to be a marginal slowdown of about 7% on A73 (but not A55). 2. I've ran additional tests with only one thread active, and added those as a baseline to the post. This should make it clear that we are seeing "slowdown", not "speedup" – Touraco 1/7, 2024 at 7:20

As for "newer cores being slower", this is not that rare when looking at isolated metrics like this. A73 is a smaller core with no L3, while A76 has L3. This leads to much lower DRAM latency for the A73, for example, as it doesn't have to go through the expensive L3 miss step. – Touraco 1/7, 2024 at 7:22

I'm testing this now on a RPi 4 (Cortex-A72 x 4) and a RPi 5 (Cortex-A76 x 4). On the A72, the "non-sharing" case is consistently about 10% slower which is extra bizarre. The A76 has the non-sharing case about 2x faster, similar to what you saw. (Probably unrelated but - your code is pinning the threads to CPUs 4 and 5; do your machines actually have 6 or more cores?) – Leoni 1/7, 2024 at 7:29

Here's a completely wild guess: perhaps the older cores aggressively coalesce stores, so when they can't get the cache line, they just overwrite their own store buffer entries. Maybe this was found to be undesirable in the bigger picture, because if you're executing a store to a shared object, you probably want it to be globally visible sooner than later. So maybe on the newer cores, they deliberately backed this off, and made them stall more often, so as to ensure the stores actually make it out of the store buffer. – Leoni 1/7, 2024 at 7:33

I wonder if these machines have any performance counters that could be helpful. I haven't ever looked into them. – Leoni 1/7, 2024 at 7:34

> do your machines actually have 6 or more cores Yes, the A76 tests are being run on a RADAXA ROCK5 Model B, which has 4xA55 cores and 4xA76 cores, so cores 4 and 5 are the first two A76 cores. I was too lazy to implement cli args for this example, so I just hardcode the affinity in the source. – Touraco 1/7, 2024 at 7:43

Also, the ns/op metric is off by two right now, since we give each core N ops. I will update the numbers so that each core does N/nthreads instead. This makes it easier to compare to the single core example. – Touraco 1/7, 2024 at 7:45

Are you sure this is about the memory order instead of the cache consistency model? The problem with false sharing is the ping ponging of the line in the EX state between the two cores. Once the SB is full, it will take an RFO to drain each entry for your shared variable. IIRC, ARM has a more lax cache model, where you can have different consistency domains to be synchronized explicitly in software. – Prelature 2/7, 2024 at 7:45

Hello, Margaret! I have not considered a different cache consistency model! Where would I look for details? – Touraco 2/7, 2024 at 8:56

It looks like Linux requires all cores to be in the same "InnerDomain": https://mcmap.net/q/1918643/-how-is-cache-coherency-maintained-on-armv8-big-little-system – Touraco 2/7, 2024 at 9:18

@MargaretBloom: I'm not aware of any systems (ARM or otherwise) where std::thread or pthreads will start threads across cores that aren't cache-coherent. GCC and Clang assume that __atomic operations only need to sync with other cores in the same Inner Shareable domain. And as the OP found, the Linux kernel also assumes that volatile load/store works like std::atomic with relaxed memory order, i.e. visibility to other cores via HW cache coherency without manual flushing. – Estate 2/7, 2024 at 10:56

ARM boards exist with cores that aren't cache-coherent (e.g. microcontroller + DSP), but they don't run a single Linux kernel (or freestanding threads) across those cores. (Semi-related: When to use volatile with multi threading? - it does work in practice as somewhat like relaxed, but don't because it's not better than std::atomic.) – Estate 2/7, 2024 at 10:59

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

More details

Recommended topics

Hot tags