Why does using the same cache-line from multiple threads not cause serious slowdown?

Asked 24/10, 2017 at 19:57 Answered 26/10, 2017 at 22:25

Solved c++multithreading performance x86 false-sharing

Look at this snippet:

#include <atomic>
#include <thread>

typedef volatile unsigned char Type;
// typedef std::atomic_uchar Type;

void fn(Type *p) {
    for (int i=0; i<500000000; i++) {
        (*p)++;
    }
}

int main() {
    const int N = 4;

    std::thread thr[N];
    alignas(64) Type buffer[N*64];

    for (int i=0; i<N; i++) {
        thr[i] = std::thread(&fn, &buffer[i*1]);
    }

    for (int i=0; i<N; i++) {
        thr[i].join();
    }

}

This little program increments four adjacent bytes a lot of times from four different threads. Before, I used the rule: don't use the same cache line from different threads, as cache line sharing is bad. So I expected that a four thread version (N=4) is much slower than a one thread version (N=1).

However, these are my measurements (on a Haswell CPU):

N=1: 1 sec
N=4: 1.2 sec

So N=4 is not much slower. If I use different cache lines (replace *1 with *64), then N=4 becomes a little faster: 1.1 sec.

The same measurements for atomic access (swap the comments at typedef), same cache line:

N=1: 3.1 sec
N=4: 48 sec

So the N=4 case is much slower (as I expected). If different cache lines used, then N=4 has similar performance as N=1: 3.3 sec.

I don't understand the reason behind these results. Why don't I get a serious slowdown the non-atomic, N=4 case? Four cores have the same memory in their caches, so they must synchronize them somehow, don't they? How can they run almost perfectly parallel? Why just the atomic case gets a serious slowdown?

I think I need to understand how memory gets updated in this case. In the beginning, no cores have buffer in their caches. After one for iteration (in fn), all 4 cores have buffer in their cache-lines, but each core writes a different byte. How do these cache-lines get synchronized (in the non-atomic case)? How does the cache know, which byte is dirty? Or is there some other mechanism to handle this case? Why is this mechanism a lot cheaper (actually, it is almost free) than the atomic-one?

Ladybird answered 24/10, 2017 at 19:57 Comment(9)

@Mysticial: it doesn't. It is volatile, but I've checked the assembly too: it uses proper memory load/stores. – Ladybird 24/10, 2017 at 20:2

I'd call 1.0 vs 1.2 sec (a 20% run-time increase) a serious slowdown for any code where performance matters. – Refrigerant 24/10, 2017 at 20:5

@JesperJuhl: it is not serious compared to the slowdown for the atomic case. 20% vs. 1400%. – Ladybird 24/10, 2017 at 20:6

This would be processor dependent, but store to load forwarding might mean the cache is not even in play here. – Benita 25/10, 2017 at 2:52

Your test program doesn't really test anything since the four threads just stomp all over each other randomly. The final values of these increments is unpredictable. – Fruge 26/10, 2017 at 22:52

@DavidSchwartz: No. Final values are completely predictable. Threads are incrementing different bytes. Final values will be the same as initial values, as the iteration count is divisible by 256. – Ladybird 26/10, 2017 at 23:17

@Ladybird - a note that this has nothing to do wtih the threads writing different bytes: there is no magic that kicks in because the writes are independent: just try having them all write to buffer[0]: the performance is the same! – Tho 27/10, 2017 at 0:24

Yes, I wasn't actually referring to the comment chain (haven't read it yet), just your note at the end of the question: "I think I need to understand...". – Tho 27/10, 2017 at 0:30

@DavidSchwartz - you are wrong about what the C++ standard guarantees, at least of C++11 (before that there was no memory model, so neither wrong nor right). Separate array elements are guaranteed to be independently accessible by different threads, per the memory model. See for example this answer. Without this guarantee, many concurrent structures would be tough to write! That said, it's all beside the point here: this code performs about the same if the array elements are independent or the same, since coherence works on cache-line units. – Tho 27/10, 2017 at 20:20

What you are seeing is basically the effect of the store buffer combined with store-to-load forwarding allowing each core to work mostly independently, despite sharing a cache line. As we will see below, it is truly a weird case where more contention is bad, up to a point, then even more contention suddenly makes things really fast!

Now with the conventional view of contention your code seems like something that will be high contention and therefore much slower than ideal. What happens, however, is that as soon as each core gets a single pending write in its write buffer, all later reads can be satisfied from the write buffer (store forwarding), and later writes just go into the buffer as well even after the core has lost ownership of the cache line. This turns most of the work into a totally local operation. The cache line is still bouncing around between the cores, but it's decoupled from the core execution path and is only needed to actually commit the stores now and then¹.

The std::atomic version can't use this magic at all since it has to use locked operations to maintain atomicity and defeat the store buffer, so you see both the full cost of contention and the cost of the long-latency atomic operations².

Let's try to actually collect some evidence that this is what's occurring. All of the discussion below deals with the non-atomic version of the benchmark that uses volatile to force reads and writes from buffer.

Let's first check the assembly, to make sure it's what we expect:

0000000000400c00 <fn(unsigned char volatile*)>:
  400c00:   ba 00 65 cd 1d          mov    edx,0x1dcd6500
  400c05:   0f 1f 00                nop    DWORD PTR [rax]
  400c08:   0f b6 07                movzx  eax,BYTE PTR [rdi]
  400c0b:   83 c0 01                add    eax,0x1
  400c0e:   83 ea 01                sub    edx,0x1
  400c11:   88 07                   mov    BYTE PTR [rdi],al
  400c13:   75 f3                   jne    400c08 <fn(unsigned char volatile*)+0x8>
  400c15:   f3 c3                   repz ret

It's straightforward: a five instruction loop with a byte load, an increment of the loaded byte, a byte store, and finally the loop increment and conditional jump back to the top. Here, gcc has missed an optimization by breaking up the sub and jne, inhibiting macro-fusion, but overall it's OK and the store-forwarding latency is going to limit the loop in any case.

Next, let's take a look at the number of L1D misses. Every time a core needs to write into the line that has been stolen away, it will suffer an L1D miss, which we can measure with perf. First, the single threaded (N=1) case:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

       1070.188749      task-clock (msec)         #    0.998 CPUs utilized          
     2,775,874,257      cycles                    #    2.594 GHz                    
     2,504,256,018      instructions              #    0.90  insn per cycle         
       501,139,187      L1-dcache-loads           #  468.272 M/sec                  
            69,351      L1-dcache-load-misses     #    0.01% of all L1-dcache hits  

       1.072119673 seconds time elapsed

It is about what we expect: essentially zero L1D misses (0.01% of the total, probably mostly from interrupts and other code outside the loop), and just over 500,000,000 hits (matching almost exactly the number of loop iterations). Note also that we can easily calculate the cycles per iteration: about 5.5⁵. This primarily reflects the cost of store-to-load forwarding, plus one cycle for the increment, which is a carried dependency chain as the same location is repeatedly updated (and volatile means it can't be hoisted into a register).

Let's take a look at the N=4 case:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

       5920.758885      task-clock (msec)         #    3.773 CPUs utilized          
    15,356,014,570      cycles                    #    2.594 GHz                    
    10,012,249,418      instructions              #    0.65  insn per cycle         
     2,003,487,964      L1-dcache-loads           #  338.384 M/sec                  
        61,450,818      L1-dcache-load-misses     #    3.07% of all L1-dcache hits  

       1.569040529 seconds time elapsed

As expected the L1 loads jumps from 500 million to 2 billion, since there are 4 threads each doing the 500 million loads. The number of L1D misses also jumped by about a factor of 1,000, to about 60 million. Still, that number is not a lot compared to the 2 billion loads (and 2 billion stores - not shown, but we know they are there). That's ~33 loads and ~33 stores for every miss. It also means 250 cycles between each miss.

That doesn't really fit the model of the cache line bouncing around erratically between the cores, where as soon as a core gets the line, another core demands it. We know that lines bounce around between cores sharing an L2 in perhaps 20-50 cycles, so the ratio of one miss every 250 cycles seems way to low.

Two Hypotheses

A couple ideas spring to mind for the above described behavior:

Perhaps the MESI protocol variant used in this chip is "smart" and recognizes that one line is hot among several cores, but only a small amount of work is being done each time a core gets the lock and the line spends more time moving between L1 and L2 than actually satisfying loads and stores for some core. In light of this, some smart component in the coherence protocol decides to enforce some kind of minimum "ownership time" for each line: after a core gets the line, it will keep it for N cycles, even if demanded by another core (the other cores just have to wait).

This would help balance out the overhead of cache line ping-pong with real work, at the cost of "fairness" and responsiveness of the other cores, kind of like the trade-off between unfair and fair locks, and counteracting the effect described here, where the faster & fairer the coherency protocol is, the worse some (usually synthetic) loops may perform.

Now I've never heard of anything like that (and the immediately previous link shows that at least in the Sandy-Bridge era things were moving in the opposite direction), but it's certainly possible!
The store-buffer effect described is actually occurring, so most operations can complete almost locally.

Some Tests

Let's try to distinguish two cases with some modifications.

Reading and Writing Distinct Bytes

The obvious approach is to change the fn() work function so that the threads still contend on the same cache line, but where store-forwarding can't kick in.

How about we just read from location x and then write to location x + 1? We'll give each thread two consecutive locations (i.e., thr[i] = std::thread(&fn, &buffer[i*2])) so each thread is operating on two private bytes. The modified fn() looks like:

for (int i=0; i<500000000; i++)
    unsigned char temp = p[0];
    p[1] = temp + 1;
}

The core loop is pretty much identical to earlier:

  400d78:   0f b6 07                movzx  eax,BYTE PTR [rdi]
  400d7b:   83 c0 01                add    eax,0x1
  400d7e:   83 ea 01                sub    edx,0x1
  400d81:   88 47 01                mov    BYTE PTR [rdi+0x1],al
  400d84:   75 f2                   jne    400d78

The only thing that's changed is that we write to [rdi+0x1] rather than [rdi].

Now as I mentioned above, the original (same location) loop is actually running fairly slowly at about 5.5 cycles per iteration even in the best-case single-threaded case, because of the loop-carried load->add->store->load... dependency. This new code breaks that chain! The load no longer depends on the store so we can execute everything pretty much in parallel and I expect this loop to run at about 1.25 cycles per iteration (5 instructions / CPU width of 4).

Here's the single threaded case:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

        318.722631      task-clock (msec)         #    0.989 CPUs utilized          
       826,349,333      cycles                    #    2.593 GHz                    
     2,503,706,989      instructions              #    3.03  insn per cycle         
       500,973,018      L1-dcache-loads           # 1571.815 M/sec                  
            63,507      L1-dcache-load-misses     #    0.01% of all L1-dcache hits                 

       0.322146774 seconds time elapsed

So about 1.65 cycles per iteration³, about about three times faster versus incrementing the same location.

How about 4 threads?

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

      22299.699256      task-clock (msec)         #    3.469 CPUs utilized          
    57,834,005,721      cycles                    #    2.593 GHz                    
    10,038,366,836      instructions              #    0.17  insn per cycle         
     2,011,160,602      L1-dcache-loads           #   90.188 M/sec                  
       237,664,926      L1-dcache-load-misses     #   11.82% of all L1-dcache hits  


       6.428730614 seconds time elapsed

So it's about 4 times slower than the same location case. Now rather than being just a bit slower than the single-threaded case it is about 20 times slower. This is the contention you've been looking for! Now also that the number of L1D misses has increased by a factor of 4 as well, nicely explaining the performance degradation and consistent with the idea that when store-to-load forwarding can't hide the contention, misses will increase by a lot.

Increasing the Distance Between Stores

Another approach would be to increase the distance in time/instructions between the store and the subsequent load. We can do this by incrementing SPAN consecutive locations in the fn() method, rather than always the same location. E.g, if SPAN is 4, increment consecutively 4 locations like:

for (long i=0; i<500000000 / 4; i++) {
    p[0]++;
    p[1]++;
    p[2]++;
    p[3]++;
}

Note that we are still incrementing 500 million locations in total, just spreading out the increments among 4 bytes. Intuitively you would expect overall performance to increase since you now have SPAN parallel dependency with length 1/SPAN, so in the case above you might expect performance to improve by a factor of 4, since the 4 parallel chains can proceed at about 4 times the total throughput.

Here's what we actually get for time (measured in cycles) for the 1 thread and 3 thread⁴, for SPAN values from 1 to 20:

Initially you see performance increase substantially in both single and multi-threaded cases; the increase from a SPAN of one to two and three is close to the theoretical expected in the case of perfect parallelism for both cases.

The single-threaded case reaches an asymptote of about 4.25x faster than the single-location write: at this point the store-forwarding latency isn't the bottleneck and other bottlenecks have taken over (max IPC and store port contention, mostly).

The multi-threaded case is very different, however! Once you hit a SPAN of about 7, the performance rapidly gets worse, leveling out at about 2.5 times worse than the SPAN=1 case and almost 10x worse compared to the best performance at SPAN=5. What happens is that store-to-load forwarding stops occurring because the store and subsequent load are far enough apart in time/cycles that the store has retired to L1, so the load actually has to get the line and participate in MESI.

Also plotted is the L1D misses, which as mentioned above is indicative of "cache line transfers" between cores. The single-threaded case has essentially zero, and they are uncorrelated with the performance. The performance of the multi-threaded case, however, pretty much tracks exactly the cache misses. With SPAN values in the 2 to 6 range, where store-forwarding is still working, there are proportionally fewer misses. Evidently the core is able to "buffer up" more stores between each cache line transfer since the core loop is faster.

Another way to think of it is that in the contended case L1D misses are basically constant per unit-time (which makes sense, since they are basically tied to the L1->L2->L1 latency, plus some coherency protocol overhead), so the more work you can do in between the cache line transfers, the better.

Here's the code for the multi-span case:

void fn(Type *p) {
    for (long i=0; i<500000000 / SPAN; i++) {
        for (int j = 0; j < SPAN; j++) {
            p[j]++;
        }
    }
}

The bash script to run perf for all SPAN value from 1 to 20:

PERF_ARGS=${1:--x, -r10}

for span in {1..20}; do
    g++ -std=c++11 -g -O2 -march=native -DSPAN=$span  cache-line-increment.cpp  -lpthread -o cache-line-increment
    perf stat ${PERF_ARGS} -e cycles,L1-dcache-loads,L1-dcache-load-misses,machine_clears.count,machine_clears.memory_ordering ./cache-line-increment
done

Finally, "transpose" the results into proper CSV:

FILE=result1.csv; for metric in cycles L1-dcache-loads L1-dcache-load-misses; do { echo $metric; grep $metric $FILE | cut -f1 -d,; } > ${metric}.tmp; done && paste -d, *.tmp

A Final Test

There's a final test that you can do to show that each core is effectively doing most of its work in private: use the version of the benchmark where the threads work on the same location (which doesn't change the performance characteristics) examine the sum of the final counter values (you'd need int counters rather than char). If everything was atomic, you'd have a sum of 2 billion, and in the non-atomic case how close the total is to that value is a rough measure of how frequently the cores were passing around the lines. If the cores are working almost totally privately, the value would be closer to 500 million than 2 billion, and I guess that's what you'll find (a value fairly close to 500 million).

With some more clever incrementing, you can even have each thread track how often the value they incremented came from their last increment rather than another threads increment (e.g., by using a few bits of the value to stash a thread identifier). With an even more clever test you could practically reconstruct thee way the cache line moved around between the cores (is there a pattern, e.g., does core A prefer to hand off to core B?) and which cores contributed most to the final value, etc.

That's all left as an exercise :).

¹ On top of that, if Intel has a coalescing store buffer where later stores that fully overlap earlier ones kill the earlier stores, it would only have to commit one value to L1 (the latest store) every time it gets the line.

² You can't really separate the two effects here, but we will do it later by defeating store-to-load forwarding.

³ A bit more than I expected, perhaps bad scheduling leading to port pressure. If gcc would just all the sub and jne to fuse, it runs at 1.1 cycles per iteration (still worse than the 1.0 I'd expect). It will do that I use -march=haswell instead of -march=native but I'm not going to go back and change all the numbers.

⁴ The results hold with 4 threads as well: but I only have 4 cores and I'm running stuff like Firefox in the background, so using 1 less core makes the measurements a lot less noisy. Measuring time in cycles helps a lot too.

⁵ On this CPU architecture, store forwarding where the load arrives before the store data is ready seem to alternate between 4 and 5 cycles, for an average of 4.5 cycles.

Tho answered 26/10, 2017 at 22:25 Comment(9)

Thanks BeeOnRope, now I actually understand what's going on, great answer!. Maybe you omitted perf results for the N=4 case by accident? (don't bother with it, if you don't, as your answer is understandable without it) – Ladybird 26/10, 2017 at 23:21

What's the typical size of this write buffer? – Ladybird 26/10, 2017 at 23:23

@Ladybird - sorry got distracted IRL and truncated my answer. I'm still finishing it - there isn't much "proof" there yet :). The write queue (aka store buffer) is 42 entires in Haswell. – Tho 26/10, 2017 at 23:38

Did you use an old gcc that doesn't know -march=skylake? It's unfortunate that gcc -march=native just enables instruction-sets and uses tune=generic in that case. It's also unfortunate that gcc doesn't default to optimizing for macro-fusion (at least of test/cmp) now that AMD / Intel mainstream chips have had it for years. But they aren't willing to unless someone has test data showing it doesn't hurt other chips much. gcc.gnu.org/bugzilla/show_bug.cgi?id=78855. – Armstrong 27/10, 2017 at 2:48

@PeterCordes - well yes, I'm using the "old" version of gcc (5.4.0) that comes with my distribution, Ubuntu 16.04 which was released almost a year after Skylake was released (and no doubt more than a year after the family & model IDs became available). So "old" in the sense that it doesn't support Skylake, but quite "normal" in the sense that it's just my disto-included gcc which is probably how the large majority of developers use it. So yes, it totally sucks that -march=native silently falls back on some totally obsolete tuning profile. – Tho 28/10, 2017 at 0:38

I mean the advice to "just use -march=native` can often be pretty much totally wrong based on this behavior: it only applies to people with new-enough gcc versions, other people will silently generate sometimes much worse code. It also leads to the odd situation where you compile-stuff from source and you upgrade your computer, and then stuff starts running slower as stuff gets recompiled because -march=native is no producing bad code (but the existing binaries you compiled on your old box work great on the new box, because tuning for the old platform is a good bet for the new). – Tho 28/10, 2017 at 0:42

A reasonable fallback when the CPU version is "too new" would just be to use the tuning profile from the newest known version of the same "product line" which sounds complicated but we are really just talking about normal Intel, Atom/Silvermont Intel stuff, KNL/KNC stuff and AMD, right? Bleh. Oddly this doc claims that gcc 5.3 supports all the way up to skylake-avx512, but my 5.4 version doesn't even support skylake. – Tho 28/10, 2017 at 0:50

re: "and counteracting the effect [described here]" ... "(and the immediately previous link shows that at least in the Sandy-Bridge era things were moving in the opposite direction)" link is missing – Junitajunius 23/2, 2021 at 19:12

@Junitajunius - good catch. Link fixed (at least to the limit of my recollection about what I wanted to link there 3.5 years ago). – Tho 23/2, 2021 at 19:24

The atomic version has to ensure that some other thread will be able to read the result in a sequentially consistent fashion. So there are fences for each write.

The volatile version does not make any relationships visible to the other cores, so does not try and synchronize the memory so it is visible on other cores. For a multi-threaded system using C++11 or newer, volatile is not a mechanism for communicating between threads.

Dorn answered 24/10, 2017 at 20:23 Comment(8)

How does the CPU synchronize caches? Are there byte-level dirty flags in it? That would be an explanation for this behavior (if there isn't, how can the cache know which byte is valid, and which isn't?) – Ladybird 24/10, 2017 at 20:26

@geza, you may want to Google "Cache coherency protocols" – Phelgon 24/10, 2017 at 20:30

@Ladybird You might also be surprised, that your threads need not run on NUMA-different CPU-cores at all. So these still may live under the same L1d/L2/L3-cache(s) hierarchy. – Candis 24/10, 2017 at 20:43

@user3666197: I'm learning these low-level things right now. But I don't think that's the case. Then the atomic case would not get a slowdown, would it? – Ladybird 24/10, 2017 at 20:52

@WhiZTiM: thanks for the suggestion! I'm familiar with MESI with a basic level. However, that doesn't explain this behavior. Can you point me to some more direct direction, so I can actually understand what's going on with my case? I've already known what mksteve writes here, so I still don't know the exact cause of no-slowdown. I mean, all four cores have dirty cache lines for the same 64 byte memory segment. But only one byte is dirty in each of them. How do they update memory with only the dirty byte? – Ladybird 24/10, 2017 at 20:57

@Ladybird might like these few views: stackoverflow.com/a/995071 + stackoverflow.com/a/14761230 – Candis 24/10, 2017 at 20:57

The atomic version isn't just slower because of fences, but because the increment is performed atomically (RMW). That requires the cache line to be locked during the operation, effectively blocking it from being accessed by any other thread.. The same code using atomics on different cache lines may be 10x faster – Blacklist 24/10, 2017 at 21:7

On x86, every atomic RMW instruction is also a full barrier (i.e. fetch_add(1, mo_seq_cst) compiles to exactly the same asm on x86 as fetch_add(1, mo_relaxed), unless there's any compile-time reordering that relaxed allows.) This is not the case for other ISAs like PowerPC where you can do atomic RMW without ordering stores to other locations. But PowerPC relaxed would still be slow for this test, because it still does require the store to commit to L1D instead of just being in the store buffer, so you would suffer cache-line ping-pong. – Armstrong 27/10, 2017 at 3:5

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++