What are the CPU-internal characteristics of CAS collision?

Asked 19/4, 2011 at 17:5 Answered 22/4, 2011 at 14:22

multithreading assembly concurrency x86-64 lock-free

I'm trying to understand the low-level mechanics of CAS on x86/x64 and I'd really appreciate some help/insight.

The reason I've been thinking about this is that I'm trying to reason about exponential backoff, and figure out in principle what the correct single unit of backoff delay should be.

If I look at a lock-free freelist benchmark, without exponential backoff, I see as the number of threads increase, performance rapidly flatlines.

Release 7 Lock-Free Freelist Benchmark #1

   M
   N
   S
  L3U
L2U L2U
L1D L1D
L1I L1I
 P   P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
0 0 0 1 310134488,31013449,0,1.00
0 1 0 1 136313300,6815665,38365,0.22

0 1 0 1 136401284,6820064,50706,0.22
1 1 1 1 111134328,2778358,23851,0.09

0 0 1 1 334747444,16737372,2421,0.54
1 1 1 1 111105898,2777647,40399,0.09

As we know, live-lock can occur, where each thread prevents the others from progressing.

My original - and I believe now mistaken - thought was that CAS was interfering with CAS. By that I mean, the CAS instruction itself would destructively collide with another CAS, should they occur concurrently. Both would fail. (Prolly because I was in the back of my mind thinking about ethernet).

This 'obviously' explains the results - all those CAS instructions operating concurrently, very few have a chance to fully execute before being destructively interrupted.

Having thought about it some more, I believe now this cannot be the case. The CAS instruction does not in fact HAVE a failure mode. It will tell you the destination is equal or not equal to the comparand. That's all. It doesn't come back and say "oh, sorry, bumped into someone else".

Destructive interference IS occurring, but it's occurring at a higher level, in the data structure algorithm itself. When we push or pop from/to the freelist, we are actually TRYING to swap. We need the destination to be stable for long enough that we can read it, do whatever work we need to do, and then find it unchanged so we can complete our push/pop.

If other threads keep CASing, destination isn't stable - it keeps changing - and we keep having to retry our operation.

But now I'm confused.

What we see is that a single thread performs about 30 million push/pop operations. Destination has to be stable for the duration of one of these operations, for the operation to succeed, so we see there are 30 million 'slots'. If we have two threads, then the maximum theoretical performance we can have is 15 million operations per thread; each thread using half the slots.

Now let's come back to CAS. CAS has no failure mode. So what's happening when second thread tries to CAS when another thread is already CASing? well, the second thread will fail at the data structure level, since the swap couldn't occur, so it will retry the swap.

But now imagine we have LOTS of threads. The first thread to begin a CAS will succeed (assuming each CAS takes exactly the same time - not true, but that assumption doesn't change anything fundamental, so it's okay to reason with). All the others will fail.

But once the first thread has finished, the very next thread who reads the new destination value will have his CAS succeed (and all the other threads, still executing their CASs or now beginning new CASs, will fail).

So why do we not see perfect scaling? because every 'slot' should be being used!

I think therefore I do not understand CAS properly.

Reading Intel's Architecture Software Developer's Manual, I find that if all data is present in cache (which the situation I'm interested in), the cache coherency protocol takes care of CAS.

Drepper in his white paper describes LL/SC and how it works using MESI.

It seems reasonable to me for CAS to operate in a similar manner.

Let's consider the two thread case. First thread begins its CAS. The cache line with the destination is in its cache and marked exclusive.

Second thread begins to CAS. The first core sends its cache line over to the second core and both cores have that cache line marked shared.

First thread completes the CAS and writes to the cache line (the write always occurs on x86/x64, even if the compare was false; it just writes the original value).

The act of writing marks the cache line as modified; a RFO occurs, causing the second core to mark its cache line as invalid.

The second thread comes to complete its CAS and notices its cache line is invalid... and then, what? I find it hard to believe the instruction is in the CPU internally looped until it succeeds - although I wonder, because LL/SC on ARM requires you in your assembly to do this loop. But the CAS instruction knows that the value of destination has changed, so the results of its compare are invalid. But there's no error possible with CAS; it always returns true or false for the compare. But even if the instructions do loop until complete, I'd still expect perfect scaling. Each 'slot' should still be used.

So what happens? what is happening with CAS?

What I do see is that as the thread count rises, less and less work is done - all available 'slots' certainly are not being used. Something is causing this. Is it destructive interference between CAS instructions? or it is a large number of RFO hogging the CPU->northbridge bus?

What I do note with considerable interest is that two threads on the same physical core scale perfectly. Something special and different is happening in that case - two threads on separate physical cores scale half as well. But it's not enough of a clue to explain it all.

Funderburk answered 19/4, 2011 at 17:5 Comment(0)

What you're seeing here is the cost of moving the data between the L1 caches of the two physical cores. When only using one core, the data sits in that L1 cache and each CAS operates at full speed with the data in the cache. When there are two cores active, on the other hand, each time a core succeeds in writing to the data, it will invalidate the other cache, which will result in the data needing to be copied between the caches before the other core can do anything (generally, it will block waiting for the load before the CAS to complete). This is much more expensive than the actual CAS (it needs to move the data up to the L3 cahce at least and then back down to the other L1 cache), and results in the slowdown you see, as the data ends up ping-ponging back and forth between the two L1 caches

Sheriff answered 19/4, 2011 at 18:38 Comment(2)

Thanks for the reply, Chris. I'm currently in the process of digesting the responses I've had, here and on Usenet. One question - when moving data between L1 caches on separate cores, presumably the data needs to move up to at least the lowest shared cache? if the two cores had a shared L2, then it would be up to there (just as two logical processors in one core only need to move up to L1 cache, since they share that). – Funderburk 20/4, 2011 at 21:55

Yes, if they shared L2, that would be as far as it needed to go, but in the diagram in the question, only L3 and above is shared -- L2 is split between the two cores (though it is unified between I and D, unlike the L1, a detail that is irrelevant for this as only D matters) – Sheriff 21/12, 2023 at 1:55

By CAS, I assume you're talking about LOCK CMPXCHG

Second thread begins to CAS. The first core sends its cache line over to the second core and both cores have that cache line marked shared.

You seem to think the operation starts, gets interrupted, continues. CAS is atomic with respect to the memory subsystem. So it reads the value, compares and writes in a single go. There is no time slot where it will lose the cacheline to another core once it acquired it. How does that work ? It raises a processor lock signal during the instruction execution, so that other instructions stall on the memory subsystem until the cacheline is available again. That's why there is a LOCK prefix on the CMPXCHG instruction. You can read the LOCK description for further details.

So most of the contention that happens is on the L1 trying to get exclusive ownership of the cacheline while that signal is mostly raised all the time. If the L1 already has the cacheline (such as in the case of 2 threads on the same core), the only contention is on the duration of the CAS itself, not including the cacheline memory transfer across cores (since it's already there). And that's a lot faster.

Bainmarie answered 20/4, 2011 at 9:49 Comment(6)

Yes, I do mean LOCK CMPXCHG. The Architecture Manual describes the bus lock signal, but indicates that if all data is in cache, it will not be used and instead the MESI protocol will be used. I can understand how bus lock will work and I can see that will enforce serialisation of the complete operation across cores; but when bus lock is not used, another mechanism is in use - and that is, I think, the case in the benchmark in the OP. I suspect Chris Dodd's comments about RFOs explain the matter. I need to properly think about both your answers though - neither are fully digested yet. – Funderburk 20/4, 2011 at 10:29

Yes, it does say that for P6 and further, the external LOCK bus signal is not taken, rather it uses cache lock. But the principle still applies. The cacheline does not move while the cache lock is taken (and it's not something that MESI by itself provides. I cannot find any official reference to this, but this URL helps: engineering.purdue.edu/~eigenman/ECE563/Handouts/x86_memory.pdf Note the bits on load_lock and write_unlock, slide 11). – Bainmarie 20/4, 2011 at 11:30

Really? I've not come across the notion of 'cache lock'. I thought if MESI was in use, well, you could tell if someone else touched your cache line, because you're not exclusive. So you don't need cache lock, whatever it is. But perhaps you could be more efficient or I may not understand properly. I'll read the PDF and where it takes me. – Funderburk 20/4, 2011 at 12:39

well, cache lock is actually mentioned in 8.1.4 of Intel Vol3a (Effects of a LOCK Operation on Internal Processor Caches). It's just that it does not explain what it is, mentioning a handwavy "allow it’s cache coherency mechanism to ensure that the operation is carried out atomically". – Bainmarie 20/4, 2011 at 13:2

Yeah - I can believe it. I'm rather disappointed with those manuals. I was hoping to be able to find out the answers to the questions - and these are answers necessary for correct software design. – Funderburk 20/4, 2011 at 13:12

Okay, I've just slowly and carefully re-read 8.1.4 and I see what you mean. I originally took cache coherency mechanism to mean MESI... – Funderburk 23/4, 2011 at 10:27

So, I've been thinking all this over.

Currently, I have two separate proposals for how CAS is handled - 'cache lock' and MESI.

This post is purely with regard to cache lock.

Cache lock posits that a core locks the given cache line and other cores attempting to CAS on that cache line stall still the cache is released.

Furthemore, I believe also that CAS always writes its results back to memory before completing.

Taking that theory, let's look at the benchmark and try to interprent the results.

Release 7 Lock-Free Freelist Benchmark #1

   M
   N
   S
  L3U
L2U L2U
L1D L1D
L1I L1I
 P   P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
0 0 0 1 310134488,31013449,0,1.00
0 1 0 1 136313300,6815665,38365,0.22

0 1 0 1 136401284,6820064,50706,0.22
1 1 1 1 111134328,2778358,23851,0.09

0 0 1 1 334747444,16737372,2421,0.54
1 1 1 1 111105898,2777647,40399,0.09

So, first the single thread case;

L L L L total ops,mean ops/sec per thread,standard deviation,scalability
0 0 0 1 310134488,31013449,0,1.00

Here we have maximum performance. Every 'slot' is used by the single thread.

Now we come to two threads on the same core;

L L L L total ops,mean ops/sec per thread,standard deviation,scalability
0 0 1 1 334747444,16737372,2421,0.54

Here we still of course have the same number of 'slots' - a CAS takes as long as it takes - but we see they're evenly distributed between logical processors. This makes sense; one core locks the cache line, the other stalls, the first completes, the second gets the lock... they alternate. The destination remains in L1 cache with the cache line being in the modified state; we never need to re-read the destination from memory, so in that sense we're just like the one thread case.

Now we come to two threads on different cores;

L L L L total ops,mean ops/sec per thread,standard deviation,scalability
0 1 0 1 136401284,6820064,50706,0.22

Here we see our first big slow down. Our maximum theoretical scaling is 0.5, but we're at 0.22. How come? well, each thread is trying to lock the same cache line (in its own cache, of course) which is fine - but the problem is when a core gets the lock, it will need to re-read the destination from memory because its cache line will have been marked invalid by the other core having modified its copy of the data. So we put the slow down to the memory reads we're having to do.

Now we come to four threads, two per core.

L L L L total ops,mean ops/sec per thread,standard deviation,scalability
1 1 1 1 111105898,2777647,40399,0.09

Here we see the total number of ops is actually only slightly less than one thread per core, although of course the scaling is much worse, since we now have four threads, not two.

In the one thread per core scenario, every CAS begins with a read of memory, since the other core has invalided the CASing cores cache line.

In this scenario, when a core finishes a CAS and releases the cache lock, three threads are competing for the lock, two on another core, one on the same core. So two thirds of the time we need to re-read memory at the start of CAS; one third of the time we do not.

So we should be FASTER. But we are in fact SLOWER.

0% memory re-reading gives 33,474,744.4 total ops per second (two threads, same core)
66% memory re-reading, gives 11,110,589.8 total ops per second (four threads, two per core)
100% memory re-reading, gives 13,640,128.4 total ops per second (two threads, one per core)

And this puzzles me. The observed facts do not fit the theory.

Funderburk answered 22/4, 2011 at 14:22 Comment(1)

One note; the writeback of the CAS value to memory will cause it to be updated in the L3 unified cache, so the read by a second core will in fact only be from L3 cache, not from memory. – Funderburk 22/4, 2011 at 15:44

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags