Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?

Asked 21/7, 2020 at 6:46 Answered 11/8, 2020 at 1:0

Solved assembly x86 cpu-cache micro-optimization compare-and-swap

I assume simple spinlock that does not go to OS waiting for the purposes of this question.

I see that simple spinlock is often implemented using lock xchg or lock bts instead of lock cmpxchg.

But doesn't cmpxchg avoid writing the value if the expectation does not match? So aren't failed attempts cheaper with cmpxchg?

Or does cmpxchg write data and invalidate cache line of other cores even on failure?

This question is similar to What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?, but it is specific to cmpxchg, not in general.

Babylonian answered 21/7, 2020 at 6:46 Comment(10)

I think all atomic RMWs do effectively count as stores, including lock cmpxchg. At least historically (for externally-visible effects), felixcloutier.com/x86/cmpxchg says "The processor never produces a locked read without also producing a locked write." But that doesn't rule out optimization of a cache-lock for cacheable memory in modern CPUs. – Sunrise 21/7, 2020 at 6:56

It has to at least bring the cache line into E state, invalidating other copies, before attempting lock cmpxchg, and that's where the cost comes from when spinning on it instead of spinning read-only until it looks like the lock is available. A lock variable is already normally going to be dirty (not in sync with DRAM) – Sunrise 21/7, 2020 at 7:0

@Peter, Oh, I see. Then it does not matter whether the actual store optimizes away or not. – Babylonian 21/7, 2020 at 7:2

Or... maybe the cache line may still become shared faster if skipping M state and not waiting for store to complete? – Babylonian 21/7, 2020 at 7:14

If you want a core to be able to read but not write a line while checking lock availability, spin read-only with a load separate from the CAS attempt, xchg, or lock bts. This is clearly better because it leaves the line in S state, not E, and is (or should be) a well known fact among lock and other spin-loop implementers (on par with using pause in the spin-retry part). e.g. Locks around memory manipulation via inline assembly is an example. – Sunrise 21/7, 2020 at 7:19

I have an answer partially written; I went looking for info on whether anyone's tested lock cmpxchg failure dirtying a cache line. I found atomic operation cost which makes the interesting point that pure-load + CAS might cause 2 cache misses: one to get Shared state for the load, another to get Exclusive. I'm still pretty sure that spinning read-only with pause after seeing it locked is a good idea, but I'm not totally sure that a pure load as the first operation is a good idea. To make the fast case faster, it might be best to start with lock CAS – Sunrise 21/7, 2020 at 14:5

@FrancescoMenzani: Yes, spinning read-only and only attempting an atomic RMW when it looks possible is definitely better. (Thanks for the link that confirms that with experimental numbers). The interesting question is whether it hurts significantly in a somewhat-low contention situation for the first check to be read-only. – Sunrise 22/7, 2020 at 11:4

@PeterCordes How can the first check be read-only? Can you please elaborate on the exact implementation? – Hoagland 22/7, 2020 at 11:18

@FrancescoMenzani: Like the try_lock in the post you linked. if(load() == already_locked) goto read-only-spin-loop before attempting the first xchg or CAS. Locks around memory manipulation via inline assembly which I linked earlier is written that way (NASM syntax). – Sunrise 22/7, 2020 at 11:26

Someone suggests doing xchg first, then relaxed load. – Hoagland 22/7, 2020 at 12:3

On most or all current Intel x86 processors, a lock cmpxchg to a location whose memory type is WB and is fully contained within a single L1D cache line is executed as follows:

A lock-read request is issued to the L1D, which brings the target line in a locked-exclusive cache coherence state and provides the requested bytes as input to one of the execution ports to perform the comparison. (Cache locking is supported since the P6.) A line in a locked state cannot be invalidated or evicted for any reason.
Perform the comparison for equality.
Whatever the result is, issue a an unlock-write request to the L1D, which changes the state of the cache line to Modified and unlocks the line, thereby allowing other access or coherence requests to replace or invalidate the line.

The first and last steps can be observed empirically using either certain performance events or latency-based measurements. One way would be to allocate a large array of atomic variables and then execute lock cmpxchg in a loop over that array. The lock-read request type is one of the types of RFO requests. So the L2_TRANS.RFO event (or what's equivalent), which is reliable on most microarchitectures, can be used to measure the number of lock-reads to the L2. (L2_TRANS.RFO counts demand RFOs, so it's better to turn off the hardware prefetchers to avoid unwanted hits in the L2. This also applies to L2_RQSTS.RFO_*.)

There are also events for measuring the number of writebacks, such as L2_TRANS.L1D_WB, L2_TRANS.L2_WB, and others. Unfortunately, many of these events and across many microarchtiectures either undercount, overcount, or they count accurately but not necessarily all/only dirty cache line writebacks. So they are more difficult to reason with and in general not reliable.

A better way would be to execute lock cmpxchg on one section of the array on a particular physical core, then migrate the thread to another physical core (in the same L3 sharing domain) and execute a loop in which the elements of that section are read (normal reads). If the lock cmpxchg instruction puts the target line in the M state, a read request from another physical core in the same L3 sharing domain should hit in the L3 and also hit-modified in the private caches of the core on which lock cmpxchg was executed. These events can be counted using OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE (or what's equivalent), which is reliable on most/all microarchitectures.

A locked instruction is an expensive operation for three reasons: (1) Requires bringing the line in an exclusive state, (2) Makes the line dirty (possibly unnecessarily) and too many writebacks can have a significant impact on execution time, even more so when they end up stealing main memory bandwidth from long stretches of read requests, and even more so when the writes are to persistent memory, and (3) They are architecturally serializing, which makes the instruction on the critical path.

Intel has a patent that proposes an optimization for the last one, where the core optimistically assumes that there is no lock contention and issues a speculative normal load to the target line. If the line is not present in any other physical core, the line will be in an exclusive state in the requesting core. Then when the locked instruction executes and issues the lock-read request, the line would hopefully still be in the exclusive state, in which case the total latency of the locked instruction would be reduced. I don't know if any processor implements this optimization. If it's implemented, the number of L2_TRANS.RFO events would be much smaller than the number of lines locked.

Doy answered 11/8, 2020 at 1:0 Comment(2)

If the patent is implemented is it likely that it is implemented equally for all locked instructions? – Babylonian 11/8, 2020 at 3:24

@AlexGuteniev Yes, it's applicable to all. – Doy 11/8, 2020 at 15:33

I made some tests. Very synthetic though, did a very little under a lock, and measured throughput of very contended scenario.

So far, no steady effect of difference between lock bts xchg or lock cmpxchg was observed.

Other stuff however had some effect:

Inner load loop is definitely helpful, both with and without pause
One pause in a loop is helpful, both with and without load loop
Load loop helps more than pause
The best results are achieved by applying "Improved version" from Intel® 64 and IA-32 Architectures Optimization Reference Manual (see below)
Starting with load instead of RMW/CAS has controversial effect: it is helpful for tests without pause, but degrades performance of tests with pause

Intel® 64 and IA-32 Architectures Optimization Reference Manual recommend using pause.

Example 2-4. Contended Locks with Increasing Back-off Example shows baseline version:

/*******************/
/*Baseline Version */
/*******************/
// atomic {if (lock == free) then change lock state to busy}
while (cmpxchg(lock, free, busy) == fail)
{
 while (lock == busy)
 {
 __asm__ ("pause");
 }
}

and improved version:

/*******************/
/*Improved Version */
/*******************/
int mask = 1;
int const max = 64; //MAX_BACKOFF
while (cmpxchg(lock, free, busy) == fail)
{
 while (lock == busy)
 {
   for (int i=mask; i; --i){
     __asm__ ("pause");
   }
   mask = mask < max ? mask<<1 : max;
 }
}

Windows SRWLOCK may also be a good example to follow. It uses load loop, and pause. it starts with interlocked operation lock bts for acquire exclusive, lock cmpxchg for acquire shared. Even TryAcquireSRWLockExclusive does only lock bts:

RtlTryAcquireSRWLockExclusive:
00007FFA86D71370  lock bts    qword ptr [rcx],0  
00007FFA86D71376  setae       al  
00007FFA86D71379  ret

It doesn't however implement exponentially growing pause in waiting versions. It does some small amount of loads with one pause, then goes to OS wait.

Babylonian answered 6/8, 2020 at 5:12 Comment(9)

I assume you were just testing multiple threads doing nothing but spamming attempts to take the lock; IDK if a read-only test before the first atomic-RMW might be qualitatively different in a (hopefully) more typical medium to low contention situation. (Like actually better instead of just less-bad, in a properly-written implementation with a read-only + pause spin loop after failure.) It might always be bad, I hadn't considered the fact that a read-only access would probably get the line in Shared state, and then RMW would need an RFO. – Sunrise 6/8, 2020 at 5:20

Attempting RMW first is the optimistic option, so it's probably even better in low-contention cases. – Sunrise 6/8, 2020 at 5:22

@PeterCordes, I did increment of a shared variable under lock, to mode lock use, and couple of integer divisions outside to model something done not under the lock. Though probably just a couple of divisions isn't too much of work – Babylonian 6/8, 2020 at 5:25

If they're 64-bit divisions on an Intel CPU, that's maybe starting to be meaningful like 24 cycles / 56 uops for idiv r64 on SKL, although OoO exec can overlap the div / idiv microcode with the execution of a locked instruction's microcode. (Unlike lfence, locked instructions are only memory barriers, not execution barriers). – Sunrise 6/8, 2020 at 5:48

@PeterCordes, made them 64-bit division, now the difference is less dramatic, but still starting with load is a bit worse, and both load and pause is better, and Intel's recommendation works best. I still think "load first penalty" is small enough to do it in try_lock where negative result is also a result. – Babylonian 6/8, 2020 at 5:59

I don't want to block OoO in a benchmark, since OoO will also happen in real code, so lets assume lock instruction overlaps with surrounding code – Babylonian 6/8, 2020 at 6:1

Hypothesis: A use-case where load-first could win is when the critical section takes longer, so it's more likely that multiple threads all see the lock taken during one "taking" of it. That means they cause less contention for the unlocking thread. (And maybe less overall traffic, but your benchmark doesn't measure that; something bandwidth-limited on the other hyperthread of each core might be interesting). Or maybe not, maybe a longer critical section gives time for things to settle down so all other threads get into their read-only loop and aren't in the middle of an RMW. – Sunrise 6/8, 2020 at 6:5

Correct, you don't want to block OoO exec here, I'm just pointing out / thinking out loud about the fact that even though idiv r64 throughput is ~24 cycles, that doesn't mean that n * 24 cycles between unlock and attempt to retake the lock, because of OoO exec. But real-world use-cases might end with a store to memory right before attempting to take a lock, not just ALU instructions, which a locked instruction would have to wait for. (Because of store ordering, it has to drain the SB, and that includes waiting for in-flight stores to get their store-data, retire, and commit.) – Sunrise 6/8, 2020 at 6:8

Yeah, I think normally you'd want to optimize so the uncontended case is fast, even for a TryLock function. Hopefully in most programs it still usually succeeds on the first try. I wondered if maybe the CPU might manage to upgrade the plain read to an RFO before it was "too later" (in which case there'd be no perf penalty), but your testing seems to prove that's not the case. Thanks for doing that; maybe I'll finish my partial answer and post it at some point. – Sunrise 6/8, 2020 at 6:31

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags