On most or all current Intel x86 processors, a lock cmpxchg
to a location whose memory type is WB and is fully contained within a single L1D cache line is executed as follows:
- A lock-read request is issued to the L1D, which brings the target line in a locked-exclusive cache coherence state and provides the requested bytes as input to one of the execution ports to perform the comparison. (Cache locking is supported since the P6.) A line in a locked state cannot be invalidated or evicted for any reason.
- Perform the comparison for equality.
- Whatever the result is, issue a an unlock-write request to the L1D, which changes the state of the cache line to Modified and unlocks the line, thereby allowing other access or coherence requests to replace or invalidate the line.
The first and last steps can be observed empirically using either certain performance events or latency-based measurements. One way would be to allocate a large array of atomic variables and then execute lock cmpxchg
in a loop over that array. The lock-read request type is one of the types of RFO requests. So the L2_TRANS.RFO
event (or what's equivalent), which is reliable on most microarchitectures, can be used to measure the number of lock-reads to the L2. (L2_TRANS.RFO
counts demand RFOs, so it's better to turn off the hardware prefetchers to avoid unwanted hits in the L2. This also applies to L2_RQSTS.RFO_*
.)
There are also events for measuring the number of writebacks, such as L2_TRANS.L1D_WB
, L2_TRANS.L2_WB
, and others. Unfortunately, many of these events and across many microarchtiectures either undercount, overcount, or they count accurately but not necessarily all/only dirty cache line writebacks. So they are more difficult to reason with and in general not reliable.
A better way would be to execute lock cmpxchg
on one section of the array on a particular physical core, then migrate the thread to another physical core (in the same L3 sharing domain) and execute a loop in which the elements of that section are read (normal reads). If the lock cmpxchg
instruction puts the target line in the M state, a read request from another physical core in the same L3 sharing domain should hit in the L3 and also hit-modified in the private caches of the core on which lock cmpxchg
was executed. These events can be counted using OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE
(or what's equivalent), which is reliable on most/all microarchitectures.
A locked instruction is an expensive operation for three reasons: (1) Requires bringing the line in an exclusive state, (2) Makes the line dirty (possibly unnecessarily) and too many writebacks can have a significant impact on execution time, even more so when they end up stealing main memory bandwidth from long stretches of read requests, and even more so when the writes are to persistent memory, and (3) They are architecturally serializing, which makes the instruction on the critical path.
Intel has a patent that proposes an optimization for the last one, where the core optimistically assumes that there is no lock contention and issues a speculative normal load to the target line. If the line is not present in any other physical core, the line will be in an exclusive state in the requesting core. Then when the locked instruction executes and issues the lock-read request, the line would hopefully still be in the exclusive state, in which case the total latency of the locked instruction would be reduced. I don't know if any processor implements this optimization. If it's implemented, the number of L2_TRANS.RFO
events would be much smaller than the number of lines locked.
lock cmpxchg
. At least historically (for externally-visible effects), felixcloutier.com/x86/cmpxchg says "The processor never produces a locked read without also producing a locked write." But that doesn't rule out optimization of a cache-lock for cacheable memory in modern CPUs. – Sunriselock cmpxchg
, and that's where the cost comes from when spinning on it instead of spinning read-only until it looks like the lock is available. A lock variable is already normally going to be dirty (not in sync with DRAM) – Sunriselock bts
. This is clearly better because it leaves the line in S state, not E, and is (or should be) a well known fact among lock and other spin-loop implementers (on par with usingpause
in the spin-retry part). e.g. Locks around memory manipulation via inline assembly is an example. – Sunriselock cmpxchg
failure dirtying a cache line. I found atomic operation cost which makes the interesting point that pure-load + CAS might cause 2 cache misses: one to get Shared state for the load, another to get Exclusive. I'm still pretty sure that spinning read-only withpause
after seeing it locked is a good idea, but I'm not totally sure that a pure load as the first operation is a good idea. To make the fast case faster, it might be best to start with lock CAS – Sunrisetry_lock
in the post you linked.if(load() == already_locked) goto read-only-spin-loop
before attempting the first xchg or CAS. Locks around memory manipulation via inline assembly which I linked earlier is written that way (NASM syntax). – Sunrisexchg
first, then relaxed load. – Hoagland