Multiple threads trying to CAS something is lock-free (but not wait-free). One of the threads will make progress every time they all try with the same old
value. https://en.wikipedia.org/wiki/Non-blocking_algorithm.
(Whether multiple threads all read the same old
value or whether some see the result of another thread's CAS depends on timing, and is basically what determines how much contention there is.)
This is unlike a normal busy-wait loop which is just waiting for some unknown-length operation and could be stuck indefinitely if the thread holding a lock is descheduled. In that case, you definitely want to back off if your CAS fails to acquire a lock, because you definitely have to wait for another thread to do something before you can succeed.
Usually lockless algorithms are used in low-contention situations where sophisticated exponential-backoff isn't really needed. That's what the linked SO answer is saying.
This is a key difference from the situation mentioned in the Wiki article: where many threads constantly update some particular shared variable. That's a high-contention situation, so it's probably better to let one thread do a bunch of updates in a row and keep the line hot in their L1d cache. (Assuming you're using CAS to implement an atomic operation that the hardware doesn't support directly, like an atomic double-precision FP add, where you shared.CAS(old, old+1.0)
or something. Or as part of a lockless queue or something.)
If you were using a CAS loop that was highly contended in practice, it might help total throughput some to back off and e.g. run an x86 pause
instruction on failure before trying again, to have fewer cores hammering on a cache line. Or for a lockless queue, if you find it was full or empty, then that's basically a waiting-for-another-thread situation so you should definitely back off.
Most architectures other than x86 have LL/SC as their atomic RMW primitive, not a direct hardware CAS. Building CAS out of LL/SC can spuriously fail if other threads are even reading the cache line during the CAS attempt, so it might not be guaranteed that at least one thread succeed.
Hopefully hardware designers try to make CPUs that make LL/SC resist spurious failure from contention, but I don't know the details. In this case, backoff might help to avoid potential livelock.
(On hardware where CAS can't spuriously fail from contention, livelock is impossible for something like while(!shared.CAS(old, old<<1)){}
.)
Intel's optimization manual has an example of waiting for a lock to become free, where they loop 1 << retry_count
times (up to some max backoff factor) Note that this is not a normal CAS loop that's part of a lockless algorithm; this is for implementing a lock.
The backoff is waiting for the lock to become free, not just for contention for access to the cache line containing the lock itself.
/// Intel's optimization manual
/// Example 2-2. Contended Locks with Increasing Back-off Example
/// from section 2.2.4 Pause Latency in Skylake Microarchitecture
/// (~140 cycles, up from ~10 in Broadwell, thus max backoff should be shorter)
/*******************/
/*Baseline Version */
/*******************/
// atomic {if (lock == free) then change lock state to busy}
while (cmpxchg(lock, free, busy) == fail)
{
while (lock == busy)
_mm_pause();
}
/*******************/
/*Improved Version */
/*******************/
int mask = 1;
int const max = 64; //MAX_BACKOFF
while (cmpxchg(lock, free, busy) == fail)
{
while (lock == busy)
{
for (int i=mask; i; --i){
_mm_pause();
}
mask = mask < max ? mask<<1 : max; // mask <<= 1 up to a max
}
}
I thought normally when waiting for a lock, you want to spin read-only instead of keeping trying with cmpxchg. I think this example from Intel is only demonstrating backoff, not other parts of how to optimize a lock to avoid delaying the unlocking thread.
Anyway, remember that example is not like what we're talking about with a lockless queue or a CAS-retry implementation of an atomic add or other primitive. It's waiting for another thread to release a lock, not just failure from using the new value that appeared between reading the old value and attempting to CAS in a new value.
Random
in a multi-threaded application is to not useRandom
, but useThreadLocalRandom
instead, as documented: "Instances ofjava.util.Random
are threadsafe. However, the concurrent use of the samejava.util.Random
instance across threads may encounter contention and consequent poor performance. Consider instead usingThreadLocalRandom
in multithreaded designs". ;-) – Argolis