Why isn't CAS (Compare And Swap) equivalent to busy wait loops?
Asked Answered
A

1

6

Reading a bit about lock free programming in the past few days I came across the util.java.Random class creating it's bits using the following routine:

protected int next(int bits) {
    long oldseed, nextseed;
    AtomicLong seed = this.seed;
    do {
        oldseed = seed.get();
        nextseed = (oldseed * multiplier + addend) & mask;
    } while (!seed.compareAndSet(oldseed, nextseed));
    return (int)(nextseed >>> (48 - bits));
}

According to this answer to a post "Spinlock vs Busy wait" :

So-called lock-free algorithms tend to use tight busy-waiting with a CAS instruction, but the contention is in ordinary situations so low that the CPU usually have to iterate only a few times.

And the Wikipedia topic "Compare-and-Swap" :

Instead of immediately retrying after a CAS operation fails, researchers have found that total system performance can be improved in multiprocessor systems—where many threads constantly update some particular shared variable—if threads that see their CAS fail use exponential backoff—in other words, wait a little before retrying the CAS.[4]

Can the Wikipedia article be understood, it has been found out but it's not employed yet or is it common practice that CAS instructions artificially backoff after they have failed. Is this the reason such a loop is not considered dangerous regarding CPU usage or because the CAS isn't constantly contested?

Second question: is there any specific reason a reference to seed is created or could we also simply use the variable from the class scope?

Arlie answered 14/9, 2018 at 19:3 Comment(3)
Second question is duplicate of Why assign a final field variable to a local copy before calling its method (Best Practice or for a specific reason)?Argolis
This is rarely if ever contested, it is highly unlikely to loop and even less likely to go around more than once.Scudder
@PeterLawrey Especially since the recommendation for using Random in a multi-threaded application is to not use Random, but use ThreadLocalRandom instead, as documented: "Instances of java.util.Random are threadsafe. However, the concurrent use of the same java.util.Random instance across threads may encounter contention and consequent poor performance. Consider instead using ThreadLocalRandom in multithreaded designs". ;-)Argolis
F
8

Multiple threads trying to CAS something is lock-free (but not wait-free). One of the threads will make progress every time they all try with the same old value. https://en.wikipedia.org/wiki/Non-blocking_algorithm.

(Whether multiple threads all read the same old value or whether some see the result of another thread's CAS depends on timing, and is basically what determines how much contention there is.)

This is unlike a normal busy-wait loop which is just waiting for some unknown-length operation and could be stuck indefinitely if the thread holding a lock is descheduled. In that case, you definitely want to back off if your CAS fails to acquire a lock, because you definitely have to wait for another thread to do something before you can succeed.


Usually lockless algorithms are used in low-contention situations where sophisticated exponential-backoff isn't really needed. That's what the linked SO answer is saying.

This is a key difference from the situation mentioned in the Wiki article: where many threads constantly update some particular shared variable. That's a high-contention situation, so it's probably better to let one thread do a bunch of updates in a row and keep the line hot in their L1d cache. (Assuming you're using CAS to implement an atomic operation that the hardware doesn't support directly, like an atomic double-precision FP add, where you shared.CAS(old, old+1.0) or something. Or as part of a lockless queue or something.)

If you were using a CAS loop that was highly contended in practice, it might help total throughput some to back off and e.g. run an x86 pause instruction on failure before trying again, to have fewer cores hammering on a cache line. Or for a lockless queue, if you find it was full or empty, then that's basically a waiting-for-another-thread situation so you should definitely back off.


Most architectures other than x86 have LL/SC as their atomic RMW primitive, not a direct hardware CAS. Building CAS out of LL/SC can spuriously fail if other threads are even reading the cache line during the CAS attempt, so it might not be guaranteed that at least one thread succeed.

Hopefully hardware designers try to make CPUs that make LL/SC resist spurious failure from contention, but I don't know the details. In this case, backoff might help to avoid potential livelock.

(On hardware where CAS can't spuriously fail from contention, livelock is impossible for something like while(!shared.CAS(old, old<<1)){}.)


Intel's optimization manual has an example of waiting for a lock to become free, where they loop 1 << retry_count times (up to some max backoff factor) Note that this is not a normal CAS loop that's part of a lockless algorithm; this is for implementing a lock.

The backoff is waiting for the lock to become free, not just for contention for access to the cache line containing the lock itself.

  /// Intel's optimization manual
  /// Example 2-2. Contended Locks with Increasing Back-off Example

  /// from section 2.2.4 Pause Latency in Skylake Microarchitecture
  /// (~140 cycles, up from ~10 in Broadwell, thus max backoff should be shorter)
/*******************/
/*Baseline Version */
/*******************/
// atomic {if (lock == free) then change lock state to busy}
while (cmpxchg(lock, free, busy) == fail)
{
   while (lock == busy)
     _mm_pause();
}


/*******************/
/*Improved Version */
/*******************/
int mask = 1;
int const max = 64; //MAX_BACKOFF
while (cmpxchg(lock, free, busy) == fail)
{
   while (lock == busy)
   {
      for (int i=mask; i; --i){
         _mm_pause();
      }
      mask = mask < max ? mask<<1 : max;    // mask <<= 1  up to a max
   }
}

I thought normally when waiting for a lock, you want to spin read-only instead of keeping trying with cmpxchg. I think this example from Intel is only demonstrating backoff, not other parts of how to optimize a lock to avoid delaying the unlocking thread.

Anyway, remember that example is not like what we're talking about with a lockless queue or a CAS-retry implementation of an atomic add or other primitive. It's waiting for another thread to release a lock, not just failure from using the new value that appeared between reading the old value and attempting to CAS in a new value.

Frayda answered 14/9, 2018 at 19:42 Comment(4)
interesting how (based on what) initial and maximum value for loops in backoff selected. or this is just based on experiments ? in windows system private api RtlBackoff used much larger values - begin form 64-128 loops (random selected) and up to 8-16k.Gerrygerrymander
@RbMm: This is just an example, I think. And it's specifically for Skylake, where pause delays for ~140 cycles instead of ~5 or 10 in earlier uarches. I should maybe take this out of the example, because it's about using CAS to spin waiting for another thread, instead of building an atomic RMW operation like float ++.Frayda
one interesting thing used in RtlBackoff - try randomize loop count (used __rdtsc in current implementation). think for different threads, if they in concurrent try update value, use different pause loop count (say 3+ thread at once try - one update, 2+ begin backoff, but using different loop count, for not at same time again try update value. if i correct understand idea of __rdtscGerrygerrymander
@RbMm: Yes, randomized exponential backoff is very good if you're going to back off at all, in case it's likely for multiple threads to start their backoff sequence at the same time. Otherwise it's quite possible they all stampede and step on each other after all delaying the same time, and keep doing so.Frayda

© 2022 - 2024 — McMap. All rights reserved.