Anything in std::atomic is wait-free?
Asked Answered
D

2

10

If T is a C++ fundamental type, and if std::atomic<T>::is_lock_free() returns true, then is there anything in std::atomic<T> that is wait-free (not just lock-free)? Like, load, store, fetch_add, fetch_sub, compare_exchange_weak, and compare_exchange_strong.

Can you also answer based on what is specified in the C++ Standard, and what is implemented in Clang and/or GCC (your version of choice).

My favorite definitions for lock-free and wait-free are taken from C++ Concurrency in Action (available for free). An algorithm is lock-free if it satisfies the first condition bellow, and it is wait-free if it satisfies both conditions below:

  1. If one of the threads accessing the data structure is suspended by the scheduler midway through its operation, the other threads must still be able to complete their operations without waiting for the suspended thread.
  2. Every thread accessing the data structure can complete its operation within a bounded number of steps, regardless of the behavior of other threads.
Deedradeeds answered 17/5, 2020 at 9:58 Comment(11)
Wait free is not the same as lock free.Sulfapyridine
I think wait-free being different from lock-free is obvious from the question! The question explicitly says "... wait-free (not just lock-free)?"Deedradeeds
There is no requirement that the lock-free operations be wait-free. Most RISC architectures lack an atomic load-modify-store operation.Adjacency
@Koosha: If lock-free isn't wait-free... what does it matter? I'm being serious; if the C++ standard library implementation can't give you a wait-free load/modify/store for some type, do you think that you will be able to construct one yourself? How are you going to implement your algorithm differently?Fieldwork
@RaymondChen: But even if you do have atomic RMW in hardware (not LL/SC), is x86 lock add dword [rdi], 1 wait-free? Depends what you mean by a "step". One instruction, but HW arbitration for access to the cache line means that only one core can own the cache line at once, so if "step = clock cycle" it's still more like lock-free than wait-free. This is similar to Are X86 atomic RMW instructions wait free. And BTW, load() is wait-free on everything, except maybe 16-byte loads on x86 implemented with lock cmpxchg16b which old GCC called lock-freePistachio
@NicolBolas: I don't think language-lawyer is a good fit for this question. The standard doesn't have much to say about what is_lock_free() means in terms of being wait-free vs. lock-free in the computer science / en.wikipedia.org/wiki/Non-blocking_algorithm sense. I think C++ is just using it in the sense of "lockless", no hidden mutex. Anyway, I tagged this computer-science because it's about terminology and putting a label on what real implementations actually do (the pure ISO C++ part is unanswerable). cpu-architecture could fit instead...Pistachio
As an example, with gcc on x86-64, std::atomic<double>::fetch_add() is lock-free but not wait-free; it does a compare-and-swap loop, so if another thread is modifying the variable very frequently, we may have to loop arbitrarily many times. See on godboltSandpaper
@PeterCordes: the OP asked, "Can you also answer based on what is specified in the C++ Standard". If the standard doesn't say, then it doesn't say, and that's a valid answer.Fieldwork
@NicolBolas: Sure, but the c++ and stdatomic tags capture that sufficiently. Since the pure standard part of the question doesn't result in a very useful answer, I think that aiming the question in that direction by tagging it language lawyer isn't ideal. A more interesting answer would say that and then go on to talk about real implementations on LL/SC machines like ARM pre ARMv8.1 vs. atomic RMW machines like x86 and ARMv8.1. I'm still deciding what I want to say about cycles or time vs. instructions as far as "steps".Pistachio
@PeterCordes: "Since the pure standard part of the question doesn't result in a very useful answer" I guess that depends on your reading of the standard, but it is what the OP asked for.Fieldwork
@PeterCordes and @NicolBolas: Thanks for the comments. Based on what I read here, it seems there is no hope for a wait-free guarantee from the C++ Standard. However, the question (right after where you quoted from) says "... and what is implemented in Clang and/or GCC (your version of choice)." So if the C++ Standard does not ask for it, what is the situation on GCC and Clang (over the architectures that you are more familiar with)?Deedradeeds
I
9

There exist universally accepted definitions of lock-freedom and wait-freedom, and the definitions provided in your question are consistent with those. And I would strongly assume that the C++ standard committee certainly sticks to definitions that are universally accepted in the scientific community.

In general, publications on lock-free/wait-free algorithms assume that CPU instructions are wait-free. Instead, the arguments about progress guarantees focus on the software algorithm.

Based on this assumption I would argue that any std::atomic method that can be translated to a single atomic instruction for some architecture is wait-free on that specific architecture. Whether such a translation is possible sometimes depends on how the method is used though. Take for example fetch_or. On x86 this can be translated to lock or, but only if you do not use its return value, because this instruction does not provide the original value. If you use the return value, then the compiler will create a CAS-loop, which is lock-free, but not wait-free. (And the same goes for fetch_and/fetch_xor.)

So which methods are actually wait-free depends not only on the compiler, but mostly on the target architecture.

Whether it is technically correct to assume that a single instruction is actually wait-free or not is a rather philosophical one IMHO. True, it might not be guaranteed that an instruction finishes execution within a bounded number of "steps" (whatever such a step might be), but the machine instruction is still the smallest unit on the lowest level that we can see and control. Actually, if you cannot assume that a single instruction is wait-free, then strictly speaking it is not possible to run any real-time code on that architecture, because real-time also requires strict bounds on time/the number of steps.


This is what the C++17 standard states in [intro.progress]:

Executions of atomic functions that are either defined to be lock-free (32.8) or indicated as lock-free (32.5) are lock-free executions.

  • If there is only one thread that is not blocked (3.6) in a standard library function, a lock-free execution in that thread shall complete. [ Note: Concurrently executing threads may prevent progress of a lock-free execution. For example, this situation can occur with load-locked store-conditional implementations. This property is sometimes termed obstruction-free. — end note ]
  • When one or more lock-free executions run concurrently, at least one should complete. [ Note: It is difficult for some implementations to provide absolute guarantees to this effect, since repeated and particularly inopportune interference from other threads may prevent forward progress, e.g., by repeatedly stealing a cache line for unrelated purposes between load-locked and store-conditional instructions. Implementations should ensure that such effects cannot indefinitely delay progress under expected operating conditions, and that such anomalies can therefore safely be ignored by programmers. Outside this document, this property is sometimes termed lock-free. — end note ]

The other answer correctly pointed out that my original answer was a bit imprecise, since there exist two stronger subtypes of wait-freedom.

  • wait-free - A method is wait-free if it guarantees that every call finishes its execution in a finite number of steps, i.e., it is not possible to determine an upper bound, but it must still be guaranteed that the number of steps is finite.
  • wait-free bounded - A method is wait-free bounded if it guarantees that every call finishes its execution in a bounded number of steps, where this bound may depend on the number of threads.
  • wait-free population oblivious - A method is wait-free population oblivious if it guarantees that every call finishes its execution in a bounded number of steps, and this bound does not depend on the number of threads.

So strictly speaking, the definition in the question is consistent with the definition of wait-free bounded.

In practice, most wait-free algorithms are actually wait-free bounded or even wait-free population oblivious, i.e., it is possible to determine an upper bound on the number of steps.

Ingratiating answered 18/5, 2020 at 8:15 Comment(11)
I think the C++ definition of is_lock_free() is more aligned with "lockless" (i.e. not using a mutex so it can't ever be stuck waiting for another thread to change a value in memory), although it does imply being at least lock-free in abstract CS theory terminology.Pistachio
A useful definition of lock-free vs. wait-free is that lock-free only implies that at least 1 thread can make forward progress per "step" that they all execute. (e.g. a CAS retry loop), while wait-free implies that all threads will make forward progress without any starvation. This may allow a sensible extension of wait-freedom from software "steps" (instructions) to hardware steps (clock cycles), considering HW arbitration for exclusive ownership of cache lines. Any real HW is finite in number of cores, though, so really the useful distinction is that RMW / store contends, reads don't.Pistachio
@PeterCordes I have updated my answer with a quote from the C++ standard. It seems to depend on the target architecture whether some operations are lock-free or obstruction-free. "lockless" usually means something different as the progress of one thread depends on another thread making progress.Ingratiating
I think the problem is that there are many definitions of wait-freedom. @PeterCordes's definition seems different from the one in the question. Another definition is similar to the one in the question, but replaces 'bounded' with 'finite', which means 'non-infinite'.Selda
If we take the definition in the question, why are CAS loops not wait-free? Isn't the upper bound on loop retries proportional to the number of threads contending? By this other definition of wait-free, I would say CAS loops are wait-free bounded.Selda
Can you confirm that the universally accepted definition of wait-freedom is what this person calls wait-free population-oblivious? That is, the bound is constant and may not depend on the number of threads.Selda
@FrancescoMenzani Your referenced blog posting from concurrencyfreaks already contains the universally accepted definition of wait-freedom: A method is Wait-Free if it guarantees that every call finishes its execution in a finite number of steps. Wait-Free Bounded and Wait-Free Population Oblivious are stronger subsets of Wait-Free. A CAS loop is usually not wait-free, because usually the number of iterations is not bounded. However, a CAS operation would be Wait-Free Population Oblivious.Ingratiating
You wrote that the definitions provided in the question are consistent with universally accepted definitions of lock-freedom and wait-freedom. In the question the word bounded is used, while in the blog post I referenced the word finite is used. These two words have different meanings, haven't they? So what is the universally accepted definition? If the number of iterations of a CAS loop is bounded by the number of threads, can it be said to be wait-free (and in particular wait-free bounded) by the definitions in the blog post I referenced?Selda
@FrancescoMenzani You are right, that was a bit imprecise. Wait-Free requires the operation to terminate within a finite number of steps. Wait-Free Bounded requires the operation to terminate within a fixed and pre-determined number of steps. The problem with CAS loops is what do you do if the CAS fails? Usually you keep retrying until it succeeds, but this is obviously not wait-free. But if the number of iterations of a CAS loop is bounded by the number of threads, then yes, this would be wait-free bounded.Ingratiating
@Ingratiating By "pre-determined", do you mean "constant", or may it depend on the number of threads? Per my understanding, the former case is wait-free population-oblivious, and the latter is wait-free bounded. Why is retrying CAS until it succeeds not wait-free as per the definition in my answer? By "bounded by the number of threads", I mean that the time it takes for all threads to exit the CAS loop is proportional to the contention level (that is the number of threads).Selda
@FrancescoMenzani This was a quote from the paper referenced in the blog post, but I also find this wording a bit confusing. Wait-free bounded means that you can give an upper bound on the number of steps, where this bound depends on the number of threads. I have updated my answer to provide more details on the wait-free definitions. Regarding CAS loops - it is usually not possible to give an upper bound on the number of retries, since in most algorithms a competing thread can intervene over and over again. So theoretically a thread can keep retrying forever, which is obviously not wait-free.Ingratiating
S
4

Since there are many definitions of wait-freedom1 and people choose different ones, I think that a precise definition is paramount, and a distinction between its specializations is necessary and useful.

These are the universally accepted definitions of wait-freedom and its specializations:

  • wait-free: All threads will make progress in a finite number of steps.

  • wait-free bounded: All threads will make progress in a bounded number of steps, which may depend on the number of threads.

  • wait-free population-oblivious3: All threads will make progress in a fixed number of steps, that does not depend on the number of threads.

Overall, the C++ standard makes no distinction between lock-free and wait-free (see this other answer). It always gives guarantees no stronger than lock-free.

When std::atomic<T>::is_lock_free() returns true, instead of mutexes the implementation utilizes atomic instructions possibly with CAS loops or LL/SC loops.
Atomic instructions are wait-free. CAS and LL/SC loops are lock-free.

How a method is implemented depends on many factors, including its usage, the compiler, the target architecture and its version. For example:

  • As someone says, on x86 gcc, fetch_add() for std::atomic<double> uses a CAS loop (lock cmpxchg), while for std::atomic<int> uses lock add or lock xadd.
  • As someone else says, on architectures featuring LL/SC instructions, fetch_add() uses a LL/SC loop if no better instructions are available. For example, this is not the case on ARM versions 8.1 and above, where ldaddal is used for non-relaxed std::atomic<int> and ldadd is used if relaxed.
  • As stated in this other answer, on x86 gcc fetch_or() uses lock or if the return value is not used, otherwise it uses a CAS loop (lock cmpxchg).

As explained in this answer of mine:

  • The store() method and lock add, lock xadd, lock or instructions are wait-free population-oblivious, while their "algorithm", that is the work performed by the hardware to lock the cache line, is wait-free bounded.
  • The load() method is always wait-free population-oblivious.

1 For example:

  • all threads will make progress in a finite number of steps (source)

  • all threads will make progress in a bounded number of steps2

  • per "step" that they all execute, all threads will make forward progress without any starvation (source)

2 It is not clear whether the bound is constant, or it may depend on the number of threads.

3 A strange name and not good for an acronym, so maybe another one should be chosen.

Selda answered 22/7, 2020 at 17:58 Comment(10)
CAS retry loops are lock-free but not wait-free in the usual en.wikipedia.org/wiki/Non-blocking_algorithm sense. If other threads are also modifying an object, it can take an unbounded number of retries before your attempt to x.CAS(old, old+1) succeeds. But it only fails on changes, so at least one thread makes progress every time they all try, thus lock-free. A single CAS itself is wait-free, though.Pistachio
On LL/SC machines, stuff like fetch_add is typically implemented as a retry loop, with LL/SC instead of CAS, so again, lock-free on some machines, only wait-free on machines with a dedicated instruction (such as x86) that always eventually succeed. But even then, that's just number of "steps" (instructions), not time (see comments on Are X86 atomic RMW instructions wait free). The more contention there is, the longer you'll have to wait on average. i.e. aggregate throughput of atomic x++ does not increase with more threads.Pistachio
You argue that a CAS loop is wait-free bounded because it will finish in a finite number of steps. This is not the case: if one thread is unlucky, it could in theory keep failing its CAS attempt while other threads keep storing new values (e.g. with store or CAS). Maybe you're thinking of a case where there are only e.g. 8 total modifications to an object to be done, by separate threads. Then yes, at most 7 failures are possible. But if other threads keep starting new operations (new CAS loops or stores) after their previous one succeeded, starvation of unlucky thread(s) can be indefinite.Pistachio
Also note, unlike Are X86 atomic RMW instructions wait free, this is not an x86 question. It's about abstract ISO C++. Answering for some common implementation is fine, but make sure to state that your conclusion only applies to a certain implementation, or to common CPUs. e.g. ARM is a lot different from x86, with single-instruction atomic RMW (which scales better than LL/SC retry loops) only available on ARMv8.1.Pistachio
I agree with @PeterCordes in all points - and that is also the reason why is_lock_free really means lock-free and not wait-free.Ingratiating
@PeterCordes Yes, I'm thinking of a case where there are a finite number of parallel executions of the algorithm. So when describing the flavor of an algorithm, should the number of executions be always assumed infinite when determining whether it is wait-free or not?Selda
Your analysis would find that everything is wait-free unless livelock is possible, because you exclude normal contention retries. (Or unless it's not non-blocking at all, and fails some other criterion like blocking other threads indefinitely if it sleeps at a certain point.) It's widely agreed that that is not the case, so clearly your assumptions aren't normal. mpoeter described the normal interpretation of "regardless of behaviour of other threads".Pistachio
@Ingratiating "If under this assumption you can still provide an upper bound on the iterations, then the algorithm is wait-free." - You mean wait-free bounded. For it to be wait-free, it is enough that the iterations are finite.Selda
@PeterCordes Then why did you agree that cache line lock acquisition algorithm time is proportional to how much contention the cache line experiences? Doesn't the assumption that a thread in a CAS loop could be delayed indefinitely holds true for the hardware continuously trying to lock a contended cache line, too?Selda
The first sentence of your answer isn't accurate. Definition of wait-free isn't the problem, it's that ISO C++ can't guarantee anything stronger than lock-free for RMWs if it wants implementation on LL/SC machines to be possible.Pistachio

© 2022 - 2024 — McMap. All rights reserved.