Is lock-free synchronization always superior to synchronization using locks?
Asked Answered
I

7

46

In C++, there is one atomic type std::atomic<T>. This atomic type may be lock-free or maybe not depending on the type T and on the current platform. If a lock-free implementation for a type is available in a platform for a type T, then most compilers would provide lock-free atomic<T>. In this case, even if I want non-lock-free atomic<T> I can't have it.

C++ standards decided to keep only one std::atomic<T> instead of one std::atomic<T> and one std::lock_free<T> (partially implemented for specific types). Does this imply that, 'there is no case where using a non-lock-free atomic type would be a better choice over using a lock-free atomic type when the latter is available' ? (Mainly in terms of performance rather than ease-of-use).

Ietta answered 28/2, 2023 at 21:47 Comment(6)
This Q&A is kinda misleading. The question body asks specifically about atomic<T>, but not the title. Some of the answers ignore it too.Poacher
@Poacher my question is 'is locking superior to lockfree' and my rationale behind asking that question is because of atomic<T>. If you have any suggestions for improving the question, please tell.Ietta
Lock-free is not always superior to locking. But if it's important to lock then you can and should do so explicitly. If you use std::atomic<T> you're saying that you want lock-free if it's available for type T. If only non-lock-free is acceptable then std::atomic<T> is, semantically, the wrong tool for the job.Nevis
"suggestions for improving the question" You're getting answers about lock-free containers, not atomic<T>; and you accepted one. It might be too late now, but I'd explicitly say in the title that you're asking only about atomic<T>.Poacher
What do you mean by "superior"? If you're talking specifically about performance, then say "faster" etc.Collaborate
@KarlKnechtel As I'm not a native English speaker, many times my word usage goes wrong. Sorry for that. If you are sure that 'faster' is a better fit, please update the question.Ietta
A
55

Does this imply that, 'there is no case where using a non-lock-free atomic type would be a better choice over using a lock-free atomic type when the latter is available' ? (Mainly in terms of performance rather than ease-of-use).

No. And that is, in general, not true.

Suppose you have two cores and three threads that are ready-to-run. Assume threads A and B are accessing the same collection and will contend significantly while thread C is accessing totally different data and will minimize contention.

If threads A and B use locks, one of those threads will rapidly wind up being de-scheduled and thread C will run on one core. This will allow whichever thread gets scheduled, A or B, to run with nearly no contention at all.

By contrast, with a lock-free collection, the scheduler never gets a chance to deschedule thread A or B. It is entirely possible that threads A and B will run concurrently through their entire timeslice, ping-ponging the same cache lines between their L2 caches the whole time.

In general, locks are more efficient than lockfree code. That's why locks are used so much more often in threaded code. However, std::atomic types are generally not used in contexts like this. It would likely be a mistake to use a std::atomic type in a context where you have reason to think a lock would be more efficient.

Amoeba answered 28/2, 2023 at 22:14 Comment(14)
So, is there some rule when I should use locks and when not? By reading @Jerry's answer, it seems if a critical section is short, then I can use lock-free code.Ietta
@SouravKannanthaB: There's only one rule => in doubt, use a lock. A locking algorithm has relatively predictable performance curve, and won't lead to CPU meltdown/thread starvation. A lock-free algorithm, badly implemented, or badly used, can lead to terrible performance -- it's experts' territory, and even experts regularly get it wrong. If you don't feel you're an expert in the domain, don't implement one in production.Matthieu
This answer seems more about the distinction between preemptive and cooperative scheduling, as opposed to locking and lock-free synchronization. They are related, and often preemptive is not lock free, but they are distinct concepts.Involution
@Involution Synchronization that uses locks creates defined points at which threads can be descheduled. Lock-free synchronization does not.Amoeba
@DavidSchwartz When using locks I have to deschedule threads, because other threads are blocked and cannot make progress. In lock-free code at least one thread is always guaranteed to make progress. As has been shown by Alistar et al, in practice lock-free algorithms are basically wait-free, so all threads can make progress. Why would I want to de-schedule a thread when it can actually make progress? Why would I want to burn cycles for unnecessary context switches?Christiansand
@Christiansand Because that thread can only make progress by contending with another thread that is currently running, causing both threads to run very slowly and possibly even slow down unrelated threads and processes by congesting shared resources such as the connections between cores. It is much better to schedule a thread that will not contend and let all the threads run at full speed. Lock-free algorithms increase contention by allowing threads that access the same data to be scheduled concurrently. Locks encourage non-contending threads to run concurrently.Amoeba
@Christiansand The only time a lock forces a context switch is when two threads are running concurrently that contend and there's another thread that will not contend that can be scheduled. That's a huge win if the current contention scenario would have been likely to repeat thousands of times if not for the context switch to schedule some non-contending thread. This, of course, requires there to be other work for the system to do. But if there isn't anything else to do, you're likely not under heavy load and performance isn't that important anyway.Amoeba
@Christiansand "In lock-free code at least one thread is always guaranteed to make progress." 1) How difficult is it to write lock free code? a) How many hours do you use to prove that your code is lock-free? b) What is your confidence in your informally validated lock-free code? Do you trust the programmers and reviewers? (baseline is validation of mutex based code) 2) How many threads tried to do progress while one thread made progress? How many CPU time was wasted on computations that failed CAS? Lock-free is fine for simple code that very rarely fails CAS.Advertise
@DavidSchwartz I disagree, but we already had this same discussion in a different thread, so I don't think we need to go down this road again.Christiansand
@Advertise I did not comment on how easy/difficult it is to write lock-free code. I just strongly disagree with the conveyed message that "lock-free is bad" and "locks are always better". Yes, there are cases where locks might be better, but as always the world is not just black and white.Christiansand
@Christiansand My point was mostly that lock-free is horribly oversold, with "arguments" like "lock-free code doesn't dead lock". Duh! So does simple lock based code that doesn't use multiple locks or call random functions. These "arguments" really don't mean anything, as they amount to "correct code behaves correctly".Advertise
@Advertise The argument that lock-free code cannot dead lock may sound dumb, but it simply means that you cannot make a mistake that results in a deadlock. But of course lock-free algorithms are no silver bullet. The more complex the data structure, the more difficult it becomes to write a correct and efficient lock-free version. That's where approaches like flat combining shine. However, in the >10 years that I have been working with and on lock-free code, I have seen a lot of cases where under high contention a lock-free version outperformed a lock-based one by orders of magnitude.Christiansand
@Christiansand For interested readers, could you link in the other discussion you had with David Schwartz? (The one for which you said "go down this road again") Thanks!Provocative
@CigaretteSmokingMan I tried to find it a while ago, but failed. It was one of those discussions that was spliced from the comments section into a chat, but it seems that the room is gone. Maybe the question was removed, I don't know.Christiansand
G
39

Further to David Schwartz's excellent answer, I'd note that a great deal can depend upon what's scarce in your system overall.

If you have more threads that are ready to run than cores to run them, then what you generally want to do is detect as quickly as possible that there's contention over some resource, and put all but one of those contending threads to sleep, so you can schedule other threads onto those cores.

Lock free tends to work out better in more or less the opposite situation: you have more hardware available at any given time than you have threads to run. In this case, a busy-wait with lock-free code can react very quickly when a resource becomes free, make its modification, and keep moving forward.

The second question is how long contention is likely to last when it does happen. If you have lots of threads constantly "fighting" over a few resources, you're almost certainly better off putting most of them to sleep, letting a few (often only one) make progress as fast as possible, then switch to another and repeat.

But putting one thread to sleep and scheduling another means a trip to kernel mode and the scheduler. If the contention is expected to be short lived, constant switching between threads can add a lot of overhead so the system as a whole slows down a lot.

Gallopade answered 28/2, 2023 at 22:37 Comment(0)
M
26

Does this imply that, 'there is no case where using a non-lock-free atomic type would be a better choice over using a lock-free atomic type when the latter is available' ?

Yes, in this specific case.

The reason that lock-free implementations of std::atomic<T> are always preferable to locking implementation is simply that the operations are natively supported by the HW.

That is, std::atomic_uint32_t::load(std::memory_order::relaxed) will on x86_64 boil down to:

mov     eax, DWORD PTR [rsp-4]

Which is just a regular memory read, since x86 already has a strong memory model by default.

And that is, of course, unbeatable.

Thus it was unnecessary to have both a locking std::locking<std::uint32_t> and a lock-free std::lock_free<std::uint32_t>: there was no situation where std::locking<std::uint32_t> would ever have been preferable, it would always have been a performance trap.

However do not take this as an endorsement that lock-free algorithms are necessarily preferable. std::atomic lock-free advantage comes from mapping directly to hardware instructions, which is a fairly special case. As explained by @David Schwartz and @Jerry Coffin, when more complex data-structures and more complex algorithms are involved -- especially multi-instructions algorithms -- then whether lock-free or locking is better is much more nuanced.

Matthieu answered 1/3, 2023 at 9:2 Comment(8)
Fair enough if architecture provides atomic operations. But what about architectures where lock-free has to be implemented using LL/SC pairs?Ietta
@SouravKannanthaB: How do you think that mutexes are implemented? There's typically no magic mutex instruction in the hardware, thus mutexes are implemented from atomic operations and appropriate memory barriers (sometimes combined). No matter what the atomic operations are, it'll always be cheaper to execute one atomic operation, than to lock (>= 1 atomic operation), read/write/increment/decrement, and unlock (>= 1 atomic operation). So, yes, atomic operations are always faster than mutexes by construction.Matthieu
Worth noting that not std::atomic<T> is not truly lock free if T is a large object, and a more explicit locking approach will likely get better performance in that caseInvolution
@rtpax: The OP already noted that detail in the question, so I don't think it's worth repeating. As for explicit locking or not... I am afraid this'll get into the weeds really quickly. It's back to the core of this answer: "it's complicated".Matthieu
There's another point to consider here as well. Depending on the ordering you specify, even if an atomic load is a single, simple instruction, atomic operations can constrain the compiler/processor's ability to reorder instructions, which can lead to significant performance differences that aren't immediately obvious from examining one instruction in isolation.Gallopade
@JerryCoffin You are correct that memory orderings matter, but on the other hand mutexes will need memory orderings too, so it doesn't affect the relative performance of lock-free atomic operation vs locking atomic operation.Matthieu
@MatthieuM.: It can--mostly in ways that favor atomics. In particular, a mutex is roughly equivalent to sequential ordering, but with atomics you can specify more relaxed ordering (acquire, release, or relaxed), which can sometimes allow better code generation.Gallopade
@JerryCoffin: ISO C++ specifies that a locking a mutex is equivalent to an acquire operation on the mutex object. e.g. like m.exchange(1, std::memory_order_acquire). So that's as weak as it's allowed to be, but it has to be an RMW which means on x86 it's no cheaper than seq_cst. Releasing a lock can be as cheap as a release pure-store, but more sophisticated mutexes may need an atomic RMW if they use the lock to record the fact that there are other waiters that might need to be woken up with a system call. So in practice yes on x86, you have a full barrier on take and release of mutex.Sapajou
A
12

Lock-free data structures often perform very well when there is no contention, and will perform correctly even in the presence of contention, but the presence of contention may cause their performance to be severely degraded. If 100 threads all try to update a lock-free data structure simultaneously, one would succeed on its first try, at least two would succeed on the first and second try, and least three would succeed within three tries, etc. but the worst-case total number of update attempts required might be over 5,000. By contrast, if 100 threads all try to update a lock-based data structure, one would be allowed to perform the update immediately while the other 99 would be blocked, but each of those 99 threads would only be awakened when it would be able to access the data structure without further delay.

The downside to lock-based data structures is that a thread which is waylaid while it holds a lock will block all other threads unless or until it finishes whatever it needs to do and releases the lock. By contrast, when using a lock-free data structure, competing threads which get waylaid will provide less impediment to other other threads than they would have otherwise.

Appellate answered 2/3, 2023 at 0:10 Comment(0)
P
2

A point that may be obvious but hasn't been mentioned in other answers: lock-free atomic operations are normally implemented by hardware instructions and therefore relatively efficient; but they're typically still a lot more expensive than their non-atomic equivalents, and not as well optimized. Acquiring a mutex means you can safely write non-atomic code, and that may compensate for the cost of the mutex.

Imagine, as a trivial example, you have a large array of counters that need to be incremented. Your code does this a lot, so performance is important, but usually not from more than one thread at a time, so contention is low. It's also not important that the counter values remain consistent with each other during the run of the program. However, it may occasionally be called concurrently, so it still has to be thread-safe.

Your options are:

// lock-free code
std::atomic<int> counters[1000000];

void increment_counters() {
    for (int i = 0; i < 1000000; i++)
        counters[i].fetch_add(1, std::memory_order_relaxed);
}

This needs one million expensive atomic read-modify-write instructions. Versus:

// locking code
int counters[1000000];
std::mutex counter_lock;

void increment_counters() {
    std::scoped_lock lk(counter_lock);
    for (int i = 0; i < 1000000; i++)
        counters[i]++;
}

This uses plain ordinary memory operations, and may even be vectorized for more speed. The savings could well outweigh the cost of locking and unlocking the mutex (which is usually very cheap when there's no contention), and it could even outweigh the cost of occasionally needing to block in the rare instance that two threads call increment_counters() concurrently.

The difference would be even more extreme if instead of a simple increment, we needed to apply some more complex transformation to the array elements; then the atomic version would need a compare-exchange loop on every element.

Pentateuch answered 15/3, 2023 at 15:56 Comment(0)
T
1

A locking atomic will generally not use a mutex, but a smaller atomic that tells whether someone is currently accessing the larger object.

Accesses to std::atomic<T> are still guaranteed to finish quickly, unlike a critical section, so if your thread finds that it cannot access the variable right now, it is sufficient to spin for a few cycles and then try again, but there is no point to suspend the current thread.

To be useful, atomic access does not just need reads and writes to be atomic, that is easy -- any access that can be performed in a single bus cycle will be atomic -- but also requires a way to either update a value and read back the previous value in one operation, or to update a value only if our information about the previous value is still correct.

The latter is what we have special hardware support for.

A mutex is then built on top of atomic access -- generally there will be a simple lock-free path that just updates an atomic variable for locking and unlocking if there is no contention, and a slower path where a thread that cannot take the mutex will register itself inside the mutex as waiting.

This registration must happen inside a lock-free structure, which is implemented by either using only atomic accesses, or an atomic lock around a data structure that cannot be accessed atomically.

Tureen answered 1/3, 2023 at 17:14 Comment(6)
does not just need reads and writes to be atomic, that is easy - Not easy for objects wider than 8 bytes on many ISAs. Some, like x86-64, have supported 16-byte atomic RMW (lock cmpxchg16b) for a lot longer than they've had any guaranteed way to do a 16-byte atomic pure load (the AVX feature bit implies aligned 16-byte load/store are atomic, at least on Intel, since last year or so) . Similarly ARMv8 has 16-byte atomicity as part of ldxp / stxp, (load-exclusive pair of 8-byte general-purpose registers), but only ARMv8.4 guaranteed that simple ldp / stp were atomic if aligned.Sapajou
You can do a 16-byte load as lock cmpxchg16b (attempting to CAS (0, 0) for example), but that needs exclusive ownership of the cache line so you don't get read-side scalability to many readers reading the same thing in parallel all hitting in their own private L1d caches. That's why GCC7 downgraded std::atomic<16byte struct> or __int128 from lock-free to report itself as non-lock-free. gcc.gnu.org/legacy-ml/gcc-patches/2017-01/msg02344.html / gcc.gnu.org/bugzilla/show_bug.cgi?id=84563 . A 16-byte store can also of course be done with a CAS or LL/SC retry loop.Sapajou
Yes, my point is though that it is not beneficial to force a locking atomic implementation for a type that can be accessed atomically by itself, simply because the lock would be implemented by an atomic, i.e. you get an additional indirection layer that uses the same mechanism that you could have used for your data, and none of the contention resolution of a proper mutex.Tureen
Yes, that part of your answer is true. I just wanted to correct some of the details you used to support it. Wide load/store atomicity isn't "easy" for types wider than 1 integer register, and/or wider than a cache-transfer granule for the interconnect between cores / sockets. (AMD K10 is an interesting example: in practice you don't get tearing on 16-byte SSE aligned loads or stores between cores in the same socket (despite the lack of documented guarantee), but you do for cores in different sockets due to HyperTransport.)Sapajou
"Accesses to std::atomic<T> are still guaranteed to finish quickly" how so? Is that a strict, mathematical guarantee, or a "guaranteed (not a guarantee)" as in advertisement, where something is often true but not always?Advertise
@curiousguy, somewhere in between. There is no API to lock an atomic variable indefinitely, like there is for a mutex, the only four generic operations are "load", "store", "exchange" and "compare_exchange", so the length of the actual access to the shared variable is limited to the time it takes to transfer the bits in and out of the memory. What can take longer than that is failure handling inside a particular thread, for example when a thread got preempted in the middle of an access, but that is of no concern to the other threads.Tureen
T
-2

Not it is not always better. but we can say lock-free synchronizing is faster and more saleable.

Tintinnabulation answered 11/3, 2023 at 13:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.