When are lock free data structures less performant than mutual exclusion (mutexes)?

Asked 18/10, 2009 at 19:37 Answered 26/7, 2016 at 0:19

Solved multithreading scheduling mutex atomic lock-free

I read somewhere (can't find the page anymore) that lock free data structures are more efficient "for certain workloads" which seems to imply that sometimes they're actually slower or the gain from them can be zero in some situations. Taking the ~100 cycle hit of a lock instruction to do an atomic op sounds plenty faster to me than going to sleep and waiting for the scheduler to wake the process back up, so it's not obvious to me under what circumstances a lock free data structure would be less preferable than old fashioned mutexes. If the lock is available 99% of the time and the process doesn't have to go to sleep, is a mutex then faster? Is there a good rule of the thumb for knowing which way to go assuming a suitable lock free data structure is available?

Veda answered 18/10, 2009 at 19:37 Comment(4)

Yes, the rule of thumb is to benchmark both approaches. – Dulia 18/10, 2009 at 19:40

While certainly the most authoritative answer will be given by a benchmark, there isn't always time to implement both approaches. At some point knowing what's likely is valuable information -- you wouldn't want to have to run benchmarks every time you had to decide between using an array or a linked list for example. – Veda 18/10, 2009 at 19:47

"For certain workloads" could also be interpreted as "for those workloads that can be synchronized with a lock free data structure". In other words they are always faster, but cannot be always applied. – Feucht 18/10, 2009 at 19:58

"faster to me than going to sleep and waiting for the scheduler to wake the process back up," going to sleep allows useful work to be done; spinning on a failed compare-and-swap does not – Tangle 21/1, 2017 at 10:10

A common approach to implementing a lock-free data structure is to have a mutable reference to an immutable object, and have anything that wants to change the structure grab the reference, produce a new version of the object with suitable changes applied, and then CompareExchange the reference to point to the new object. If the CompareExchange works, great. If not, ditch the new object, re-grab the reference, and start over.

This can work well if producing the new object is cheap and the level of contention is low enough that the CompareExchange will usually work. If there is considerable contention, and if producing the new object is slow, simultaneous attempted updates by N threads may take N^2 time to complete. As an extreme example, suppose 100 threads are running on a CPU, an update takes 100ms of CPU time (just over a time-slice), and all 100 threads attempt to update an object at once. During the first ten seconds, each thread will produce a new object based on the original one. One of the threads will successfully do the CompareExchange, while the others will all fail. Then during the next 9.9 seconds, 99 threads will generate new versions of the object, after which one will successfully post its update and 98 will fail. The net effect will be that the lock-free method will take 505 seconds' worth of CPU time to perform 100 updates, when a locking system could have done them in about 10 seconds.

Thema answered 28/10, 2010 at 15:48 Comment(27)

Thanks. There are times when the best performance can be had using a combination of locking and lock-free methods: keep an interlocked total of how many create-and-try-swap attempts have been made by threads that haven't yet succeeded; if that number gets high enough, everyone wanting to change the resource should hold off until they acquire a lock. The lock isn't necessary to prevent corruption, but will serialize access and avoid contention. This approach is tricky, and it loses some of the advantages of lock-free coding (e.g. surviving if a thread evaporates) but it can be useful anyway. – Thema 29/10, 2010 at 15:15

Interesting answer. I was thinking about how it might generalize to more CPUs. Am I right in thinking that mutexes being faster in the described scenario will only be true if you have fewer CPUs than threads? It seems that if you have the same number, there's no penalty to letting those threads try the compare and exchange over and over. If you sum up the number of seconds each CPU would spend working it looks worse, but the wall clock time from start to finish would still be shorter. – Veda 1/11, 2010 at 1:30

@Joseph Garvin: On a single-CPU system, if the time between acquiring the old value and doing the compare-exchange is less than one time-slice quantum, then any given attempt should fail at most once. In a multi-CPU system, things are more complicated. If the time to process elements is not uniform, it would be possible for a thread which was trying to process a 'slow' element to get blocked indefinitely while one or more threads on the other CPU continuously process 'fast' elements. From a 'total work done' perspective, this is not a particular problem, but... – Thema 1/11, 2010 at 1:44

...if e.g. different tasks are associated with different users, some users might get annoyed if the system never completes their task, no matter how effectively the system might be processing other tasks. Compare-and-try-swap, in the absence of locking or prioritization methods, is really only good when resource utilization is low. Fortunately, there are many situations where that is the case, but if utilization could even reach 10% one should consider locking. Otherwise, repeated retries could create something of a 'logjam' effect. – Thema 1/11, 2010 at 1:53

It's worth noting that the phenomenon described in this answer only occurs for swap-based (especially cas-based) lock-free operations. Naturally "certain workloads" require them, but if you have operations like atomic inc/dec/add at your disposal, many simpler workloads can be handled with just these, and no chance of pathological performance. – Deanadeanda 5/12, 2010 at 4:49

@R.: Hardware interlocks allow some operations like increment/decrement to be performed without risk of live-lock; when a useful task can be performed with such operations, it's often good to use them. On the other hand, even using a CAS to perform such operations doesn't impose much risk in anything other than a massively-overloaded system, since the window during which another operation can necessitate a retry is very small. – Thema 6/12, 2010 at 1:10

Exponential backoff helps with this problem, to a certain extent, but getting the basic time unit of backoff correct is a bugger (and sometimes isn't possible - multiple different operations occurring on the same memory destination, each of which takes a different period of time). – Filipe 7/5, 2011 at 17:2

@Blank Xavier: I could certainly see how randomized exponential back-off might reduce the total amount of CPU time expended on CAS-based tasks, but only if threads backed off so much as to leave a lot of time when nobody was doing any work toward the CAS-based task. In a CSMA-based Ethernet, when two or more entities try to transmit simultaneously, neither one succeeds. Also, there's no synchronization mechanism available other than attempted transmission. – Thema 7/5, 2011 at 20:8

@Blank Xavier: In most scenarios involving CAS-based updates, there are "parallel data channels" that can be used by the tasks wanting to perform updates. One could, for example, use atomic counters to give each access attempt a number, and use those to schedule access attempts. If the last successful attempt to update a data item was update #19952, my ticket number is #19958, and updates take about 5-20ms to perform, that would suggest that I should come back in about 30ms and see how things are going. Things are a bit tricky if one wants to allow for... – Thema 7/5, 2011 at 20:17

...out of sequence execution of tasks in case a thread gets waylaid, but using a "ticket-number dispenser" should allow for much better performance than an exponential back-off. – Thema 7/5, 2011 at 20:18

BTW, I don't see the backoff period has to be long. The basic time unit (that you are using multiples off as you exponentiate) is optimally the length of time required for the operation being performed, e.g. say a freelist push. A problem here is that the length of time for that operation will vary if the competing threads are on the same core or another core (to the thread which most recently succeeded); same core threads will be reading from cache, other core threads will be needing to read from memory (due to cache invalidation). – Filipe 7/5, 2011 at 22:33

Indeed, the basic unit of backoff time will be the same with exponential backoff or a ticket based backoff. – Filipe 8/5, 2011 at 7:58

In ethernet, when two parties transmit simultaniously, both fail. This is not the case in x86/x64 CAS - no matter how many parties CAS, one will always succeed. CAS is different to ethernet - I only realised after a week or two of thinking about it, because I had the ethernet assumption in the back of my mind. – Filipe 8/5, 2011 at 8:29

@Blank Xavier: I wasn't suggesting that the tasks should spin with nops; I was suggesting that they should let the cores find something else useful to do. Depends on the time-scale I suppose. If a CAS-based update only takes a few microseconds, task-switching may not be appropriate, but CAS-based updates can, in the absence of contention, be pefectly reasonable for things that can take milliseconds. BTW, I wonder why "CAS" is called "Compare-And-Swap" rather than "Compare-And-Store", since unlike "Compare-Exchange", it doesn't seem to be "swapping" anything. – Thema 8/5, 2011 at 16:10

Ah, well, with a lock-free data structure, typically there's nothing else useful for you to do when you want to CAS and you backoff. I mean, the whole notion of lock-free data structures as they are now generally doesn't scale; they're just lock-free versions of single threaded data structures, e.g. typically they have one or two critical pointers and a lot of contention for those pointers. Scalability requires fundamentally different - distributed - approaches. Backoff is basically a hack - it helps make something which doesn't scale, scale a bit at least. – Filipe 8/5, 2011 at 20:16

@Blank Xavier: If the operation between a load and a CAS is fast enough to fit within a scheduling quantum, then from a purely wall-time perspective, except in cases involving hundreds of thousands of updates per second, not a whole lot is lost if threads whose CAS fails simply retry immediately. The normal pattern is to copy an object to a privately-allocated buffer, modify that buffer, and then CAS the reference. The old object in this context is read-only, so there should be no cache contention there; the privately-allocated buffer is private until the CAS succeeds, so no conflict there. – Thema 8/5, 2011 at 21:24

@Blank Xavier: The only point of cache contention is going to be on the CAS'ed reference itself, and attempted updates to that are generally going to represent a small part of execution time. The main situation where I would think back-off would help would be if the operation between the load and the store is a bit more complex. Something like event subscription using delegates, for example, could be done using either locks or CompareExchange. If an event has thousands of subscribers and multiple entities want to subscribe or unsubscribe, CompareExchange could get a little ugly. – Thema 8/5, 2011 at 21:29

I think when a CAS fails, it's bad news. Consider that in a lock-free data structure, usually, when you CAS, you need that CAS to succeed. So imagine you have say eight cores, all of whom have read in a variable, done a bit of work and are now trying to CAS on that variable. One will succeed, the other seven will fail. The one who succeeds will write - and by doing so, invalidate the cache line on the other seven cores. For their next CAS attempt, the first will RFO, the others will need to copy from that first cores cache. That's a big performance problem. – Filipe 9/5, 2011 at 5:47

BTW, in kinds of lock-free data structures I work with, there's no read-copy-update. All the threads are directly working with the pointers / data structure. – Filipe 9/5, 2011 at 5:48

@Blank Xavier: I would think read-munge-writeback would be the normal sue for CAS, which can be implemented via either CompareExchange or load-store-exclusive. I like LX/SX conceptually (though I've never actually used it in a multi-core machine) because LX could be used to give other tasks a "hint" that a variable is in the middle of a read-munge-writeback cycle and should best not be disturbed. My preference would be to have "dominant" and "submissive" versions of the LX instruction (a submissive attempt to grab a cache line that was already LX'ed would fail and set a flag). – Thema 9/5, 2011 at 15:13

@Blank Xavier: Something like that could avoid cache contention, and avoid having tasks invalidating each others' work or doing work that would get invalidated. BTW, do you know whether CompareExchange implementations generally acquire a cache line for write permission only when they look like they're going to work, or do they do so regardless? – Thema 9/5, 2011 at 15:15

CAS via load-store-exclusive can't know at the LX stage that the CAS will fail. This is because load-store-exclusive is implemented via the MESI protocol; the store fails when the cache line is in the invalid state (e.g. someone else wrote). You can't tell at LX time from the cache line state if you'll fail when you SX. – Filipe 10/5, 2011 at 5:28

CAS on x86/x64 stalls until the cache lock is obtained, so the concept of working/not-working doesn't exist. You just wait till you can go. – Filipe 10/5, 2011 at 5:29

BTW, a strong/weak LX would avoid the cache contention of failed CAS, but it wouldn't avoid the cache contention of successful CAS, which still kills you. If you have say two threads, one per physical core with no shared cache and they both sit there CASing (e.g. cache invalidation for the other core on every write), well, on x86/x64 you lose more than 50% of your performance. – Filipe 10/5, 2011 at 5:33

In retrospect this example is unrealistic. 100ms isn't "just over a time slice" on modern OSes, it's more like 10 time slices. And if you take over a time slice to complete, chances are at some during the thread's work the scheduler is going to put the thread to sleep, and each thread will sleep for different amounts of time, scattering the threads' attempts to do the CAS and likely prevent them from contending simultaneously. Also in real world programs it's unlikely each thread will uniformly take the same amount of time -- they likely want to write different data to the object. – Veda 5/7, 2012 at 18:24

@JosephGarvin: It doesn't really matter whether 100ms represents one time slice or a thousand; if all operations start simultaneously, the behavior would be identical. As for the fact that operations aren't likely to all start and repeat simultaneously, that is true, but from a practical perspective it doesn't matter much. What does matter is that most of the time there will be many threads executing code somewhere between the 'grab' and the 'CompareExchange'; if there are N such threads, the operations being attempted by N-1 will represent wasted effort, yielding efficiency below 1/N. – Thema 5/7, 2012 at 18:50

@JosephGarvin: Incidentally, if a program like the above were run on an OS whose time slice was long enough to encompass the grab/edit/CompareExchange sequence, efficiency could be pretty good since most such attempts would fail at most once with a single CPU, or on average K times with K CPUs. The problem scenario is the one were the grab/CompareExchange sequence takes longer than a time-slice; 100ms would certainly qualify on many systems. – Thema 5/7, 2012 at 18:54

lockless data structures will, one way or another, use atomic semantics from your architecture to perform its core operations. When you do this, you can be using the machines entire internal exclusion mechanisms to ensure correct ordering or fencing of data. A mutex or critical section also does this, but it only does it once for a single flag. Where the mutex or critical section is slow, is when the the lock acquisition fails (there is contention). In this case, the OS also invokes the scheduler to suspend the thread until the exclusion object has been released.

So it seems logical that whenever your lock-less data structure uses multiple atomic operations per core method when a single lock shielding a critical section could provide the same semantics AND there tends to be very little contention, in practice, for the data structure in question, then in fact, it does make more sense to use an OS-provided locking mechanism, than trying to build your own.

Merganser answered 18/10, 2009 at 21:41 Comment(0)

I don't know about making it slower, but it certainly makes it harder to get right. In the many cases where the two approaches are virtually identical in performance (or when it simply doesn't matter if it takes 500 pico-seconds rather than 100 pico-seconds), then pick the simplest approach - generally lock.

There are very few cases when that extra bit of performance is key; and if it is, I suspect you'd do well to use the pre-rolled pattern implementations from established libraries. Getting lock-free code working properly (and proving that it works properly in all conditions) is often very hard.

Note also that some environments offer a level of locking above the OS-provided mutex; mutex behaviour, but without some of the overheads (for example, Monitor in .NET).

Kerbela answered 18/10, 2009 at 19:43 Comment(1)

Did you mean "virtually identical"? – Intromission 18/10, 2009 at 20:0

I would like to add one point to this part of the answer: "Where the mutex or critical section is slow, is when the the lock acquisition fails (there is contention). In this case, the OS also invokes the scheduler to suspend the thread until the exclusion object has been released."

Seems like different operating systems can have different approaches as to what to do when lock acquisition failed. I use HP-UX and it for example has a more sophisticated approach to locking mutexes. Here is its description:

... On the other hand, changing context is an expensive process. If the wait is going to be a short one, we'd rather not do the context switch. To balance out these requirements, when we try to get a semaphore and find it locked, the first thing we do is a short spin wait. The routine psema_spin_1() is called to spin for up to 50,000 clock cycles trying to get the lock. If we fail to get the lock after 50,000 cycles, we then call psema_switch_1() to give up the processor and let another process take over.

Rover answered 19/10, 2009 at 13:23 Comment(0)

Keep in mind that a mutex may well be implemented as a lock-free data structure, in the sense that it uses one or a few atomic objects to represent its state. It's a false dichotomy.

Better is to consider whether you need to allow multiple threads to wait for access to some set of operations or to block until signaled. Each requires a queue of waiting threads. The former queues threads waiting for access to the synchronized area, while the latter queues threads waiting for a signal. The Java classes AbstractQueuedSynchronizer and AbstractQueuedLongSynchronizer provide such a queue—in particular, a CLH Queue—upon which one can build mutexes, conditions, and other queue-based primitives.

If your requirements favor instead only one thread taking on an exclusive set of work while other threads remain free to carry on with other work, as opposed to waiting until they too can do that same work themselves, then using lock-free techniques is possible. Whether doing so will grant faster run times falls to benchmarking, being subject to how often and how many threads will contend over these synchronization controls, and whether there's other work for threads to perform independently.

Chaise answered 7/11, 2009 at 22:49 Comment(4)

A mutex to me implies that when a lock can't be acquired the OS puts the thread to sleep and is responsible for only waking it once the mutex is available, at which point I'd say the dichotomy makes sense. Lock free operations never yield to the OS (though the OS can still be preempted involuntarily by the OS). But I imagine if you had a two thread system with a minimal (or no) OS it could implement 'mutexes' by spinning. – Veda 20/12, 2014 at 2:42

Yes, I agree with you that it would be an oddball approach to implement a mutex without parking a queued thread, but I stand by my assertion that it's common to implement a mutex in terms of lock-free structures, because one can't build locks out of locks (or turtles) without bottoming out. In hindsight, my answer was motivated by wanting the OP to consider whether he needs blocking behavior or not in order to control exclusive access. – Chaise 20/12, 2014 at 12:44

Your terminology in the first paragraph is wrong (this doesn't invalidate the rest of the answer). "lock-free" doesn't just mean that it uses atomic operations. It means that at least one thread can always make progress. A mutex isn't lock-free even if you roll your own with atomics, because the process holding the lock can be de-scheduled or blocked on a page-fault or whatever. See en.wikipedia.org/wiki/Non-blocking_algorithm for definitions of lock-free (at least one thread can make progress) vs. wait-free (all threads can always make progress, so no cmpxchg retry loops or similar). – Desperado 11/7, 2017 at 12:29

One of the main disadvantages of locking is that a writer has to wait for all readers to finish, and lock out all readers during the update. In a many-readers with an occasional writer, RCU (read-copy-update) is very very good. It can make the readers very very cheap, even wait-free. – Desperado 11/7, 2017 at 12:34

Efficiency depends on the metric. Lock-, or wait-free algorithms are important in systems where preemption can introduce deadlock or affect scheduling deadlines. In those cases, processing is less important than correctness.

The OP considers locking as an alternative to mutexes. Some algorithms require neither to access a shared data structure. In these cases, both producer and consumer can access the same data structure concurrently without regard for the other. An example of a shared queue permits a single reader and a single writer to simultaneously act on a shared instance. This meets the common need of a device driver writing data that a consumer process can access on demand.

More complex relationships between processes can be permitted (see Herlihy (1991) for an analysis) with varying levels of hardware support. He concludes Wait-free synchronization represents a qualitative break with the traditional locking-based techniques for implementing concurrent objects.

What it means is that there remains a trade-off, but that it is not one simply between choosing between mutexes and spinlocks.

A rule of thumb remains to focus on correctness rather than performance. Performance can usually be achieved by throwing money at the problem, while meeting requirements is usually more difficult.

Turenne answered 26/7, 2016 at 0:19 Comment(2)

both producer and consumer can access the same data structure concurrently without regard for the other That is misleading/wrong. They must not access the same datum concurrently, but they may access different data (of the same data structure). However, this was not the question. – Bickford 6/7, 2017 at 9:12

@Bickford Absolutely - sometimes worrying about the individual datum is unnecessary when a sequence is being processed. I realise my answer did not consider the performance of either mutexes or lock-free structures. The point was precisely that jumping between the horns of the dilemma provides a third, and often credible, alternative. – Turenne 7/7, 2017 at 5:36

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags