Do lock-free algorithms really perform better than their lock-full counterparts?

Asked 15/4, 2011 at 18:28 Answered 7/2, 2023 at 10:38

multithreading synchronization lock-free

Raymond Chen has been doing a huge series on lockfree algorithms. Beyond the simple cases of the InterlockedXxx functions, it seems like the prevailing pattern with all of these is that they implement their own locks. Sure, there are not processor locks, but the concept of looping over and over on each CPU to ensure consistency is very much like a spinlock. And being a spinlock, they are going to be less efficient than the general locks that come with the operating system because they do not yield control of their quanta while waiting for other threads. Therefore, whenever someone comes to me and says "but my algorithm is lock-free", my general response is "so"?

I'm curious -- are there benchmarks available which show lock free algorithms to have an edge over their lock-full counterparts?

Esdraelon answered 15/4, 2011 at 18:28 Comment(3)

I've only seen some graphs on this topic in Joe Duffy's Concurrency book, not exhaustive however. Also see his blog bluebytesoftware for some additional articles. – Myongmyopia 15/4, 2011 at 18:33

More flexible, scalable locking in JDK 5.0 has some benchmarks. – Doradorado 15/4, 2011 at 18:35

Benchmarks : liblfds.org/wordpress/?p=89 (note the freelist scales better now exponential backoff has been added; later benchmarks show 0.4 scaling for two threads and I'm not sure yet if the backoff delay period is optimal). – Ennead 18/4, 2011 at 9:35

Beyond the simple cases of the InterlockedXxx functions, it seems like the prevailing pattern with all of these is that they implement their own locks.

None of the answers here really seem to get to the heart of the difference between a "lock-free" CAS loop and a mutex or spin-lock.

The important difference is that lock-free algorithms are guaranteed to make progress on their own - without the assistance of other threads. With a lock or spin lock, any poor thread that can't acquire a lock is entirely at the mercy of the thread that owns the lock. The poor thread that can't acquire the lock can do nothing except wait (either via a busy wait or an OS-assisted sleep).

With lock-free algorithms that loop on a CAS, each thread is guaranteed to make progress regardless of what other contending threads are doing. Each thread is, essentially, in control of its own fate. Yes, it still may have to loop many times, but the number of times it loops is limited by the number of contending threads. It cannot infinitely loop, for the most part. (In practice, it's possible for live lock to occur due to, e.g. an LL/SC loop that keeps failing due to false sharing) - but again measures can be taken by the thread itself to deal with this - it is not at the mercy of another thread holding a lock.

As for performance, it depends. I've seen flagrant examples of lock-free algorithms being totally out-performed by their locking counterparts, even under high-thread contention. On an x86-64 machine running Debian 7, I compared the performance between the C++ Boost.Lockfree queue (based on the Michael/Scott algorithm) and a plain old std::queue surround by an std::mutex. Under high thread contention, the lockfree version was almost twice as slow.

So why is that? Well, the performance of lockfree algorithms ultimately comes down to the implementation details. How does the algorithm avoid ABA? How does it accomplish safe memory reclamation? There are so many variants... tagged pointers, epoch based reclamation, RCU/quiescent state, hazard pointers, general process-wide garbage collection, etc. All these strategies have performance implications, and some also place restrictions on how your application in general can be designed. In general, reference-counting approaches (or tagged pointer approaches) tend to perform poorly, in my experience. But the alternatives can be much more complex to implement, and require a lot more memory-reclamation infrastructure based around thread-local storage or generalized garbage collection.

Play answered 7/5, 2016 at 19:36 Comment(0)

In general, lock free algorithms are less efficient per thread - you're doing more work, as you mentioned, in order to implement a lock free algorithm than a simple lock.

However, they do tend to dramatically improve the overall throughput of the algorithm as a whole in the face of contention. Thread switching latency and context switches, which fast, over many threads slow down the throughput of your application dramatically. Lock free algorithms effectively are implementing their own "locks," but they do so in a manner that prevents or reduces the number of context switches, which is why they tend to out perform their locking counterparts.

That being said - most of this depends on the algorithm (and implementation) in question. For example, I've got some routines that I managed to switch to .NET 4's new concurrent collections instead of using the previous locking mechanisms, and have measured improvements over near 30% in my total algorithm speed. That being said, there are many benchmarks you can find that show reduced performance using some of these same collections when compared to a basic lock. As with all performance optimizations - you really don't know until you measure.

Bernadine answered 15/4, 2011 at 18:39 Comment(2)

+1 - Of course it's going to depend on the specific algorithms and one should benchmark to see -- I'm just trying to challenge the perception that "lockfree" is equated to "better" in all cases. :) I think everyone would agree that a really good algorithm using locks will probably outperform a really bad one which is lockfree, just as a really good lockfree one will outperform a really bad one using locks. – Esdraelon 15/4, 2011 at 18:42

@Billy: Yes, but a "good" algorithm may be better locking or lock free - it really depends on what you're having to lock, how frequently it's being locked, etc. Typically, lock-free tends to be better in the face of high levels of concurrency (the more concurrency, the more it tends to help)... – Bernadine 15/4, 2011 at 18:46

Lock-free isn't necessarily any faster, but it can eliminate the possibility of deadlock or livelock, so you can guarantee that your program will always make progress toward finishing. With locks, it's difficult to make any such guarantee -- it's all too easy to miss some possible execution sequence that results in a deadlock.

Past that, it all depends. At least in my experience, differences in speed tend to depend more on the skill level deployed in the implementation than whether it uses locks or not.

Alfeus answered 15/4, 2011 at 19:28 Comment(6)

Hmm.. I'm not sure I buy that argument. It's entirely possible to have deadlock with one of these algorithms -- you're implementing things which work in a locking manner yourself! But even if this eliminated all deadlock and livelock bugs, those bugs are easier to deal with, debug, and eliminate than the potential data consistency bugs you can get from buggy lockfree algorithms. -- But +1 for the last paragraph. – Esdraelon 15/4, 2011 at 19:33

@Billy: look carefully: I didn't say it goes guarantee anything, only that it can. With locks, it's much more difficult to guarantee anything, even under the best of circumstances. – Alfeus 15/4, 2011 at 19:47

@Jerry: Hmm.... I would argue the opposite. It's easier to argue about deadlock or livelock but it's much harder to argue about data consistency (which IMHO is the more important constraint). – Esdraelon 15/4, 2011 at 21:2

+1 Billy. Reasoning and proving validity for lock-free algorithms whether its the lowest (obstruction-free) or highest (wait-free) guarantees lends itself to be much more difficult then a simple locking structure. – Insure 15/4, 2011 at 21:8

@Billy: Oh, don't get me wrong: I'm not saying it's easier to to accomplish much in general, or anything like that. I'm just saying there are a couple of specific types of things that are easier to prove when you go lock-free. – Alfeus 15/4, 2011 at 21:49

The official definition of "lock-free" isn't just using atomic operations manually instead of mutexes. It actually means that forward progress is always possible for at least some threads.. With a mutex, if the one thread holding the lock blocks, no other threads can make progress. "Wait-free" is a stronger guarantee: every operation requires at most a finite number of steps. I think this rules out cmpxchg loops (which is what @BillyONeal was worried about with spin-like loops, I think). – Fulvi 29/10, 2016 at 4:6

Under Windows on x64, a straightforward (no combining array in front of the freelist) lock-free freelist is about an order of magnitude faster than a mutex based freelist.

On my laptop (Core i5), for a single thread, lock-free I get about 31 million freelist operations per second, vs for mutex about 2.3 million operations per second.

For two threads (on separate physical cores), with lock-free I get about 12.4 million freelist operations per thread. With a mutex, I get about 80 THOUSAND operations per second.

Ennead answered 18/4, 2011 at 8:31 Comment(7)

Mutex is the wrong tool to use here; mutexes are for cross process communication only. If you're doing this kind of thing you should probably be using a critical section instead... – Esdraelon 18/4, 2011 at 13:58

I note though that lock-free doesn't care if you're across processes; it's performance will be unchanged. If CS is faster than mutex because it's one-process-only, then lock-free has a capability CS does not have. – Ennead 18/4, 2011 at 14:38

@Blank: Yes and no. Yes, the cross process algorithm need not be modified, but it's going to require a shared memory segment, which isn't free in terms of access. (i.e. so the typical lockfree polling loop will perform more slowly) – Esdraelon 18/4, 2011 at 16:36

I'm under the impression shared memory is no slower to access than non-shared memory. What overheads do you have in mind? the code for the benchmark hasn't yet been published - it's part of release 7 of liblfds, which will be out in a month or two. Currently release 6 is available. liblfds.org – Ennead 18/4, 2011 at 17:35

The benchmark is very simple; it's a pop and then a push, from/to a freelist, in a while loop, where those (and a counter) are the only operations in the loop. This runs for ten seconds. Various combinations of logical cores are used. – Ennead 18/4, 2011 at 17:36

@Blank: Scratch that -- you appear to be correct. Most of the overhead of shared memory has to do with cache locality, which isn't going to matter for a benchmark like this because all the data's probably going to be cached the whole time. As for why it'd be nice to see code, we can't objectively look at the benchmark without being able to see the relevant code bits to show that they are equivlenet (i.e. and that there aren't unnecessarily slow things done (by i.e. typos) in either algorithm). – Esdraelon 18/4, 2011 at 17:43

Here are the results for critical sections; liblfds.org/wordpress/?p=203 – Ennead 18/4, 2011 at 18:19

The primary advantage of genuinely lock-free algorithms is that they are robust even if a task gets waylaid (note that lock free is a tougher condition than "not using locks"(*)). While there are performance advantages to avoiding unnecessary locking, the best-performing data structures are often those which can operate locking in many cases, but which can use locks to minimize thrashing.

(*)I've seen some attempts at "lock free" multi-producer queues where a producer that got waylaid at the wrong time would prevent consumers from seeing any new items until it completed its work); such data structures shouldn't really be called "lock free". One producer that gets blocked won't block other producers from making progress, but may arbitrarily block consumers.

Teocalli answered 8/5, 2011 at 1:11 Comment(0)

Lock-free algorithms can absolutely be faster then their blocking counterpart. But of course the inverse is true as well. Assuming the implementation performs better then the locking counter part, the only limiting factor is contention.

Take the two Java classes, ConcurrentLinkedQueue and LinkedBlockingQueue. Under moderate real world contention the CLQ outperforms the LBQ by a good amount. With heavy contention the use of suspending threads will allow the LBQ to perform better.

I disagree with user237815. synchronized keyword doesn't require as much overhead as it once did, but relative to a lock-free algorithm it does have a good amount of overhead associated to it compared to a single CAS.

Insure answered 15/4, 2011 at 20:52 Comment(0)

Recently on [JavaOne Russia][1] Oracle employee (who specializes on Java performance and benchmarks) have showed some measurements about operations per second within parallel access to simple int counter, using CAS (lock-free, high-level spinlock in fact) and classic locks (java.util.concurrent.locks.ReentrantLock).

According to this, spin-locks have better performance only until the few number of threads tries to access monitor.

Petta answered 15/4, 2011 at 22:58 Comment(2)

Hmm.. this is very specific about the Java language though and really doesn't say anything about lock free or not. And there's no such think as a lock free spinlock -- a spinlock is a lock. – Esdraelon 17/4, 2011 at 23:23

An int counter doesn't scale no matter what you do, because you have lots of threads accessing the same memory location. You need a counting funnel. – Ennead 18/4, 2011 at 8:26

In Java, at least, locking by itself can be very quick. The synchronized keyword doesn't add a lot of overhead. You can benchmark it yourself just by calling a synchronized method in a loop.

Locking only gets slow when there is contention, and the process being locked isn't instantaneous.

Apologize answered 15/4, 2011 at 18:32 Comment(2)

+1 - True -- but I think we can assume there's contention going on with both algorithms (because if it was not a contentious point in your problem it's unlikely you'd take the time to write a lock-free implementation of it). – Esdraelon 15/4, 2011 at 18:34

The question was about benchmarks, and not about the theory of locks and lock-free algorithms. My partial answer suggested one way of getting a benchmark. – Apologize 21/4, 2011 at 17:31

Lock-free also has the advantage that it does not sleep. There are places in kernels where you are not permitted to sleep - the Windows kernel has a bunch of them - and that painfully restricts your ability to use data structures.

Ennead answered 18/4, 2011 at 8:27 Comment(1)

Spinlocks don't sleep either. – Esdraelon 25/10, 2011 at 20:55

The following pic (src) may give you a better understanding of Charles Salvia's answer:

Lock-free: Each thread is guaranteed to make progress regardless of what other contending threads are doing

Lock-based: The poor thread that can't acquire the lock can do nothing except wait

Decibel answered 7/2, 2023 at 10:38 Comment(1)

Each thread is guaranteed to make progress regardless of what other contending threads are doing describes a "wait free" algorithm. It's stronger than "lock free", which only guarantees that at least one thread will make progress. en.wikipedia.org/wiki/Non-blocking_algorithm CAS retry loops on the same object are lock-free but not wait-free; only one thread can actually succeed at a CAS at once, not all of them in parallel. The other n-1 threads will have their CAS fail and have to retry. (If they ran in lockstep, all able to try at once, which isn't actually the case.) – Fulvi 7/2, 2023 at 16:15

-1

Yes lock-freedom ensures progress but unless you manually abort threads which is possible on some platforms or allocate in critical section and get out of memory exception, or anything stupid like that, you dont need that. Properly implemented spinlock almost always beats lockless approaches if not performs equal, because typically youll be doing more work for first time or after unsuccesfull attempts. If you keep spinning time short and overwhelm cpus with compare exchange instructions and/or dont back off after some period giving thread timeslice to other threads (which gives opportunity to out-scheduled thread to wake up and release lock) then lockfree code can perform better. Other than that i dont think its possible. Im not interested, nor was into complex data types where spinlock not fit, but still i sense properly designed lock-based alghoritms will be almost always better. i may be wrong though.

Keeney answered 1/2, 2019 at 16:11 Comment(2)

I think the question meant to ask about lockless code, whether or not it's wait-free / lock-free / obstruction-free. (en.wikipedia.org/wiki/Non-blocking_algorithm). As you say, actually implementing those properties is usually not worth it. But lockless code that can block (but almost never does in practice, and not for long) can beat taking locks. e.g. a lockless queue can support simultaneous read/write. e.g. Lock-free Progress Guarantees is a good analysis of a good lockless queue which can block in rare corner cases. – Fulvi 2/2, 2019 at 8:55

you nuts? he is asking to compare lockless vs locked code. and in your link it may be single cmpx in uncontended case but so it is with spinlock. so what is bad waiting to be able to write into syncronized memory instead of do a lot of job and try if noone did it before? again im assuming you are not fool killing your threads in unexpected ways or greedly spin.. – Keeney 2/2, 2019 at 15:55

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags