In what circumstances lock free data structures are faster than lock based ones?

Asked 27/10, 2016 at 13:45 Answered 20/2, 2022 at 14:46

Solved c++multithreading data-structures concurrency lock-free

I'm currently reading C++ Concurrency in Action book by Anthony Williams and there are several lock free data structures implementations. In the forward of the chapter about lock free data structures in the book Anthony is writing:

This brings us to another downside of lock-free and wait-free code: although it can increase the potential for concurrency of operations on a data structure and reduce the time an individual thread spends waiting, it may well decrease overall performance.

And indeed I tested all lock free stack implementations described in the book against lock based implementation from one of the previous chapters. And it seems the performance of lock free code is always lower than the lock based stack.

In what circumstances lock free data structure are more optimal and must be preferred?

Ibidem answered 27/10, 2016 at 13:45 Comment(5)

Which how much data? How long were the locking versions spending locked? – Mast 27/10, 2016 at 13:48

"must be preferred?" Must is a very strong word – Ganef 27/10, 2016 at 13:49

I like lock free data structures because they help avoid mutex deadlocks. They will be slower when there is lots of contention, just like with spin locks. – Remanent 27/10, 2016 at 13:52

Shortly: lock-based is good as long as no thread actually fails to acquire the lock. When that happen it is a pretty much guaranteed context-switch, so bad. If it is likely that multiple threads will actually access the same critical section at the same time (lot of contention), lock-free will outperform. – Henninger 27/10, 2016 at 13:53

Besides other use-cases, they can be great for read-mostly data. With something like RCU, the readers never have to wait, or even do an atomic operation at all. The burden of synchonization can be placed entirely on the writers. – Stumpage 27/10, 2016 at 14:4

One benefit of lock-free structures is that they do not require context switch. However, in modern systems, uncontended locks are also context-switch free. To benefit (performance-wise) from lock-free algo, several conditions have to be met:

Contention has to be high
There should be enough CPU cores so that spinning thread can run uninterrupted (ideally, should be pinned to its own core)

Boarer answered 27/10, 2016 at 13:54 Comment(3)

I think those two conditions cancels out each other. If we have enough cpu core, we would have less contention. – V 28/1, 2021 at 17:41

@AnkushG not really, why? – Boarer 28/1, 2021 at 19:54

Contention isn’t caused by a lack of cpu cores; rather it is when multiple cores need exclusive access the same data at the same time. – Thisbe 20/2, 2022 at 14:54

I've done performance study years ago. When the number of threads is small, lock-free data structures and lock-based data structures are comparable. But as the number of threads rises, at some point lock-based data structures exhibit a sharp performance drop, while lock-free data structures scale up to thousands of threads.

Velarium answered 27/10, 2016 at 14:1 Comment(1)

@Donghui Zhang You are right, this heavily depends on the number of threads and the size of the data stored in each node. I tested with 1000 reader and writer threads and the things this way look different in favor for the lock free design. – Ibidem 27/10, 2016 at 14:39

it depends on the probability of a collision.

if a collision is very likely, than a mutex is the optimal solution. For example: 2 threads are constantly pushing data to the end of a container. With lock-freedom only 1 thread will succeed. The other will need to retry. In this scenario the blocking and waiting would be better.

But if you have a large container and the 2 threads will access the container at different areas, its very likely, that there will be no collision. For example: one thread modifies the first element of a container and the other thread the last element. In this case, the probability of a retry is very small, hence lock-freedom would be better here.

Other problem with lock-freedom are spin-locks (heavy memory-usage), the overall performance of the atomic-variables and some constraints on variables.

For example if you have the constraint x == y which needs to be true, you cannot use atomic-variables for x and y, because you cannot change both variables at once, while a lock() would satisfy the constraint

Sclerous answered 27/10, 2016 at 14:27 Comment(0)

The only way to know which is better is to profile each. The result will change drastically from use case to use case, from one cpu to another, from one arch to another, from one year to another. What might be best today might not be best tomorrow. So always measure and keep measuring.

That said let me give you some of my private thoughts on this:

First: If there is no contention it shouldn't matter what you do. The no-collision case should always be fast. If it's not then you need a different implementation tuned to the no contention case. One implementation might use fewer or faster machine instruction than the other and win but the difference should be minimal. Test, but I expect near identical results.

Next lets look at cases with (high) contention:

Again you might need an implementation tuned to the contention case. One lock mechanism isn't like the other same as lock-free methods.

threads <= cores

It's reasonable to assume all threads will be running and do work. There might be small pauses where a thread gets interrupted but that's really the exception. This obviously will only hold true if you only have one application doing this. The threads of all cpu heavy applications add up for this scenario.

Now with a lock one thread will get the lock and work. Other threads can wait for the lock to become available and react the instant the lock becomes free. They can busy loop or for longer durations sleep, doesn't matter much. The lock limits you to 1 thread doing work and you get that with barely any cpu cycles wasted when switching locks.

On the other hand lock free data structures all fall into some try&repeat loop. They will do work and at some crucial point they will try to commit that work and if there was contention they will fail and try again. Often repeating a lot of expensive operations. The more contention there is the more wasted work there is. Worse, all that access on the caches and memory will slow down the thread that actually manages to commit work in the end. So you are not only not getting ahead faster, you are slowing down progress.

I would always go with locks with any workload that takes more cpu cycles than the lock instruction vs. the CAS (or similar) instruction a lock free implementation needs. It really doesn't take much work there leaving only trivial cases for the lock-free approach. The builtin atomic types are such a case and often CPUs have opcodes to do those atomic operations lock-free in hardware in a single instruction that is (relatively) fast. In the end the lock will use such an instruction itself and can never be faster than such a trivial case.

threads >> cores

If you have much more threads than cores then only a fraction of them can run at any one time. It is likely a thread that sleeps will hold a lock. All other threads needing the lock will then also have to go to sleep until the lock holding thread wakes up again. This is probably the worst case for locking data structures. Nobody gets to do any work.

Now there are implementations for locks (with help from the operating system) where one thread trying to acquire a lock will cause the lock holding thread to take over till it releases the lock. In such systems the waste is reduced to context switching between the thread.

There is also a problem with locks called the thundering herd problem. If you have 100 threads waiting on a lock and the lock gets freed, then depending on your lock implementation and OS support, 100 threads will wake up. One will get the lock and 99 will waste time trying to acquire the lock, fail and go back to sleep. You really don't want a lock implementation suffering from thundering herds.

Lock free data structures begin to shine here. If one thread is descheduled then the other thread will continue their work and succeed in committing the result. The thread will wake up again at some point and fail to commit it's work and retry. The waste is in the work the one descheduled thread did.

cores < threads < 2 * cores

There is a grey zone there when the number of threads is near the number of cores. The chance the blocking thread is running remains high. But this is a very chaotic region. Results what method is better are rather random there. My conclusion: If you don't have tons of threads then try really hard to stay <= core count.

Some more thoughs:

Sometimes the work, once started, needs to be done in a specific order. If one thread is descheduled you can't just skip it. You see this in some data structures where the code will detect a conflict and one thread actually finishes the work a different thread started before it can commit it's own results. Now this is really great if the other thread was descheduled. But if it's actually running it's just wasteful to do the work twice. So data structure with this scheme really aim towards scenario 2 above.

With the amount of mobile computing done today it becomes more and more important to consider the power usage of your code. There are many ways you can optimize your code to change power usage. But really the only way for your code to use less power is to sleep. Something you hear more and more is "race to sleep". If you can make your code run faster so it can sleep earlier then you save power. But the emphasis here is on sleep earlier, or maybe I should say sleep more. If you have 2 threads running 75% of the time they might solve your problem in 75 seconds. But if you can solve the same problem with 2 threads running 50% of the time, alternating with a lock, then they take 100 seconds. But the first also uses 150% cpu power. For a shorter time, true, but 75 * 150% = 112.5 > 100 * 100%. Power wise the slower solution wins. Locks let you sleep while lock free trades power for speed.

Keep that in mind and balance your need for speed with the need to recharge your phone of laptop.

Massotherapy answered 20/2, 2022 at 14:46 Comment(0)

The mutex design will very rarely, if ever out perform the lockless one. So the follow up question is why would anyone ever use a mutex rather than a lockless design?

The problem is that lockless designs can be hard to do, and require a significant amount of designing to be reliable; while a mutex is quite trivial (in comparison), and when debugging can be even harder. For this reason, people generally prefer to use mutexes first, and then migrate to lock free later once the contention has been proven to be a bottleneck.

Ganef answered 27/10, 2016 at 13:59 Comment(3)

"The mutex design will very rarely, if ever out perform the lockless one." I do not think this is quite true. Looks like purely based on "theory" rather than experience. First of all "lockless" is not really lock free, lock is done on hardware level. Second "lockless" does not mean "delayless" – Premise 27/10, 2016 at 14:6

If making a data-structure lock-free means that normally-simple operations now require a sequence of atomic ops, a lightly-contended data structure might actually benefit from using locks. – Stumpage 27/10, 2016 at 14:6

I saw it a lot - academic papers with lockless data structures (e.g. for priority queue) are just huge waste of time comparing to implementation based light-weight fine-grained user-level locks. – Acquittance 27/10, 2016 at 15:32

I think one thing missing in these answers is locking period. If your locking period is very short, i.e. after acquiring lock if you perform a task for a very short period(like incrementing a variable) then using lock-based data structure would bring in unnecessary context switching, cpu scheduling etc. In this case, lock-free is a good option as the thread would be spinning for a very short time.

V answered 28/1, 2021 at 17:49 Comment(1)

A good mutex doesn't involve any context-switching or system calls in the un-contended case. (preshing.com/20111124/always-use-a-lightweight-mutex). Of course, if you really do just want to increment a single scalar counter, then yes lock-free atomics are certainly the obvious choice. On some machines, that means no spinning at all. But in more common cases, yes a CAS retry loop may have to retry a few times in rare cases. – Stumpage 28/1, 2021 at 18:57

Recommended topics

Hot tags