Is synchronizing with `std::mutex` slower than with `std::atomic(memory_order_seq_cst)`?
Asked Answered
I

3

33

The main reason for using atomics over mutexes, is that mutexes are expensive but with the default memory model for atomics being memory_order_seq_cst, isn't this just as expensive?

Question: Can concurrent a program using locks be as fast as concurrent lock-free program?

If so, it may not be worth the effort unless I want to use memory_order_acq_rel for atomics.


Edit: I may be missing something but lock-based cant be faster than lock-free because each lock will have to be a full memory barrier too. But with lock-free, it's possible to use techniques that are less restrictive then memory barriers.

So back to my question, is lock-free any faster than lock based in new C++11 standard with default memory_model?

Is "lock-free >= lock-based when measured in performance" true? Let's assume 2 hardware threads.


Edit 2: My question is not about progress guarantees, and maybe I'm using "lock-free" out of context.

Basically when you have 2 threads with shared memory, and the only guarantee you need is that if one thread is writing then the other thread can't read or write, my assumption is that a simple atomic compare_and_swap operation would be much faster than locking a mutex.

Because if one thread never even touches the shared memory, you will end up locking and unlocking over and over for no reason but with atomic operations you only use 1 CPU cycle each time.

In regards to the comments, a spin-lock vs a mutex-lock is very different when there is very little contention.

Issuable answered 30/4, 2013 at 20:20 Comment(13)
Well, there's different progress guarantees between locks, lock-free , and wait-free code.Radiolocation
Mandatory reading.Admission
watch : youtube.com/watch?v=DCdGlxBbKU4Zipper
atomic operations (whether they are used for lock free stuff or to implement locks) take MUCH longer than "1 CPU cycle". used to be in the triple-digits (!) on Pentium 4, now it's down to low double digits -- but still much more than 1.Catholicize
@PaulGroke Under what hypothesis? Cached, not cached?Dunlap
@Dunlap What do you mean hypothesis? That's not a hypothesis it's a fact. Atomic reads or writes can be as fast as "normal" reads/writes depending on the platform and required memory ordering, but there is no single modern platform where an atomic Compare-And-Swap is as fast as the non-atomic equivalent. If the value is cached or not doesn't really change which one (atomic vs. non-atomic) is faster -- assuming of course you compare apples to apples and oranges to oranges.Catholicize
@Dunlap ... If you're interested in low-level stuff like what operations take roughly what amount of time, I found this article to be very interesting ithare.com/infographics-operation-costs-in-cpu-clock-cyclesCatholicize
@PaulGroke If X is always (and always will) faster than Y, that doesn't mean that the ratio time(Y)/time(X) is always great: in some cases both can be slow and the ration would be low, in other cases X will be trivial and the ration would be extremely high. That's what was meant by "hypothesis". (I did not insinuate that some set of hypotheses would make X slower than Y.)Dunlap
@PaulGroke "I found this article" Thank you. It illustrates what I said: CAS in the slow hypothesis is in the same time bracket as a read! So the relative cost is much dependent on the exact case. So the "much" faster normal modification claim depends on your hypothesis.Dunlap
@Dunlap Yes, the ratio changes. So what? Please go back and read my original comment again. The OP implied that atomic CAS takes one CPU cycle. So I corrected him, writing that it's a lot more than one CPU cycle. And now you come and act as if I had written that atomic CAS is "much" slower than a memory access, assuming non-cached so the words you put into my mouth do not make sense anymore. Forgive me but I really don't see the point. If that wasn't your intention I apologize, but this is my perception.Catholicize
@PaulGroke You wrote "used to be in the triple-digits (!)" and I replied "Under what hypothesis?" You have confirmed that cached/uncached memory access diff is at least as important as write/CAS diff. Also the cost of an operation that needs to wait until the store buffer is empty depends on how many items are in the buffer and whether they need to cache coherency operations.Dunlap
@Dunlap Ah. I think I understand now - at least somewhat. In that case the answer is "cached". I was correcting the OP's claim, and for that assuming optimal conditions for his claim, i.e. the fastest atomic CAS that you can get. Which on prescott was about 100 cycles and on more modern x86 CPUs is in the 15-20 cycle range. ps: I still find the use of the word "hypothesis" confusing. I would have used "conditions".Catholicize
@PaulGroke I prob should have used "condition". And CAS may be slow compared to normal operation, for cached memory it's also so fast that some compilers emit a meaningless (non changing) CAS on the stack to implement a fence.Dunlap
A
54

Lockfree programming is about progress guarantees: From strongest to weakest, those are wait-free, lock-free, obstruction-free, and blocking.

A guarantee is expensive and comes at a price. The more guarantees you want, the more you pay. Generally, a blocking algorithm or datastructure (with a mutex, say) has the greatest liberties, and thus is potentially the fastest. A wait-free algorithm on the other extreme must use atomic operations at every step, which may be much slower.

Obtaining a lock is actually rather cheap, so you should never worry about that without a deep understanding of the subject. Moreover, blocking algorithms with mutexes are much easier to read, write and reason about. By contrast, even the simplest lock-free data structures are the result of long, focused research, each of them worth one or more PhDs.

In a nutshell, lock- or wait-free algorithms trade worst latency for mean latency and throughput. Everything is slower, but nothing is ever very slow. This is a very special characteristic that is only useful in very specific situations (like real-time systems).

Admission answered 30/4, 2013 at 20:32 Comment(14)
ok - but 2 threads with shared data.. dont you agree that a spin-lock is faster than a blocking lock?Issuable
for example a trading application that listens to ticks from 2 exchanges on 2 threads. wouldn't a spin-lock be much faster then mutex?Issuable
@jaybny: No, why. And a spinlock is also a lock. It doesn't change the fundamental guarantees. And please take a look at the implementation of the pthread mutex for a pleasant surprise.Admission
hmm. ok, but i dont consider a spin-lock a lock at all. its just an atomic operation. will do more research. is pthread mutex a spin lock for short durations?Issuable
Assuming you still don't believe Kerrek (great summary btw), try it. Code it both ways. Lock-free algorithms usually don't show their virtue until parallelism is taken to the extreme (hundreds or thousands of producers/consumers), or other very specific situations. THEN it can be better, but for most cases for most people, regular locks (or their derivatives, like semaphores) are best, and even perform best.Wurster
@jaybny: "i dont consider a spin-lock a lock at all" You may want to read that over and over until the irony sinks in...Coinsure
locking a mutex is expensive to acquire and it will go to sleep. looping on a compare exchange is 1 atomic operation most of the time. #5870325Issuable
@Kevin - i will test it. im my apps, most of the time there is no contention, and when there is lock is only needed for very short periods of time.Issuable
@jaybny: The concept of "lock" is about progress. Imagine if the thread holding the lock (whichever, spin or mutex) is interrupted, held up, swapped out or terminated. Then no other thread can make progress. That's what makes a lock a lock. A lock-free algorithm does not have this property, i.e. any number of threads can be held up, and the rest of the program can still make progress.Admission
I don't agree that lock-free algorithms necessarily have worse mean latency and throughput. Sometimes they need no more interlocked operations than the lock implementation itself. Really the tradeoff is in the complexity of mental reasoning.Radiolocation
"Really the tradeoff is in the complexity of mental reasoning." @Ben VoigtIssuable
@jaybny: Ok, you repeated my sentence back to me. Was there a point to that?Radiolocation
@Ben Voigt im still pondering the quote.Issuable
@ Kerrek SB - if a thread is in middle of writing to shared memory and gets swapped out then yes, no other thread can attempt to access the shared memory, because its in a bad state.Issuable
T
1

A lock tends to require more operations than a simple atomic operation does. In the simplest cases, memory_order_seq_cst will be about twice as fast as locking because locking tends to require, at minimum two atomic operations in its implementation (one to lock, one to unlock). In many cases, it takes even more than that. However, once you start leveraging the memory orders, it can be much faster because you are willing to accept less synchronization.

Also, you'll often see "locking algorithms are always as fast as lock free algorithms." This is somewhat true. The basic idea is that if the fastest algorithm happens to be lock free, then the fastest algorithm without the lock-free guarentee is ALSO the same algorithm! However, if the fastest algortihm requires locks, then those demanding lockfree guarantees have to go find a slower algorithm.

In general, you will see lockfree algorithms in a few low level algorithms, where the performance of leveraging specialized opcodes helps. In almost all other code, locking is more than satisfactory performance, and much easier to read.

Talyah answered 3/9, 2013 at 6:0 Comment(2)
"locking algorithms are always as fast as lock free algorithms." - i just dont get this. a thread safe loop with shared memory will be much faster doing a simple compare_exchange then a lock(mutex) unlock(mutex). especially if there is never contention from another thread.Issuable
Locking a spin lock requires a RMW but unlocking it only requires a W. An atomic W of a word size scalar variable is a regular write on all common arch for the std ABI (which makes word size scalars word aligned). It's very fast.Dunlap
D
1

Question: Can concurrent a program using locks be as fast as concurrent lock-free program?

It can be faster: lock free algorithm must keep the global state in a consistent state at all time, and do calculations without knowing if they will be productive as the state might have changed when the calculation is done, making it irrelevant, with lost CPU cycles.

The lock free strategy makes the serialization happen at the end of the process, when the calculation is done. In a pathological case many threads can do an effort and only one effort will be productive, and the others will retry.

Lock free can lead to starvation of some threads, whatever their priority is, and there is no way to avoid that. (Although it's unlikely for a thread to starve retrying for very long unless there is crazy contention.)

On the other hand, "serialized calculation and series of side effect based" (aka lock based) algorithms will not start before they know they will not be prevented by other actors to operate on that specific locked ressource (the guarantee is provided by the use of a mutex). Note that they might be prevented from finishing by the need to access another resource, if multiple locks are taken, leading to possible dead lock when multiple locks are needed in a badly designed program.

Note that this dead lock issue isn't in the scope of lock free code, which can't even act on multiple entities: it usually can't do an atomic commit based on two unrelated objects(1).

So the lack of chance of dead lock for lock free code is sign of weakness of lock free code: not being able to dead lock is a limit of your tool. A system that can only hold of lock at a time also wouldn't be able to dead lock.

The scope of lock free algorithms is minuscule compared to the scope of lock based algorithms. For a lot of problem, lock free doesn't even make sense.

A lock based algorithm is polite, the threads will have to wait in line before doing what they need to do: that is maximally efficient in term of computation steps by each thread. But it's inefficient to have to queue threads in a wait list: they often can't use the end of their time slice, so it can be very inefficient, as someone trying to do serious work while being interrupted by the phone all the time: his concentration is gone and he can't never reach maximum efficiency because his work time to cut into small pieces.

(1) You would have at least need to be able to do a double CAS for that, that is an operation atomic on two arbitrary addresses (not a double word CAS, which is just a CAS on more bits, which can trivially be implemented up to the natural CPU memory access arbitration unit that is the cache line).

Dunlap answered 8/12, 2019 at 21:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.