Lock free synchronization
Asked Answered
I

3

8

My question is related to multithreading lock-free synchronization. I wanted to know the following:

  1. What are general approaches to achieve this? I read somewhere about LockFreePrimitives like CompareAndExchange (CAS) or DoubleCompareAndExchange (DCA) but no explanation for those were given? Any approaches to MINIMIZE use of locks?

  2. How does Java/.NET achieve their concurrent containers? Do they use locks or lock-free synch?

Thanks in advance.

Interfertile answered 1/2, 2012 at 21:40 Comment(0)
D
7

Here are some general approaches that can minimize the use of locks, assuming your algorithm has some particular exploitable features:

  1. When updating a single numeric variable, you can use non-blocking primitives such as CAS, atomic_increment, etc. They are usually much faster that a classic blocking critical section (lock, mutex).

  2. When a data structure is read by multiple threads, but only written by one or few threads, an obvious solution would be a read-write lock, instead of a full lock.

  3. Try to exploit fine grain locking. For example, instead of locking an entire data structure with a single lock, see if you can use multiple different locks to protect distinct sections of the data structure.

  4. If you're relying on the implicit memory fence effect of locks to ensure visibility of a single variable across threads, just use volatile1, if available.

  5. Sometimes, using a conditional variable (and associated lock) is too slow in practice. In this case, a volatile busy spin is much more efficient.

More good advice on this topic here: http://software.intel.com/en-us/articles/intel-guide-for-developing-multithreaded-applications/

A nice read in another SO question: Lock-free multi-threading is for real threading experts (don't be scared by the title).

And a recently discussed lock-free Java implementation of atomic_decrement: Starvation in non-blocking approaches


1 The use of volatile here applies to languages such as Java where volatile has defined semantics in the memory model, but not to C or C++ where volatile preceded the introduction of the cross-thread memory model and doesn't integrate with it. Similar constructs are available in those languages, such as the various std::memory_order specifiers in C++.

Disease answered 1/2, 2012 at 21:49 Comment(28)
You have some good point, but #4 is wrong. volatile doesn't do what you expect it to do. It may prevent the compiler from doing some optimizations, but doesn't prevent the processor from reordering reads, and therefore can't replace a barrier.Concentre
@Concentre That is incorrect in Java context. A volatile read/store deploys necessary memory barriers.Expand
@JohnVint, I admit I don't know much about Java, and was thinking about C (where volatile is the most misunderstood keyword). Since the question isn't tagged "Java", this should at least be made clear.Concentre
@Concentre You're right it isnt tagged in Java. If Tudor was talking about Java when referencing volatile he should include that.Expand
Well, in the original question he was asking about lock-free collections in Java/C#, so my answer applies (mainly) to these languages.Disease
Re: previous comments: C++11 atomic<T> / C11 _Atomic is called volatile in Java. Thanks Java for using a C keyword with related but different semantics. (volatile in C/C++ does not have all the things required to make num++ atomic.)Harcourt
Re: 5. you should pretty much always use light-weight locking that only makes system calls if there's contention. (preshing.com/20111124/always-use-a-lightweight-mutex). You get very fast behaviour in the uncontended or low contention case (hopefully common), but you also get OS-assisted waiting that yields the CPU on contention, with OS-assisted wakeup of one waiter when you unlock.Harcourt
@PeterCordes the analogy between Java volatile and atomic<T> in C++ is not very exact. Syntactically, volatile in C/C++ and Java are pretty much the same: both are qualifiers that don't change the syntax of accessing the underlying value. Semantically (memory-model wise) volatile is a bit like a restricted form of atomic where the only things you can do are loads with memory_order::acquire and stores with memory_order::release. The various Atomic* classes in Java are basically the direct equivalent of atomic<T> (with the usual caveat that Java is inherently reference based).Dyarchy
I don't think you can criticize Java much on the memory-model front compared to C++. They at least made an effort to have an (informal) memory model almost from the start, and formalized it well in Java 5 (or 4?) - quite ahead of C and C++. The semantics they gave volatile were pretty reasonably on the face of it - in fact people used volatile for cross-thread communication "since forever" in C/C++, long before they had a formal memory model (since it was the only semi-portable way to accomplish that), and at least MSVC formalized this with the same release/acquire semantics as Java.Dyarchy
Of course the reasons for C/C++ not having a memory model from the start are also kind of obvious: their introduction (at least in the case of C) preceeded the widespread use of in-process threads - but there was still a long period of stagnation in general from the 1990 to C++03 or C++11 (opinion varies) and since the explosion of threading happened then it kind of got left out. Final note: even my description of volatile in Java isn't complete: there is a "total access order" component to volatile that isn't present in basic acquire/release semantics (this makes store expensive).Dyarchy
@BeeOnRope: Thanks for the correction; I must have been misremembering or seen a mis-statement that volatile in Java was called atomic in C++. (in herbsutter.com/2013/02/11/…). So in Java you get seq_cst loads/stores, but not atomic RMW. I take back my criticism of the naming, that's annoying but pretty obviously makes sense, especially given the (ab)use of C volatile in practice at the time for lack of anything better. (But there were libraries what wrapped volatile unsigned with synchronization semantics, weren't there?)Harcourt
@PeterCordes - well I thought Java volatile is something less than seq_cst in C++ but something more than seq_acq for loads and seq_rel for stores. It is usually thought of as acquire/release since that's how it interacts with with other (plain) loads and stores, but then it has this additional total order requirement which only applies to the set of volatile accesses, which makes things like Dekker possible (and it's what makes volatile stores slow). Does that make it strong enough to be equivalent to seq_cst in C++? I'd don't know enough to say.Dyarchy
@BeeOnRope: I think seq_acq / seq_rel + a single total order is the same thing C++ seq_cst gives you, when you leave out RMW operations. But possibly it's not, so I probably shouldn't assume without looking at Java carefully.Harcourt
My impression was that for example a seq_cst store in C++ would not be reordered with a following plain (or relaxed) store. Such a re-ordering is allowed in Java since the volatile store only has release semantics with respect to non-volatile accesses. @peter Do you know what C++ allows?Dyarchy
@BeeOnRope: wait, I thought you said Java volatile gave seq_release semantics, not just regular release. If not, then it's not like seq_cst. C++ works the way you suggest, and gcc demonstrates this (godbolt.org/g/mRggrj) by hoisting a non-atomic load before a mo_release store, but not before a mo_seq_cst store.Harcourt
Sorry anywhere I wrote seq_rel or seq_acq I meant release and acquire respectively. I don't think seq_rel/acq exist or make sense.Dyarchy
@Bee: Herb Sutter describes AArch64's stlr as having sequential-release semantics, in part 2 of herbsutter.com/2013/02/11/…. Some CPUs need to use expensive barriers that go far beyond what C++11 seq_cst requires, but the newly-designed AArch64 provides just what the standard requires but not much more, so the C++11 default seq_cst is as cheap as possible. (gcc bug? It hoists in both cases godbolt.org/g/iTdQFS on AArch64) I hadn't heard it before that, so IDK if he made it up and your typo just happened to match it..Harcourt
@Bee: actually wait, for AArch64 gcc isn't hoisting 2nd the store before the release, it's sinking the first store past the release in both cases, which is clearly illegal.Harcourt
@PeterCordes I think "sequential release" is just longhand for "seq_cst" when talking about stores (sequential implies release for a store). AFAIK if you do everything seq_cst pretty much zero reordering is allowed, so it's not surprising the heaviest barriers are needed there. The mismatch is more usually when there is a mix of other other operations, like seq_acq and seq_rel where some platforms still have to use the same heavy barriers. Rading more of the C++ memory model I now agree that java volatile is similar to seq_cst: the guarantees of the latter were weaker than I thought.Dyarchy
@PeterCordes - I don't see anything illegal with gcc's code: it has simply elided the first store na = 1 in favor of the later store na = 3 which kills it unconditionally. The way acquire-release works if you see a = 2 it means you'll see at least the na = 1 operation, but it could be any later operation in the modification order too, and in this case gcc compiles it so it is always a later operation. In fact this code is UB because there is no necessary happens before the various cross-thread modifications of na.Dyarchy
@BeeOnRope: The problem is: if you see a == 2, you aren't allowed to see na == 0 or whatever other old value, only na == 1 or na==3. Seeing the original value is possible when there's no store to na before doing a release-store. Hmm, na isn't atomic at all, so possible gcc decided that it's UB for another thread to read it while this thread is writing it, because it's still written after the release-store?Harcourt
@BeeOnRope: Ah yes, looks like that was it. A relaxed atomic store doesn't sink past a release or seq_cst store (godbolt.org/g/snS97v), presumably because it's not UB for another thread that Synchronizes With the seq_cst store to read weak. IDK how to create conditions that would "tempt" gcc to sink a non-atomic store past a release without writing the non-atomic object after the release, too.Harcourt
@BeeOnRope: oh wait a minute, that didn't prove anything because gcc never merges atomic stores in the first place. Derp. It's not hoisting the weak.store(3, mo_relaxed) above sync.store(2, mo_release) either, of course. (And BTW, I'm imagining a use-case where only one thread ever calls these functions to publish, and another thread reads sync and then maybe na. So not cross-modification, but yeah there was UB.)Harcourt
@PeterCordes - you can convince gcc (and most compilers) to swap the store order usually just by putting the earlier store at the end of a longer dependency chain than the later store. See here for example. Note that you need __restrict__ for this to work at least for the non-atomic case since otherwise it assumes the int values may alias. There the store order is swapped only for the two-non-atomic-stores case, but not in any of the cases where an atomic is involved, even when allowed (relaxed). You see the x86 advantage in the release case.Dyarchy
I think it is likely that this is a result of the essentially-zero optimization around atomics we've discussed before: almost none of the low hanging fruit has been picked for almost any cases involving atomic values (except that clang recognizes totally local atomic vars and essentially removes their atomic nature), and so it's not really surprising they also don't recordering around them in the cases it would be allowed, since that's complicated. The "killed stores" optimization is different and I think it's just working as it always had (removing a store shouldn't break things).Dyarchy
@BeeOnRope: I think the x86 advantage you're talking about is the ability to do a plain release store, not sequential-release. Yeah, seems odd that AArch64 didn't include a weaker release-store for unlocking spinlocks and so on, unless stlr is a lot cheaper than mov + mfence, or xchg. Anyway, I am surprised that gcc doesn't reorder even relaxed stores. clang and ICC don't either. (And don't even reorder the non-atomic case.)Harcourt
@BeeOnRope: submitted gcc.gnu.org/bugzilla/show_bug.cgi?id=83285Harcourt
This conversation is so awesome. :)Disease
P
1

Compare and swap is useful, but there is an even simpler (so called 'lock-free') technique that is useful in certain producer/consumer use cases that might be useful so I will mention it.

Imagine you have a function doWork() that writes to a buffer.

  1. Thread A initializes a volatile boolean variable (flag) to false that is accessible by both Thread A and creates a volatile buffer object that doWork will output to Thread B (Global, etc).
  2. Thread A creates thread B, which calls doWork().
  3. Thread B's doWork() begins creating/writing to the buffer.
  4. When B’s doWork is finished, issue a write barrier to wait for all work to be written, then set the boolean to true.
  5. Thread A should issue a full barrier then poll a global boolean flag that starts out false. When it turns true (non false), it can access the data in the buffer object, assured that is finished. Between polls Thread A can do other work. (So for example it polls once in an update call and does not wait for a true value). This doesn't take into account error handling, but this can also be handled within the buffer.

There are multiple ways for A to wait for doWork to finish depending on how long you expect it to take:

  • If you expect it to be a few nanoseconds, use thread affinities to schedule a and b as hyper threads on the same cpu core and use a busy wait buffered by cpu-specific wait/pause/relax instructions to help the CPU’s pipeline digest the busy wait more efficiently.

  • if you expect a few hundreds nanoseconds to a few microseconds, test once, and skip the busy wait straight to a futex syscall, which will perform its own in-kernel highly optimized cooperative busy wait for a few microseconds

  • if you expect more than a few dozen microseconds, use pthread conditionals, which have high upfront overhead but take into account dozens of peripheral circumstances and possibilities and ensure the most optimal approach for longer waits.

This only works because A only reads and B only writes, but this use case is fairly common for 'background worker' threads. This will only sure to work on Java or C# where volatile comes with the guarantees.

Possing answered 1/2, 2012 at 22:2 Comment(8)
You can't use a simple Boolean for such synchronization. The compiler and/or processor may reorder your reads, leading to bugs. You think you read the Boolean first (and see that it's true), and only then read the buffer. But in reality (an you get "half baked" data), the reads may be reversed. You need a semaphore.Concentre
I believe there will never be a case where thread A reads the and uses the buffer first, because it is read within a conditional dependent on value the boolean read, and the compiler will not switch this stuff around. This is why you can write a conditional that checks null pointers and then use the pointer after that.Possing
I'm not against believing in principle, but in multi-threaded programming, it's a big mistake. The compiler may or may not switch stuff around, but the CPU may, when different addresses are involved. Intel CPUs do quite a lot of reordering, and I've already seen amazing bugs due to it.Concentre
Again, compiler reordering obeys certain rules, and this technique abides by them. Besides belief I explained the conditional case which compilers respect.Possing
You're ignoring out-of-order execution. Suppose the compiler did all you expect, and you have machine code such as mov var1,%eax; test %eax,%eax; jz somewhere; mov var2,%eax. Reading var2 is done after var1 was read and verified to be non-zero. But the CPU can still reorder, and read var2 first, var1 later (maybe because var2 was in a nearer cache). If var1 ends up 0, the var2 value is discarded. But if it ends up non-zero, the old value will be used.Concentre
I see your point now, and thank you for illustrating it well. But it is still possible because of the volatile keyword (which I should have mentioned,) actually guaranteeing not to do this kind of reordering in some languages (java > 1.5 and c#, but not c). In c you will need some memory barrier.Possing
In C11, use _Atomic int. In C++11, use atomic<int>. That gives you what volatile does in Java / C# for loads/stores, and also supports atomic RMW.Harcourt
Update: added info about barriers and polling and how to correctly do thisDrifter
C
1

There are some useful way to use lock-free sychronization (such as those @Tudor mentions). But I want to warn about one thing - lock-free syncrhonization doesn't compose.

You may have, for example, an integer maintained by compare&swap, and it's OK. You may also have a queue, maintained by a lock-free algorithms (it's a bit tricky, but there are good algorithms for it), and the queue is also OK.
But if you try to use the counter to count the elements in a queue, you'll get wrong answers. There will be times when an element was added, but the counter doesn't yet reflect it (or vice versa), and you can get bugs if you trust it (e.g. you may try to add to a full queue).

In short - you can have each element consistent with itself, but not consistent with each other.

Concentre answered 1/2, 2012 at 22:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.