How do "acquire" and "consume" memory orders differ, and when is "consume" preferable?

Asked 26/10, 2013 at 17:54 Answered 11/10, 2015 at 4:36

The C++11 standard defines a memory model (1.7, 1.10) which contains memory orderings, which are, roughly, "sequentially-consistent", "acquire", "consume", "release", and "relaxed". Equally roughly, a program is correct only if it is race-free, which happens if all actions can be put in some order in which one action happens-before another one. The way that an action X happens-before an action Y is that either X is sequenced before Y (within one thread), or X inter-thread-happens-before Y. The latter condition is given, among others, when

X synchronizes with Y, or
X is dependency-ordered before Y.

Synchronizing-with happens when X is an atomic store with "release" ordering on some atomic variable, and Y is an atomic load with "acquire" ordering on the same variable. Being dependency-ordered-before happens for the analogous situation where Y is load with "consume" ordering (and a suitable memory access). The notion of synchronizes-with extends the happens-before relationship transitively across actions being sequenced-before one another within a thread, but being dependency-ordered-before is extended transitively only through a strict subset of sequenced-before called carries-dependency, which follows a largish set of rules, and notably can be interrupted with std::kill_dependency.

Now then, what is the purpose of the notion of "dependency ordering"? What advantage does it provide over the simpler sequenced-before / synchronizes-with ordering? Since the rules for it are stricter, I assume that can be implemented more efficiently.

Can you give an example of a program where switching from release/acquire to release/consume is both correct and provides a non-trivial advantage? And when would std::kill_dependency provide an improvement? High-level arguments would be nice, but bonus points for hardware-specific differences.

Ender answered 26/10, 2013 at 17:54 Comment(24)

Disclaimer: I just watched Herb Sutter's atomic<> Weapons talks, and he said that he won't discuss "consume" because "nobody understands it". – Ender 26/10, 2013 at 17:57

@Anthony Williams: Would you happen to have some insights into this? :-) – Ender 27/10, 2013 at 16:16

"And when would std::kill_dependency provide an improvement?" Related: https://mcmap.net/q/25248/-c-memory_order_consume-kill_dependency-dependency-ordered-before-synchronizes-with/420683 and https://mcmap.net/q/20749/-what-does-std-kill_dependency-do-and-why-would-i-want-to-use-it/420683 ; also note cppreference claims "On all mainstream CPUs other than DEC Alpha, dependency ordering is automatic, no additional CPU instructions are issued for this synchronization mode[...]" whereas this doesn't hold for release-acquire ordering (I think an example is ARM). – Forwent 28/10, 2013 at 0:8

Possible real use case - single-producer/single-consumer queue. During push operation you just attach new node to head with memory_order_release. During pop you want to have both tail and tail->value, where load of tail carries-a-dependency-to tail->value, but you don't care about anything else - so can use memory_order_consume instead of memory_order_acquire. – Entire 28/10, 2013 at 4:27

@EvgenyPanasyuk: Very interesting, though as far as I can tell, everything before the release in that code carries a dependency on the synchronization variable (the pointer), so it looks essentially identical to if you had used "acquire" for the load, non? – Ender 28/10, 2013 at 8:36

@DyP: Thanks - the cppreference page is actually very nice. When you say "this doesn't hold for release/acquire ordering", the ordering in question is just general sequenced-before ordering, which is indeed not automatic on modern CPUs (and not just on ARM), and in fact, being able to reorder is what makes modern CPUs fast. – Ender 28/10, 2013 at 8:42

@KerrekSB Main aim of this example is to show implementation of SPSC queue. Users of that queue may have their own stuff which is not required to be synchronized during synchronization of queue internals. So memory_order_acquire - can be superfluous. – Entire 28/10, 2013 at 8:50

@KerrekSB: Sutter does not understand that consume is the same as acquire, except that there are no happens-before guarantees on unrelated, non-atomic shared data? Seriously? – Gramicidin 30/10, 2013 at 13:17

@Damon: No, he said that nobody understands what it means and how to use it. It's one thing to have an abstract description, and another to have an intimate understanding of how it's used correctly and effectively. Would you agree that there are very few people who understand how to write lock-free code properly? And that's a much simpler problem. – Ender 30/10, 2013 at 13:42

@KerrekSB: Well, true. Though the bigger problem in my opinion is not understanding what it means and being intimately familiar with using it correctly (I daresay most people do get that part right, it's not that difficult), but actually finding a useful not-totally-contrieved example where you really need it. :-) You usually either need that very happens-before guarantee so you use acquire, or you're not interested in any guarantees (just need one value to be atomic) and use relaxed. Similarly, it's tough to come up with a no-bullshit example where you truly can't do without seq_cst. – Gramicidin 30/10, 2013 at 14:22

@Damon: That's why I thought I'd ask this question :-) – Ender 30/10, 2013 at 14:49

'the heck is voting this down... :'( – Corfu 30/10, 2013 at 20:26

I think that this thread on SO and the proposal N2664 contain answers to your question. – Rademacher 1/11, 2013 at 10:48

For those reading here, one key detail is that consume is not transitive, meaning if T2 consumes T1's changes, and T3 consumes T2's changes, T3 MAY not see all of T1's changes! With acquire/release, this transitive behavior does work, and T3 would see T1's changes. For most developers, this is much more intuitive than consume. However, on a few VERY large computers (1024+ cores), the cost of synchronizing more memory than needed could be very great. Consume did a good job of matching what was needed in those cases. – Ivette 10/9, 2014 at 21:22

@CortAmmon "cost of synchronizing more memory" What do you mean? What is synchronization? – Nucleoplasm 14/6, 2018 at 0:27

@Nucleoplasm I'm referring to the actions specified by memory_order. For example, all side effects which happen before an atomic operation with memory_order_release must be visible to any thread which does an atomic operation on the same atomic variable with memory_order_acquire. This can require the CPU to do things like flushing caches, though typically CPUs designers try to find more efficient solutions than that. For large supercomputers, it can be worth paying attention to the bandwidth of the connections used in this process. – Ivette 14/6, 2018 at 2:18

@CortAmmon Which CPU flushed a cache for a release operation? – Nucleoplasm 14/6, 2018 at 4:44

@Nucleoplasm That's a CPU architecture question, but I believe the general answer is that the cache flush is done by the CPU that acquires, if needed. I know a lot of modern processors also do some clever tricks to peek into the cache of cores on the same processor to avoid the cache flush (since that can be an expensive operation) – Ivette 14/6, 2018 at 5:46

@CortAmmon AFAIK, any load on x86 is an acquire operation. – Nucleoplasm 14/6, 2018 at 6:40

@Nucleoplasm Almost. You still need to use the lock prefix to support read-modify-update patterns such as atomic<int>::operator++ which involve both a read and write. Simple reads or writes are indeed atomic (under reasonable circumstances). Here is someone else's answer going deeper into that. The behavior of architectures other than x86 is, obviously, a different question entirely. – Ivette 14/6, 2018 at 15:11

@CortAmmon I mean that a simple word load is enough on x86 to provide ordering. (An aligned word load or store is atomic. Of course, combined operations aren't atomic by default.) – Nucleoplasm 15/6, 2018 at 13:11

@Nucleoplasm Yes, if you limit yourself to simple loads and simple stores on x86, and the compiler does not do any optimizations that would get in the way, the processor will provide the atomic guarantee. If I recall, GCC's implementation of static variables in a function uses this guarantee. – Ivette 15/6, 2018 at 14:30

@CortAmmon What kinds of compiler optimizations would break atomicity? – Nucleoplasm 15/6, 2018 at 14:48

@Nucleoplasm I would phrase it differently. Instead of saying some optimizations break atomicity, I would say that atomicity is something that is not guaranteed unless you intentionally use compiler features which guarantee the particular guarantees you need for that particular operation (volatile being one such feature). This blog post is my personal favorite way of explaining just how much the compiler can do unless you explicitly use these features to limit it's optimizations. – Ivette 15/6, 2018 at 15:32

Load-consume is much like load-acquire, except that it induces happens-before relationships only to expression evaluations that are data-dependent on the load-consume. Wrapping an expression with kill_dependency results in a value that no longer carries a dependency from the load-consume.

The key use case is for the writer to construct a data structure sequentially, then swing a shared pointer to the new structure (using a release or acq_rel atomic). The reader uses load-consume to read the pointer, and dereferences it to get to the data structure. The dereference creates a data dependency, so the reader is guaranteed to see the initialized data.

std::atomic<int *> foo {nullptr};
std::atomic<int> bar;

void thread1()
{
    bar = 7;
    int * x = new int {51};
    foo.store(x, std::memory_order_release);
}

void thread2()
{
    int *y = foo.load(std::memory_order_consume)
    if (y)
    {
        assert(*y == 51); //succeeds
        // assert(bar == 7); //undefined behavior - could race with the store to bar 
        // assert(kill_dependency(*y) + bar == 58) // undefined behavior (same reason)
        assert(*y + bar == 58); // succeeds - evaluation of bar pulled into the dependency 
    }
}

There are two reasons for providing load-consume. The primary reason is that ARM and Power loads are guaranteed to consume, but require additional fencing to turn them into acquires. (On x86, all loads are acquires, so consume provides no direct performance advantage under naive compilation.) The secondary reason is that the compiler can move later operations without data dependence up to before the consume, which it can't do for an acquire. (Enabling such optimizations is the big reason for building all of this memory ordering into the language.)

Wrapping a value with kill_dependency allows computation of an expression that depends on the value to be moved to before the load-consume. This is useful e.g. when the value is an index into an array that was previously read.

Note that the use of consume results in a happens-before relation that is no longer transitive (though it is still guaranteed to be acyclic). For example, the store to bar happens before the store to foo, which happens before the dereference of y, which happens before the read of bar (in the commented-out assert), but the store to bar doesn't happen before the read of bar. This leads to a rather more complicated definition of happens-before, but you can imagine how it works (start with sequenced-before, then propagate through any number of release-consume-dataDependency or release-acquire-sequencedBefore links)

Losse answered 5/11, 2013 at 10:58 Comment(3)

Interesting. So the compiler spots the (*y + bar) and pulls in bar from the cache? Why doesn't the writing thread need to signal that bar is also being released? – Alisaalisan 4/8, 2014 at 16:15

"when the value is an index into an array that was previously read" I don't get it – Nucleoplasm 14/6, 2018 at 0:31

"The secondary reason is that the compiler can move later operations without data dependence up to before the consume, which it can't do for an acquire." Is there any practical case where it is done and useful? – Nucleoplasm 23/6, 2018 at 1:44

Data dependency ordering was introduced by N2492 with the following rationale:

There are two significant use cases where the current working draft (N2461) does not support scalability near that possible on some existing hardware.

read access to rarely written concurrent data structures

Rarely written concurrent data structures are quite common, both in operating-system kernels and in server-style applications. Examples include data structures representing outside state (such as routing tables), software configuration (modules currently loaded), hardware configuration (storage device currently in use), and security policies (access control permissions, firewall rules). Read-to-write ratios well in excess of a billion to one are quite common.

publish-subscribe semantics for pointer-mediated publication

Much communication between threads is pointer-mediated, in which the producer publishes a pointer through which the consumer can access information. Access to that data is possible without full acquire semantics.

In such cases, use of inter-thread data-dependency ordering has resulted in order-of-magnitude speedups and similar improvements in scalability on machines that support inter-thread data-dependency ordering. Such speedups are possible because such machines can avoid the expensive lock acquisitions, atomic instructions, or memory fences that are otherwise required.

emphasis mine

the motivating use case presented there is rcu_dereference() from the Linux kernel

Constringent answered 31/10, 2013 at 18:29 Comment(0)

Jeff Preshing has a great blog post answering this question. I can't add anything myself, but think anyone wondering about consume vs. acquire should read his post:

http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/

He shows a specific C++ example with corresponding benchmarked assembly code across three different architectures. Compared to memory_order_acquire, memory_order_consume potentially offers a 3x speedup on PowerPC, 1.6x speedup on ARM, and negligible speedup on x86 which has strong consistency anyway. The catch is that as of when he wrote it, only GCC actually treated consume semantics any differently from acquire, and probably because of a bug. Nonetheless, it demonstrates that a speedup is available if the compiler writers can figure out how to take advantage of it.

Freestone answered 11/10, 2015 at 4:36 Comment(1)

Another link if you like: open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0098r0.pdf – Dairyman 11/10, 2015 at 7:44

I'd like to record a partial finding, even though it's not a real answer and doesn't mean that there won't be a big bounty for a proper answer.

After staring at 1.10 for a while, and in particular the very helpful note in paragraph 11, I think this isn't actually so hard. The big difference between synchronizes-with (henceforth: s/w) and dependency-ordered-before (dob) is that a happens-before relationship can be established by concatenating sequenced-before (s/b) and s/w arbitrarily, but not so for dob. Note one of the definitions for inter-thread happens before:

A synchronizes-with X and X is sequenced before B

But the analogous statement for ~~A is dependency-ordered before X~~ is missing!

So with release/acquire (i.e. s/w) we can order arbitrary events:

A1    s/b    B1                                            Thread 1
                   s/w
                          C1    s/b    D1                  Thread 2

But now consider an arbitrary sequence of events like this:

A2    s/b    B2                                            Thread 1
                   dob
                          C2    s/b    D2                  Thread 2

In this sequenece, it is still true that A2 happens-before C2 (because A2 is s/b B2 and B2 inter-thread happens before C2 on account of dob; but we could argue that you can never actually tell!). However, it is not true that A2 happens-before D2. The events A2 and D2 are not ordered with respect to one another, unless it actually holds that C2 carries dependency to D2. This is a stricter requirement, and absent that requirement, A2-to-D2 cannot be ordered "across" the release/consume pair.

In other words, a release/consume pair only propagates an ordering of actions which carry a dependency from one to the next. Everything that's not dependent is not ordered across the release/consume pair.

Furthermore, note that the ordering is restored if we append a final, stronger release/acquire pair:

A2    s/b    B2                                                         Th 1
                   dob
                          C2    s/b    D2                               Th 2
                                             s/w
                                                    E2    s/b    F2     Th 3

Now, by the quoted rule, D2 inter-thread happens before F2, and therefore so do C2 and B2, and so A2 happens-before F2. But note that there is still no ordering between A2 and D2 — the ordering is only between A2 and later events.

In summary and in closing, dependency carrying is a strict subset of general sequencing, and release/consume pairs provide an ordering only among actions that carry dependency. As long as no stronger ordering is required (e.g. by passing through a release/acquire pair), there is theoretically a potential for additional optimization, since everything that is not in the dependency chain may be reordered freely.

Maybe here is an example that makes sense?

std::atomic<int> foo(0);

int x = 0;

void thread1()
{
    x = 51;
    foo.store(10, std::memory_order_release);
}

void thread2()
{
    if (foo.load(std::memory_order_acquire) == 10)
    {
        assert(x == 51);
    }
}

As written, the code is race-free and the assertion will hold, because the release/acquire pair orderes the store x = 51 before the load in the assertion. However, by changing "acquire" into "consume", this would no longer be true and the program would have a data race on x, since x = 51 carries no dependency into the store to foo. The optimization point is that this store can be reordered freely without concern to what foo is doing, because there is no dependency.

Ender answered 27/10, 2013 at 23:18 Comment(6)

Perhaps you should add to your example something that shows where consume actually works. Maybe something like foo.load(memory_order_consume)->value. – Entire 28/10, 2013 at 2:24

@EvgenyPanasyuk: I hadn't been able to think of a useful example. An atomic pointer seems to be the obvious case, though. – Ender 28/10, 2013 at 8:26

In your second example (release/consume) it should read "unless it actually holds that C2 carries dependency to D2". Because then B2 is dependency-ordered before D2, hence inter-thread happens before D2 and A2 is sequenced before B2, which implies A2 inter-thread happens before D2. – Rademacher 31/10, 2013 at 10:54

As far as I know, there is no difference in the implementation of "release/acquire" and "release/consume" on x86 architectures. But on Power for example, load-acquire is implemented by ld; cmp; bc; isync;, while load-consume only by ld; and the (weak) memory model guarantees that certain dependencies are respected. – Rademacher 31/10, 2013 at 11:21

Note that your definition of happens before as given in the question is wrong, since it would always imply that A2 happens before D2. – Rademacher 31/10, 2013 at 12:14

@MWid: Thanks for all that, I've updated both the question and this post! – Ender 31/10, 2013 at 18:15

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags