What does memory_order_consume really do?

Asked 17/12, 2020 at 7:38 Answered 18/12, 2020 at 8:3

Solved c++cpu-architecture lock-free memory-model stdatomic

From the link: What is the difference between load/store relaxed atomic and normal variable?

I was deeply impressed by this answer:

Using an atomic variable solves the problem - by using atomics all threads are guarantees to read the latest writen-value even if the memory order is relaxed.

Today, i read the the link below: https://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/

atomic<int*> Guard(nullptr);
int Payload = 0;

thread1:

  Payload = 42;
    Guard.store(&Payload, memory_order_release);

thred2:

g = Guard.load(memory_order_consume);
if (g != nullptr)
    p = *g;

QUESTION: I learned that Data Dependency prevents related instruction be reordered. But i think that is obvious for ensure the correctness of execution results. It doesn't matter whether comsume-release semantic exists or not. So i wonder comsume-release really do. Oh, maybe it uses data dependencies to prevent reordering of instructions while ensuring the visibility of Payload?

Is it possible to get the same correct result using memory_order_relaxed if i make that 1.preventing instruction be reordered 2.ensuring the visibility of non atomic var of Payload :

atomic<int*> Guard(nullptr);
volatile int Payload = 0;   // 1.Payload is volatile now

// 2.Payload.assign and Guard.store in order for data dependency
Payload = 42;               
Guard.store(&Payload, memory_order_release);

// 3.data Dependency make w/r of g/p in order
g = Guard.load(memory_order_relaxed);  
if (g != nullptr)
    p = *g;      // 4. For 1,2,3 there are no reorder, and here, volatile Payload make the value of 42 is visable.

Additional content(because of Sneftel's anwser):

1.Payload = 42; volatile make the W/R of Payload to/from main memory but not to/from cache.So 42 will write to memory.

2.Guard.store(&Payload, any MO flag can use for writting); Guard is non-volatile as you said, but is atomic

Using an atomic variable solves the problem - by using atomics all threads are guarantees to read the latest writen-value even if the memory order is relaxed.

In fact, atomics are always thread safe, regardless of the memory order! The memory order is not for the atomics -> it's for non atomic data.

So after Guard.store performing, Guard.load (with any MO flag can use for reading) can get the address of Payload correcttly. And then get the 42 from memory correcttly.

Above code:

1.no reorder effect for data dependency .

2.no cache effect for volatile Payload

3.no thread-safe problem for atomic Guard

Can i get the correct value - 42?

Back to the main question

When you use consume semantics, you’re basically trying to make the compiler exploit data dependencies on all those processor families. That’s why, in general, it’s not enough to simply change memory_order_acquire to memory_order_consume. You must also make sure there are data dependency chains at the C++ source code level.

" You must also make sure there are data dependency chains at the C++ source code level."

I think the data dependency chains at the C++ source code level prevents instruction are reordered naturally. So What does memory_order_consume really do?

And can I use memory_order_relaxed to achieve the same result as above code?

Additional content end

Etter answered 17/12, 2020 at 7:38 Comment(8)

“But i think that is obvious for ensure the correctness of execution results.” How so? What correctness issues in a single-threaded program do you think might arise if, for instance, the writes in the code a=1; b=2 were reordered? – Axe 17/12, 2020 at 7:47

It doesn't matter if a=1; b=2 are reordered.But a=1; b=a were reordered is wrong.That is what "that is obvious for ensure the correctness of execution results" i want to express. – Etter 17/12, 2020 at 8:9

That’s not what your code is doing (either version). It’s not clear to me why you think volatile+relaxed is equivalent to release. The volatile qualifier doesn’t constrain reads/writes to non-volatile objects (such as your atomic). It has nothing to do with writing conformant multithreaded code in C++. – Axe 17/12, 2020 at 8:19

Glad you were impressed by my answer! I would largely advice to forget about memory order consume and simply replace it with acquire in all cases. – Giraldo 17/12, 2020 at 8:40

@Axe thank you for your reply. I added some content for the question. – Etter 17/12, 2020 at 10:28

In practice, it's treated exactly like acquire by current compilers, because it proved to be too hard to safely and efficiently implement the ISO C++ spec in a way that takes advantage of asm dependency-ordering guarantees. If you want that efficiency, you have to hack it with mo_relaxed and cross your fingers (with code that would make it hard for a compiler to break the data dependency, e.g. by branching on a value or removing it if it can prove there's only one possible value.) See C++11: the difference between relaxed and consume – Reliant 18/12, 2020 at 7:2

volatile make the W/R of Payload to/from main memory but not to/from cache - no. It makes sure the store is done at all, rather than keeping the value in registers until later. Registers are not cache; many people are confused by phrasing like having "a value cached in registers". That's one way for software to use a register to hold the value of a var that isn't being modified, but actual CPU cache is different (and is coherent). When to use volatile with multi threading? - never, but it does have some effects in practice. – Reliant 18/12, 2020 at 7:11

Also meant to link Myths Programmers Believe about CPU Caches – Reliant 18/12, 2020 at 7:34

First of all, memory_order_consume is temporarily discouraged by the ISO C++ committee until they come up with something compilers can actually implement. For a few years now, compilers have treated consume as a synonym for acquire. See the section at the bottom of this answer.

Hardware still provides the data dependency, so it's interesting to talk about that, despite not having any safely portable ISO C++ ways to take advantage currently. (Only hacks with mo_relaxed or hand-rolled atomics, and careful coding based on understanding of compiler optimizations and asm, kind of like you're trying to do with relaxed. But you don't need volatile.)

Oh, maybe it uses data dependencies to prevent reordering of instructions while ensuring the visibility of Payload?

Not exactly "reordering of instructions", but memory reordering. As you say, sanity and causality are enough in this case if the hardware provides dependency ordering. C++ is portable to machines that don't. (e.g DEC Alpha.)

The normal way to get visibility for Payload is via release-store in the writer, acquire load in the reader which sees the value from that release-store. https://preshing.com/20120913/acquire-and-release-semantics/. (So of course repeatedly storing the same value to a "ready_flag" or pointer doesn't let the reader figure out whether it's seeing a new or old store.)

Release / acquire creates a happens-before synchronization relationship between the threads, which guarantees visibility of everything the writer did before the release-store. (consume doesn't, that's why only the dependent loads are ordered.)

(consume is an optimization on this: avoiding a memory barrier in the reader by letting the compiler take advantage of hardware guarantees as long as you follow some dependency rules.)

You have some misconceptions about what CPU cache is, and about what volatile does, which I commented about under the question. A release-store makes sure earlier non-atomic assignments are visible in memory.

(Also, cache is coherent; it provides all CPUs with a shared view of memory that they can agree on. Registers are thread-private and not coherent, that's what people mean when they say a value is "cached". Registers are not CPU cache, but software can use them to hold a copy of something from memory. When to use volatile with multi threading? - never, but it does have some effects in real CPUs because they have coherent cache. It's a bad way to roll your own mo_relaxed. See also https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/)

In practice on real CPUs, memory reordering happens locally within each core; cache itself is coherent and never gets "out of sync". (Other copies are invalided before a store can become globally visible). So release just has to make sure the local CPUs stores become globally visible (commit to L1d cache) in the right order. ISO C++ doesn't specify any of that level of detail, and an implementation that worked very differently is hypothetically possible.

Making the writer's store volatile is irrelevant in practice because a non-atomic assignment followed by a release-store already has to make everything visible to other threads that might do an acquire-load and sync with that release store. It's irrelevant on paper in pure ISO C++ because it doesn't avoid data-race UB.

(Of course, it's theoretically possible for whole-program optimization to see that there are no acquire or consume loads that would ever load this store, and optimize away the release property. But compilers currently don't optimize atomics in general even locally, and never try to do that kind of whole-program analysis. So code-gen for writer functions will assume that there might be a reader that syncs with any given store of release or seq_cst ordering.)

What does memory_order_consume really do?

One thing mo_consume does is to make sure the compiler uses a barrier instruction on implementations where the underlying hardware doesn't provide dependency ordering naturally / for free. In practice that means only on DEC Alpha. Dependent loads reordering in CPU / Memory order consume usage in C11

Your question is a near duplicate of C++11: the difference between memory_order_relaxed and memory_order_consume - see the answers there for the body of your question about misguided attempts to do stuff with volatile and relaxed. (I'm mostly answering because of the title question.)

It also ensures that the compiler uses a barrier at some point before execution passes into code that doesn't know about the data dependency this value carries. (i.e. no [[carries_dependency]] tag on the function arg in the declaration). Such code might replace x-x with a constant 0 and optimize away, losing the data dependency. But code that knows about the dependency would have to use something like a sub r1, r1, r1 instruction to get a zero with a data dependency.

That can't happen for your use-case (where relaxed will work in practice on ISAs other than Alpha), but the on-paper design of mo_consume allowed all kinds of stuff that would require different code-gen from what compilers would normally do. This is part of what made it so hard to implement efficiently that compilers just promote it to mo_acquire.

The other part of the problem is that it requires code to be littered with kill_dependency and/or [[carries_dependency]] all over the place, or you'll end up with a barrier at function boundaries anyway. These problems led the ISO C++ committee to temporarily discourage consume.

C++11: the difference between memory_order_relaxed and memory_order_consume
P0371R1: Temporarily discourage memory_order_consume and other C++ wg21 documents linked from that about why consume is discouraged.
Memory order consume usage in C11 - more about the hardware mechanism / guarantee that consume is intended to expose to software. Out-of-order exec can only reorder independent work anyway, not start a load before the load address is known, so on most CPUs enforcing dependency ordering happens for free anyway: only a few models of DEC Alpha could violate causality and effectively load data from before it had the pointer that gave it the address.

And BTW:

The example code is safe with release + consume regardless of volatile. It's safe on most compilers and most ISAs in practice with release store + relaxed load, although of course ISO C++ has nothing to say about the correctness of that code. But with the current state of compilers, that's a hack that some code makes (like the Linux kernel's RCU).

If you need that level of read-side scaling, you'll have to work outside of what ISO C++ guarantees. That means your code will have to make assumptions about how compilers work (and that you're running on a "normal" ISA that isn't DEC Alpha), which means you need to support some set of compilers (and maybe ISAs, although there aren't many multi-core ISAs around). The Linux kernel only cares about a few compilers (mostly recent GCC, also clang I think), and the ISAs that they have kernel code for.

Reliant answered 18/12, 2020 at 8:3 Comment(7)

Really thanks for your relay. May i simply think that: Some case of code or hardware implement does not guarantee that data dependence can be used to solve the reordering problem. Therefore, the original purpose of designing cosume is to guide the compiler to generate correct non-reordering code through data dependence without fence. Although is too difficult to achieve. – Etter 19/12, 2020 at 4:30

@breaker00: No, the existence of HW without dependency-ordering guarantees is not why we need consume. You would still need it without that, to control code-gen and make sure the compiler doesn't optimize away a dependency. (C++ rules need to be formal and exact; something like "as long as you don't do something the compiler can optimize" is not specific enough). – Reliant 19/12, 2020 at 4:53

it's also critical to understand that the C++ memory model is separate from the hardware memory model. Even when compiling stuff for x86 for example (where even acquire is free, not just consume), optimizations are based on the C++ memory ordering rules, not the hardware. A compiler targeting x86 can still reorder .load(mo_relaxed) at compile time, even though the hardware must maintain the illusion of running them in order. – Reliant 19/12, 2020 at 4:53

Similarly, optimizing a data dependency into a branch is allowed for relaxed, e.g. for something like int idx = x.load(relaxed); int *p = table[idx]; q = *p; with a 2 element table: the compiler could just branch on 0 vs. 1 and pick one, losing the dependency. So ISO C++ needs some way to forbid compilers from doing that, while still allowing full flexibility of optimization for code that doesn't rely on data dependency ordering. So mo_consume is necessary in some form as part of the formal language spec to avoid having everything carry a dependency and disallowing branching. – Reliant 19/12, 2020 at 4:57

I corrected my thoughts about original purpose of consume. Is it close to right? If there is consume, it means that i want the compiler to ensure the correctness of the data dependency. Do not optimize to remove the data dependency that I need. And even on a platform like DEC Alpha, please add a fence to ensure the same correct data dependency relationship. For other cases without consume, it is equivalent to telling the compiler that I don't care about the correctness of the data dependency so much, to do the optimization you think is correct. – Etter 19/12, 2020 at 8:0

@breaker00: Yeah, that's correct. ISO C++ has to define rules about what a data dependency is and isn't in C++, and that's what compilers have to respect when generating asm using a result that carries a dependency. (Directly or indirection from a consume load.) – Reliant 19/12, 2020 at 8:2

Thanks to people like you who can’t go to bed. So that people who get lost can go to bed earlier. – Etter 19/12, 2020 at 8:20

volatile has nothing to do with multi-threading in c/c++, its sequential visibility side effect only occurs on single-thread program and usually use it only for telling compiler not optimize out this value. It is DIFFERENT from Java/C#.
release/consume is all about data dependency, and it may build a dependency chain (which can be break by kill_dependency to avoid unnecessary barriers later).
release/acquire forms a pair-wise synchronize-with/inter-thread happens-before relationship.

For your case, release/acquire would form the expected happens-before relationship. release/consume will also work because *g is dependent on g.

But note that with current compilers, consume is treated as a synonym for acquire, because it proved too hard to implement efficiently. see another answer

Deprecatory answered 17/12, 2020 at 10:8 Comment(3)

Yes, all of those bullet points are true, but release/consume is safe here (regardless of volatile), even on old compilers that don't just promote consume to acquire. (consume is temporarily deprecated until the C++ committee comes up with a better consume that can be practically implemented fully safely but still efficiently, and without infecting everyone's code with [[carries_dependency]] tags.) g is the result of a consume that saw the release-store, so *g is ordered after the load of g. – Reliant 18/12, 2020 at 7:17

(release/relaxed is not safe on paper, but on most ISAs will "happen" to work because the compiler will make asm that has a data dependency, and all(?) ISAs except Alpha guarantee dependency ordering. C++11: the difference between memory_order_relaxed and memory_order_consume - only do this in production code if you understand the situation, and only care about a limited set of compilers. e.g. the Linux kernel does this for RCU with their hand-rolled asm atomics / barriers (or lack thereof), but only care about gcc / clang, not ISO C) – Reliant 18/12, 2020 at 7:21

Yes, you're right, consume is valid here. Please feel free to edit my answer, thx. – Deprecatory 18/12, 2020 at 9:59

-1

The thing is, the answer is not entirely correct as there a couple of nuances.

Using an atomic variable solves the problem - by using atomics all threads are guarantees to read the latest writen-value even if the memory order is relaxed.

They do read "latest written value", but with memory order "relaxed" the order of instructions can be rearranged.

So, if you say write DoSomething(); x = y.load(relaxed); then post compilation the relaxed load might be sequenced prior to DoSomething();. And assuming that the routine took quite a while then x's value can be quite off from y's latest value.

With memory order "consume" the instruction rearrangement is forbidden so such an issue will not occur.

Quay answered 17/12, 2020 at 8:36 Comment(9)

I don't think you read my entire answer then, because I covered the subject of ooo execution. – Giraldo 17/12, 2020 at 8:39

@DavidHaim well, this is the only purpose of memory order "consume". I checked the article you referenced, cannot say for certain but the use of "consume" they propose sounds very wrong. Reading the pointer g via consume doesn't guarantee by any means that accessing pointer's data will be done properly. Even if the CPU can somehow ensure dependency load - which I doubt - the compiler can screw it still by making assumptions. Perhaps, the reason why defacto consume works is because compiler writers were confused with this instruction and implemented it as aquire+release. – Quay 17/12, 2020 at 9:11

I'm not the OP. I'm the one you said his answer is not entirely correct (implying that I did not mention the reordering stuff, but I did) – Giraldo 17/12, 2020 at 9:16

@DavidHaim no, indeed. I just read OP's quote. You have a long answer... my suspicion is that OP didn't properly understand it and the referenced link he posted is wrong. – Quay 17/12, 2020 at 11:32

@ALX23z: yes the compiler can break stuff if you use release + relaxed. But if the asm has a dependency of the 2nd address on the first load result, it's actually really hard for the CPU to violate causality and somehow know where to load from before it has an address for that load. All(?) modern ISAs except DEC Alpha guarantee dependency ordering on paper so it's not optional or luck that it works in hardware, it's guaranteed by CPU vendors. (But like I said, it's basically free for them to implement in a standard OoO exec machine; only independent work can be reordered.) – Reliant 18/12, 2020 at 7:25

The actual mechanism by which a few models of DEC Alpha CPUs could somehow violate causality / dependency ordering is very obscure, but is a good example of why it's usually not safe to assume "no CPU design could ever actually violate this assumption I want to make". For details, see Dependent loads reordering in CPU, and for other commentary see a quote from Linus Torvalds about Alphas: Memory order consume usage in C11. Value-prediction for loads could break it, but no real CPUs do that (yet?). – Reliant 18/12, 2020 at 7:29

@PeterCordes "yes the compiler can break stuff if you use release + relaxed." I believe it can break stuff even with "release + consume". Say, you write x=3 and do not mofify then later read atomic y with consume - at this point I believe the compiler can assume that x==3 so no loading even needs to be scheduled. – Quay 18/12, 2020 at 8:39

@PeterCordes what about Cached data? With "consume" CPU doesn't need to synchronise Cache. What if another thread/core modified data that was previously cashed on this thread/core? Since no appropriate memory fence was triggered you get incorrect values from the cache, no? – Quay 18/12, 2020 at 8:40

@Quay The cache is always clean/uptodate. It has to be in a reasonable arch designed for MT. – Slipnoose 28/12, 2020 at 20:26

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags