How to achieve a StoreLoad barrier in C++11?

Asked 4/2, 2020 at 9:10 Answered 13/2, 2020 at 20:50

Solved c++language-lawyer atomic memory-barriers stdatomic

I want to write portable code (Intel, ARM, PowerPC...) which solves a variant of a classic problem:

Initially: X=Y=0

Thread A:
  X=1
  if(!Y){ do something }
Thread B:
  Y=1
  if(!X){ do something }

in which the goal is to avoid a situation in which both threads are doing something. (It's fine if neither thing runs; this isn't a run-exactly-once mechanism.) Please correct me if you see some flaws in my reasoning below.

I am aware, that I can achieve the goal with memory_order_seq_cst atomic stores and loads as follows:

std::atomic<int> x{0},y{0};
void thread_a(){
  x.store(1);
  if(!y.load()) foo();
}
void thread_b(){
  y.store(1);
  if(!x.load()) bar();
}

which achieves the goal, because there must be some single total order on the
{x.store(1), y.store(1), y.load(), x.load()} events, which must agree with program order "edges":

x.store(1) "in TO is before" y.load()
y.store(1) "in TO is before" x.load()

and if foo() was called, then we have additional edge:

y.load() "reads value before" y.store(1)

and if bar() was called, then we have additional edge:

x.load() "reads value before" x.store(1)

and all these edges combined together would form a cycle:

x.store(1) "in TO is before" y.load() "reads value before " y.store(1) "in TO is before" x.load() "reads value before" x.store(true)

which violates the fact that orders have no cycles.

I intentionally use non-standard terms "in TO is before" and "reads value before" as opposed to standard terms like happens-before, because I want to solicit feedback about correctness of my assumption that these edges indeed imply happens-before relation, can be combined together in single graph, and the cycle in such combined graph is forbidden. I am not sure about that. What I know is this code produces correct barriers on Intel gcc & clang and on ARM gcc

Now, my real problem is a bit more complicated, because I have no control over "X" - it's hidden behind some macros, templates etc. and might be weaker than seq_cst

I don't even know if "X" is a single variable, or some other concept (e.g. a light-weight semaphore or mutex). All I know is that I have two macros set() and check() such that check() returns true "after" another thread has called set(). (It is also known that set and check are thread-safe and can't create data-race UB.)

So conceptually set() is somewhat like "X=1" and check() is like "X", but I have no direct access to atomics involved, if any.

void thread_a(){
  set();
  if(!y.load()) foo();
}
void thread_b(){
  y.store(1);
  if(!check()) bar();
}

I'm worried, that set() might be internally implemented as x.store(1,std::memory_order_release) and/or check() might be x.load(std::memory_order_acquire). Or hypothetically a std::mutex that one thread is unlocking and another is try_locking; in the ISO standard std::mutex is only guaranteed to have acquire and release ordering, not seq_cst.

If this is the case, then check()'s if body can be "reordered" before y.store(true) (See Alex's answer where they demonstrate that this happens on PowerPC).
This would be really bad, as now this sequence of events is possible:

thread_b() first loads the old value of x (0)
thread_a() executes everything including foo()
thread_b() executes everything including bar()

So, both foo() and bar() got called, which I had to avoid. What are my options to prevent that?

Option A

Try to force Store-Load barrier. This, in practice, can be achieved by std::atomic_thread_fence(std::memory_order_seq_cst); - as explained by Alex in a different answer all tested compilers emitted a full fence:

x86_64: MFENCE

PowerPC: hwsync

Itanuim: mf

ARMv7 / ARMv8: dmb ish

MIPS64: sync

The problem with this approach is, that I could not find any guarantee in C++ rules, that std::atomic_thread_fence(std::memory_order_seq_cst) must translate to full memory barrier. Actually, the concept of atomic_thread_fences in C++ seems to be at a different level of abstraction than the assembly concept of memory barriers and deals more with stuff like "what atomic operation synchronizes with what". Is there any theoretical proof that below implementation achieves the goal?

void thread_a(){
  set();
  std::atomic_thread_fence(std::memory_order_seq_cst)
  if(!y.load()) foo();
}
void thread_b(){
  y.store(true);
  std::atomic_thread_fence(std::memory_order_seq_cst)
  if(!check()) bar();
}

Option B

Use control we have over Y to achieve synchronization, by using read-modify-write memory_order_acq_rel operations on Y:

void thread_a(){
  set();
  if(!y.fetch_add(0,std::memory_order_acq_rel)) foo();
}
void thread_b(){
  y.exchange(1,std::memory_order_acq_rel);
  if(!check()) bar();
}

The idea here is that accesses to a single atomic (y) must be form a single order on which all observers agree, so either fetch_add is before exchange or vice-versa.

If fetch_add is before exchange then the "release" part of fetch_add synchronizes with the "acquire" part of exchange and thus all side effects of set() have to be visible to code executing check(), so bar() will not be called.

Otherwise, exchange is before fetch_add, then the fetch_add will see 1 and not call foo(). So, it is impossible to call both foo() and bar(). Is this reasoning correct?

Option C

Use dummy atomics, to introduce "edges" which prevent disaster. Consider following approach:

void thread_a(){
  std::atomic<int> dummy1{};
  set();
  dummy1.store(13);
  if(!y.load()) foo();
}
void thread_b(){
  std::atomic<int> dummy2{};
  y.store(1);
  dummy2.load();
  if(!check()) bar();
}

If you think the problem here is atomics are local, then imagine moving them to global scope, in the following reasoning it does not appear to matter to me, and I intentionally wrote the code in such a way to expose how funny it is that dummy1 and dummy2 are completely separate.

Why on Earth this might work? Well, there must be some single total order of {dummy1.store(13), y.load(), y.store(1), dummy2.load()} which has to be consistent with program order "edges":

dummy1.store(13) "in TO is before" y.load()
y.store(1) "in TO is before" dummy2.load()

(A seq_cst store + load hopefully form the C++ equivalent of a full memory barrier including StoreLoad, like they do in asm on real ISAs including even AArch64 where no separate barrier instructions are required.)

Now, we have two cases to consider: either y.store(1) is before y.load() or after in the total order.

If y.store(1) is before y.load() then foo() will not be called and we are safe.

If y.load() is before y.store(1), then combining it with the two edges we already have in program order, we deduce that:

dummy1.store(13) "in TO is before" dummy2.load()

Now, the dummy1.store(13) is a release operation, which releases effects of set(), and dummy2.load() is an acquire operation, so check() should see the effects of set() and thus bar() will not be called and we are safe.

Is it correct here to think that check() will see the results of set()? Can I combine the "edges" of various kinds ("program order" aka Sequenced Before, "total order", "before release", "after acquire") like that? I have serious doubts about this: C++ rules seem to talk about "synchronizes-with" relations between store and load on same location - here there is no such situation.

Note that we're only worried about the case where dumm1.store is known (via other reasoning) to be before dummy2.load in the seq_cst total order. So if they had been accessing the same variable, the load would have seen the stored value and synchronized with it.

(The memory-barrier / reordering reasoning for implementations where atomic loads and stores compile to at least 1-way memory barriers (and seq_cst operations can't reorder: e.g. a seq_cst store can't pass a seq_cst load) is that any loads/stores after dummy2.load definitely become visible to other threads after y.store. And similarly for the other thread, ... before y.load.)

You can play with my implementation of Options A,B,C at https://godbolt.org/z/u3dTa8

Preemption answered 4/2, 2020 at 9:10 Comment(14)

The C++ memory model doesn't have any concept of StoreLoad reordering, only Synchronizes-with and happens-before. (And UB on data races on non-atomic objects, unlike asm for real hardware.) On all real implementations I'm aware of, std::atomic_thread_fence(std::memory_order_seq_cst) does compile to a full barrier, but since the entire concept is an implementation detail you won't find any mention of it in the standard. (CPU memory models usually are defined in terms of what reorerings are allowed relative to sequential consistency. e.g. x86 is seq-cst + a store buffer w/ forwarding) – Guadalupeguadeloupe 4/2, 2020 at 9:30

@PeterCordes thanks, I might have been not clear in my writing. I wanted to convey what you wrote in the section "Option A". I know the title of my question uses word "StoreLoad", and that "StoreLoad" is a concept from a completely different world. My problem is how to map this concept into C++. Or if it can not be mapped directly, then how to achieve the goal I've posed: prevent foo() and bar() from both being called. – Preemption 4/2, 2020 at 9:37

In your examples, I'm wondering why do you use std::atomic<int> instead of std::atomic<bool> ? Or even much better: std::atomic_flag which seems to exactly fit your use-case. – Clichy 4/2, 2020 at 10:2

I hadn't read the full question yet because it's long, just replying to the title. Looks like there is a real question here despite the title, but there's a potential showstopper as far as avoiding UB and depending only on the wording of the standard: Are you sure that check() and set() even use an atomic object at all? If it's a plain int or something, you can have data-race UB. (Note that even volatile doesn't avoid data-race UB in ISO C++; it's only usable as a roll-your-own mo_relaxed (or stronger with barriers) on real implementations.) – Guadalupeguadeloupe 4/2, 2020 at 10:3

@Clichy thanks, originally I've used bool, because, as you've said, it's more natural. Once I've got to spelling out "Option B", I've realized I know of no way to "read-modify-write" std::atomic<bool> in a way which does not really change its value. So, I tideously went back and reworked everything to int, as for atomic<int> I have plenty to choose from (fetch_add(0),fetch_or(0), fetch_sub(0),..). Surprisingly to me std::atomic<bool> has no fetch_or(false) implemented. Let me know if there is a way to "just read, but in a way which is both acquire and release" for bool :) – Preemption 4/2, 2020 at 10:23

If I'm seeing this right, it's possible that neither foo nor bar runs for at least the first version (with seq-cst x and y). You know that, right? – Guadalupeguadeloupe 4/2, 2020 at 10:35

@Preemption I understand. Perhaps you would be interested by std::atomic_flag instead and its test_and_set function. (I know my comments are not directly related to your issue, but only for information :) ) – Clichy 4/2, 2020 at 10:35

You can use compare_exchange_* to perform an RMW operation on an atomic bool without changing its value (simply set expected and new to the same value). – Capacitate 4/2, 2020 at 10:37

@Clichy and qbolec: atomic<bool> has exchange and compare_exchange_weak. The latter can be used to do a dummy RMW by (attempting to) CAS(true, true) or false,false. It either fails or atomically replaces the value with itself. (In x86-64 asm, that trick with lock cmpxchg16b is how you do guaranteed-atomic 16-byte loads; inefficient but less bad than taking a separate lock.) – Guadalupeguadeloupe 4/2, 2020 at 10:38

@mpoeter: heh, didn't see your comment until I posted mine, a few seconds later. Possible caveat: The failure side of a CAS only counts as a load in C++ I think, not a real RMW. On LL/SC machines it really is just a load and a conditional branch; the store-conditional doesn't execute. – Guadalupeguadeloupe 4/2, 2020 at 10:41

@PeterCordes yes I know it can happen that neither foo() nor bar() will be called. I didn't want to bring to many "real world" elements of the code, to avoid "you think you have problem X but you have problem Y" kind of responses. But, if one really needs to know what is the background storey: set() is really some_mutex_exit(), check() is try_enter_some_mutex(), y is "there are some waiters", foo() is "exit without waking up anyone", bar() is "wait for wakup"... But, I refuse to discuss this design here - I can't change it really. – Preemption 4/2, 2020 at 10:50

That's fine, I just wanted to cover the bases of that case being possible for this synchronization algorithm. It sounds like set and check` are known to be thread safe; you could add that to the question. – Guadalupeguadeloupe 4/2, 2020 at 10:59

qbolec and @mpoeter: I updated the question to better explain why one might think that Option C could work, by thinking of C++ atomics in terms of memory reordering and fences. (i.e. the way they're actually implemented by real-word implementations). This is of course flawed reasoning whether or not it happens to give the right answer; the way I stated it hopefully makes that clear. Remove it if you think it's too much of a distraction or isn't at all what you were thinking. There's still the reasoning based on total order and Sequenced before / after – Guadalupeguadeloupe 5/2, 2020 at 13:59

Related: Does atomic_thread_fence(memory_order_seq_cst) have the semantics of a full memory barrier? – Guadalupeguadeloupe 5/4, 2022 at 3:12

Options A and B are valid solutions.

Option A: it doesn't really matter what a seq-cst fence translates to, the C++ standard clearly defines what guarantees it provides. I have laid them out in this post: When is a memory_order_seq_cst fence useful?
Option B: yes, your reasoning is correct. All modifications on some object have a single total order (the modification order), so you can use that to synchronize the threads and ensure visibility of all side-effects.

However, Option C is not valid! A synchronize-with relation can only be established by acquire/release-operations on the same object. In your case you have two completely different and indepent objects dummy1 and dummy2. But these cannot be used to establish a happens-before relation. In fact, since the atomic variables are purely local (i.e., they are only ever touched by one thread), the compiler is free to remove them based on the as-if rule.

Update

Option A:
I assume set() and check() do operate on some atomic value. Then we have the following situation (-> denotes sequenced-before):

set()-> fence1(seq_cst) -> y.load()
y.store(true) -> fence2(seq_cst) -> check()

So we can apply the following rule:

For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order.

I.e., either check() sees that value stored in set, or y.load() sees the value written be y.store() (the operations on y can even use memory_order_relaxed).

Option C:
The C++17 standard states [32.4.3, p1347]:

There shall be a single total order S on all memory_order_seq_cst operations, consistent with the "happens before" order and modification orders for all affected locations [...]

The important word here is "consistent". It implies that if an operation A happens-before an operation B, then A must precede B in S. However, logical implication is a one-way-street, so we cannot infer the inverse: just because some operation C precedes an operation D in S does not imply that C happens before D.

In particular, two seq-cst operations on two separate objects cannot be used to establish a happens before relation, even though the operations are totally ordered in S. If you want to order operations on separate objects, you have to refer to seq-cst-fences (see Option A).

Capacitate answered 4/2, 2020 at 10:6 Comment(19)

It's not obvious that Option C is invalid. seq-cst operations even on private objects can still order other operations to some degree. Agreed there's no synchronizes-with, but we don't care which of foo or bar runs (or apparently neither), just that they don't both run. The sequenced-before relationship and the total order of seq-cst operations (which must exist) does I think give us that. – Guadalupeguadeloupe 4/2, 2020 at 10:34

Thank you @mpoeter. Could you please elaborate about Option A. Which of the three bullets in your answer apply here? IIUC if y.load() does not see effect of y.store(1), then we can prove from the rules that in S, atomic_thread_fence of thread_a is before atomic_thread_fence of thread_b. What I don't see is how to get from this to conclusion that set() side effects are visible to check(). – Preemption 4/2, 2020 at 10:36

@qbolec: I have updated my answer with more details about option A. – Capacitate 4/2, 2020 at 10:50

Would you agree that "Option B" seems safer than "Option A", as it does not require the additional assumption that set() and check() use (the same) atomic variable inside? I'd like to have a code which makes the minimum assumptions about set() and check() internals. – Preemption 4/2, 2020 at 10:54

@Peter Cordes: I tend to disagree. What ordering guarantees would a local atomic variable introduce? Even for atomics, the compiler is free to apply various optimizations based on the as-if rule. See open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html – Capacitate 4/2, 2020 at 11:0

If check() and set() do not use atomics, you could end up with undefined behaviour anyway. Couldn't you just use a lock to serialize the function calls, or is this performance critical? – Capacitate 4/2, 2020 at 11:5

Possibly I'm making the mistake of mapping C++ operations to asm "too early" and thinking of them in terms of store buffer and loadstore / storeload reordering. Hmm, I think I was thinking of a previous debate about what the standard allows for optimization of atomics (which current compilers don't do) where the subject of optimizing away an atomic decrement / increment of a shared var came up (e.g. unlock/re-lock). That still has to separate relaxed operations before from after (so they can't escape critical sections) Can num++ be atomic for 'int num'? – Guadalupeguadeloupe 4/2, 2020 at 11:8

But operations on a seq-cst local are still part of the global total order of all seq-cst operations. Doesn't that interact with sequenced-before in a way that lets us exclude the possibility of both foo and bar executing, even if it doesn't create a synchronizes-with? (Getting sleepy, I might take another look at this tomorrow.) – Guadalupeguadeloupe 4/2, 2020 at 11:11

@Capacitate yes, it is performance critical, it's actually inside the implementation of our synchronization primitives themselves. I think we must assume that set() and check() are safe to be executed in parallel, as they internally have some synchronization primitives, but we should not assume what are these (they are maintained as separate "module" and have their own "issues"). – Preemption 4/2, 2020 at 11:25

Yes, a local seq-cst operation would still be part of the single total order S on all seq-cst operations. But S is "only" consistent with the happens-before order and modification orders, i.e., if A happens-before B, then A must precede B in S. But the inverse is not guaranteed, i.e., just because A precedes B in S, we cannot deduce, that A happens-before B. – Capacitate 4/2, 2020 at 11:28

Well, assuming that set and check can safely be executed in parallel, I would probably go with Option A, especially if this is performance critical, since it avoids contention on the shared variable y. – Capacitate 4/2, 2020 at 11:32

Don't forget to @ notify people so they see your comments. After looking more closely at the @qbolec's reasoning for C, yes it totally falls apart and does appear to rely on a non-existent synchronizes-with. It's certainly safe in practice on an implementation on a multi-copy asm atomic memory model (stores become visible to all threads at once, no IRIW reordering. POWER violates this with store-forwarding between logical cores of a physical core). Given a multi-copy-atomic model like many ISAs have (e.g. ARMv8), any seq-cst store then load form a full memory barrier. But maybe not otherwise – Guadalupeguadeloupe 4/2, 2020 at 12:0

And in terms of the ISO C++ standard, I'm not sure I see any required ordering that would be inconsistent that the OP didn't mention. Good point that only seq-cst operations are part of the global total order that must exist, so we can't go from that to ordering wrt. non-seq-cst operations. I think that's where it breaks down. Going into that level of detail in your answer would be good, IMO; that's one of the more interesting parts of this question. – Guadalupeguadeloupe 4/2, 2020 at 12:7

@Peter Cordes sorry, I wasn't aware that you would only get a notification if you are mentioned via @. Yes, there is a good chance that it would work in practice on many architectures. But since we are working in C++ we have to play by the rules the standard dictates, and since Option C contains nothing that establishes a happens-before relation, there is no guarantee that we will see the latest value. I will update my answer to provide more details on that. – Capacitate 4/2, 2020 at 12:28

No worries, it can be non-obvious because SO sends notifications to the user who owns the post you're commenting under, as well as for @user, so I don't need to @ you, only vice versa. Anyway, I brought up real ISAs because if you can find a case where a real ISA would break something the way compilers normally compile it, that's pretty definitive. And also figuring out what ISA rules make it safe on some ISAs can help us see exactly what it relies on that C++ doesn't guarantee. (Often it comes down to different threads being allowed to disagree about the order of non-seq-cst events.) – Guadalupeguadeloupe 4/2, 2020 at 12:40

@Peter Cordes I have updated my answer - please let me know if you think it still misses important details. – Capacitate 4/2, 2020 at 12:48

@Preemption and mpoeter: I think the real sticking point that Option C depends on is that some hypothetical observer could synchronize-with y and the dummy operations. Then it would definitely have to observe the effects of check() after the effects of y.store(1). And similar for the other thread. This is consistent with the reasoning based on "combining edges" comes from, I think. (And doesn't work because C++ doesn't guarantee anything for non-seq_cst unless there actually is an observer, only for it. Until then it's tree falls in the forest / Schroedinger's cat territory.) – Guadalupeguadeloupe 5/2, 2020 at 14:14

So yes, perhaps a compiler that can see that dummy2 doesn't escape the function could actually remove that seq_cst load. If compiling for AArch64, that would allow an earlier seq_cst store to reorder in practice with later relaxed operations, which wouldn't have been possible with a seq_cst store + load draining the store buffer before any later loads could execute. (Of course current compilers don't optimize atomics at all, even though they're allowed to; that's an unsolved problem.) This is allowed I think because the C++ memory model isn't multi-copy atomic; there's no implicit observer. – Guadalupeguadeloupe 5/2, 2020 at 14:18

mpoeter and @qbolec: Posted an answer with my thoughts on exactly why Option C isn't formally guaranteed. I had been going to just expand those last 2 comments into an answer, but it's not really IRIW / threads not having to agree on ordering that's the issue. It's simply the total lack of any ordering requirement on non-seq_cst ops outside of a happens-before. This answer (that I'm commenting under) at first seemed too simplistic (it's not ordered because there's no happens-before), so I wanted to expand on exactly why that is. – Guadalupeguadeloupe 5/2, 2020 at 16:10

@mpoeter explained why Options A and B are safe.

In practice on real implementations, I think Option A only needs std::atomic_thread_fence(std::memory_order_seq_cst) in Thread A, not B.

seq-cst stores in practice include a full memory barrier, or on AArch64 at least can't reorder with later acquire or seq_cst loads (stlr sequential-release has to drain from the store buffer before ldar can read from cache).

C++ -> asm mappings have a choice of putting the cost of draining the store buffer on atomic stores or atomic loads. The sane choice for real implementations is to make atomic loads cheap, so seq_cst stores include a full barrier (including StoreLoad). While seq_cst loads are the same as acquire loads on most.

(But not POWER; there even loads need heavy-weight sync = full barrier to stop store-forwarding from other SMT threads on the same core which could lead to IRIW reordering, because seq_cst requires all threads to be able to agree on the order of all seq_cst ops. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)

(Of course for a formal guarantee of safety, we do need a fence in both to promote acquire/release set() -> check() into a seq_cst synchronizes-with. Would also work for a relaxed set, I think, but a relaxed check could reorder with bar from the POV of other threads.)

I think the real problem with Option C is that it depends on some hypothetical observer that could synchronize-with y and the dummy operations. And thus we expect the compiler to preserve that ordering when making asm for a barrier-based ISA, where there is a single coherent shared memory state and barriers order this core/thread's access to that shared state. See also C11 Standalone memory barriers LoadLoad StoreStore LoadStore StoreLoad for more about this model vs. the stdatomic synchronizes-with ordering model for barriers weaker than seq_cst.

This is going to be true in practice on real ISAs; both threads include a full barrier or equivalent and compilers don't (yet) optimize atomics. But of course "compiling to a barrier-based ISA" isn't part of the ISO C++ standard. Coherent shared cache is the hypothetical observer that exists for asm reasoning but not for ISO C++ reasoning.

For Option C to work, we need an ordering like dummy1.store(13); / y.load() / set(); (as seen by Thread B) to violate some ISO C++ rule.

The thread running these statements has to behave as if set() executed first (because of Sequenced Before). That's fine, runtime memory ordering and/or compile time reordering of operations could still do that.

The two seq_cst ops d1=13 and y are consistent with the Sequenced Before (program order). set() doesn't participate in the required-to-exist global order for seq_cst ops because it's not seq_cst.

Thread B doesn't synchronize-with dummy1.store so no happens-before requirement on set relative to d1=13 applies, even though that assignment is a release operation.

I don't see any other possible rule violations; I can't find anything here that is required to be consistent with the set Sequenced-Before d1=13.

The "dummy1.store releases set()" reasoning is the flaw. That ordering only applies for a real observer that synchronizes-with it, or in asm. As @mpoeter answered, the existence of the seq_cst total order doesn't create or imply happens-before relationships, and that's the only thing that formally guarantees ordering outside of seq_cst.

Any kind of "normal" CPU with coherent shared cache where this reordering could really happen at runtime doesn't seems plausible. (But if a compiler could remove dummy1 and dummy2 then clearly we'd have a problem, and I think that's allowed by the standard.)

But since the C++ memory model isn't defined in terms of a store buffer, shared coherent cache, or litmus tests of allowed reordering, things required by sanity are not formally required by C++ rules. This is perhaps intentional to allow optimizing away even seq_cst variables that turn out to be thread private. (Current compilers don't do that, of course, or any other optimization of atomic objects.)

An implementation where one thread really could see set() last while another could see set() first sounds implausible. Not even POWER could do that; both seq_cst load and store include full barriers for POWER. (I had suggested in comments that IRIW reordering might be relevant here; C++'s acq/rel rules are weak enough to accommodate that, but the total lack of guarantees outside of synchronizes-with or other happens-before situations is much weaker than any HW.)

C++ doesn't guarantee anything for non-seq_cst unless there actually is an observer, and then only for that observer. Without one we're in Schroedinger's cat territory. Or, if two trees fall in the forest, did one fall before the other? (If it's a big forest, general relativity says it depends on the observer and that there's no universal concept of simultaneity.)

@mpoeter suggested a compiler could even remove the dummy load and store operations, even on seq_cst objects.

I think that may be correct when they can prove that nothing can synchronize with an operation. e.g. a compiler that can see that dummy2 doesn't escape the function can probably remove that seq_cst load.

This has at least one real-world consequence: if compiling for AArch64, that would allow an earlier seq_cst store to reorder in practice with later relaxed operations, which wouldn't have been possible with a seq_cst store + load draining the store buffer before any later loads could execute.

Of course current compilers don't optimize atomics at all, even though ISO C++ doesn't forbid it; that's an unsolved problem for the standards committee.

This is allowed I think because the C++ memory model doesn't have an implicit observer or a requirement that all threads agree on ordering. It does provide some guarantees based on coherent caches, but it doesn't require visibility to all threads to be simultaneous.

Guadalupeguadeloupe answered 5/2, 2020 at 16:4 Comment(2)

Nice summary! I agree that in practice it would probably suffice if only thread A had a seq-cst fence. However, based on the C++ standard we would not have the necessary guarantee that we see the latest value from set(), so I would still use the fence in thread B as well. I suppose a relaxed-store with a seq-cst fence would generate almost the same code as a seq-cst-store anyway. – Capacitate 6/2, 2020 at 15:6

@mpoeter: yup, I was only talking about in practice, not formally. Added a note at the end of that section. And yes, in practice on most ISAs I think a seq_cst store is usually just plain store (relaxed) + a barrier. Or not; on POWER a seq-cst store does a (heavy-weight) sync before the store, nothing after. godbolt.org/z/mAr72P But seq-cst loads need some barriers on both sides. – Guadalupeguadeloupe 7/2, 2020 at 2:36

In the first example, y.load() reading 0 does not imply that y.load() happens before y.store(1).

It does imply however that it is earlier in the single total order thanks to the rule that a seq_cst load returns either the value of the last seq_cst store in the total order, or the value of some non-seq_cst store that doesn't happen before it (which in this case doesn't exist). So if y.store(1) was earlier than y.load() in the total order, y.load() would have returned 1.

The proof is still correct because the single total order doesn't have a cycle.

How about this solution?

std::atomic<int> x2{0},y{0};

void thread_a(){
  set();
  x2.store(1);
  if(!y.load()) foo();
}

void thread_b(){
  y.store(1);
  if(!x2.load()) bar();
}

Sarsaparilla answered 5/2, 2020 at 12:45 Comment(4)

The OP's problem is that I have no control over "X" - it's behind wrapper macros or something and might not be seq-cst store / load. I updated the question to highlight that better. – Guadalupeguadeloupe 5/2, 2020 at 14:57

@PeterCordes The idea was to create another "x" that he does have control over. I'll rename it to "x2" in my answer to make it clearer. I'm sure I'm missing some requirement, but if the only requirement is to make sure that foo() and bar() are not both called, then this satisfies that. – Sarsaparilla 5/2, 2020 at 15:0

So would if(false) foo(); but I think the OP doesn't want that either :P Interesting point but I think the OP does want the conditional calls to be based on the conditions they specify! – Guadalupeguadeloupe 5/2, 2020 at 15:3

Hi @TomekCzajka, thanks for taking time to propose new solution. It wouldn't work in my particular case, as it omits important side-effects of check() (see my comment to my question for real-world meaning of set,check,foo,bar). I think it could work with if(!x2.load()){ if(check())x2.store(0); else bar(); } instead. – Preemption 7/2, 2020 at 9:15

in the ISO standard std::mutex is only guaranteed to have acquire and release ordering, not seq_cst.

But nothing is guaranteed to have "seq_cst ordering", as seq_cst is not a property of any operation.

seq_cst is a guarantee over all operations of a given implementation of std::atomic or an alternative atomic class. As such, your question is unsound.

Heptane answered 13/2, 2020 at 20:50 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags