I want to write portable code (Intel, ARM, PowerPC...) which solves a variant of a classic problem:
Initially: X=Y=0
Thread A:
X=1
if(!Y){ do something }
Thread B:
Y=1
if(!X){ do something }
in which the goal is to avoid a situation in which both threads are doing something
. (It's fine if neither thing runs; this isn't a run-exactly-once mechanism.)
Please correct me if you see some flaws in my reasoning below.
I am aware, that I can achieve the goal with memory_order_seq_cst
atomic store
s and load
s as follows:
std::atomic<int> x{0},y{0};
void thread_a(){
x.store(1);
if(!y.load()) foo();
}
void thread_b(){
y.store(1);
if(!x.load()) bar();
}
which achieves the goal, because there must be some single total order on the
{x.store(1), y.store(1), y.load(), x.load()}
events, which must agree with program order "edges":
x.store(1)
"in TO is before"y.load()
y.store(1)
"in TO is before"x.load()
and if foo()
was called, then we have additional edge:
y.load()
"reads value before"y.store(1)
and if bar()
was called, then we have additional edge:
x.load()
"reads value before"x.store(1)
and all these edges combined together would form a cycle:
x.store(1)
"in TO is before" y.load()
"reads value before " y.store(1)
"in TO is before" x.load()
"reads value before" x.store(true)
which violates the fact that orders have no cycles.
I intentionally use non-standard terms "in TO is before" and "reads value before" as opposed to standard terms like happens-before
, because I want to solicit feedback about correctness of my assumption that these edges indeed imply happens-before
relation, can be combined together in single graph, and the cycle in such combined graph is forbidden. I am not sure about that. What I know is this code produces correct barriers on Intel gcc & clang and on ARM gcc
Now, my real problem is a bit more complicated, because I have no control over "X" - it's hidden behind some macros, templates etc. and might be weaker than seq_cst
I don't even know if "X" is a single variable, or some other concept (e.g. a light-weight semaphore or mutex). All I know is that I have two macros set()
and check()
such that check()
returns true
"after" another thread has called set()
. (It is also known that set
and check
are thread-safe and can't create data-race UB.)
So conceptually set()
is somewhat like "X=1" and check()
is like "X", but I have no direct access to atomics involved, if any.
void thread_a(){
set();
if(!y.load()) foo();
}
void thread_b(){
y.store(1);
if(!check()) bar();
}
I'm worried, that set()
might be internally implemented as x.store(1,std::memory_order_release)
and/or check()
might be x.load(std::memory_order_acquire)
. Or hypothetically a std::mutex
that one thread is unlocking and another is try_lock
ing; in the ISO standard std::mutex
is only guaranteed to have acquire and release ordering, not seq_cst.
If this is the case, then check()
's if body can be "reordered" before y.store(true)
(See Alex's answer where they demonstrate that this happens on PowerPC).
This would be really bad, as now this sequence of events is possible:
thread_b()
first loads the old value ofx
(0
)thread_a()
executes everything includingfoo()
thread_b()
executes everything includingbar()
So, both foo()
and bar()
got called, which I had to avoid. What are my options to prevent that?
Option A
Try to force Store-Load barrier. This, in practice, can be achieved by std::atomic_thread_fence(std::memory_order_seq_cst);
- as explained by Alex in a different answer all tested compilers emitted a full fence:
- x86_64: MFENCE
- PowerPC: hwsync
- Itanuim: mf
- ARMv7 / ARMv8: dmb ish
- MIPS64: sync
The problem with this approach is, that I could not find any guarantee in C++ rules, that std::atomic_thread_fence(std::memory_order_seq_cst)
must translate to full memory barrier. Actually, the concept of atomic_thread_fence
s in C++ seems to be at a different level of abstraction than the assembly concept of memory barriers and deals more with stuff like "what atomic operation synchronizes with what". Is there any theoretical proof that below implementation achieves the goal?
void thread_a(){
set();
std::atomic_thread_fence(std::memory_order_seq_cst)
if(!y.load()) foo();
}
void thread_b(){
y.store(true);
std::atomic_thread_fence(std::memory_order_seq_cst)
if(!check()) bar();
}
Option B
Use control we have over Y to achieve synchronization, by using read-modify-write memory_order_acq_rel operations on Y:
void thread_a(){
set();
if(!y.fetch_add(0,std::memory_order_acq_rel)) foo();
}
void thread_b(){
y.exchange(1,std::memory_order_acq_rel);
if(!check()) bar();
}
The idea here is that accesses to a single atomic (y
) must be form a single order on which all observers agree, so either fetch_add
is before exchange
or vice-versa.
If fetch_add
is before exchange
then the "release" part of fetch_add
synchronizes with the "acquire" part of exchange
and thus all side effects of set()
have to be visible to code executing check()
, so bar()
will not be called.
Otherwise, exchange
is before fetch_add
, then the fetch_add
will see 1
and not call foo()
.
So, it is impossible to call both foo()
and bar()
. Is this reasoning correct?
Option C
Use dummy atomics, to introduce "edges" which prevent disaster. Consider following approach:
void thread_a(){
std::atomic<int> dummy1{};
set();
dummy1.store(13);
if(!y.load()) foo();
}
void thread_b(){
std::atomic<int> dummy2{};
y.store(1);
dummy2.load();
if(!check()) bar();
}
If you think the problem here is atomic
s are local, then imagine moving them to global scope, in the following reasoning it does not appear to matter to me, and I intentionally wrote the code in such a way to expose how funny it is that dummy1 and dummy2 are completely separate.
Why on Earth this might work? Well, there must be some single total order of {dummy1.store(13), y.load(), y.store(1), dummy2.load()}
which has to be consistent with program order "edges":
dummy1.store(13)
"in TO is before"y.load()
y.store(1)
"in TO is before"dummy2.load()
(A seq_cst store + load hopefully form the C++ equivalent of a full memory barrier including StoreLoad, like they do in asm on real ISAs including even AArch64 where no separate barrier instructions are required.)
Now, we have two cases to consider: either y.store(1)
is before y.load()
or after in the total order.
If y.store(1)
is before y.load()
then foo()
will not be called and we are safe.
If y.load()
is before y.store(1)
, then combining it with the two edges we already have in program order, we deduce that:
dummy1.store(13)
"in TO is before"dummy2.load()
Now, the dummy1.store(13)
is a release operation, which releases effects of set()
, and dummy2.load()
is an acquire operation, so check()
should see the effects of set()
and thus bar()
will not be called and we are safe.
Is it correct here to think that check()
will see the results of set()
? Can I combine the "edges" of various kinds ("program order" aka Sequenced Before, "total order", "before release", "after acquire") like that? I have serious doubts about this: C++ rules seem to talk about "synchronizes-with" relations between store and load on same location - here there is no such situation.
Note that we're only worried about the case where dumm1.store
is known (via other reasoning) to be before dummy2.load
in the seq_cst total order. So if they had been accessing the same variable, the load would have seen the stored value and synchronized with it.
(The memory-barrier / reordering reasoning for implementations where atomic loads and stores compile to at least 1-way memory barriers (and seq_cst operations can't reorder: e.g. a seq_cst store can't pass a seq_cst load) is that any loads/stores after dummy2.load
definitely become visible to other threads after y.store
. And similarly for the other thread, ... before y.load
.)
You can play with my implementation of Options A,B,C at https://godbolt.org/z/u3dTa8
std::atomic_thread_fence(std::memory_order_seq_cst)
does compile to a full barrier, but since the entire concept is an implementation detail you won't find any mention of it in the standard. (CPU memory models usually are defined in terms of what reorerings are allowed relative to sequential consistency. e.g. x86 is seq-cst + a store buffer w/ forwarding) – Guadalupeguadeloupefoo()
andbar()
from both being called. – Preemptionstd::atomic<int>
instead ofstd::atomic<bool>
? Or even much better:std::atomic_flag
which seems to exactly fit your use-case. – Clichycheck()
andset()
even use anatomic
object at all? If it's a plainint
or something, you can have data-race UB. (Note that evenvolatile
doesn't avoid data-race UB in ISO C++; it's only usable as a roll-your-own mo_relaxed (or stronger with barriers) on real implementations.) – Guadalupeguadeloupebool
, because, as you've said, it's more natural. Once I've got to spelling out "Option B", I've realized I know of no way to "read-modify-write"std::atomic<bool>
in a way which does not really change its value. So, I tideously went back and reworked everything toint
, as foratomic<int>
I have plenty to choose from (fetch_add(0)
,fetch_or(0)
,fetch_sub(0)
,..). Surprisingly to mestd::atomic<bool>
has nofetch_or(false)
implemented. Let me know if there is a way to "just read, but in a way which is both acquire and release" forbool
:) – Preemptionstd::atomic_flag
instead and itstest_and_set
function. (I know my comments are not directly related to your issue, but only for information :) ) – Clichycompare_exchange_*
to perform an RMW operation on an atomic bool without changing its value (simply set expected and new to the same value). – Capacitateatomic<bool>
hasexchange
andcompare_exchange_weak
. The latter can be used to do a dummy RMW by (attempting to) CAS(true, true) or false,false. It either fails or atomically replaces the value with itself. (In x86-64 asm, that trick withlock cmpxchg16b
is how you do guaranteed-atomic 16-byte loads; inefficient but less bad than taking a separate lock.) – Guadalupeguadeloupefoo()
norbar()
will be called. I didn't want to bring to many "real world" elements of the code, to avoid "you think you have problem X but you have problem Y" kind of responses. But, if one really needs to know what is the background storey:set()
is reallysome_mutex_exit()
,check()
istry_enter_some_mutex()
,y
is "there are some waiters",foo()
is "exit without waking up anyone",bar()
is "wait for wakup"... But, I refuse to discuss this design here - I can't change it really. – Preemptionset
and check` are known to be thread safe; you could add that to the question. – Guadalupeguadeloupe