Here is a simple example of acquire-release semantics used for data synchronization across threads.
// thread 1 // thread 2
data = 100;
flag.store(true, std::memory_order_release);
while(!flag.load(std::memory_order_acquire));
assert(data == 100);
As I understand, this is accurately showing the use of acquire-release memory ordering and program will work as intended.
But what is the case if I were to use standalone barriers?
// thread 1 // thread 2
data = 100;
std::atomic_thread_fence(std::memory_order_release);
flag.store(true, std::memory_order_relaxed);
while(!flag.load(std::memory_order_relaxed))
std::atomic_thread_fence(std::memory_order_acquire);
assert(data == 100);
I always thought that this is exactly equivalent to the first example.
But today I watched a talk at CppCon by Herb Sutter (C++ and Beyond 2012: Herb Sutter - atomic Weapons). At 1:07:10 in the video, he gives an example to show that standalone fences are suboptimal. After seeing that I am confused.
The example is this:
// thread 1 // thread 2
widget *temp = new widget();
XX mb(); XXXXXXXXXXXXXXXXXXXXX // a
global = temp;
temp2 = global;
XX mb(); XXXXXXXXXXXXXXXXXXXXX // b
temp2->do_something();
temp2 = global;
XX mb(); XXXXXXXXXXXXXXXXXXXXX
temp2->do_something_else();
He says that, at a and b, you need a full barrier, not just release and acquire, since those are not associated with any particular stores or loads. Furthermore, he says that, standalone acquire and release barriers doesn't make any sense. Is this correct? (For simplicity, it is assumed that, reads and writes to global are indivisible, i.e. global never contains a torn value).
Why this does not work?
// thread 1 // thread 2
widget *temp = new widget();
XX release(); XXXXXXXXXXXXXXXX // a
global = temp;
temp2 = global;
XX acquire(); XXXXXXXXXXXXXXXX // b
temp2->do_something();
temp2 = global;
XX acquire(); XXXXXXXXXXXXXXXX
temp2->do_something_else();
lab
as an acquire load, but a separate barrier compiles to a separate barrier instruction. For 32-bit mode, it'sdmb ish
(a full barrier). AArch64 has an acquire-fence instruction so that's less bad. – Synderrelaxed
operations intoacquire
wrt. things after the barrier, unlike an acquire operation which isn't a 2-way barrier. So it wouldn't be a valid optimization; the barrier form is stronger. (But yes, acquire_op -> (relaxed + acquire barrier) is legal as far as my understanding, with the barrier version being stronger. On some ISAs, that's how the asm has to loop, although there can still be a difference in terms of compile-time reordering allowed. On other ISAs where there are acquire-load instructions, it's slower so compilers don't.) – Synderload(acquire)
can be replaced byload(relaxed)+fence(acquire)
in every case, but you might need to wrap that in a function or statement-expression if you want to use it as a loop condition, because yeah, both have to run every time, otherwise one of the loads is only relaxed. – Synderr1 = a.load(relaxed); r2 = a.load(relaxed);
can be collapsed intor1 = a.load(relaxed); r2 = r1;
as per the standard. So I guesswhile(a.load(relaxed));
can turn intor1 = a.load(relaxed); while(r1);
. Correct me if I am wrong. This optimization is not possible if there is a barrier after load. (Apparently, currently no compiler is doing these things. Still it is allowed) – Anabolismwhile(!x);
and thread2 concurrently storesx=1;
, still there is no data race since x can never have a torn value, isn't it? But compilers still hoists x out of the loop! – Anabolismx=1;
is atomic and free of UB, and so when you compilex=1;
on an x86 machine, the compiler does not have to compile it into a single aligned store instruction. [...] – Amphibolite