Is there any compiler barrier which is equal to asm("" ::: "memory") in C++11?
Asked Answered
E

1

8

My test code is as below, and I found that only the memory_order_seq_cst forbade compiler's reorder.

#include <atomic>

using namespace std;

int A, B = 1;

void func(void) {
    A = B + 1;
    atomic_thread_fence(memory_order_seq_cst);
    B = 0;
}

And other choices such as memory_order_release, memory_order_acq_rel did not generate any compiler barrier at all.

I think they must work with atomic variable just as below.

#include <atomic>

using namespace std;

atomic<int> A(0);
int B = 1;

void func(void) {
    A.store(B+1, memory_order_release);
    B = 0;
}

But I do not want to use atomic variable. At the same time, I think the "asm("":::"memory")" is too low level.

Is there any better choice?

Expunge answered 13/11, 2016 at 22:7 Comment(2)
Sufficient for what? atomic_thread_fence does whatever is necessary to stop reordering at compile time and at run time. atomic_signal_fence only stops reordering at compile time, so other threads can observer reordering, but signal handlers that run asynchronously inside this thread won't. (Because out-of-order execution and memory reordering always preserve the behaviour of a single thread.)Allometry
Please clarify your question, but I think the answer is: atomic_signal_fence is the compiler-barrier you're looking for. It never compiles to any instructions, even with mo_seq_cst (which is equivalent to asm volatile("" ::: "memory"); in GNU C). It's confusing because you say "memory_order_acq_rel did not generate any compiler barrier at all!", but you don't show any evidence of how you checked. Of course it doesn't compile to any extra instructions on x86, since x86 is strongly ordered and has such a barrier for free before/after every load/store.Allometry
A
10

re: your edit:

But I do not want to use atomic variable.

Why not? If it's for performance reasons, use them with memory_order_relaxed and atomic_signal_fence(mo_whatever) to block compiler reordering without any runtime overhead other than the compiler barrier potentially blocking some compile-time optimizations, depending on the surrounding code.

If it's for some other reason, then maybe atomic_signal_fence will give you code that happens to work on your target platform. I suspect that most implementations of it do order non-atomic<> loads and stores in practice, at least as an implementation detail, and probably effectively required if there are accesses to atomic<> variables. So it might help in practice to avoid some actual consequences of any data-race Undefined Behaviour which would still exist. (e.g. as part of a SeqLock implementation where for efficiency you want to use non-atomic reads / writes of the shared data so the compiler can use SIMD vector copies, for example.)

See Who's afraid of a big bad optimizing compiler? on LWN for some details about the badness you can run into (like invented loads) if you only use compiler barriers to force reloads of non-atomic variables, instead of using something with read-exactly-once semantics. (In that article, they're talking about Linux kernel code so they're using volatile for hand-rolled load/store atomics. But in general don't do that: When to use volatile with multi threading? - pretty much never)


Sufficient for what?

Regardless of any barriers, if two threads run this function at the same time, your program has Undefined Behaviour because of concurrent access to non-atomic<> variables. So the only way this code can be useful is if you're talking about synchronizing with a signal handler that runs in the same thread.

That would also be consistent with asking for a "compiler barrier", to only prevent reordering at compile time, because out-of-order execution and memory reordering always preserve the behaviour of a single thread. So you never need extra barrier instructions to make sure you see your own operations in program order, you just need to stop the compiler reordering stuff at compile time. See Jeff Preshing's post: Memory Ordering at Compile Time

This is what atomic_signal_fence is for. You can use it with any std::memory_order, just like thread_fence, to get different strengths of barrier and only prevent the optimizations you need to prevent.


... atomic_thread_fence(memory_order_acq_rel) did not generate any compiler barrier at all!

Totally wrong, in several ways.

atomic_thread_fence is a compiler barrier plus whatever run-time barriers are necessary to restrict reordering in the order our loads/stores become visible to other threads.

I'm guessing you mean it didn't emit any barrier instructions when you looked at the asm output for x86. Instructions like x86's MFENCE are not "compiler barriers", they're run-time memory barriers and prevent even StoreLoad reordering at run-time. (That's the only reordering that x86 allows. SFENCE and LFENCE are only needed when using weakly-ordered (NT) stores, like MOVNTPS (_mm_stream_ps).)

On a weakly-ordered ISA like ARM, thread_fence(mo_acq_rel) isn't free, and compiles to an instruction. gcc5.4 uses dmb ish. (See it on the Godbolt compiler explorer).

A compiler barrier just prevents reordering at compile time, without necessarily preventing run-time reordering. So even on ARM, atomic_signal_fence(mo_seq_cst) compiles to no instructions.

A weak enough barrier allows the compiler to do the store to B ahead of the store to A if it wants, but gcc happens to decide to still do them in source order even with thread_fence(mo_acquire) (which shouldn't order stores with other stores).

So this example doesn't really test whether something is a compiler barrier or not.


Strange compiler behaviour from gcc for an example that is different with a compiler barrier:

See this source+asm on Godbolt.

#include <atomic>
using namespace std;
int A,B;

void foo() {
  A = 0;
  atomic_thread_fence(memory_order_release);
  B = 1;
  //asm volatile(""::: "memory");
  //atomic_signal_fence(memory_order_release);
  atomic_thread_fence(memory_order_release);
  A = 2;
}

This compiles with clang the way you'd expect: the thread_fence is a StoreStore barrier, so the A=0 has to happen before B=1, and can't be merged with the A=2.

    # clang3.9 -O3
    mov     dword ptr [rip + A], 0
    mov     dword ptr [rip + B], 1
    mov     dword ptr [rip + A], 2
    ret

But with gcc, the barrier has no effect, and only the final store to A is present in the asm output.

    # gcc6.2 -O3
    mov     DWORD PTR B[rip], 1
    mov     DWORD PTR A[rip], 2
    ret

But with atomic_signal_fence(memory_order_release), gcc's output matches clang. So atomic_signal_fence(mo_release) is having the barrier effect we expect, but atomic_thread_fence with anything weaker than seq_cst isn't acting as a compiler barrier at all.

One theory here is that gcc knows that it's officially Undefined Behaviour for multiple threads to write to non-atomic<> variables. This doesn't hold much water, because atomic_thread_fence should still work if used to synchronize with a signal handler, it's just stronger than necessary.

BTW, with atomic_thread_fence(memory_order_seq_cst), we get the expected

    # gcc6.2 -O3, with a mo_seq_cst barrier
    mov     DWORD PTR A[rip], 0
    mov     DWORD PTR B[rip], 1
    mfence
    mov     DWORD PTR A[rip], 2
    ret

We get this even with only one barrier, which would still allow the A=0 and A=2 stores to happen one after the other, so the compiler is allowed to merge them across a barrier. (Observers failing to see separate A=0 and A=2 values is a possible ordering, so the compiler can decide that's what always happens). Current compilers don't usually do this kind of optimization, though. See discussion at the end of my answer on Can num++ be atomic for 'int num'?.

Allometry answered 13/11, 2016 at 22:47 Comment(2)
Working on an update for this: atomic_thread_fence doesn't stop reordering of operations on non-atomic objects. With gcc on x86, atomic_signal_fence does. I'm not sure if this is required by the standard, or an implementation artifact. So atomic_signal_fence is not a strict subset of atomic_thread_fence.Allometry
update 2: IIRC, that last comment was a GCC bug that has since been fixed.Allometry

© 2022 - 2024 — McMap. All rights reserved.