When is a compiler-only memory barrier (such as std::atomic_signal_fence) useful?

Asked 26/8, 2013 at 17:7 Answered 5/11, 2013 at 9:19

Solved c++c++11 atomic memory-barriers memory-fences

The notion of a compiler fence often comes up when I'm reading about memory models, barriers, ordering, atomics, etc., but normally it's in the context of also being paired with a CPU fence, as one would expect.

Occasionally, however, I read about fence constructs which only apply to the compiler. An example of this is the C++11 std::atomic_signal_fence function, which states at cppreference.com:

std::atomic_signal_fence is equivalent to std::atomic_thread_fence, except no CPU instructions for memory ordering are issued. Only reordering of the instructions by the compiler is suppressed as order instructs.

I have five questions related to this topic:

As implied by the name std::atomic_signal_fence, is an asynchronous interrupt (such as a thread being preempted by the kernel to execute a signal handler) the only case in which a compiler-only fence is useful?
Does its usefulness apply to all architectures, including strongly-ordered ones such as x86?
Can a specific example be provided to demonstrate the usefulness of a compiler-only fence?
When using std::atomic_signal_fence, is there any difference between using acq_rel and seq_cst ordering? (I would expect it to make no difference.)
This question might be covered by the first question, but I'm curious enough to ask specifically about it anyway: Is it ever necessary to use fences with thread_local accesses? (If it ever would be, I would expect compiler-only fences such as atomic_signal_fence to be the tool of choice.)

Thank you.

Pechora answered 26/8, 2013 at 17:7 Comment(6)

Have you checked? preshing.com/20120625/memory-ordering-at-compile-time. – Uremia 26/8, 2013 at 19:31

Quoting preshing.com: "As I mentioned, compiler barriers are sufficient to prevent memory reordering on a single-processor system. But it’s 2012, and these days, multicore computing is the norm. If we want to ensure our interactions happen in the desired order in a multiprocessor environment, and on any CPU architecture, then a compiler barrier is not enough. [...]" – Uremia 26/8, 2013 at 19:36

@chico: Good point- if the programmer knows the application will only run on non-SMP systems (i.e., single CPU with single core or SMP disabled in the kernel for some reason), which is something the compiler couldn't possibly know or assume, then atomic_signal_fence (or some other compiler-only fence construct) could be used as a potential optimization. As the article states, the Linux kernel has functions smp_rmb and smp_wmb which are implemented this way. However, I'm still interested in hearing answer(s) -- if any exist -- that are not restricted to such an assumption. – Pechora 26/8, 2013 at 20:14

I think, it could also be useful in an architectured application to be run taking advantage of processor affinity, where multiple instances are independently running in parallel in their specific cores, hence, compiler-only barriers can be an optimization, being it just what's the necessary. – Uremia 26/8, 2013 at 20:23

@chico: Also a good point regarding processor affinity, but that is essentially the same assumption as before as it reduces an SMP environment to non-SMP (for the application) if it is strictly bound to a single core. – Pechora 26/8, 2013 at 20:26

to avoid responses surrounding this theme I've focused, you may add to/change your question demonstrating that you're not interested in this view of the problem, to give better focus. – Uremia 26/8, 2013 at 20:30

To answer all 5 questions:

1) A compiler fence (by itself, without a CPU fence) is only useful in two situations:

To enforce memory order constraints between a single thread and asynchronous interrupt handler bound to that same thread (such as a signal handler).
To enforce memory order constraints between multiple threads when it is guaranteed that every thread will execute on the same CPU core. In other words, the application will only run on single core systems, or the application takes special measures (through processor affinity) to ensure that every thread which shares the data is bound to the same core.

2) The memory model of the underlying architecture, whether it's strongly- or weakly-ordered, has no bearing on whether a compiler-fence is needed in a situation.

3) Here is pseudo-code which demonstrates the use of a compiler fence, by itself, to sufficiently synchronize memory access between a thread and an async signal handler bound to the same thread:

void async_signal_handler()
{
    if ( is_shared_data_initialized )
    {
        compiler_only_memory_barrier(memory_order::acquire);
        ... use shared_data ...
    }
}

void main()
{
// initialize shared_data ...
    shared_data->foo = ...
    shared_data->bar = ...
    shared_data->baz = ...
// shared_data is now fully initialized and ready to use
    compiler_only_memory_barrier(memory_order::release);
    is_shared_data_initialized = true;
}

Important Note: This example assumes that async_signal_handler is bound to the same thread that initializes shared_data and sets the is_initialized flag, which means the application is single-threaded, or it sets thread signal masks accordingly. Otherwise, the compiler fence would be insufficient, and a CPU fence would also be needed.

4) They should be the same. acq_rel and seq_cst should both result in a full (bidirectional) compiler fence, with no fence-related CPU instructions emitted. The concept of "sequential consistency" only comes into play when multiple cores and threads are involved, and atomic_signal_fence only pertains to one thread of execution.

5) No. (Unless of course, the thread-local data is accessed from an asynchronous signal handler in which case a compiler fence might be necessary.) Otherwise, fences should never be needed with thread-local data since the compiler (and CPU) are only allowed to reorder memory accesses in ways that do not change the observable behavior of the program with respect to its sequence points from a single-threaded perspective. And one can logically think of thread-local statics in a multi-threaded program to be the same as global statics in a single-threaded program. In both cases, the data is only accessible from a single thread, which prevents a data race from occuring.

Schist answered 27/8, 2013 at 0:4 Comment(4)

Informative, but inaccurate. There are other cases in which compiler-only fences are useful in pre-C11 code for particular processors. For example, if you are on an x86 and are satisfied with acquire-release, but want to allow the compiler to reorder memory operations to different addresses within a block, compiler fencing around the block (but leaving the memory accesses nonvolatile) is the only way to achieve this. – Gracegraceful 24/2, 2017 at 1:47

@Gracegraceful But atomic_signal_fence applies to atomics only, not to ordinary object so you have to change all your datatypes. – Hypnotherapy 25/5, 2019 at 14:27

@MikeTusar, do you need to mark shared_data/is_initialized as volatile as well? What if the sighandler is blocked (pthread_sigmask) while the modifications are performed? My current thinking is that you need volatile for the former, but not for the latter. – Anglofrench 11/4 at 18:10

@curiousguy: in the new C++20, it's possible to use atomic_ref over ordinary objects to perform the atomic operation (so atomic_signal_fence should work here w/o changing any datatype). Besides C++ fences only need 1 atomic access to synchronize against the acquire/release semantics, so changing all datatypes was never necessary. In fact, the purpose of fences is to synchronize all non-atomic (and relaxed atomic) memory accesses against some atomic access. You may also find this interesting: youtube.com/watch?v=KeLBd2EJLOU&t=5439s – Anglofrench 19/4 at 12:55

There are actually some nonportable but useful C programming idioms where compiler fences are useful, even in multicore code (particularly in pre-C11 code). The typical situation is where the program is doing some accesses that would normally be made volatile (because they are to shared variables), but you want the compiler to be able to move the accesses around. If you know that the accesses are atomic on the target platform (and you take some other precautions), you can leave the accesses nonvolatile, but contain code movement using compiler barriers.

Thankfully, most programming like this is made obsolete with C11/C++11 relaxed atomics.

Gracegraceful answered 5/11, 2013 at 9:19 Comment(0)

Recommended topics

Hot tags