C11 memory fence usage

Asked 30/10, 2013 at 17:29 Answered 31/10, 2013 at 13:45

Solved c multithreading atomic c11 memory-fences

Even for a simple 2-thread communication example, I have difficulty to express this in the C11 atomic and memory_fence style to obtain proper memory ordering:

shared data:

volatile int flag, bucket;

producer thread:

while (true) {
   int value = producer_work();
   while (atomic_load_explicit(&flag, memory_order_acquire))
      ; // busy wait
   bucket = value;
   atomic_store_explicit(&flag, 1, memory_order_release);
}

consumer thread:

while (true) {
   while (!atomic_load_explicit(&flag, memory_order_acquire))
      ; // busy wait
   int data = bucket;
   atomic_thread_fence(/* memory_order ??? */);
   atomic_store_explicit(&flag, 0, memory_order_release);
   consumer_work(data);
}

As far as I understand, above code would properly order the store-in-bucket -> flag-store -> flag-load -> load-from-bucket. However, I think that there remains a race condition between load-from-bucket and re-write the bucket again with new data. To force an order following the bucket-read, I guess I would need an explicit atomic_thread_fence() between the bucket read and the following atomic_store. Unfortunately, there seems to be no memory_order argument to enforce anything on preceding loads, not even the memory_order_seq_cst.

A really dirty solution could be to re-assign bucket in the consumer thread with a dummy value: that contradicts the consumer read-only notion.

In the older C99/GCC world I could use the traditional __sync_synchronize() which I believe would be strong enough.

What would be the nicer C11-style solution to synchronize this so-called anti-dependency?

(Of course I am aware that I should better avoid such low-level coding and use available higher-level constructs, but I would like to understand...)

Townie answered 30/10, 2013 at 17:29 Comment(6)

I'm not a C++ programmer, but (conceptually) I'm not sure the atomic_thread_fence() call is necessary. The flag update has release semantics, preventing any preceding store instructions from being reordered across it (e.g., the store to data). The store to data has a dependency on the read from bucket, so that read cannot be reordered past the flag release either. If the full fence is necessary, I'd love to hear why. – Morion 30/10, 2013 at 18:30

Not an answer, thus just a comment: it seems that you re-invent C11's atomic_flag data type, that implements exactly this semantic, but which eventually has more direct implementation in hardware. atomic_flag is the only atomic data type that is guaranteed to be lock-free, so this is always to be preferred over more complex operations. And it definitively wouldn't need an extra fence to ensure consistency. – Mortality 30/10, 2013 at 22:55

Mike S, your reply seems attractive to me, but... I thought that memory fences were ensuring things on the memory subsystem, affecting ld/st ops. In above example, 'data' would probably become a register variable, so its assignment does not create a store op. That would only leave the load from bucket for memory synchronization? (for which there is no subsequent C11 memory order?) – Townie 31/10, 2013 at 8:36

@JosvE This is interesting: "Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order" (preshing.com/20120913/acquire-and-release-semantics). That would suggest the read from bucket could not be reordered past the write to flag, regardless of where the read value goes. Also, from Herb Sutter: "A write-release executes after all reads and writes by the same thread that precede it in program order." – Morion 31/10, 2013 at 13:24

@MikeStrobel, thanks! that expresses what I was looking for. To me it seems that both the C++ 4th edition from Stroustrup and the 'C++ concurrency' from Williams do not describe this clearly, and that the information on en.cppreference.com/w/c/atomic/memory_order is then apparently incorrect. – Townie 31/10, 2013 at 13:55

Great! I consolidated my various comments and some additional bits into an answer. – Morion 31/10, 2013 at 14:0

To force an order following the bucket-read, I guess I would need an explicit atomic_thread_fence() between the bucket read and the following atomic_store.

I do not believe the atomic_thread_fence() call is necessary: the flag update has release semantics, preventing any preceding load or store operations from being reordered across it. See the formal definition by Herb Sutter:

A write-release executes after all reads and writes by the same thread that precede it in program order.

This should prevent the read of bucket from being reordered to occur after the flag update, regardless of where the compiler chooses to store data.

That brings me to your comment about another answer:

The volatile ensures that there are ld/st operations generated, which can subsequently be ordered with fences. However, data is a local variable, not volatile. The compiler will probably put it in register, avoiding a store operation. That leaves the load from bucket to be ordered with the subsequent reset of flag.

It would seem that is not an issue if the bucket read cannot be reordered past the flag write-release, so volatile should not be necessary (though it probably doesn't hurt to have it, either). It's also unnecessary because most function calls (in this case, atomic_store_explicit(&flag)) serve as compile-time memory barriers. The compiler would not reorder the read of a global variable past a non-inlined function call because that function could modify the same variable.

I would also agree with @MaximYegorushkin that you could improve your busy-waiting with pause instructions when targeting compatible architectures. GCC and ICC both appear to have _mm_pause(void) intrinsics (probably equivalent to __asm__ ("pause;")).

Morion answered 31/10, 2013 at 13:45 Comment(4)

Thanks Mike for the correct write-up. So you confirm that today's description of memory_order_release on cppreference.com is incorrect (claiming effect on preceding stores only). Regarding volatile for bucket: if the compiler would have knowledge regarding the atomic_store_explicit and inline the call, then a non-volatile bucket could be mapped to register skipping some of the intended loads and inter-thread data exchanges? – Townie 31/10, 2013 at 22:18

It may not be incorrect per se; techncially, Sutter was describing "write-release" semantics, which I would take to mean "a write with release semantics", e.g., atomic_store_explicit(). That may simply be a stronger guarantee than that provided by a lone release barrier, e.g., atomic_thread_fence(), in which case the documentation for memory_order_release may simply be describing the minimal guarantees. Unfortunately, getting authorative answers to this sort of thing has become exceedingly difficult, but every definition of "release semantics" I've seen agrees with Sutter's. – Morion 1/11, 2013 at 10:13

Accepting this answer implies that cppreference.com is wrong. @Mike, your above link to preshing.com also indicates so. I just found stackoverflow.com/questions/16179938 where the answer also states that cppreference.com is wrong: both say that a store.release should order on preceding loads. However, the C++11 Jan 2012 draft says: "Informally, performing a release operation on A forces prior side eﬀects on other memory locations to become visible to other threads". This 'side effects' now excludes reads? This weakens the case for changing 'memory_order_release' on cppreference.com :-( – Townie 3/11, 2013 at 21:1

The quote from the C++11 Jan 2012 draft refers to visibility guarantees, not ordering guarantees. Having had a chance to read a bit more, it does indeed seem as though the cppreference.com documentation is wrong, or at least incomplete. – Morion 4/11, 2013 at 13:55

I agree with what @MikeStrobel says in his comment.

You don't need atomic_thread_fence() here because your critical sections start with acquire and end with release semantics. Hence, reads within your critical sections can not be reordered prior to the acquire and writes post the release. And this is why volatile is unnecessary here as well.

In addition, I don't see a reason why (pthread) spinlock is not used here instead. spinlock does a similar busy spin for you but it also uses pause instruction:

The pause intrinsic is used in spin-wait loops with the processors implementing dynamic execution (especially out-of-order execution). In the spin-wait loop, the pause intrinsic improves the speed at which the code detects the release of the lock and provides especially significant performance gain. The execution of the next instruction is delayed for an implementation-specific amount of time. The PAUSE instruction does not modify the architectural state. For dynamic scheduling, the PAUSE instruction reduces the penalty of exiting from the spin-loop.

Unicuspid answered 30/10, 2013 at 20:30 Comment(3)

If bucket would not be declared volatile, I doubt whether the loads from bucket would always return the proper value: The volatile seems needed to prevent the compiler from making a local register copy of bucket. I indeed don't know if the volatile is still needed for flag, the atomic intrinsics might have coverage for that. Note however that the C11 prototypes of these functions specify a volatile pointer argument. – Townie 31/10, 2013 at 8:50

@JosvE Nope, volatile is unnecessary because you have memory barriers before and after accessing it, that is all what matters. See channel9.msdn.com/Shows/Going+Deep/… and the second part for detailed treatment of your question and volatile as well. – Unicuspid 31/10, 2013 at 10:31

OK, went deep through Sutters presentation... You are correct in that the compiler must take care of ordering with the atomic fences, not needing further volatile declarations. Thanks. – Townie 3/11, 2013 at 14:15

-1

Direct answer:

That your store is a memory_order_release operation means that your compiler has to emit a memory fence for store instruction before the store of flag. This is required to ensure that other processors see the final state of the released data before they start interpreting it. So, no, you don't need to add a second fence.

Long answer:

As noted above, what happens is that the compiler transforms your atomic_... instructions into combinations of fences and memory accesses; the fundamental abstraction is not the atomic load, it is the memory fence. That is how things work, even though the new C++ abstractions entice you to think differently. And I personally find memory fences much easier to think about than the contrieved abstractions in C++.

From a hardware perspective, what you need to ensure, is the relative order of your loads and stores, i. e. that the write to bucket completes before flag is written in the producer, and that the load of flag reads a value older than the load of bucket in the consumer.

That said, what you actually need is this:

//producer
while(true) {
    int value = producer_work();
    while (flag) ; // busy wait
    atomic_thread_fence(memory_order_acquire);  //ensure that value is not assigned to bucket before the flag is lowered
    bucket = value;
    atomic_thread_fence(memory_order_release);  //ensure bucket is written before flag is
    flag = true;
}

//consumer
while(true) {
    while(!flag) ; // busy wait
    atomic_thread_fence(memory_order_acquire);  //ensure the value read from bucket is not older than the last value read from flag
    int data = bucket;
    atomic_thread_fence(memory_order_release);  //ensure data is loaded from bucket before the flag is lowered again
    flag = false;
    consumer_work(data);
}

Note, that the labels "producer" and "consumer" are misleading here, because we have two processes playing ping pong, each becoming producer and consumer in turn; it's just that one thread produces useful values, while the other produces "holes" to write useful values into...

atomic_thread_fence() is all you need, and since it directly translates to the assembler instructions below the atomic_... abstractions, it is guaranteed to be the fastest approach.

Mert answered 30/10, 2013 at 21:24 Comment(8)

Without interlocked (atomic) reads of flag, I would not expect the latest flag value to be read if the producer and consumer are running on separate CPUs. The updates may be ordered, but the new results may not be seen. I would expect he also needs an interlocked store when updating the flag, for similar reasons. At the very least, the releases fences should follow the flag assignments. – Morion 30/10, 2013 at 21:34

@MikeStrobel No, the releases must be in front of the flag assignments, they are the store fences: any store before must become effective before any store after it does. Regarding atomicity: Yes, assignments to flag must be atomic, but show me the architecture on which a store to a properly aligned int is not atomic. We don't need any language support for that. – Mert 30/10, 2013 at 21:45

whoops, yes, disregard what I said about moving the release fence after the store. Regarding the atomics, however, my point wasn't that the reads and writes to flag are not atomic, but that they are not interlocked (and presumably the atomic read/write functions in this API are). In your version, how is it ensured that other CPUs do not attempt to read a stale version of flag from their local cache? – Morion 30/10, 2013 at 22:7

@MikeStrobel Well, that is already ensured by flag being volatile, that ensures that the compiler cannot optimize any reads from flag away. The semantics of volatile is actually much stronger than what is needed here, the only thing that needs to be ensured is that the compiler does not optimize the spin loop away. I am fully aware, that this is probably not the best (= most performant way) to write a spin loop, but the question is not about spin loops, it is about race conditions in inter-thread communication, and that is perfectly addressed by the memory fences. – Mert 30/10, 2013 at 22:39

@MikeStrobel If you want to avoid volatile and be sure about fresh values, you can always do while(!flag) atomic_thread_fence(memory_order_acquire);, it will just as effectively stop the compiler from doing bad things, but I believe, it's just as much overkill as volatile. – Mert 30/10, 2013 at 22:41

I wasn't concerned about compiler optimizations; I was concerned about cache coherency (a runtime issue, not a compile-time issue). From what little I remember of C programming, volatile does nothing to address this. It only prevents certain compiler optimizations, like you said. I can see how your version might ensure proper ordering; I do not see how it guarantees immediate visibility of the latest flags value across CPUs. – Morion 30/10, 2013 at 22:43

I would like to raise a similar doubt as to the 1st answer... The volatile ensures that there are ld/st operations generated, which can subsequently be ordered with fences. However, data is a local variable, not volatile. The compiler will probably put it in register, avoiding a store operation. That leaves the load from bucket to be ordered with the subsequent reset of flag. Unfortunately, the C11 definition of memory_order_release which I am looking at, does not say anything about preceding loads... – Townie 31/10, 2013 at 9:2

The example code is misleading. Acquiring and releasing the lock here (flag) has acquire and release semantics which are memory barriers. No atomic_thread_fence is necessary here, regardless of whether you use a mutex, spinlock, or your own busy spin. – Unicuspid 31/10, 2013 at 10:44

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags