What is the difference between using explicit fences and std::atomic?
Asked Answered
B

5

37

Assuming that aligned pointer loads and stores are naturally atomic on the target platform, what is the difference between this:

// Case 1: Dumb pointer, manual fence
int* ptr;
// ...
std::atomic_thread_fence(std::memory_order_release);
ptr = new int(-4);

this:

// Case 2: atomic var, automatic fence
std::atomic<int*> ptr;
// ...
ptr.store(new int(-4), std::memory_order_release);

and this:

// Case 3: atomic var, manual fence
std::atomic<int*> ptr;
// ...
std::atomic_thread_fence(std::memory_order_release);
ptr.store(new int(-4), std::memory_order_relaxed);

I was under the impression that they were all equivalent, however Relacy detects a data race in the first case (only):

struct test_relacy_behaviour : public rl::test_suite<test_relacy_behaviour, 2>
{
    rl::var<std::string*> ptr;
    rl::var<int> data;

    void before()
    {
        ptr($) = nullptr;
        rl::atomic_thread_fence(rl::memory_order_seq_cst);
    }

    void thread(unsigned int id)
    {
        if (id == 0) {
            std::string* p  = new std::string("Hello");
            data($) = 42;
            rl::atomic_thread_fence(rl::memory_order_release);
            ptr($) = p;
        }
        else {
            std::string* p2 = ptr($);        // <-- Test fails here after the first thread completely finishes executing (no contention)
            rl::atomic_thread_fence(rl::memory_order_acquire);

            RL_ASSERT(!p2 || *p2 == "Hello" && data($) == 42);
        }
    }

    void after()
    {
        delete ptr($);
    }
};

I contacted the author of Relacy to find out if this was expected behaviour; he says that there is indeed a data race in my test case. However, I'm having trouble spotting it; can someone point out to me what the race is? Most importantly, what are the differences between these three cases?

Update: It's occurred to me that Relacy may simply be complaining about the atomicity (or lack thereof, rather) of the variable being accessed across threads... after all, it doesn't know that I intend only to use this code on platforms where aligned integer/pointer access is naturally atomic.

Another update: Jeff Preshing has written an excellent blog post explaining the difference between explicit fences and the built-in ones ("fences" vs "operations"). Cases 2 and 3 are apparently not equivalent! (In certain subtle circumstances, anyway.)

Bernardina answered 5/1, 2013 at 1:58 Comment(11)
Surely you intend for the release to go after the store?Pogrom
Just use std::atomic. Using the relaxed model might be a bit faster on some architectures, but is rarely worth the effort. See bartoszmilewski.com/2008/12/01/c-atomics-and-memory-orderingUmber
@GMan: Actually, no. If the release goes before the store, then all other stores done before that one are guaranteed to be visible if the store itself is visible (assuming it's loaded after an acquire). If the release goes after the store, then the reader of the variable (using acquire semantics) has no guarantee that previous stores have completed even if it can see that store (because the store could become visible before the release executes; also, the compiler or CPU could simply re-order the stores).Bernardina
@Axel: Thanks, but actually I've already put in the effort to get things working with the relaxed model ;-) I just want to figure out why my relacy test was failing with a plain var (and manual fences), vs with a relaxed std::atomic var (and the same manual fences).Bernardina
On which platform are you running? You realize, that if it's x86 there will be no benefit?Umber
@Axel: x86/x64 for now. Yes, I realize that these fences should be no-ops on those processors; however, they still prevent compiler re-ordering, and I might one day like to use my code (I wrote a lock-free queue) on ARM or PowerPC without having to modify the source.Bernardina
Just using std::atomic would have been good enough for that as well.Umber
@Axel: Right, but the default is to enforce sequential consistency, which is overkill (e.g. it's a full sync instead of a lightweight sync on PPC). Also, I wanted my code to work with VS2010, which doesn't have std::atomic, but does have memory barrier primitives, so I used manual fences (this way I also need less than one per variable access). I guess my example code is kinda trivial, but I really do want to find out what the differences are between the samples, since this is exactly the kind of code that may turn out to fail only on one arch. Thanks for taking an interest by the way :-)Bernardina
@Cameron: Sorry, you are correct.Pogrom
As a complement to preshig's article, see the diagrams in modernescpp.com/index.php/fences-as-memory-barriers. It's easiest to remember the fences as having the same semantics as the correspoding atomic op on one-side (e.g. acquire not allowing anything to be moved "above" it), but also imposes restrictions on the other side (e.g. acquire fence doesn't allow any loads to move below it).Far
(or another way to think about it is that the release fence has not just the one-way "release semantic" barrier but additionally every subsequent store is treated as a release store with regard to reordering against the fence.)Far
L
15

I believe the code has a race. Case 1 and case 2 are not equivalent.

29.8 [atomics.fences]

-2- A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.

In case 1 your release fence does not synchronize with your acquire fence because ptr is not an atomic object and the store and load on ptr are not atomic operations.

Case 2 and case 3 are equivalent (actually, not quite, see LWimsey's comments and answer), because ptr is an atomic object and the store is an atomic operation. (Paragraphs 3 and 4 of [atomic.fences] describe how a fence synchronizes with an atomic operation and vice versa.)

The semantics of fences are defined only with respect to atomic objects and atomic operations. Whether your target platform and your implementation offer stronger guarantees (such as treating any pointer type as an atomic object) is implementation-defined at best.

N.B. for both of case 2 and case 3 the acquire operation on ptr could happen before the store, and so would read garbage from the uninitialized atomic<int*>. Simply using acquire and release operations (or fences) doesn't ensure that the store happens before the load, it only ensures that if the load reads the stored value then the code is correctly synchronized.

Loquat answered 6/1, 2013 at 14:26 Comment(10)
Thank you. If you know, would you tell what purpose C++ fences serve, then? (I understand the purpose of the x86 SFENCE, LFENCE and MFENCE instructions, though I am unfamiliar with similar instructions on other architectures. However, I believe that SFENCE and LFENCE would prevent the race described, whereas you seem to be right: the C++ standard seems to allow the race. If so, then what is the purpose of C++ fences, if they don't issue instructions like SFENCE and LFENCE?)Kirghiz
Not all platforms have such instructions. On a platform that does, a C++ fence probably maps to those instructions & your code might work, but the standard is defined in more abstract terms. C++ fences can be used to add synchronization to a sequence of several relaxed atomic ops e.g. you could do five relaxed stores to five different atomic objects and use only a single release fence, and do five relaxed loads and only have a single acquire fence. That could be cheaper than five seqcst stores and five seqcst loads. In your code, with a single atomic object, I'd just use atomic<string*>Loquat
Jonathan: Aha, thanks for this answer. It fills in a gap in my understanding :-) As far as I know, all modern processors (like x86, x86-64, PowerPC, and ARM) treat aligned int and pointer loads/stores atomically -- but as you say, this is implementation-defined, and not guaranteed by the C++ standard. @thb: I believe acquire and release fences are no-ops on x86 (all loads and stores intrinsically have acquire and release semantics, respectively).Bernardina
Just to add, for the next poor soul who reads my previous comment, that even if aligned pointer and integer loads/stores are atomic on a platform, that does not mean you can get away without using std::atomic. What it means is "if you don't use std::atomic, your code might work, but no guarantees" -- in particular, the optimizations of the compiler may suddenly (subtly) break code that was previously working. See software.intel.com/en-us/blogs/2013/01/06/…Bernardina
@JonathanWakely I doubt whether case 2 and case 3 are equivalent. Case 2 seems correct, but in case 3, the allocation of integer memory (new) is sequenced after the release fence which means that even if another thread correctly issues an acquire fence after loading ptr, it may then still be pointing at garbage since that memory allocation is not correctly synchronized.Emblazon
@LWimsey,that's a good point, but doesn't the problem of reading garbage exist for both case 2 and case 3? If an acquire operation on ptr doesn't read the value stored then you get garbage either way. If the acquire operation reads the stored (non-garbage) value, then it synchronizes with the store (for case 3 this is defined by [atomics.fences] p2 "if Y reads the value written by X"). To avoid reading garbage ptr could be initialized to nullptr and then if the consumer thread does while (!(p2 = ptr.load())) I think the code is correctly synchronized, for either of case 2 or case 3.Loquat
@JonathanWakely In both cases, an acquire operation will read the correct ptr value (since it is available at the store/release operation), but not the value ptr is pointing at. An acquire synchronizing with a release applies to memory operations that happen before the release and that is where the problem lies. In case 2, new is an argument and therefore technically sequenced before the release operation (as it should). However, in case 3 it is sequenced after the (standalone) release fence and therefore it fails to satisfy the inter-thread 'happens before' relationship.Emblazon
@JonathanWakely I cannot agree with case 2 and 3 being equivalent (-1). I added my own answer in an attempt to explain thisEmblazon
@LWimsey, "In both cases, an acquire operation will read the correct ptr value (since it is available at the store/release operation)" Unless it isn't available, in which case a garbage value will be read because the atomic<int> was default-constructed. That's what I thought you meant. Thanks for clarifying. I agree that the acquire operation is not synchronized with the initialization of the int and so case 3 has a data race.Loquat
@JonathanWakely thank you - yeah, it is sometimes challenging (for me) to express thoughts in only a few linesEmblazon
E
21

Although various answers cover bits and pieces of what the potential problem is and/or provide useful information, no answer correctly describes the potential issues for all three cases.

In order to synchronize memory operations between threads, release and acquire barriers are used to specify ordering.
In the diagram, memory operations A in thread 1 cannot move down across the (one-way) release barrier (regardless whether that is a release operation on an atomic store, or a standalone release fence followed by a relaxed atomic store). Hence memory operations A are guaranteed to happen before the atomic store. Same goes for memory operations B in thread 2 which cannot move up across the acquire barrier; hence the atomic load happens before memory operations B.

enter image description here

The atomic ptr itself provides inter-thread ordering based on the guarantee that it has a single modification order. As soon as thread 2 sees a value for ptr, it is guaranteed that the store (and thus memory operations A) happened before the load. Because the load is guaranteed to happen before memory operations B, the rules for transitivity say that memory operations A happen before B and synchronization is complete.

With that, let's look at your 3 cases.

Case 1 is broken because ptr, a non-atomic type, is modified in different threads. That is a classical example of a data race and it causes undefined behavior.

Case 2 is correct. As an argument, the integer allocation with new is sequenced before the release operation. This is equivalent to:

// Case 2: atomic var, automatic fence
std::atomic<int*> ptr;
// ...
int *tmp = new int(-4);
ptr.store(tmp, std::memory_order_release);

Case 3 is broken, albeit in a subtle way. The problem is that even though the ptr assignment is correctly sequenced after the standalone fence, the integer allocation (new) is also sequenced after the fence, causing a data race on the integer memory location.

the code is equivalent to:

// Case 3: atomic var, manual fence
std::atomic<int*> ptr;
// ...
std::atomic_thread_fence(std::memory_order_release);

int *tmp = new int(-4);
ptr.store(tmp, std::memory_order_relaxed);

If you map that to the diagram above, the new operator is supposed to be part of memory operations A. Being sequenced below the release fence, ordering guarantees no longer hold and the integer allocation may actually be reordered with memory operations B in thread 2. Therefore, a load() in thread 2 may return garbage or cause other undefined behavior.

Emblazon answered 15/4, 2017 at 17:38 Comment(10)
In case 3, how can the allocation possibly be reordered with any of the operations in thread B? The allocation had to finish before anything was assigned to ptr (we have to know what value to assign), right? This is already a happens-before relationship. The only thing I can imagine is that some stores within the new() (for example setting the allocated memory to -4) will not be visible to B.Teyde
@Teyde Since the integer allocation is sequenced after the release fence, there is no inter-thread ordering between this operation and ptr.store(mo_relaxed). The allocation finishes before it is assigned to ptr (happens-before indeed), but that guarantee only hold within the same thread. Without ordering enforced by the release fence, among other effects, thread B may observe these operations out of order (technically undefined behavior).Emblazon
So, in this particular case, by "operations" you mean any operations done within the allocator (like setting some metadata, etc.)? Let's forget about initializing the memory to -4. In that case, if thread B reads a non-null value from ptr then this value must be a valid pointer to some memory region, right? (even if thread B does not see the updated allocator metadata). Or is this assumption wrong and if we read/write to that memory something bad can happen?Teyde
@Teyde With 'operations', I meant int *tmp = ... and ptr.store(tmp, mo_relaxed). Your assumption is incorrect; once thread B observes the new value for ptr, it may still not see -4 at that memory location (uninitialized), or the memory location itself may even be invalid. That's the thing with undefined behavior, you cannot really reason about it; But it's all well-defined if tmp = new ... is sequenced before the fence.Emblazon
@Emblazon Great answer and great job pointing out that only the address has been fixed at this point, all other allocation work may be unfinished (and is not guaranteed to be visible in another thread anyway. as you have also explained). I urgently recommend C++ programmers to never assume any type of synchronization that the standard does not explicitly mandate.Clairclairaudience
In the example with memory operations A/B, you mentioned that the release fence is one-way, i.e. prevents previous read-writes to be reordered ahead of it but not the other way around. So, can't a relaxed atomic read/write move from ahead of the release fence to behind it? Won't that atomic read/write (ptr.load()/ptr.store) potentially not see everything behind the fence (if it moves behind all A operations let's say)? @EmblazonTrajectory
@User10482 This is not possible because the standalone fence only works when paired with an atomic operation (usually relaxed). It's the combination of a release fence followed by an atomic store that creates a one-way barrier (for operations sequenced before the release fence). Same for load acquire followed by a standalone acquire fence which creates a one-way barrier for operations sequenced after the acquire fence. The C++ standard describes this behavior in [31.11, atomic.fences]Emblazon
If I understand correctly, atomic operations after release fence will not cross it despite being relaxed. However, same guarantee does not apply to non-atomic operations. Right? @EmblazonTrajectory
@User10482 The guarantee is that the relaxed atomic store after the release fence cannot be re-ordered with (non-atomic) operations before the fence. So yes, you could say that operations before the release fence and the relaxed atomic store after the fence cannot cross the fence itself The store after the release fence and the load before the acquire fence must be atomic operations, or you would have a data race which is undefined behavior. For more details, search for 'data race' in the standards document.Emblazon
Appreciate the help :) @EmblazonTrajectory
L
15

I believe the code has a race. Case 1 and case 2 are not equivalent.

29.8 [atomics.fences]

-2- A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.

In case 1 your release fence does not synchronize with your acquire fence because ptr is not an atomic object and the store and load on ptr are not atomic operations.

Case 2 and case 3 are equivalent (actually, not quite, see LWimsey's comments and answer), because ptr is an atomic object and the store is an atomic operation. (Paragraphs 3 and 4 of [atomic.fences] describe how a fence synchronizes with an atomic operation and vice versa.)

The semantics of fences are defined only with respect to atomic objects and atomic operations. Whether your target platform and your implementation offer stronger guarantees (such as treating any pointer type as an atomic object) is implementation-defined at best.

N.B. for both of case 2 and case 3 the acquire operation on ptr could happen before the store, and so would read garbage from the uninitialized atomic<int*>. Simply using acquire and release operations (or fences) doesn't ensure that the store happens before the load, it only ensures that if the load reads the stored value then the code is correctly synchronized.

Loquat answered 6/1, 2013 at 14:26 Comment(10)
Thank you. If you know, would you tell what purpose C++ fences serve, then? (I understand the purpose of the x86 SFENCE, LFENCE and MFENCE instructions, though I am unfamiliar with similar instructions on other architectures. However, I believe that SFENCE and LFENCE would prevent the race described, whereas you seem to be right: the C++ standard seems to allow the race. If so, then what is the purpose of C++ fences, if they don't issue instructions like SFENCE and LFENCE?)Kirghiz
Not all platforms have such instructions. On a platform that does, a C++ fence probably maps to those instructions & your code might work, but the standard is defined in more abstract terms. C++ fences can be used to add synchronization to a sequence of several relaxed atomic ops e.g. you could do five relaxed stores to five different atomic objects and use only a single release fence, and do five relaxed loads and only have a single acquire fence. That could be cheaper than five seqcst stores and five seqcst loads. In your code, with a single atomic object, I'd just use atomic<string*>Loquat
Jonathan: Aha, thanks for this answer. It fills in a gap in my understanding :-) As far as I know, all modern processors (like x86, x86-64, PowerPC, and ARM) treat aligned int and pointer loads/stores atomically -- but as you say, this is implementation-defined, and not guaranteed by the C++ standard. @thb: I believe acquire and release fences are no-ops on x86 (all loads and stores intrinsically have acquire and release semantics, respectively).Bernardina
Just to add, for the next poor soul who reads my previous comment, that even if aligned pointer and integer loads/stores are atomic on a platform, that does not mean you can get away without using std::atomic. What it means is "if you don't use std::atomic, your code might work, but no guarantees" -- in particular, the optimizations of the compiler may suddenly (subtly) break code that was previously working. See software.intel.com/en-us/blogs/2013/01/06/…Bernardina
@JonathanWakely I doubt whether case 2 and case 3 are equivalent. Case 2 seems correct, but in case 3, the allocation of integer memory (new) is sequenced after the release fence which means that even if another thread correctly issues an acquire fence after loading ptr, it may then still be pointing at garbage since that memory allocation is not correctly synchronized.Emblazon
@LWimsey,that's a good point, but doesn't the problem of reading garbage exist for both case 2 and case 3? If an acquire operation on ptr doesn't read the value stored then you get garbage either way. If the acquire operation reads the stored (non-garbage) value, then it synchronizes with the store (for case 3 this is defined by [atomics.fences] p2 "if Y reads the value written by X"). To avoid reading garbage ptr could be initialized to nullptr and then if the consumer thread does while (!(p2 = ptr.load())) I think the code is correctly synchronized, for either of case 2 or case 3.Loquat
@JonathanWakely In both cases, an acquire operation will read the correct ptr value (since it is available at the store/release operation), but not the value ptr is pointing at. An acquire synchronizing with a release applies to memory operations that happen before the release and that is where the problem lies. In case 2, new is an argument and therefore technically sequenced before the release operation (as it should). However, in case 3 it is sequenced after the (standalone) release fence and therefore it fails to satisfy the inter-thread 'happens before' relationship.Emblazon
@JonathanWakely I cannot agree with case 2 and 3 being equivalent (-1). I added my own answer in an attempt to explain thisEmblazon
@LWimsey, "In both cases, an acquire operation will read the correct ptr value (since it is available at the store/release operation)" Unless it isn't available, in which case a garbage value will be read because the atomic<int> was default-constructed. That's what I thought you meant. Thanks for clarifying. I agree that the acquire operation is not synchronized with the initialization of the int and so case 3 has a data race.Loquat
@JonathanWakely thank you - yeah, it is sometimes challenging (for me) to express thoughts in only a few linesEmblazon
K
13

Several pertinent references:

Some of the above may interest you and other readers.

Kirghiz answered 6/1, 2013 at 18:25 Comment(3)
Thanks for the links! I read the Linux kernel memory barrier notes a few weeks back, and they were particularly helpful. This overview of memory barriers from a hardware perspective was also useful.Bernardina
The overview you mention looks good. I have added it to the list. Of curiosity, are you in the same position I am in? I had done a very little, grossly concurrent programming over the years using task forks and/or lockfiles, though nothing more sophisticated. Then along comes the new C++11 standard with its headache-inducing sect 1.10 on concurrency, so naturally I want to start to learn what this C++ concurrency is all about. The list of links comes of my present effort to learn. Do you also stand so, or do you approach from another perspective?Kirghiz
I don't have much experience in general, but I recently got interested in audio programming, which tends to be very performance critical, otherwise the audio could glitch; since the audio data is generally requested via a callback on another thread, this leads to the desire for fast synchronization -- and lock-free queues are ideal for this. So, I read a bit about lock-free programming, which lead me to memory barriers, which let me implement a lock free queue. Slightly different perspective :-)Bernardina
C
1

The memory backing an atomic variable can only ever be used for the contents of the atomic. However, a plain variable, like ptr in case 1, is a different story. Once a compiler has the right to write to it, it can write anything to it, even the value of a temporary value when you run out of registers.

Remember, your example is pathologically clean. Given a slightly more complex example:

std::string* p  = new std::string("Hello");
data($) = 42;
rl::atomic_thread_fence(rl::memory_order_release);
std::string* p2 = new std::string("Bye");
ptr($) = p;

it is totally legal for the compiler to choose to reuse your pointer

std::string* p  = new std::string("Hello");
data($) = 42;
rl::atomic_thread_fence(rl::memory_order_release);
ptr($) = new std::string("Bye");
std::string* p2 = ptr($);
ptr($) = p;

Why would it do so? I don't know, perhaps some exotic trick to keep a cache line or something. The point is that, since ptr is not atomic in case 1, there is a race case between the write on line 'ptr($) = p' and the read on 'std::string* p2 = ptr($)', yielding undefined behavior. In this simple test case, the compiler may not choose to exercise this right, and it may be safe, but in more complicated cases the compiler has the right to abuse ptr however it pleases, and Relacy catches this.

My favorite article on the topic: http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong

Cogen answered 3/9, 2013 at 5:18 Comment(0)
V
0

The race in the first example is between the publication of the pointer, and the stuff that it points to. The reason is, that you have the creation and initialization of the pointer after the fence (= on the same side as the publication of the pointer):

int* ptr;    //noop
std::atomic_thread_fence(std::memory_order_release);    //fence between noop and interesting stuff
ptr = new int(-4);    //object creation, initalization, and publication

If we assume that CPU accesses to properly aligned pointers are atomic, the code can be corrected by writing this:

int* ptr;    //noop
int* newPtr = new int(-4);    //object creation & initalization
std::atomic_thread_fence(std::memory_order_release);    //fence between initialization and publication
ptr = newPtr;    //publication

Note that even though this may work fine on many machines, there is absolutely no guarantee within the C++ standard on the atomicity of the last line. So better use atomic<> variables in the first place.

Visitant answered 13/4, 2017 at 11:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.