What are the C++11 memory ordering guarantees in this corner case?
Asked Answered
F

3

21

I'm writing some lock-free code, and I came up with an interesting pattern, but I'm not sure if it will behave as expected under relaxed memory ordering.

The simplest way to explain it is using an example:

std::atomic<int> a, b, c;

auto a_local = a.load(std::memory_order_relaxed);
auto b_local = b.load(std::memory_order_relaxed);
if (a_local < b_local) {
    auto c_local = c.fetch_add(1, std::memory_order_relaxed);
}

Note that all operations use std::memory_order_relaxed.

Obviously, on the thread that this is executed on, the loads for a and b must be done before the if condition is evaluated.

Similarly, the read-modify-write (RMW) operation on c must be done after the condition is evaluated (because it's conditional on that... condition).

What I want to know is, does this code guarantee that the value of c_local is at least as up-to-date as the values of a_local and b_local? If so, how is this possible given the relaxed memory ordering? Is the control dependency together with the RWM operation acting as some sort of acquire fence? (Note that there's not even a corresponding release anywhere.)

If the above holds true, I believe this example should also work (assuming no overflow) -- am I right?

std::atomic<int> a(0), b(0);

// Thread 1
while (true) {
    auto a_local = a.fetch_add(1, std::memory_order_relaxed);
    if (a_local >= 0) {    // Always true at runtime
        b.fetch_add(1, std::memory_order_relaxed);
    }
}

// Thread 2
auto b_local = b.load(std::memory_order_relaxed);
if (b_local < 777) {
    // Note that fetch_add returns the pre-incrementation value
    auto a_local = a.fetch_add(1, std::memory_order_relaxed);
    assert(b_local <= a_local);    // Is this guaranteed?
}

On thread 1, there is a control dependency which I suspect guarantees that a is always incremented before b is incremented (but they each keep being incremented neck-and-neck). On thread 2, there is another control dependency which I suspect guarantees that b is loaded into b_local before a is incremented. I also think that the value returned from fetch_add will be at least as recent as any observed value in b_local, and the assert should therefore hold. But I'm not sure, since this departs significantly from the usual memory-ordering examples, and my understanding of the C++11 memory model is not perfect (I have trouble reasoning about these memory ordering effects with any degree of certainty). Any insights would be appreciated!


Update: As bames53 has helpfully pointed out in the comments, given a sufficiently smart compiler, it's possible that an if could be optimised out entirely under the right circumstances, in which case the relaxed loads could be reordered to occur after the RMW, causing their values to be more up-to-date than the fetch_add return value (the assert could fire in my second example). However, what if instead of an if, an atomic_signal_fence (not atomic_thread_fence) is inserted? That certainly can't be ignored by the compiler no matter what optimizations are done, but does it ensure that the code behaves as expected? Is the CPU allowed to do any re-ordering in such a case?

The second example then becomes:

std::atomic<int> a(0), b(0);

// Thread 1
while (true) {
    auto a_local = a.fetch_add(1, std::memory_order_relaxed);
    std::atomic_signal_fence(std::memory_order_acq_rel);
    b.fetch_add(1, std::memory_order_relaxed);
}

// Thread 2
auto b_local = b.load(std::memory_order_relaxed);
std::atomic_signal_fence(std::memory_order_acq_rel);
// Note that fetch_add returns the pre-incrementation value
auto a_local = a.fetch_add(1, std::memory_order_relaxed);
assert(b_local <= a_local);    // Is this guaranteed?

Another update: After reading all the responses so far and combing through the standard myself, I don't think it can be shown that the code is correct using only the standard. So, can anyone come up with a counter-example of a theoretical system that complies with the standard and also fires the assert?

Fania answered 14/8, 2013 at 4:22 Comment(14)
I think that it must be the nature of the fetch_add operation, and not the control dependency that causes the assertion to be true. I can't find anything that would indicate that a control dependency would cause any additional synchronization beyond what a sequenced-before relationship would.Postliminy
@Vaughn: That makes sense, though still seems non-intuitive to me. Without the control dependency, though, the relaxed ordering could cause the loads to happen after the fetch_add -- so they're synchronizing, I just can't figure out quite in what capacity. Perhaps all sequenced-before relationships would have the same effect here?Fania
It's a really interesting situation. I want to go through the rules carefully and see what the possibilities are. I'm thinking that the if is limiting what kind of reordering the compiler can do in practice, although maybe not in theory. On top of that, it would seem that the fetch_add has to be limiting what values can be seen from the modification order. I think that when you fetch_add, the value that you get back can't be before any modification that happened-before the fetch_add. Which wouldn't be the case with a regular load.Postliminy
For example, I don't think behavior described in the relaxed ordering example given in en.cppreference.com/w/cpp/atomic/memory_order, would be possible if thread 2's load/store was atomic.Postliminy
The more I think about this, the more I believe that the behaviour is correct. But I can't prove it :-) It's certainly contingent on the if not being optimized away, though -- so it may end up being implementation defined (though I doubt any compilers will actually be able to optimize away an if that uses an atomic variable since the value could be changed by any thread -- it would have to analyze the behaviour of the entire program to determine the possible values of the atomic var).Fania
"there is a control dependency which I suspect guarantees that a is always incremented before b is incremented" C++ doesn't guarantee that. I don't see anything that creates any dependencies between the values of a and b such that a compiler smart enough to prove a_local>=0 is always true would be prevented from re-ordering the fetch-add on b to before the fetch-add on a, for example.Packthread
@bames: You're right of course! Thank you. In that case, what if an std::atomic_signal_fence(std::memory_order_seq_cst); was inserted at the top of each if? This would stop the compiler from reordering even if it could optimized out the if, but it would not tell the CPU any additional information -- but can the CPU reorder anything in an unexpected way in that case? Could the assert fire in that case?Fania
@Fania adding the fence would introduce synchronization and I believe would force the assert to hold.Packthread
@bames: I've started a bounty. If you could flesh out your comment into a (standard-backed) answer, I'd be grateful (even if it turns out it's not safe, of course!).Fania
Your phrase "at least as up to date as" is misleading. Since there is no ordering on your atomic operations, there is no sensible notion of "as least as up to date". The operations simply do not have a mutual ordering.Sanctus
@Kerrek: After reading the other responses, and combing through the standard myself, I'm forced to conclude that you are indeed correct. The atomic variables have independent modification orders. However, due to the nature of read-modify-write operations (always operating on the latest value), it seems natural to me that a mutual ordering could emerge -- and so I believe the question is still valid, though I no longer think my code can be shown correct using the standard. Perhaps a better question would be to find a theoretical system that both complies with the standard and fires the assert!Fania
In general, any reordering a compiler might do, so might a CPU. Specifically, if there is no data dependency between the two increments, and no memory barriers, then a CPU might choose to perform the increments in either order. (More realistically, the increments might become visible to other CPUs in either order depending on how the caches happen to behave. "Relaxed" means just that: You do not know in what order the operations will become visible to other threads.) Seems academic though... If the standard does not guarantee your program works, then your program is wrong, period.Priggery
I think you are looking for things that simply 1) aren't there 2) can't even exist in the current setting. The std simply doesn't do "control dependency", "data dependency" or even "write after conditional". They simply have no special semantic, you only have the semantic of the memory order; and relaxed has almost empty semantics so you got nothing although compilers in practice respect logic and progress of computations.Postwar
@Priggery "If the standard does not guarantee your program works, then your program is wrong, period" That's your subjective opinion.Postwar
P
3

This example gets at a variation of reads-from-thin-air like behavior. The relevant discussion in the spec is in section 29.3p9-11. Since the current version of the C11 standard doesn't guarantee dependences be respected, the memory model should allow the assertion to be fired. The most likely situation is that the compiler optimizes away the check that a_local>=0. But even if you replace that check with a signal fence, CPUs would be free to reorder those instructions. You can test such code examples under the C/C++11 memory models using the open source CDSChecker tool. The interesting issue with your example is that for an execution to violate the assertion, there has to be a cycle of dependences. More concretely:

The b.fetch_add in thread one depends on the a.fetch_add in the same loop iteration due to the if condition. The a.fetch_add in thread 2 depends on b.load. For an assertion violation, we have to have T2's b.load read from a b.fetch_add in a later loop iteration than T2's a.fetch_add. Now consider the b.fetch_add the b.load reads from and call it # for future reference. We know that b.load depends on # as it takes it value from #.

We know that # must depend on T2's a.fetch_add as T2's a.fetch_add atomic reads and updates a prior a.fetch_add from T1 in the same loop iteration as #. So we know that # depends on the a.fetch_add in thread 2. That gives us a cycle in dependences and is plain weird, but allowed by the C/C++ memory model. The most likely way of actually producing that cycle is (1) compiler figures out that a.local is always greater than 0, breaking the dependence. It can then do loop unrolling and reorder T1's fetch_add however it wants.

Pytlik answered 2/9, 2013 at 5:18 Comment(5)
Thanks for the reference to CDSChecker, I'll check it out. But with regards to your statement that "CPUs would be free to reorder those instructions", don't they have to follow the as-if rule? (Or, on further thought, does the "relaxed" ordering give them that permission?)Fania
The signal fence just specifies single thread behavior. So the compiler can't reorder the instructions, but the processor is free to do so. The relevant restrictions on optimizations for multi-threaded code is the C/C++11 memory model.Pytlik
@Pytlik Which CPU can do this type of reordering?Postwar
"the memory model should allow the assertion to be fired" The model is unsound and doesn't allow to prove any program that destroys a variable to be correct. You have the same issue for all non atomic actions - unless you somehow assume a different ordering semantic for atomic and non actions, which isn't just without foundation in std text, but also gratuitous. Why would non atomic memory operations be safer than atomic operations?Postwar
Non-atomics have different ordering semantics in C/C++11. In the absence of a data race on some SC execution for non-atomics, non-atomics have SC semantics. That is very different from atomics semantics.Pytlik
P
4

Signal fences don't provide the necessary guarantees (well, not unless 'thread 2' is a signal hander that actually runs on 'thread 1').

To guarantee correct behavior we need synchronization between threads, and the fence that does that is std::atomic_thread_fence.


Let's label the statements so we can diagram various executions (with thread fences replacing signal fences, as required):

while (true) {
    auto a_local = a.fetch_add(1, std::memory_order_relaxed); // A
    std::atomic_thread_fence(std::memory_order_acq_rel);      // B
    b.fetch_add(1, std::memory_order_relaxed);                // C
}


auto b_local = b.load(std::memory_order_relaxed);             // X
std::atomic_thread_fence(std::memory_order_acq_rel);          // Y
auto a_local = a.fetch_add(1, std::memory_order_relaxed);     // Z


So first let's assume that X loads a value written by C. The following paragraph specifies that in that case the fences synchronize and a happens-before relationship is established.

29.8/2:

A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.

And here's a possible execution order where the arrows are happens-before relations.

Thread 1: A₁ → B₁ → C₁ → A₂ → B₂ → C₂ → ...
                ↘
Thread 2:    X → Y → Z

If a side effect X on an atomic object M happens before a value computation B of M, then the evaluation B shall take its value from X or from a side effect Y that follows X in the modification order of M. — [C++11 1.10/18]

So the load at Z must take its value from A₁ or from a subsequent modification. Therefore the assert holds because the value written at A₁ and at all later modifications is greater than or equal to the value written at C₁ (and read by X).


Now let's look at the case where the fences do not synchronize. This happens when the load of b does not load a value written by thread 1, but instead reads the value that b is initialized with. There's still synchronization where the threads starts though:

30.3.1.2/5

Synchronization: The completion of the invocation of the constructor synchronizes with the beginning of the invocation of the copy of f.

This is specifying the behavior of std::thread's constructor. So (assuming the thread creation is correctly sequenced after the initialization of a) the value read by Z must take its value from the initialization of a or from one of the subsequent modifications on thread 1, which means that the assertions still holds.

Packthread answered 25/8, 2013 at 15:45 Comment(6)
Ah, brilliant. The happens-before diagram helps a lot. Are you sure that the fences can still synchronize following the definition in the standard when they're mere signal fences and not thread fences, though? Because I thought that in this case all they did was enforce sequencing for each thread separately without influencing inter-thread guarantees.Fania
@Fania Oh, I didn't notice before that they were signal fences. Signal fences aren't sufficient to get the guarantee you want under the standard.Packthread
On hardware that enforces a relatively strong memory model signal fences might generate code that is guaranteed the behave as desired, but the spec is written so that hardware which provides a relatively weak memory model is not forced to bear the expense of stronger guarantees when it's not necessary.Packthread
That is to say, if your implementation does work with these weaker fences due to the hardware's memory model, then the code will be just as slow as if you wrote portable code using the stronger fences. The benefits of relaxed memory ordering, signal fences, all the 'weaker' commands, all only apply to hardware that's capable of enforcing those weaker requirements with less overhead than they can enforce stricter requirements like fully sequentially consistent memory ordering.Packthread
I agree (regarding portable code and efficiency). I've also come to the conclusion that, as you say, the asserted condition is not guaranteed according purely to the standard's notes on the memory model. However, I still cannot construct a case where the assert would fail. So, I'm going to award you the bounty, but until I or someone else can show definitively that the assert can fail under a specific scenario, or that the standard guarantees the code's correctness after all (unlikely, I admit), I'm going to leave the question without an officially "chosen" answer. Thanks for your help! :-)Fania
@Fania Even under the execute-recompile model where you start the execution, propagate variable values, optimize expression, simplify always taken conditionals, simplify redundant code, do some other execution step, etc. (that is constant on the fly optimization based on the current execution) this program is probably guaranteed to work OK based on the weakest requirements for sane executions.Postwar
P
3

This example gets at a variation of reads-from-thin-air like behavior. The relevant discussion in the spec is in section 29.3p9-11. Since the current version of the C11 standard doesn't guarantee dependences be respected, the memory model should allow the assertion to be fired. The most likely situation is that the compiler optimizes away the check that a_local>=0. But even if you replace that check with a signal fence, CPUs would be free to reorder those instructions. You can test such code examples under the C/C++11 memory models using the open source CDSChecker tool. The interesting issue with your example is that for an execution to violate the assertion, there has to be a cycle of dependences. More concretely:

The b.fetch_add in thread one depends on the a.fetch_add in the same loop iteration due to the if condition. The a.fetch_add in thread 2 depends on b.load. For an assertion violation, we have to have T2's b.load read from a b.fetch_add in a later loop iteration than T2's a.fetch_add. Now consider the b.fetch_add the b.load reads from and call it # for future reference. We know that b.load depends on # as it takes it value from #.

We know that # must depend on T2's a.fetch_add as T2's a.fetch_add atomic reads and updates a prior a.fetch_add from T1 in the same loop iteration as #. So we know that # depends on the a.fetch_add in thread 2. That gives us a cycle in dependences and is plain weird, but allowed by the C/C++ memory model. The most likely way of actually producing that cycle is (1) compiler figures out that a.local is always greater than 0, breaking the dependence. It can then do loop unrolling and reorder T1's fetch_add however it wants.

Pytlik answered 2/9, 2013 at 5:18 Comment(5)
Thanks for the reference to CDSChecker, I'll check it out. But with regards to your statement that "CPUs would be free to reorder those instructions", don't they have to follow the as-if rule? (Or, on further thought, does the "relaxed" ordering give them that permission?)Fania
The signal fence just specifies single thread behavior. So the compiler can't reorder the instructions, but the processor is free to do so. The relevant restrictions on optimizations for multi-threaded code is the C/C++11 memory model.Pytlik
@Pytlik Which CPU can do this type of reordering?Postwar
"the memory model should allow the assertion to be fired" The model is unsound and doesn't allow to prove any program that destroys a variable to be correct. You have the same issue for all non atomic actions - unless you somehow assume a different ordering semantic for atomic and non actions, which isn't just without foundation in std text, but also gratuitous. Why would non atomic memory operations be safer than atomic operations?Postwar
Non-atomics have different ordering semantics in C/C++11. In the absence of a data race on some SC execution for non-atomics, non-atomics have SC semantics. That is very different from atomics semantics.Pytlik
P
0

After reading all the responses so far and combing through the standard myself, I don't think it can be shown that the code is correct using only the standard.

And unless you admit that non atomic operations are magically safer and more ordered then relaxed atomic operations (which is silly) and that there is one semantic of C++ without atomics (and try_lock and shared_ptr::count) and another semantic for those features that don't execute sequentially, you also have to admit that no program at all can be proven correct, as the non atomic operations don't have an "ordering" and they are needed to construct and destroy variables.

Or, you stop taking the standard text as the only word on the language and use some common sense, which is always recommended.

Postwar answered 26/11, 2019 at 23:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.