Memory order consume usage in C11
Asked Answered
S

2

5

I read about the carries a dependency relation and dependency-ordered before that uses one in its definition 5.1.2.4(p16):

An evaluation A is dependency-ordered before an evaluation B if:

A performs a release operation on an atomic object M, and, in another thread, B performs a consume operation on M and reads a value written by any side effect in the release sequence headed by A, or

— for some evaluation X, A is dependency-ordered before X and X carries a dependency to B.

So I tried to craft an example where it might be useful. Here is it:

static _Atomic int i;

void *produce(void *ptr){
    int int_value = *((int *) ptr);
    atomic_store_explicit(&i, int_value, memory_order_release);
    return NULL;
}

void *consume(void *ignored){
    int int_value = atomic_load_explicit(&i, memory_order_consume);
    int new_int_value = int_value + 42;
    printf("Consumed = %d\n", new_int_value);
}

int main(int args, const char *argv[]){
    int int_value = 123123;
    pthread_t t2;
    pthread_create(&t2, NULL, &produce, &int_value);

    pthread_t t1;
    pthread_create(&t1, NULL, &consume, NULL);

    sleep(1000);
}

In the function void *consume(void*) the int_value carries a dependency for new_int_value so if atomic_load_explicit(&i, memory_order_consume); reads a value written by some atomic_store_explicit(&i, int_value, memory_order_release); then new_int_value computation dependency-ordered-before the atomic_store_explicit(&i, int_value, memory_order_release);.

But what useful things can the dependency-ordered-before give us?

I currently think that the memory_order_consume may well be replaced with memory_order_acquire without causing any data race...

Syzygy answered 18/4, 2019 at 7:30 Comment(0)
P
14

consume is cheaper than acquire. All CPUs (except DEC Alpha AXP's famously weak memory model1) do it for free, unlike acquire. (Except on x86 and SPARC-TSO, where the hardware has acq/rel memory ordering without extra barriers or special instructions.)

On ARM/AArch64/PowerPC/MIPS/etc weakly-ordered ISAs, consume and relaxed are the only orderings that don't require any extra barriers, just ordinary cheap load instructions. i.e. all asm load instructions are (at least) consume loads, except on Alpha. acquire requires LoadStore and LoadLoad ordering, which is a cheaper barrier instruction than a full-barrier for seq_cst, but still more expensive than nothing.

mo_consume is like acquire only for loads with a data dependency on the consume load. e.g. float *array = atomic_ld(&shared, mo_consume);, then access to any array[i] is safe if the producer stored the buffer and then used a mo_release store to write the pointer to the shared variable. But independent loads/stores don't have to wait for the consume load to complete, and can happen before it even if they appear later in program order. So consume only orders the bare minimum, not affecting other loads or stores.


(It's basically free to implement support for consume semantics in hardware for most CPU designs, because OoO exec can't break true dependencies, and a load has a data dependency on the pointer, so loading a pointer and then dereferencing it inherently orders those 2 loads just by the nature of causality. Unless CPUs do value-prediction or something crazy. Value prediction is like branch prediction, but guess what value is going to be loaded instead of which way a branch is going to go.

Alpha had to do some crazy stuff to make CPUs that could actually load data from before the pointer value was truly loaded, when the stores were done in order with sufficient barriers.

Unlike for stores, where the store buffer can introduce reordering between store execution and commit to L1d cache, loads become "visible" by taking data from L1d cache when they execute, not when the retire + eventually commit. So ordering 2 loads wrt. each other really does just mean executing those 2 loads in order. With a data dependency of one on the other, causality requires that on CPUs without value prediction, and on most architectures the ISA rules do specifically require that. So you don't have to use a barrier between loading + using a pointer in asm, e.g. for traversing a linked list.)

See also Dependent loads reordering in CPU


But current compilers just give up and strengthen consume to acquire

... instead of trying to map C dependencies to asm data dependencies (without accidentally breaking having only a control dependency that branch prediction + speculative execution could bypass). Apparently it's a hard problem for compilers to keep track of it and make it safe.

It's non-trivial to map C to asm, because if the dependency is only in the form of a conditional branch, the asm rules don't apply. So it's hard to define C rules for mo_consume propagating dependencies only in ways that line up with what does "carry a dependency" in terms of asm ISA rules.

So yes, you're correct that consume can be safely replaced with acquire, but you're totally missing the point.


ISAs with weak memory-ordering rules do have rules about which instructions carry a dependency. So even an instruction like ARM eor r0,r0 which unconditionally zeroes r0 is architecturally required to still carry a data dependency on the old value, unlike x86 where the xor eax,eax idiom is specially recognized as dependency-breaking2.

See also http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/

I also mentioned mo_consume in an answer on Atomic operations, std::atomic<> and ordering of writes.


Footnote 1: The few Alpha models that actually could in theory "violate causality" didn't do value-prediction, there was a different mechanism with their banked cache. I think I've seen a more detailed explanation of how it was possible, but Linus's comments about how rare it actually was are interesting.

Linus Torvalds (Linux lead developer), in a RealWorldTech forum thread

I wonder, did you see non-causality on Alpha by yourself or just in the manual?

I never saw it myself, and I don't think any of the models I ever had access to actually did it. Which actually made the (slow) RMB instruction extra annoying, because it was just pure downside.

Even on CPU's that actually could re-order the loads, it was apparently basically impossible to hit in practice. Which is actually pretty nasty. It result in "oops, I forgot a barrier, but everything worked fine for a decade, with three odd reports of 'that can't happen' bugs from the field" kinds of things. Figuring out what's going on is just painful as hell.

Which models actually had it? And how exactly they got here?

I think it was the 21264, and I have this dim memory of it being due to a partitioned cache: even if the originating CPU did two writes in order (with a wmb in between), the reading CPU might end up having the first write delayed (because the cache partition that it went into was busy with other updates), and would read the second write first. If that second write was the address to the first one, it could then follow that pointer, and without a read barrier to synchronize the cache partitions, it could see the old stale value.

But note the "dim memory". I may have confused it with something else. I haven't actually used an alpha in closer to two decades by now. You can get very similar effects from value prediction, but I don't think any alpha microarchitecture ever did that.

Anyway, there definitely were versions of the alpha that could do this, and it wasn't just purely theoretical.

(RMB = Read Memory Barrier asm instruction, and/or the name of Linux kernel function rmb() that wraps whatever inline asm is necessary to make that happen. e.g. on x86, just a barrier to compile-time reordering, asm("":::"memory"). I think modern Linux manages to avoid an acquire barrier when only a data dependency is needed, unlike C11/C++11, but I forget. Linux is only portable to a few compilers, and those compilers do take care to support what Linux depends on, so they have an easier time than the ISO C11 standard in cooking up something that works in practice on real ISAs.)

See also https://lkml.org/lkml/2012/2/1/521 re: Linux's smp_read_barrier_depends() which is necessary in Linux only because of Alpha. (But a reply from Hans Boehm points out that "compilers can, and sometimes do, remove dependencies", which is why C11 memory_order_consume support needs to be so elaborate to avoid risk of breakage. Thus smp_read_barrier_depends is potentially brittle.)


Footnote 2: x86 orders all loads whether they carry a data dependency on the pointer or not, so it doesn't need to preserve "false" dependencies, and with a variable-length instruction set it actually saves code size to xor eax,eax (2 bytes) instead mov eax,0 (5 bytes).

So xor reg,reg became the standard idiom since early 8086 days, and now it's recognized and actually handled like mov, with no dependency on the old value or RAX. (And in fact more efficiently than mov reg,0 beyond just code-size: What is the best way to set a register to zero in x86 assembly: xor, mov or and?)

But this is impossible for ARM or most other weakly ordered ISAs, like I said they're literally not allowed to do this.

ldr r3, [something]       ; load r3 = mem
eor r0, r3,r3             ; r0 = r3^r3 = 0
ldr r4, [r1, r0]          ; load r4 = mem[r1+r0].  Ordered after the other load

is required to inject a dependency on r0 and order the load of r4 after the load of r3, even though the load address r1+r0 is always just r1 because r3^r3 = 0. But only that load, not all other later loads; it's not an acquire barrier or an acquire load.

Pith answered 18/4, 2019 at 8:22 Comment(11)
So to summarize mo_consume guarantees us that 2 subsuquent dependent loads are not reordered. On x86 (unlike Alpha) we have it for free as specified at 8.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations.Syzygy
@SomeName: On x86 we have acquire for free, consume is not interesting or special. But on ARM/AArch64/PowerPC/MIPS/etc, we only have consume (and relaxed) for free, anything more requires at least cheap barriers. (Not as expensive as the StoreLoad full barriers required for seq_cst, though). x86's TSO memory model is seq_cst + a store buffer (with store forwarding), so stores can only get later, and everything else is ordered.Pith
@SomeName: Also, not sure what you mean by "2 subsequent dependent loads". Any number of loads with a data dependency on the consume load will respect causality. e.g. float *array = atomic_load(&shared, mo_consume); lets you loop over array[i] and see data that another thread stored before storing the pointer to shared. Or a struct pointer that you use with ptr->a, ptr->b, etc. But if you mean chaining loads (like a linked list), then no, every step in the chain has to be a consume load, not every other. (I haven't studied the wording in the C standard carefully for consume.)Pith
So the crucial part of mo_consume as it does not guarantee visibility of ordered actions performed by the same thread 5.1.2.4(p15): A performs a release operation on an atomic object M, and, in another thread performs a consume operation on M and reads a value written by any side effect in the release sequence headed byASyzygy
@SomeName: right, consume is like acquire only for loads with a data dependency on the consume load. Independent loads/stores can still happen before it, without waiting for the consume load to complete.Pith
I re-read some parts of the 5.1.2.4 and I see the following formal explanation of the difference between release - acquire/release - consume: release synchronize with acquire, but release does not synchronize with consume. As per the first bullet of inter-thread happens before definition A synchronizes with X and X is sequenced before B we can claim that all actions in the thread performing release sequenced before the release (even to unrelated memory locations) are visible to all actions after the acquire which cannot be applied to consume due to absent of synchronize with.Syzygy
@SomeName: I haven't actually read (recently) the language in the standard that defines consume, I just know what it's supposed to be for in terms of exposing that useful asm behaviour. But yes, that makes sense, synchronize-with is technical language that implies everything after this point is after everything before the release, and the whole point of consume is not stalling/blocking other operations to wait for that to happen.Pith
"which is why C11 memory_order_consume support needs to be so elaborate" so "elaborate" that you can reliably get desired effect by generously using volatile unlike unreliable consume implementation? volatile int index = consume_get(); T *volatile p = addr; T r = p[index-index]; gets you a real dependent zero on real compilers, because handling of simple cases of volatile values is stable unlike brand new consume semantics which tends to be lost in intermediate language and candidate for arithmetic optimization.Marquismarquisate
@SomeName "as it does not guarantee visibility of ordered actions performed by the same thread" what would "guarantee visibility" of the current thread action even mean?Marquismarquisate
@curiousguy: Possibly, but introducing 2 stores and 3 reloads via volatile could be worse than acquire, especially on AArch64 which has acquire-loads in hardware (with a special load instruction, not a barrier). And obviously much worse on x86 where strengthening consume to acquire is free. Also, I'm not 100% sure that dependency-ordering rules on typical RISC ISAs follow store/reload through memory. I haven't looked into it, maybe it's still fine just by the laws of causality as long as the ISA doesn't allow implementations that do value-prediction for loads.Pith
Or if you only care about real-world implementations of those ISAs, which at this point don't do value-prediction.Pith
M
1

memory_order_consume is currently underspecified, and there is some ongoing work to fix it. Currently AFAIK all implementations implicitly promote it to memory_order_acquire.

Morbific answered 18/4, 2019 at 8:5 Comment(3)
It would be interesting to know motivation behind introducing memory_order_consume...Syzygy
I think it was mainly for implementing RCU, see work by Paul McKenney.Morbific
"Underspecified", or fully specified in a useless way?Marquismarquisate

© 2022 - 2024 — McMap. All rights reserved.