The strong-ness of x86 store instruction wrt. SC-DRF?
Asked Answered
F

1

2

I read about Herb's atomic<> Weapons talk and had a question about page 42:

He mentioned that (50:00 in the video):

(x86) stores are much stronger than they need to be...

What I don't understand is: if the x86 "S" on the chart is a plain store, i.e. mov, I don't think it's stronger than SC-DRF because it's only a release store plus total store order (and that's why you need an xchg for a SC store). But if it means an SC store, i.e. xchg, it should fall on the "fully SC" bar because it's effectively a full barrier. How should I take this x86 "S"'s strong-ness on the chart?

(SC-DRF is a guarantee of Sequentially Consistent execution for Data Race Free programs, as long as they don't use any atomics with orders weaker than std::memory_order_seq_cst. ISO C++ and Java, and other languages, provide this.)

Frambesia answered 6/12, 2021 at 17:43 Comment(2)
What is SC-DRF?Stagg
@ThomasMatthews it stands for "Sequential Consistency for Data-Race-Free program". You can watch Herb's "atomic<> Weapons" talk for more information.Frambesia
A
3

Yes, he's showing xchg there (full barrier and an RMW operation), not just a mov store - a plain mov would be below the SC-DRF bar because it doesn't provide sequential consistency on its own without mfence or other barrier.

Compare ARM64 stlr / ldar - they can't reorder with each other (not even StoreLoad), but stlr can reorder with other later operations, except of course other release-store operations, or some fences. (Like I mentioned in answer to your previous question). See also Does STLR(B) provide sequential consistency on ARM64? re: interaction with ldar for SC vs. ldapr for just acquire / release or acq_rel. Also Possible orderings with memory_order_seq_cst and memory_order_release for another example of how AArch64 compiles (without ARMv8.3 LDAPR).


But x86 seq_cst stores drain the store buffer on the spot, even if there is no later seq_cst load, store, or RMW in the same thread. This lack of reordering with later non-SC or non-atomic loads/stores is what makes it stronger (and more expensive) than necessary.

Herb Sutter explained this earlier in the video, at around 36:00. He points out xchg is stronger than necessary, not just an SC-release that can one-way reorder with later non-SC operations. "So what we have here, is overkill. Much stronger than is necessary" at 36:30

(Side note: right around 36:00, he mis-spoke: he said "we're not going to use these first 3 guarantees" (that x86 doesn't reorder loads with loads or stores with stores, or stores with older loads). But those guarantees are why SC load can be just a plain mov. Same for acq/rel being just plain mov for both load and store. That's why as he says, lfence and sfence are irrelevant for std::atomic.)


So anyway, ARM64 can hit the sweet spot with no extra barrier instructions, being exactly strong enough for seq_cst but no stronger. (ARMv8.3 with ldapr is slightly stronger than acq_rel requires, e.g. ARM64 still forbids IRIW reordering, but only a few machines can do that in practice, notably POWER)

Other ISAs with both L and S below the bar need extra barriers as part of their seq_cst load and seq_cst store recipes (https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html).

Abirritant answered 6/12, 2021 at 21:22 Comment(3)
Thanks for the explanation. I think I get the "they are stronger than they need to be" part. My last confusion is that if we were talking about xchg, wouldn't it be on the very top, i.e., fully SC, of the chart?Frambesia
@zanmato: Yeah, I don't know why x86 xchg wouldn't right up there with the fence at the top, fully SC. Good point. It's exactly as strong as what most C++ implementations use for atomic_thread_fence(seq_cst), a lock addl $0, (%rsp). (xchg is equivalent to mfence for everything except weakly-ordered movntdqa loads from WC memory on some microarchitectures. But std:atomic leaves it up to the programmer to manually sfence or mfence after using weakly-ordered stores, and many implementations do use a locked op, not mfence, for atomic_thread_fence).Abirritant
Thank you Peter. Great answer and follow-up, as always!Frambesia

© 2022 - 2025 — McMap. All rights reserved.