Questioning validity of PowerPC barriers in GCC-generated atomics

About

Asked 14/9, 2018 at 2:57 Answered 14/9, 2018 at 2:57

GCC implements __sync_val_compare_and_swap on PowerPC[64] as:

    sync
1:  lwarx 9,0,3
    cmpw 0,9,4
    bne 0,2f
    stwcx. 5,0,3
    bne 0,1b
2:  isync

GCC documents for the __sync_* builtins:

In most cases, these builtins are considered a full barrier. That is, no memory operand will be moved across the operation, either forward or backward. Further, instructions will be issued as necessary to prevent the processor from speculating loads across the operation and from queuing stores after the operation.

However the use of isync rather than sync at the end is bothering me. Is this actually a full barrier? Or:

Could loads performed after the __sync_val_compare_and_swap fail to see stores performed before the store that produced the value __sync_val_compare_and_swap loaded?
Could stores performed after the __sync_val_compare_and_swap be seen by other threads before they see the value stored by the __sync_val_compare_and_swap?

Hectorhecuba answered 14/9, 2018 at 2:57 Comment(27)

If using GCC >= 4.7, __atomic_* builtins are preferred as they lets you choose C11/C++11 memory model (consume, acquire, release, both or sequentially consistent) – Gastrostomy 14/9, 2018 at 3:2

@minmaxavg: I'm asking specifically about the __sync one where I want the full-barrier property that's stronger than the C11 memory model and I'm not clear that GCC is actually providing it. – Hectorhecuba 14/9, 2018 at 3:3

The __ATOMIC_SEQ_CST does provide the full barrier property you want. Besides, I'm also looking for the answer to this question since I'm curious about this one too. – Gastrostomy 14/9, 2018 at 3:6

@minmaxavg: It does not produce any difference from the __sync version. – Hectorhecuba 14/9, 2018 at 3:17

Related: Does `isync` prevent Store-Load reordering on CPU PowerPC?. I haven't read it fully. If __ATOMIC_SEQ_CST produces the same asm, then presumably there's some reason. I think seq-cst requires that later loads/stores can't become visible before the store part of the CAS. – Seine 14/9, 2018 at 3:17

@PeterCordes: No, it's the same. I saw that other question but was unsure if it answers mine, since maybe the stwcx. is doing some magic that makes it work. – Hectorhecuba 14/9, 2018 at 3:18

stwcx. is the write part of the read-modify-write primitive on PPC. – Calutron 14/9, 2018 at 3:22

My apologies, I've somehow mistaken that you assumed __sync to provide a full barrier. Yes, it is the same as the __sync* version (and also kinda acts as a fallback). *edit – Gastrostomy 14/9, 2018 at 3:22

@A.Wilcox: yes, but does it have any ordering semantics at all, stronger than relaxed? – Seine 14/9, 2018 at 3:22

it's the set of barriers PowerPC uses for a Seq/Cst read-modify-write. isync prevents speculative execution from accessing earlier operations (acquire), lwsync is used for 'release' guarantees and is replaced by sync in case of a seq/cst operation. – Polyvinyl 14/9, 2018 at 3:24

@LWimsey: But to be a real seq_cst atomic, barriers are generally needed on both sides of it, not just before it. I just tested GCC's __atomic_store with seq_cst for ppc64 and it's totally wrong -- it's only a release barrier (sync;stw). – Hectorhecuba 14/9, 2018 at 3:27

@R..I think the mistake here is the believe that seq/cst atomic operations act as a full barrier; they do not.. The guarantee for an SC atomic store is that it has release semantics, an SC atomic load has acquire semantics and in addition, SC operations follow a global order wrt each other, but in isolation, SC operations are not full barriers. – Polyvinyl 14/9, 2018 at 3:33

@LWimsey: What part of the spec allows them to only be acquire or release? I thought they had to be ordered with respect to other relaxed-order atomics? – Hectorhecuba 14/9, 2018 at 3:37

FYI __atomic_store with __ATOMIC_SEQ_CST and __ATOMIC_RELEASE seems to use sync and lwsync, respectively. Sequentially consistent ordering only guarantees total order of memory operations wrt other __ATOMIC_SEQ_CST operations, not relaxed ones. I think the GCC's documentation for __sync is indeed a bit misleading. So was I, who probably do need to have a cup of coffee after skipping over a night :/ – Gastrostomy 14/9, 2018 at 3:40

@R.. My comment was about seq/cst atomic loads (acquire) and stores (release). A seq/cst read-modify-write operation has both acquire and release semantics and therefore, a relaxed operation sequenced before (or after) a seq/cst RMW must be observed by other threads in the same order. The PowerPC barriers in your question enforce that behavior. – Polyvinyl 14/9, 2018 at 4:13

In a not related answer, I included some references to the C++ standard. You've used the C-tag, but it's my understanding that the memory models for both languages are similar (if not equivalent). – Polyvinyl 14/9, 2018 at 4:13

@LWimsey: Thanks, that's very helpful. Unless I'm misunderstanding something though it looks like the atomic CAS here lacks acquire semantics too, which is a big problem if it's being used to implement a lock where access to non-atomic objects should be synchronized by it, no? – Hectorhecuba 14/9, 2018 at 14:23

(Maybe the release barrier at the previous seq_cst atomic that unlocked it guarantees these the acquire semantics here, but if so I don't understand how that works in the hardware memory model.) – Hectorhecuba 14/9, 2018 at 14:43

Example POWER Implementation for C/C++ Memory Model. isync is purely an instruction re-ordering prevention barrier and not a memory barrier at all. – Epagoge 21/9, 2018 at 7:41

@Epagoge "isync is purely an instruction re-ordering prevention barrier and not a memory barrier at all" What's the difference? – Piperidine 28/11, 2019 at 0:0

@R.. "But to be a real seq_cst atomic, barriers are generally needed on both sides of it, not just before it." But you only showed one atomic operation. What happens if you put more operations in sequence? Do barriers appear on both side? – Piperidine 28/11, 2019 at 2:0

@curiousguy: You're missing the point that the other operations happen on other cores, not inline with this one or visible to the compile emitting this asm. What isync being purely an instruction reordering barrier, not a memory barrier, is that it has no influence on synchronization of memory between cores. – Hectorhecuba 28/11, 2019 at 2:17

@R.. I never realized that memory was explicitly synchronized between cores. – Piperidine 28/11, 2019 at 3:11

@curiousguy: It's necessary whenever you have cache, which is necessary for a computer not to be something like 1000x slower. See en.wikipedia.org/wiki/Cache_coherence – Hectorhecuba 28/11, 2019 at 3:24

@R.. Without a cache, we would program by explicitly addressing many regions of memory, some faster than others. It would be a lot more complicated but not 1000x slower. – Piperidine 28/11, 2019 at 20:47

@curiousguy: Nobody did that back in the days when cache was new, and it's not a viable programming model. It is potentially viable to do it via the MMU (essentially, treat SRAM as the only native memory the MMU maps and DRAM as a block device, using the OS page cache layer in place of hardware cache) but the idea is the same. In any case the 1000x figure is pretty close to accurate. Booting modern Windows with cache disabled takes something like half a day. – Hectorhecuba 28/11, 2019 at 20:52

Let us continue this discussion in chat. – Piperidine 28/11, 2019 at 20:55

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags