Questioning validity of PowerPC barriers in GCC-generated atomics
Asked Answered
H

0

6

GCC implements __sync_val_compare_and_swap on PowerPC[64] as:

    sync
1:  lwarx 9,0,3
    cmpw 0,9,4
    bne 0,2f
    stwcx. 5,0,3
    bne 0,1b
2:  isync

GCC documents for the __sync_* builtins:

In most cases, these builtins are considered a full barrier. That is, no memory operand will be moved across the operation, either forward or backward. Further, instructions will be issued as necessary to prevent the processor from speculating loads across the operation and from queuing stores after the operation.

However the use of isync rather than sync at the end is bothering me. Is this actually a full barrier? Or:

  1. Could loads performed after the __sync_val_compare_and_swap fail to see stores performed before the store that produced the value __sync_val_compare_and_swap loaded?

  2. Could stores performed after the __sync_val_compare_and_swap be seen by other threads before they see the value stored by the __sync_val_compare_and_swap?

Hectorhecuba answered 14/9, 2018 at 2:57 Comment(27)
If using GCC >= 4.7, __atomic_* builtins are preferred as they lets you choose C11/C++11 memory model (consume, acquire, release, both or sequentially consistent)Gastrostomy
@minmaxavg: I'm asking specifically about the __sync one where I want the full-barrier property that's stronger than the C11 memory model and I'm not clear that GCC is actually providing it.Hectorhecuba
The __ATOMIC_SEQ_CST does provide the full barrier property you want. Besides, I'm also looking for the answer to this question since I'm curious about this one too.Gastrostomy
@minmaxavg: It does not produce any difference from the __sync version.Hectorhecuba
Related: Does `isync` prevent Store-Load reordering on CPU PowerPC?. I haven't read it fully. If __ATOMIC_SEQ_CST produces the same asm, then presumably there's some reason. I think seq-cst requires that later loads/stores can't become visible before the store part of the CAS.Seine
@PeterCordes: No, it's the same. I saw that other question but was unsure if it answers mine, since maybe the stwcx. is doing some magic that makes it work.Hectorhecuba
stwcx. is the write part of the read-modify-write primitive on PPC.Calutron
My apologies, I've somehow mistaken that you assumed __sync to provide a full barrier. Yes, it is the same as the __sync* version (and also kinda acts as a fallback). *editGastrostomy
@A.Wilcox: yes, but does it have any ordering semantics at all, stronger than relaxed?Seine
it's the set of barriers PowerPC uses for a Seq/Cst read-modify-write. isync prevents speculative execution from accessing earlier operations (acquire), lwsync is used for 'release' guarantees and is replaced by sync in case of a seq/cst operation.Polyvinyl
@LWimsey: But to be a real seq_cst atomic, barriers are generally needed on both sides of it, not just before it. I just tested GCC's __atomic_store with seq_cst for ppc64 and it's totally wrong -- it's only a release barrier (sync;stw).Hectorhecuba
@R..I think the mistake here is the believe that seq/cst atomic operations act as a full barrier; they do not.. The guarantee for an SC atomic store is that it has release semantics, an SC atomic load has acquire semantics and in addition, SC operations follow a global order wrt each other, but in isolation, SC operations are not full barriers.Polyvinyl
@LWimsey: What part of the spec allows them to only be acquire or release? I thought they had to be ordered with respect to other relaxed-order atomics?Hectorhecuba
FYI __atomic_store with __ATOMIC_SEQ_CST and __ATOMIC_RELEASE seems to use sync and lwsync, respectively. Sequentially consistent ordering only guarantees total order of memory operations wrt other __ATOMIC_SEQ_CST operations, not relaxed ones. I think the GCC's documentation for __sync is indeed a bit misleading. So was I, who probably do need to have a cup of coffee after skipping over a night :/Gastrostomy
@R.. My comment was about seq/cst atomic loads (acquire) and stores (release). A seq/cst read-modify-write operation has both acquire and release semantics and therefore, a relaxed operation sequenced before (or after) a seq/cst RMW must be observed by other threads in the same order. The PowerPC barriers in your question enforce that behavior.Polyvinyl
In a not related answer, I included some references to the C++ standard. You've used the C-tag, but it's my understanding that the memory models for both languages are similar (if not equivalent).Polyvinyl
@LWimsey: Thanks, that's very helpful. Unless I'm misunderstanding something though it looks like the atomic CAS here lacks acquire semantics too, which is a big problem if it's being used to implement a lock where access to non-atomic objects should be synchronized by it, no?Hectorhecuba
(Maybe the release barrier at the previous seq_cst atomic that unlocked it guarantees these the acquire semantics here, but if so I don't understand how that works in the hardware memory model.)Hectorhecuba
Example POWER Implementation for C/C++ Memory Model. isync is purely an instruction re-ordering prevention barrier and not a memory barrier at all.Epagoge
@Epagoge "isync is purely an instruction re-ordering prevention barrier and not a memory barrier at all" What's the difference?Piperidine
@R.. "But to be a real seq_cst atomic, barriers are generally needed on both sides of it, not just before it." But you only showed one atomic operation. What happens if you put more operations in sequence? Do barriers appear on both side?Piperidine
@curiousguy: You're missing the point that the other operations happen on other cores, not inline with this one or visible to the compile emitting this asm. What isync being purely an instruction reordering barrier, not a memory barrier, is that it has no influence on synchronization of memory between cores.Hectorhecuba
@R.. I never realized that memory was explicitly synchronized between cores.Piperidine
@curiousguy: It's necessary whenever you have cache, which is necessary for a computer not to be something like 1000x slower. See en.wikipedia.org/wiki/Cache_coherenceHectorhecuba
@R.. Without a cache, we would program by explicitly addressing many regions of memory, some faster than others. It would be a lot more complicated but not 1000x slower.Piperidine
@curiousguy: Nobody did that back in the days when cache was new, and it's not a viable programming model. It is potentially viable to do it via the MMU (essentially, treat SRAM as the only native memory the MMU maps and DRAM as a block device, using the OS page cache layer in place of hardware cache) but the idea is the same. In any case the 1000x figure is pretty close to accurate. Booting modern Windows with cache disabled takes something like half a day.Hectorhecuba
Let us continue this discussion in chat.Piperidine

© 2022 - 2024 — McMap. All rights reserved.