Can x86 reorder a narrow store with a wider load that fully contains it?
Asked Answered
N

2

13

Intel® 64 and IA-32 Architectures Software Developer’s Manual says:

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations
The Intel-64 memory-ordering model allows a load to be reordered with an earlier store to a different location. However, loads are not reordered with stores to the same location.

What about loads that partially or fully overlap previous stores, but don't have the same start address? (See the end of this post for a specific case)


Suppose the following C-like code:

// lock - pointer to an aligned int64 variable
// threadNum - integer in the range 0..7
// volatiles here just to show direct r/w of the memory as it was suggested in the comments
int TryLock(volatile INT64* lock, INT64 threadNum)
{
    if (0 != *lock)
        return 0;                           // another thread already had the lock

    ((volatile INT8*)lock)[threadNum] = 1;  // take the lock by setting our byte

    if (1LL << 8*threadNum != *lock)
    {   // another thread set its byte between our 1st and 2nd check.   unset ours
        ((volatile INT8*)lock)[threadNum] = 0;
        return 0;
    }

    return 1;
}

Or its x64 asm equivalent:

; rcx - address of an aligned int64 variable
; rdx - integer in the range 0..7
TryLock PROC
cmp qword ptr [rcx], 0
jne @fail

mov r8, rdx
mov rax, 8
mul rdx

mov byte ptr [rcx+r8], 1

bts rdx, rax
cmp qword ptr [rcx], rdx
jz  @success

mov byte ptr [rcx+r8], 0

@fail:
mov rax, 0
ret

@success:
mov rax, 1
ret

Then suppose that TryLock is concurrently executed in two threads:

INT64 lock = 0;

void Thread_1() {  TryLock(&lock, 1);  }
void Thread_5() {  TryLock(&lock, 5);  }

The question:

The ((INT8*)lock)[1] = 1; and ((INT8*)lock)[5] = 1; stores aren't to the same location as the 64bit load of lock. However, they are each fully contained by that load, so does that "count" as the same location? It seems impossible that a CPU could do that.

What about ((INT8*)lock)[0] = 1? The address of the store is then the same as the address of the following load. Are these operations "to the same location", even if the earlier case wasn't?

p.s. please notice that the question isn't about C/Asm code, it's about behaviour of the x86 CPUs.

Noguchi answered 6/3, 2016 at 18:18 Comment(8)
That mul is an amusing way to compile 1LL << 8*threadNum. You could have used imul eax, edx, 8 / xor-zero / bts. Or better, what gcc does: lea ecx, [0+rdx*8] / mov edx, 1 / shl rdx, clMorass
Thanks but it's irrelevant to the question.Noguchi
I'm inclined to think that it will not work properly because of 8.2.3.5 Intra-Processor Forwarding Is Allowed The memory-ordering model allows concurrent stores by two processors to be seen in different orders by those two processors; specifically, each processor may perceive its own store occurring before that of the other.Noguchi
Looking at the C code, lock is not volatile. AAR, the compiler could load the value into a register, and use the same value for the first 3 lines, completely ignoring what "other threads" might be doing to the memory. As for the asm, it doesn't reset the memory to 0 on failure. These are just my first thoughts, there might be more issues. I'd be seriously thinking about something that does interlocked compare exchange here.Alphonso
Fixed, thanks. Please consider these examples as pseudocode, it's the explanation of the algorithm, it's not a ready to use solution. The question is about x86 concurrency and memory ordering in general.Noguchi
I guess I'm not prepared to answer the question about memory ordering (which is why I'm using comments instead of answer). If you just want this to work, I'd think about something more like: xor r8, r8 ; shl rdx, 3 ; bts r8, rdx ; xor rax, rax ; lock cmpxchg [rcx], r8 ; setz al ; movzx eax, al ; ret. The movzx is needed if you are returning an int. If you can make your return type a byte, it can be omitted.Alphonso
@DavidWohlferd: In some assemblers, xor rax,rax actually assembles to a 3-byte insn with a REX prefix, instead of optimizing to xor eax,eax. You should generally write xor-zeroing with 32bit registers.Morass
1. All that is stored by one thread - can be seen at the same thread immediately. 2. You already know that this flag is set by the logic of the program and should not have to check it. You should use: if ( ~(1LL << 8*threadNum) & *lock == 0 ) 3. You should flush store-buffer after this store ((volatile INT8*)lock)[threadNum] = 1; MFENCE, because your example requires Sequential Consistency.Farflung
F
9

Can x86 reorder a narrow store with a wider load that fully contains it?

Yes, x86 can reorder a narrow store with a wider load that fully contains it.

That is why your lock algorithm is broken, shared_value isn't equal to 800000:

  1. GCC 6.1.0 x86_64 - link to assembler code: https://godbolt.org/g/ZK9Wql

  2. Clang 3.8.0 x86_64 - link to assembler code: https://godbolt.org/g/qn7XuJ

See below correct example.


The question:

The ((INT8*)lock)[ 1 ] = 1; and ((INT8*)lock)[ 5 ] = 1; stores aren't to the same location as the 64bit load of lock. However, they are each fully contained by that load, so does that "count" as the same location?

No, that does not.

Intel® 64 and IA-32 Architectures Software Developer’s Manual says:

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations The Intel-64 memory-ordering model allows a load to be reordered with an earlier store to a different location. However, loads are not reordered with stores to the same location.

This is a simplified rule for the case when the STORE and LOAD of the same size.

But a general rule is that the write into the memory is delayed for a time, and STORE (address+value) enqueued to the Store Buffer to waits cache-line in exclusive-state (E) - when this cache line will be invalidated (I) in cache of other CPU-Cores. But you can use asm operation MFENCE (or any operation with [LOCK] prefix) to forced wait until the write is done, and any following instructions can be done only after the Store Buffer will have been cleared, and STORE will be visible to all CPU-Cores.

About reordering two lines:

((volatile INT8*)lock)[threadNum] = 1;  // STORE
if (1LL << 8*threadNum != *lock)        // LOAD
  • If STORE and LOAD size is equal, then LOAD CPU-Core do (Store-forwarding) lookup into Store-Buffer and sees all the required data - you can get all actual data right now before STORE has been done

  • If STORE and LOAD size is not equal, STORE (1 Byte) and LOAD (8 Byte), then even if LOAD CPU-Core do lookup into Store-Buffer then it sees only 1/8 of the required data - you can't get all actual data right now before STORE has been done. Here could be 2 variants of CPU actions:

    1. case-1: CPU-Core loads other data from cache-line which in shared-state (S), and overlaps 1 Byte from Store Buffer, but the STORE still remains in the Store Buffer and waits for receipt of an exclusive-state (E) cache line to modify it - i.e. CPU-Core reads data before STORE has been done - in your example is data-races (error). STORE-LOAD reordered to LOAD-STORE in globally visible. - This is exactly what happens on x86_64

    2. case-2: CPU-Core wait when Store-Buffer will be flushed, STORE has waited an exclusive-state (E) of cache line and STORE has been done, then CPU-Core loads all required data from cache-line. STORE-LOAD isn't reordered in globally visible. But this is the same as if you used the MFENCE.

Conclusion, you must use MFENCE after STORE in any case:

  1. It completely solve the problem in the case-1.
  2. It will not have any effect on the behavior and performance in the case-2. Explicit MFENCE for empty Store-Buffer will end immediately.

The correct example on C and x86_64 asm:

We force the CPU-Core to act as in the case-2 by using MFENCE, consequently there isn't StoreLoad reordering

Note: xchgb is always has prefix LOCK, so it usually is not written in asm or indicated in brackets.

All other compilers can be selected manually on the links above: PowerPC, ARM, ARM64, MIPS, MIPS64, AVR.

C-code - should use Sequential Consistency for the first STORE and next LOAD:

#ifdef __cplusplus
#include <atomic>
using namespace std;
#else
#include <stdatomic.h>
#endif

// lock - pointer to an aligned int64 variable
// threadNum - integer in the range 0..7
// volatiles here just to show direct r/w of the memory as it was suggested in the comments
int TryLock(volatile uint64_t* lock, uint64_t threadNum)
{
  //if (0 != *lock)
  if (0 != atomic_load_explicit((atomic_uint_least64_t*)lock, memory_order_acquire)) 
    return 0;                           // another thread already had the lock

  //((volatile uint8_t*)lock)[threadNum] = 1;  // take the lock by setting our byte
  uint8_t* current_lock = ((uint8_t*)lock) + threadNum;
  atomic_store_explicit((atomic_uint_least8_t*)current_lock, (uint8_t)1, memory_order_seq_cst);

  //if (1LL << 8*threadNum != *lock)
  // You already know that this flag is set and should not have to check it.
  if ( 0 != ( (~(1LL << 8*threadNum)) & 
    atomic_load_explicit((atomic_uint_least64_t*)lock, memory_order_seq_cst) )) 
  {   // another thread set its byte between our 1st and 2nd check.   unset ours

    //((volatile uint8_t*)lock)[threadNum] = 0;
    atomic_store_explicit((atomic_uint_least8_t*)current_lock, (uint8_t)0, memory_order_release);
    return 0;
  }

  return 1;
}

GCC 6.1.0 - x86_64 asm-code - should use MFENCE for the first STORE:

TryLock(unsigned long volatile*, unsigned long):
        movq    (%rdi), %rdx
        xorl    %eax, %eax
        testq   %rdx, %rdx
        je      .L7
.L1:
        rep ret
.L7:
        leaq    (%rdi,%rsi), %r8
        leaq    0(,%rsi,8), %rcx
        movq    $-2, %rax
        movb    $1, (%r8)
        rolq    %cl, %rax
        mfence
        movq    (%rdi), %rdi
        movq    %rax, %rdx
        movl    $1, %eax
        testq   %rdi, %rdx
        je      .L1
        movb    $0, (%r8)
        xorl    %eax, %eax
        ret

Full example how it works: http://coliru.stacked-crooked.com/a/65e3002909d8beae

shared_value = 800000

What will happen if you do not use MFENCE - Data-Races

There is a StoreLoad reordering as in the described above case-1 (i.e. if don't use Sequential Consistency for STORE) - asm: https://godbolt.org/g/p3j9fR

I changed the memory barrier for STORE from memory_order_seq_cst to memory_order_release, it removes MFENCE - and now there are data-races - shared_value is not equal to 800000.

enter image description here

Farflung answered 17/8, 2016 at 23:57 Comment(14)
Your case 1 alone isn't StoreLoad reordering. The data being read is the new data, with the newly-stored data merged in. You're right that this lets the load execute before the store commits, but reordering can only be detected if a store from another thread to the same location becomes globally visible between the load executing and the store committing to L1 cache. This is probably why the only hardware that implements this kind of narrow-store to wide-load forwarding is in-order Atom.Morass
It would be possible to speculatively do that kind of store-forwarding but roll back if the cache line is invalidated before the store can commit. So I don't think your case 1 proves anything. The reasoning in this answer is flawed, and isn't sufficient proof that this kind of reordering is possible on any real hardware, or even intended to be allowed by the ISA for all future implementations.Morass
Good point that an MFENCE will make the OP's idea safe, without ever doing an atomic RMW to the contended cache line. I'd be interested to know what the performance is like compared to using xchg or something to do the store, since atomic RMW operations are expecting contention from other cores and won't mis-speculate. (i.e. don't need pause)Morass
You also claim that in case 2, MFENCE won't have any effect on performance. That's incorrect: it takes several uops, and time to execute, even when no memory uops are in flight. It also forces the load to wait for all in-flight stores, not just the overlapping one. It's highly unlikely for the byte store and qword load to be the only things the CPU is doing.Morass
@Peter Cordes 2. Yes, if we talk quite strictly, it adds a few cycles (less than 10). But this is nothing compared with the flushing Store Buffer, which is waiting for when the cache line arrives with an exclusive state from another CPU-Core, it takes about 500 cycles.Farflung
@Peter Cordes 1. "It would be possible to speculatively do that kind of store-forwarding but roll back if the cache line is invalidated before the store can commit. So I don't think your case 1 proves anything" : - this is exactly for what is required LL/SC, which isn't in x86_64. I think it is impossible to get this behavior with the same performance on x86_64, otherwise, it would be used to solve the ABA-problem instead of: tagged-pointers, RCU, Hazard Pointers, intermediate nodes.Farflung
@Peter Cordes "Your case 1 alone isn't StoreLoad reordering. ... but reordering can only be detected if a store from another thread to the same location becomes globally visible between the load executing and the store committing to L1 cache." - any reordering make sense only for visible from other thread/Core, and I've added an example which differs only by the absence MFENCE, and it contains the Data-Race due to the reordering of STORE and LOAD. So I think there is a StoreLoad reordering.Farflung
@Peter Cordes "So I don't think your case 1 proves anything. The reasoning in this answer is flawed, and isn't sufficient proof that this kind of reordering is possible on any real hardware," coliru.stacked-crooked.com/a/438989514d44593a This example shows the case-1, that if we remove MFENCE (change seq_cst -> release for STORE) then will be Data-Races due to the StoreLoad reordering - i.e. we can see old data from cache line. Also this never fire: if(and_val == 0) assert( loaded_val != 0 ); - i.e. value from Store-Buffer overlaped with data from cache-line.Farflung
re: speculative rollback: you're mixing up what's architecturally visible (no LL/SC in the instruction set) with how it could be implemented under the hood. Intel's microarchitectures do speculate on memory ordering for access to different cache lines at least. Part of the benefit of the pause instruction is avoiding this speculation so you don't get a detect/rollback when exiting a spin loop.Morass
@Peter Cordes Can you provide any example with this TryLock() (in spin loop or without) that shows speculative rollback (i.e. faster than with MFENCE) and hasn't Data-Races?Farflung
Interesting experimental test. I didn't expect we'd see reordering on real hardware, but it looks like we are, assuming the locking algorithm isn't broken. re: memory-ordering mis-speculation: This Intel article shows the performance benefit of pause, from avoiding mis-speculation when exiting a spin loop. That's speculation on ordering between two cache lines, though, not within one cache line like we're talking about here.Morass
I think your experimental testing shows that reordering is happening, with the narrow stores becoming globally visible after the wide loads. However, I still need to verify that the locking scheme is valid, and correctly provides acquire and release semantics for the critical section. If the locking scheme is broken, then adding an MFENCE might be making it work for a different reason even if it's not needed for the locking/unlocking instructions to become globally visible in program order. I started writing it down on paper, but I haven't finished yet.Morass
@Peter Cordes To be fully strict I also added memory_order_seq_cst to the second LOAD. It doesn't change asm code on x86_64 on most C/C++-compilers (change only for MIPS/PowerPC), but memory_order_seq_cst on both STORE & LOAD is required to be correct on any potentially possible C/C++-compilers. Even if SeqCst alternatively mapped to asm_x86_64 as: STORE(MOV) - LOAD(MFENCE, MOV): cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.htmlFarflung
Any idea why I can't reproduce this with gcc-7.1.0/6.2.0/5.4.1, clang-3.6.0 on Intel(R) Xeon(R) CPU E3-1226 v3 @ 3.30GHz? I consistently get shared_value = 800000, and it gets proportionally higher as I increase iteration count. (I copy-pasted the code from Coliru, where the value is different.)Gerald
M
5

Can mov byte [rcx+r8], 1 reorder with the cmp qword [rcx], rdx load that follows it? This is the lock[threadNum]=1 store and the following load to make sure nobody else wrote a byte.

The load must return data that includes the store, because the executing thread always observes its own actions to happen in program order. (This is true even on weakly-ordered ISAs).


It turns out this exact locking idea has been proposed before (for the Linux kernel), and Linus Torvalds explained that x86 really does allow this kind of reordering

Despite the term "store-forwarding failure or stall", it doesn't mean the data has to commit to cache before the load can read it. It actually can be read from the store buffer while the cache line is still in S state (MESI). (And on in-order Atom cores, you don't even get a store-forwarding stall at all.)

Real hardware does work this way (as Alex's tests show): the CPU will merge data from L1D with data from the store buffer, without committing the store to L1D.

This by itself isn't reordering yet1 (the load sees the store's data, and they're adjacent in the global order), but it leaves the door open for reordering. The cache line can be invalidated by another core after the load, but before the store commits. A store from another core can become globally visible after our load, but before our store.

So the load includes data from our own store, but not from the other store from another CPU. The other CPU can see the same effect for its load, and thus both threads enter the critical section.


1 (This is the point I was making in comments on Alex's answer. If x86 didn't allow this reordering, CPUs could still do the store-forwarding speculatively before the store becomes globally visible, and shoot it down if another CPU invalidated the cache line before the store committed. That part of Alex's answer didn't prove that x86 worked the way it does. Only experimental testing and careful reasoning about the locking algo gave us that.)

If x86 did disallow this reordering, a store/partially-overlapping-reload pair would work like an MFENCE: Earlier loads can't become globally visible before the load, and earlier stores can't become globally visible before the store. The load has to become globally visible before any following loads or stores, and it would stop the store from being delayed, too.

Given this reasoning, it's not totally obvious why perfectly-overlapping stores aren't equivalent to an MFENCE as well. Perhaps they actually are, and x86 only manages to make spill/reload or arg-passing on the stack fast with speculative execution!


The locking scheme:

It looks like TryLock can fail for both/all callers: They all see it initially zero, they all write their byte, then they all see at least two non-zero bytes each. This is not ideal for heavily-contended locks, compared to using a locked instruction. There is a hardware arbitration mechanism to handle conflicting locked insns. (TODO: find the Intel forum post where an Intel engineer posted this in response to another software retry loop vs. locked instruction topic, IIRC.)

The narrow-write / wide-read will always trigger a store-forwarding stall on modern x86 hardware. I think this just means the load result isn't ready for several cycles, not that execution of other instructions stalls (at least not in an OOO design).

In a lightly-contended lock that's used frequently, the branch will be correctly predict to take the no-conflict path. Speculative execution down that path until the load finally completes and the branch can retire shouldn't stall, because store-forwarding stalls are not quite long enough to fill up the ROB.

  • SnB: ~12 cycles longer than when store-forwarding works (~5c)
  • HSW: ~10c longer
  • SKL: ~11c longer than when store-forwarding works (4c for 32 and 64bit operands, which is 1c less than previous CPUs)
  • AMD K8/K10: Agner Fog doesn't give a number.
  • AMD Bulldozer-family: 25-26c (Steamroller)

  • Atom: "Unlike most other processors, the Atom can do store forwarding even if the read operand is larger than the preceding write operand or differently aligned", and there is only 1c latency. Only fails when crossing a cache-line boundary.

  • Silvermont: ~5c extra (base: 7c)
  • AMD Bobcat/Jaguar: 4-11c extra (base: 8c/3c)

So if the whole locking scheme works, it might do well for lightly-contended locks.

I think you could turn it into a multiple-readers/single-writer lock by using bit 1 in each byte for readers and bit 2 for writers. TryLock_reader would ignore the reader bits in other bytes. TryLock_writer would work like the original, requiring a zero in all bits in other bytes.


BTW, for memory ordering stuff in general, Jeff Preshing's blog is excellent.

Morass answered 10/3, 2016 at 7:27 Comment(27)
Could Performance Monitor Counters be used to gather any empirical evidence? There is a LD_BLOCKS.STORE_FORWARD which reads "loads blocked by overlapping with store buffer that cannot be forwarded". Is that in anyway relevant?Bangka
@MargaretBloom: I don't think so, because that's not the part I'm unsure of. The store-forwarding rules are clear and fairly simple (on all CPUs except Atom, stores can't forward to wider loads). So that counter should just confirm that. The part I'm not sure about is global visibility after a store-forwarding failure. I'm not sure that even a store-forwarding failure forces the store to become globally visible before the load (on current hardware). If you have any clever idea for using perf counters that I'm not thinking of, that would be cool, though :)Morass
@PeterCordes - if I'm following this correctly, it seems that per Alex's results above, CPUs other than Atom can forward from a narrower store to a wider read: effectively forwarding some bytes from the store buffer and the remaining bytes from the L1. At least that's what the coliru results seem to indicate, right? Of course from a performance perhaps it is still slow - so counts as a "failed forwarding" in cycle measurements - perhaps because hardware exists to get bytes from the "most recent" overlapping store, but not merge any other overlapping bytes in the SB.Staford
... so the hardware simply waits until the entire store buffer (or, as an optimization, the entries older than the one that was a partial forward?) are drained and then merges the earlier captured bytes from "store forwarding" with whatever it reads from L1. Seems weird, why would they bother with the whole capture logic when they could just do the whole read from L1?Staford
@BeeOnRope: As I understand it, it's a store-forwarding delay, not a total failure that makes it wait until the store commits to L1. We could test this by creating a store-forwarding failure where the store touches a hot and cold cache line, but the load only touches the hot cache line (or maybe avoids touching any not-just-written bytes of the cold cache line). So if a delayed store-forwarding has to wait for the data to commit to L1, it will be much slower than the advertised constant 10c penalty above the standard store-forwarding latency on Haswell. Evict with movnt, clflush or a thread.Morass
I see. So let's say this test showed that you still get only a ~10c penalty, then the working theory could be that the store-forwarding can still work in this case, merging the bytes, but it has to do more work (including checking the rest of the store buffer for any other overlapping bytes, using hardware which is optimized for finding nearest which could be were a lot of the time is being spent).Staford
@BeeOnRope: Also, draining the store buffer would take a variable number of cycles, depending on how full it was, and especially if a cache-miss stores was queued. (x86 is strongly-ordered, so it can't make stores globally visible out-of-order). IDK if I've seen it discussed outside of Agner Fog's guide, but he talks about store-forwarding stalls as a constant latency penalty. Probably the hardware is optimized for finding successful store-forwarding, and if that fails it falls back to a slow scan of the store-buffer to assemble the correct bytes.Morass
And you didn't need to say "other than Atom", because Atom can also forward from a narrow store to a wide load. The difference is that it doesn't even stall while doing so. :)Morass
Sure, but I always took the various apparently fixed penalties as something like a result of benchmarking them in a specific case where the store buffer is mostly empty, or least everything is hitting in L1. Perhaps they really are fixed though.Staford
BTW, about your "Given this reasoning" paragraph: I think Intel is just (still) being unclear about their ordering model. ISTM that the whole "Loads Are not Reordered with Older Stores to the Same Location" thing is just stating the very obvious principle that single-threaded RAW semantics are preserved (i.e., you obviously read the value you wrote on the same thread). Their Example 8-4 in 8.2.3.4 in Vol 3 for this principle just gives a single-threaded example of how writing a 1 then reading it back will give you 1 (duh).Staford
Later on, in 8.2.3.5, they give a very confusing example. The are talking about store-forwarding, but the condition they note that r2 == r4 == 0 doesn't involve the forwarded read on either CPU (which goes to r1 and r3). However, it does show (and perhaps is meant to show?) that the "not reordering to same location" rule doesn't prevent this reordering and hence doesn't give you MFENCE like behavior: the 2nd read (of r2 and `r4) is able to move above the the store, which implies that the 1st read must also (since read-read reordering isn't allowed).Staford
@BeeOnRope: Nice detective work. I'm trying to figure out whether it would be safe for gcc to optimize foo.load().m to a load of just the member, instead of an atomic load of the whole struct and then extracting that member. (Huge gain for an 8B member of a 16B struct, like this pointer+counter, since x86's only 16B atomic load is cmpxchg16b!). I think it's fine with mo_relaxed, but I'm still working through the implications of this reordering rule for mo_acquire for non-first members.Morass
I can't see the problem if the remainder (non-m part) of the struct is thrown away as in load().m - no additional fence/locked op is needed in general on x86 for mo_acquire, and I don't see how making it a struct can change that? If it didn't work in this case it wouldn't work for the plain uint64_t case, but we know it does.Staford
@BeeOnRope: An acquire load still has to Synchronize With release-stores to the whole object (different address). I think this works on x86 because synchronization between cores is based on cache-lines, not individual objects (and for user-space threading, OOO always preserves the illusion of in-order execution for a single HW thread). Also, until you found 8.2.3.5, I had been thinking that in the global-visibility order, a load couldn't reorder with a preceding store to the same address (first member), but could for later members. But yeah, I don't see any problems except 8B-aligned 16B.Morass
Sure, but the whole object thing doesn't really apply in x86 as all the behaviors are global. So the wording about the acquire/release pair provides some optimization opportunities for the compiler (e.g., in cases where it can prove, for example, that an object doesn't escape and so can't participate in a synchronization), and matters on other platforms with weaker primitives that may apply to a particular memory location(s) - but on x86 it's global. I'd ignore that part about "no store-load reordering if same address": even if it meant something I don't think it buys anything for aq/rel.Staford
@BeeOnRope: submitted gcc.gnu.org/bugzilla/show_bug.cgi?id=80835. Do you have anything to add to that bug thread re: non-global synchro primitives on non-x86, and how it might affect this? I don't know anything about that.Morass
@PeterCordes - I will take a look at the bug, but I probably can't add much authoritative about the weaker primitives. I know it existed because I worked on platforms (e.g., POWER) that had them, but we mostly just mapped everything to the x86 strong-mode so we didn't make use of them directly. They are useful, for example, for things like reference counting, where you want an atomic manipulation of the lock variable, and you want manipulations of said variable to occur in a total order and a few other guarantees, but you don't need that to fence all other accesses (weak CAS or whatever).Staford
FWIW, I forgot to link this above. It's about as close to a comprehensive study on the memory model with reference to the Intel docs as you'll get, since it explicitly examines the language in the manual and points out the flaw and proposes a corrected model.Staford
@BeeOnRope: Good point, thanks. I'm aware of that (and linked it from the x86 tag wiki a while ago), but it's not really skimable due to the notation used. It might have taken less time to just read it instead of head-scratching and hand-waving, though. But that's still only for x86, and might not help me reason about whether it's valid for other ISAs.Morass
@PeterCordes could you get the lock to work by introducing an arbitrary 4k aliasing store between the store to the lock and load of the lock? AFAIK the load would need to wait on the 4k stores retirement which implies retirement of store the lock. I.e store lock; store (lock + 4096); load lock. With this I'm seeing it "work" but with parallelism could just be getting lucky.Tapioca
@Noah: Interesting idea. I definitely wouldn't want to count on that as future-proof / portable since there's no architectural guarantee of that microarchitectural effect. (e.g. dynamic translation to run on an AArch64 like M1 would have totally different microarchitectural effects, and some future x86 hardware could also be different). But as a means for investigating the behaviour of current microarchitectures, it's interesting. IDK if it would be worth it for performance on a specific current uarch vs. other locking options; might well not be.Morass
@PeterCordes don't expect to ever implement it, just was thinking 4K aliasing can be used as a somewhat targeted memory fence on supported systems.Tapioca
@PeterCordes so it doesn't seem to work. Not able to reproduce on my machine but parallelism is like that. The counter value is alot closer. What mechanism is allowing the load/store to reorder? Possible the address of the intermediate store is not ready yet when the load is issued so it doesn't block the store-forwarding lookup>Tapioca
That cant be right. The address of the byte move to the lock is dependent on the computation of the address of the 4k alias so something else must be going on.Tapioca
@Noah: Maybe a different microarchitecture for coliru vs. your local Ice Lake makes the difference? Or even a multi-socket Xeon vs. client interconnect difference affecting something? IDK, I haven't really tried to wrap my head around it yet.Morass
@PeterCordes Might make an SO question if I'm able to reproduce on local machine. Also wrote a version that used inline assembly to very explicitly guarantee that lock store address is dependent on 4k address and the load address on the lock store address. Possible AGU scheduling is not guranteed even with no calculation necessary. The thing is the counter value is getting very close to correct (off by 1%) so think its likely some "small" detail that allows it.Tapioca
@Noah: Sounds like a good idea if you want to take the time; even if you don't get an answer, it's a better place than this comment thread to write up your findings so far. Even if we did find the cause, I don't think editing a section about it into this answer would be good.Morass

© 2022 - 2024 — McMap. All rights reserved.