Why does MSVC generate nop instructions for atomic loads on x64?
Asked Answered
B

1

8

If you compile code such as

#include <atomic>

int load(std::atomic<int> *p) {
    return p->load(std::memory_order_acquire) + p->load(std::memory_order_acquire);
}

you see that MSVC generates NOP padding after each memory load:

int load(std::atomic<int> *) PROC
        mov     edx, DWORD PTR [rcx]
        npad    1
        mov     eax, DWORD PTR [rcx]
        npad    1
        add     eax, edx
        ret     0

Why is this? Is there any way to avoid it without relaxing the memory order (which would affect the correctness of the code)?

Baribaric answered 30/12, 2022 at 0:40 Comment(1)
Related, maybe answers this question too: #44854997Facer
L
9

p->load() may eventually use the _ReadWriteBarrier compiler intrinsic.

According to this: https://developercommunity.visualstudio.com/t/-readwritebarrier-intrinsic-emits-unnecessary-code/1538997

the nops get inserted because of the flag /volatileMetadata which is now on by default. You can return to the old behavior by adding /volatileMetadata-, but doing so will result in worse performance if your code is ever run emulated. It’ll still be emulated correctly, but the emulator will have to pessimistically assume every load/store needs a barrier.

And compiling with /volatileMetadata- does indeed remove the npad.

Lanfranc answered 30/12, 2022 at 3:40 Comment(3)
So there can be metadata that means a 1-byte nop after a memory access should be treated as some kind of memory barrier when binary-translating to a weakly-ordered ISA? _ReadWriteBarrier only blocks compile-time reordering, but on x86(-64), that after a load is sufficient for an acquire operation, so I guess a translator could recognize that as an acquire load (aarch64 ldar)? It would need a way to signal a full std::atomic_thread_fence(std::memory_order_acquire) 2-way barrier (not a 1-way ordered operation), so maybe every 1-byte NOP is treated as a fence?Knesset
Anyway, yeah that explains that the NOP has some kind of meaning. Maybe they went with in-band NOPs to be able to support x86-64 JIT engines being aware of this ARM64ec thing? Otherwise pure metadata with the addresses of atomic operations and barriers could have avoided wasting front-end bandwidth and uop-cache footprint when running on actual x86-64, at a cost in binary size. But would also give room for more specific info about what kind of memory-order is required.Knesset
@PeterCordes I asked this in MSVC STL Discord, someone (not from MS) assumed that there's no out-of-band metadata, to save space, possibly ending up with spurious barriers for true NOPs inserted for alignment. This assumptions aligns with my experiments which show no signs of extra metadata.Vampire

© 2022 - 2024 — McMap. All rights reserved.