I'm checking how the compiler emits instructions for multi-core memory barriers on x86_64. The below code is the one I'm testing using gcc_x86_64_8.3
.
std::atomic<bool> flag {false};
int any_value {0};
void set()
{
any_value = 10;
flag.store(true, std::memory_order_release);
}
void get()
{
while (!flag.load(std::memory_order_acquire));
assert(any_value == 10);
}
int main()
{
std::thread a {set};
get();
a.join();
}
When I use std::memory_order_seq_cst
, I can see the MFENCE
instruction is used with any optimization -O1, -O2, -O3
. This instruction makes sure the store buffers are flushed, therefore updating their data in L1D cache (and using MESI protocol to make sure other threads can see effect).
However when I use std::memory_order_release/acquire
with no optimizations MFENCE
instruction is also used, but the instruction is omitted using -O1, -O2, -O3
optimizations, and not seeing other instructions that flush the buffers.
In the case where MFENCE
is not used, what makes sure the store buffer data is committed to cache memory to ensure the memory order semantics?
Below is the assembly code for the get/set functions with -O3
, like what we get on the Godbolt compiler explorer:
set():
mov DWORD PTR any_value[rip], 10
mov BYTE PTR flag[rip], 1
ret
.LC0:
.string "/tmp/compiler-explorer-compiler119218-62-hw8j86.n2ft/example.cpp"
.LC1:
.string "any_value == 10"
get():
.L8:
movzx eax, BYTE PTR flag[rip]
test al, al
je .L8
cmp DWORD PTR any_value[rip], 10
jne .L15
ret
.L15:
push rax
mov ecx, OFFSET FLAT:get()::__PRETTY_FUNCTION__
mov edx, 17
mov esi, OFFSET FLAT:.LC0
mov edi, OFFSET FLAT:.LC1
call __assert_fail
std::memory_order_seq_cst
by mistake, therefore I deleted my answer. So for x86 as long as the instruction is atomic, any release acquire memory order will work. – Proffer