Does the MOV x86 instruction implement a C++11 memory_order_release atomic store?
Asked Answered
O

2

9

According to this https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html, a released store is implemented as MOV (into memory) on x86 (including x86-64).

According to his http://en.cppreference.com/w/cpp/atomic/memory_order

memory_order_release:

A store operation with this memory order performs the release operation: no memory accesses in the current thread can be reordered after this store. This ensures that all writes in the current thread are visible in other threads that acquire or the same atomic variable and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic.

I understand that when memory_order_release is used, all memory stores done previously should finish before this one.

int a;
a = 10;
std::atomic<int> b;
b.store(50, std::memory_order_release); // i can be sure that 'a' is already 10, so processor can't reorder the stores to 'a' and 'b'

QUESTION: how is it possible that a bare MOV instruction (without an explicit memory fence) is sufficient for this behaviour? How does MOV tell the processor to finish all previous stores?

Overkill answered 28/4, 2015 at 14:56 Comment(7)
You forgot to mention "on x86"Serrato
@cubbi: right, it is important, doneOverkill
Because it's a dynamic scheduled ISA, chip always assumes the worst case.Crake
x86 doesn't have separate release and acquire barriers.Allurement
@BenVoigt: why? It will never try to reorder that my example?Overkill
The bottom of that cppreference page has a link to x86-TSO paper that gets into more detail you'll ever needSerrato
"I can be sure that 'a' is already 10, so processor can't reorder the stores to 'a' and 'b" For clarity, in the standard there is no global notion that "'a' is already 10", so more accurately: "I can be sure that another thread that loads the 50 stored here in 'b' with a memory order of at least memory_order_acquire will also observe 'a' to be 10." It is a popular pitfall to believe that the release makes previous writes magically visible in other threads — the standard merely states that writes from one thread should become visible in other threads "within a reasonable amount of time".Mariann
P
5

That does appear to be the mapping, at least in code compiled with the Intel compiler, where I see:

0000000000401100 <_Z5storeRSt6atomicIiE>:
  401100:       48 89 fa                mov    %rdi,%rdx
  401103:       b8 32 00 00 00          mov    $0x32,%eax
  401108:       89 02                   mov    %eax,(%rdx)
  40110a:       c3                      retq
  40110b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000401110 <_Z4loadRSt6atomicIiE>:
  401110:       48 89 f8                mov    %rdi,%rax
  401113:       8b 00                   mov    (%rax),%eax
  401115:       c3                      retq
  401116:       0f 1f 00                nopl   (%rax)
  401119:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

for the code:

#include <atomic>
#include <stdio.h>

void store( std::atomic<int> & b ) ;

int load( std::atomic<int> & b ) ;

int main()
{
   std::atomic<int> b ;

   store( b ) ;

   printf("%d\n", load( b ) ) ;

   return 0 ;
}

void store( std::atomic<int> & b )
{
   b.store(50, std::memory_order_release ) ;
}

int load( std::atomic<int> & b )
{
   int v = b.load( std::memory_order_acquire ) ;

   return v ;
}

The current Intel architecture documents, Volume 3 (System Programming Guide), does a nice job explaining this. See:

8.2.2 Memory Ordering in P6 and More Recent Processor Families

  • Reads are not reordered with other reads.
  • Writes are not reordered with older reads.
  • Writes to memory are not reordered with other writes, with the following exceptions: ...

The full memory model is explained there. I'd assume that Intel and the C++ standard folks have worked together in detail to nail down the best mapping for each of the memory order operations possible with that conforms to the memory model described in Volume 3, and plain stores and loads have been determined to be sufficient in those cases.

Note that just because no special instructions are required for this ordered store on x86-64, doesn't mean that will be universally true. For powerpc I'd expect to see something like a lwsync instruction along with the store, and on hpux (ia64) the compiler should be using a st4.rel instruction.

Profluent answered 28/4, 2015 at 15:48 Comment(2)
Just before what you quoted the Intel doc states "In a single-processor system", where "processor" can mean "core". Surely we're discussing re-ordering in a multi-core system? The doc goes on to say "In a multiple-processor system, the following ordering principles apply:"Nauseate
Unless it's this part "Individual processors use the same ordering principles as in a single-processor system." ??Nauseate
D
7

There's memory reordering at run-time (done by CPU) and there's memory reordering at compile-time. Please read Jeff Preshing's article on compile-time reordering (and also great many other good ones on that blog) for further information.

memory_order_release prevents the compiler from reordering access to data, as well as emitting any necessary fencing or special instructions. In x86 asm, ordinary loads and stores already have acquire / release semantics, so blocking compile-time reordering is sufficient for acq_rel, but not seq_cst.

Deathblow answered 28/4, 2015 at 15:57 Comment(2)
"memory_order_release prevents the compiler from reordering access to data" by definition that includes the CPU behaviorAthalee
@curiousguy: yeah, this answer explained it badly before. But instead of downvoting, I decided to fix it.Cia
P
5

That does appear to be the mapping, at least in code compiled with the Intel compiler, where I see:

0000000000401100 <_Z5storeRSt6atomicIiE>:
  401100:       48 89 fa                mov    %rdi,%rdx
  401103:       b8 32 00 00 00          mov    $0x32,%eax
  401108:       89 02                   mov    %eax,(%rdx)
  40110a:       c3                      retq
  40110b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000401110 <_Z4loadRSt6atomicIiE>:
  401110:       48 89 f8                mov    %rdi,%rax
  401113:       8b 00                   mov    (%rax),%eax
  401115:       c3                      retq
  401116:       0f 1f 00                nopl   (%rax)
  401119:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

for the code:

#include <atomic>
#include <stdio.h>

void store( std::atomic<int> & b ) ;

int load( std::atomic<int> & b ) ;

int main()
{
   std::atomic<int> b ;

   store( b ) ;

   printf("%d\n", load( b ) ) ;

   return 0 ;
}

void store( std::atomic<int> & b )
{
   b.store(50, std::memory_order_release ) ;
}

int load( std::atomic<int> & b )
{
   int v = b.load( std::memory_order_acquire ) ;

   return v ;
}

The current Intel architecture documents, Volume 3 (System Programming Guide), does a nice job explaining this. See:

8.2.2 Memory Ordering in P6 and More Recent Processor Families

  • Reads are not reordered with other reads.
  • Writes are not reordered with older reads.
  • Writes to memory are not reordered with other writes, with the following exceptions: ...

The full memory model is explained there. I'd assume that Intel and the C++ standard folks have worked together in detail to nail down the best mapping for each of the memory order operations possible with that conforms to the memory model described in Volume 3, and plain stores and loads have been determined to be sufficient in those cases.

Note that just because no special instructions are required for this ordered store on x86-64, doesn't mean that will be universally true. For powerpc I'd expect to see something like a lwsync instruction along with the store, and on hpux (ia64) the compiler should be using a st4.rel instruction.

Profluent answered 28/4, 2015 at 15:48 Comment(2)
Just before what you quoted the Intel doc states "In a single-processor system", where "processor" can mean "core". Surely we're discussing re-ordering in a multi-core system? The doc goes on to say "In a multiple-processor system, the following ordering principles apply:"Nauseate
Unless it's this part "Individual processors use the same ordering principles as in a single-processor system." ??Nauseate

© 2022 - 2024 — McMap. All rights reserved.