Why GCC does not use LOAD(without fence) and STORE+SFENCE for Sequential Consistency?
Asked Answered
R

4

16

Here are four approaches to make Sequential Consistency in x86/x86_64:

  1. LOAD(without fence) and STORE+MFENCE
  2. LOAD(without fence) and LOCK XCHG
  3. MFENCE+LOAD and STORE(without fence)
  4. LOCK XADD(0) and STORE(without fence)

As it is written here: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

C/C++11 Operation x86 implementation

  • Load Seq_Cst: MOV (from memory)
  • Store Seq Cst: (LOCK) XCHG // alternative: MOV (into memory),MFENCE

Note: there is an alternative mapping of C/C++11 to x86, which instead of locking (or fencing) the Seq Cst store locks/fences the Seq Cst load:

  • Load Seq_Cst: LOCK XADD(0) // alternative: MFENCE,MOV (from memory)
  • Store Seq Cst: MOV (into memory)

GCC 4.8.2(GDB in x86_64) uses first(1) approach for C++11-std::memory_order_seq_cst, i.e. LOAD(without fence) and STORE+MFENCE:

std::atomic<int> a;
int temp = 0;
a.store(temp, std::memory_order_seq_cst);
0x4613e8  <+0x0058>         mov    0x38(%rsp),%eax
0x4613ec  <+0x005c>         mov    %eax,0x20(%rsp)
0x4613f0  <+0x0060>         mfence

As we know, that MFENCE = LFENCE+SFENCE. Then this code we can rewrite to this: LOAD(without fence) and STORE+LFENCE+SFENCE

Questions:

  1. Why do we need not to use LFENCE here before LOAD, and need to use LFENCE after STORE (because LFENCE make sense only before LOAD!)?
  2. Why GCC does not use approach: LOAD(without fence) and STORE+SFENCE for std::memory_order_seq_cst?
Reclaim answered 27/9, 2013 at 9:29 Comment(4)
What do you mean with LFENCE before LOAD? In your source code you assign a zero value to a, which is a store and not a load and then it makes no difference if lfence is called before or after the mov instruction.Archer
@Archer I mean definitely that LFENCE make sense only before LOAD, and LFENCE don't make any sense after STORE in any cases.Reclaim
std::memory_order_seq_cst implies lfence+sfence. This triggers synchronization of all other variables that are not declared atomic, thus not calling lfence+sfence (or mfence) when the standard says so would change semantics. If you have a variable "int b;" and another thread has assigned b=1 and then called sfence, this will be visible to this thread first when this thread calls lfence (which could be done by storing a new value into the atomic variable a).Archer
@Archer and Alex: sfence + lfence is still not a StoreLoad barrier (preshing.com/20120710/… explains how StoreLoad barriers are special). x86 has a strong memory model where LFENCE and SFENCE only exist for use with movnt loads/stores, which are weakly ordered as well as bypassing the cache. See stackoverflow.com/questions/32705169/….Scheers
G
7

The only reordering x86 does (for normal memory accesses) is that it can potentially reorder a load that follows a store.

SFENCE guarantees that all stores before the fence complete before all stores after the fence. LFENCE guarantees that all loads before the fence complete before all loads after the fence. For normal memory accesses, the ordering guarantees of individual SFENCE or LFENCE operations are already provided by default. Basically, LFENCE and SFENCE by themselves are only useful for the weaker memory access modes of x86.

Neither LFENCE, SFENCE, nor LFENCE + SFENCE prevents a store followed by a load from being reordered. MFENCE does.

The relevant reference is the Intel x86 architectural manual.

Grout answered 12/10, 2013 at 11:30 Comment(0)
A
6

Consider the following code:

#include <atomic>
#include <cstring>

std::atomic<int> a;
char b[64];

void seq() {
  /*
    movl    $0, a(%rip)
    mfence
  */
  int temp = 0;
  a.store(temp, std::memory_order_seq_cst);
}

void rel() {
  /*
    movl    $0, a(%rip)
   */
  int temp = 0;
  a.store(temp, std::memory_order_relaxed);
}

With respect to the atomic variable "a", seq() and rel() are both ordered and atomic on the x86 architecture because:

  1. mov is an atomic instruction
  2. mov is a legacy instruction and Intel promises ordered memory semantics for legacy instructions to be compatible with old processors that always used ordered memory semantics.

No fence is required to store a constant value into an atomic variable. The fences are there because std::memory_order_seq_cst implies that all memory is synchronized, not only the memory that holds the atomic variable.

The effect can be demonstrated by the following set and get functions:

void set(const char *s) {
  strcpy(b, s);
  int temp = 0;
  a.store(temp, std::memory_order_seq_cst);
}

const char *get() {
  int temp = 0;
  a.store(temp, std::memory_order_seq_cst);
  return b;
}

strcpy is a library function that might use newer sse instructions if such are available in runtime. Since sse instructions were not available in old processors there is no requirement on backwards compatibility and memory order is undefined. Thus the result of a strcpy in one thread might not be directly visible in other threads.

The set and get functions above uses an atomic value to enforce memory synchronization so that the result of strcpy becomes visible in other threads. Now the fences matters, but the order of them inside the call to atomic::store is not significant since the fences are not needed internally in atomic::store.

Archer answered 30/9, 2013 at 7:12 Comment(11)
Thanks! Can you give a link where said that "sse instructions ... is no requirement on backwards compatibility and memory order is undefined"?Reclaim
That was my own summary of a quite comprehensive document... Intel® 64 and IA-32 Architectures Software Developer's Manual Combined Volumes 3A, 3B, and 3C: System Programming Guide Section 8.2 is about memory ordering. Note that x86_64 is IA-32e in the Intel world while IA-64 is the Itanium.Archer
std::memory_order_seq_cst must not provide SC(sequential consistency) for strcpy(b, s); because SC provide it only for atomic operations that are so tagged std::memory_order_seq_cst. "Atomic operations tagged std::memory_order_seq_cst not only order memory the same way as release/consume ordering (everything that happened-before a store in one thread becomes a visible side effect in the tread that did a load), but, in addition, establish a single total modification order of all atomic operations that are so tagged." en.cppreference.com/w/cpp/atomic/memory_orderReclaim
SC is not provided between atomic operations. Agree to that. But if thread X calls strcpy at time t0 and makes an atomic store at time t1, thread Y makes an atomic load at time t2, the result of the strcpy should be visible to thread Y at time t3 (provided t0<t1<t2<t3 and memory_order_seq_cst). This is why the fences are required.Archer
And your arguments are valid and identical for the acquire-release semantic, but then another question: why GCC does not put fences to std::memory_order_acq_rel? In this line std::string* p = new std::string("Hello"); class std::string can use SSE-instructions in an example of source code for Release-Acquire ordering, and it will not work in GCC for acquire-release? en.cppreference.com/w/cpp/atomic/memory_orderReclaim
Which atomic operation did you use with std::memory_order_acq_rel?Archer
Who need it. Or what do you mean?Reclaim
Where didn't gcc put fences with memory_order_acq_rel?Archer
Everywhere, in any cases. Look at the link for orders std::memory_order_release and std::memory_order_acquire: https://mcmap.net/q/17751/-does-the-semantics-of-std-memory_order_acquire-requires-processor-instructions-on-x86-x86_64 For memory_order_acq_rel you self can using GCC+GDB try to disassemble: ideone.com/i8wBag Or read this: "On strongly-ordered systems (x86, SPARC, IBM mainframe), release-acquire ordering is automatic for the majority of operations. No additional CPU instructions are issued for this synchronization mode, only certain compiler optimizations are affected" - en.cppreference.com/w/cpp/atomic/memory_orderReclaim
The example as written is wrong and has a data race. Two seq_cst stores don't synchronize and thus don't force the get function to see the effects of the strcpy in the set. If you make the seq_cst store into a load, and check that it sees the 0, then this will be okay.Grout
If strcpy or memcpy uses NT stores, it must end with an sfence inside the library function. This is required for a valid C11 implementation, because C11 / C++11 require an atomic release store to order non-atomic operations. Existing implementations use pure mov without fences for mo_release stores. (Also, as briand points out, the seq_cst store in the reader is pointless. You need an acquire operation, and you need to check the value to make sure strcpy actually finished in the producer thread, i.e. check a data_ready flag.)Scheers
S
6

SFENCE + LFENCE is not a StoreLoad barrier (MFENCE), so the premise of the question is incorrect. (See also my answer on another version of this same question from the same user Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?)


  • SFENCE can pass (appear before) earlier loads. (It's just a StoreStore barrier).
  • LFENCE can pass earlier stores. (Loads can't cross it in either direction: LoadLoad barrier).
  • Loads can pass SFENCE (but stores can't pass LFENCE, so it's a LoadStore barrier as well as a LoadLoad barrier).

LFENCE+SFENCE doesn't include anything that stops a store from being buffered until after a later load. MFENCE does prevent this.

Preshing's blog post explains in more detail and with diagrams how StoreLoad barriers are special, and has a practical example of working code that demonstrates reordering without MFENCE. Anyone that's confused about memory ordering should start with that blog.

x86 has a strong memory model where every normal store has release semantics, and every normal load has acquire semantics. This post has the details.

LFENCE and SFENCE only exist for use with movnt stores, which are weakly ordered as well as bypassing the cache. (And later SSE4.1 movntdqa loads from WC memory are also weakly ordered, but not from WB memory so it won't bypass cache for cacheable memory.) LFENCE is in practice primarily useful a barrier to out-of-order exec of non-memory instructions, like rdtsc. (And long after it was introduced, to block speculative exec in some cases for Spectre mitigation.)


In case those links ever die, there's even more info in my answer on another similar question.

Scheers answered 22/9, 2015 at 2:9 Comment(0)
C
5

std::atomic<int>::store is mapped to the compiler intrinsic __atomic_store_n. (This and other atomic-operation intrinsics are documented here: Built-in functions for memory model aware atomic operations.) The _n suffix makes it type-generic; the back-end actually implements variants for specific sizes in bytes. int on x86 is AFAIK always 32 bits long, so that means we're looking for the definition of __atomic_store_4. The internals manual for this version of GCC says that the __atomic_store operations correspond to machine description patterns named atomic_store‌mode; the mode corresponding to a 4-byte integer is "SI" (that's documented here), so we are looking for something called "atomic_storesi" in the x86 machine description. And that brings us to config/i386/sync.md, specifically this bit:

(define_expand "atomic_store<mode>"
  [(set (match_operand:ATOMIC 0 "memory_operand")
        (unspec:ATOMIC [(match_operand:ATOMIC 1 "register_operand")
                        (match_operand:SI 2 "const_int_operand")]
                       UNSPEC_MOVA))]
  ""
{
  enum memmodel model = (enum memmodel) (INTVAL (operands[2]) & MEMMODEL_MASK);

  if (<MODE>mode == DImode && !TARGET_64BIT)
    {
      /* For DImode on 32-bit, we can use the FPU to perform the store.  */
      /* Note that while we could perform a cmpxchg8b loop, that turns
         out to be significantly larger than this plus a barrier.  */
      emit_insn (gen_atomic_storedi_fpu
                 (operands[0], operands[1],
                  assign_386_stack_local (DImode, SLOT_TEMP)));
    }
  else
    {
      /* For seq-cst stores, when we lack MFENCE, use XCHG.  */
      if (model == MEMMODEL_SEQ_CST && !(TARGET_64BIT || TARGET_SSE2))
        {
          emit_insn (gen_atomic_exchange<mode> (gen_reg_rtx (<MODE>mode),
                                                operands[0], operands[1],
                                                operands[2]));
          DONE;
        }

      /* Otherwise use a store.  */
      emit_insn (gen_atomic_store<mode>_1 (operands[0], operands[1],
                                           operands[2]));
    }
  /* ... followed by an MFENCE, if required.  */
  if (model == MEMMODEL_SEQ_CST)
    emit_insn (gen_mem_thread_fence (operands[2]));
  DONE;
})

Without going into a great deal of detail, the bulk of this is a C function body that will be called to generate the low-level "RTL" intermediate representation of the atomic store operation. When it's invoked by your example code, <MODE>mode != DImode, model == MEMMODEL_SEQ_CST, and TARGET_SSE2 is true, so it will call gen_atomic_store<mode>_1 and then gen_mem_thread_fence. The latter function always generates mfence. (There is code in this file to produce sfence, but I believe it is only used for explicitly-coded _mm_sfence (from <xmmintrin.h>).)

The comments suggest that someone thought MFENCE was required in this case. I conclude that either you are mistaken to think a load fence is not required, or this is a missed optimization bug in GCC. It is not, for instance, an error in how you are using the compiler.

Chilung answered 29/9, 2013 at 19:16 Comment(3)
Thank you! But what is the logic to use the LFENCE after the STORE? (STORE+MFENCE = STORE+LFENCE+SFENCE)Reclaim
I can't help you with that part of the question, sorry. You would maybe have better luck asking this on the mailing list [email protected].Chilung
MFENCE is required for sequential consistency; see my answer.Scheers

© 2022 - 2024 — McMap. All rights reserved.