Make previous NT stores visible to subsequent memory loads in other threads
Asked Answered
M

2

2

I want to store data in a large array with _mm256_stream_si256() called in a loop. As I understood, a memory fence is then needed to make these changes visible to other threads. The description of _mm_sfence() says

Perform a serializing operation on all store-to-memory instructions that were issued prior to this instruction. Guarantees that every store instruction that precedes, in program order, is globally visible before any store instruction which follows the fence in program order.

But will my recent stores of the current thread be visible to subsequent load instructions too (in the other threads)? Or do I have to call _mm_mfence()? (The latter seems to be slow)

UPDATE: I saw this question earlier: when should I use _mm_sfence _mm_lfence and _mm_mfence . The answers there rather focus on when to use fence in general. My question is more specific and the answers in that question are not likely to address this (and don't currently do this).

UPDATE2: following the comments/answers, let's define "subsequent loads" as the loads in a thread that subsequently takes the lock which the current thread currently holds.

Monzon answered 1/7, 2017 at 18:14 Comment(16)
Possible duplicate of when should I use _mm_sfence _mm_lfence and _mm_mfenceAttraction
Accessing recently stored data breaks the whole purpose of _mm256_stream_si256, which is to write into memory bypassing cache when you know that you won't access recently stored data.Ronni
@VTT, usually it's not accessed immediately. But this may occasionally happen, and I want the program to be correct in that case.Monzon
Well, if it happens only occasionally then there is no point to bother with performance impact of fencing since you will most likely have to deal with cache miss as well and both of these won't happen often enough to impact the performance considerably.Ronni
@VTT, fence is called systematically, that's why I want it to be fast.Monzon
That's not about the C nor the C++ language but some architecture/compiler-specific extension or machine code..Morganatic
re: your edit: did you mean to ask " will my recent stores be globally visible before subsequent load instructions too"? i.e. whether sfence stops StoreLoad reordering in the order of your thread's stores and loads becoming globally visible? (It doesn't, only mfence prevents StoreLoad reordering). The two sentences where you bolded things are talking about completely different things. I'm still not sure if you really were just wondering about visibility in the thread that did the store.Situated
@PeterCordes , such a subtle topic :) . I meant to ask "will my recent stores from the current thread be visible to subsequent loads in the other threads?" (after sfence)Monzon
What do you mean by "subsequent", then? Are you talking about a flag variable creating a "synchronizes-with" relationship with something in the other thread? It sounds like you're confused about something fundamental, or else I'm just not understanding your questions, but I'm not sure what. sfence doesn't make stores instantly visible, it just limits reordering of the current thread's stores. With or without it, all stores will eventually become globally visible!Situated
Err, at least you were confused. You're all sorted out now, right?Situated
@PeterCordes, by "subsequent" I mean happening later in time. "sfence doesn't make stores instantly visible, it just limits reordering of the current thread's stores" - this is close. There will eventually be some synchronization, at least a spin lock with memory_order_acquire on enter and memory_order_release on exit. So perhaps I can just omit sfence. However, I wrote the current question because I saw a suggestion to use _mm_sfence() after _mm256_stream_si256() e.g. here https://mcmap.net/q/15076/-what-is-the-meaning-of-quot-non-temporal-quot-memory-accesses-in-x86Monzon
Ok, then yes you need sfence after NT stores to make sure they don't violate the release semantics of a later store that you're using for synchronization. (Since unlocking a spinlock is usually just a release-store, not a lock xadd or something). The part of my answer that explains how normal synchronization stuff doesn't try to deal with weakly-ordered stores applies to normal spinlock library functions as well as lock-free std::atomic stuff.Situated
by "subsequent" I mean happening later in time. There's no way to make this happen unless you limit when those loads can be executed, by using something that synchronizes the producer thread with the consumer. As worded, you're asking for sfence to make NT stores globally visible the instant it executes, so that loads on other cores that execute 1 clock cycle after sfence will see the stores. A sane definition of "subsequent" would be "in the next thread that takes the lock this thread currently holds".Situated
@PeterCordes, it seems now clear to me, thanks!Monzon
finally :) I did a quick update of my answer to insert some of that, especially to make a point of mentioning normal locking instead of a lock-free producer-consumer model.Situated
It's better not to even talk about time when discussing memory models and concurrency, since there just isn't any "global clock" that can be used to determine some type of global ordering. Not does such a clock not exist in a engineering or architectural sense, the concept of some global time isn't even well-founded in a deep physical sense. That's exactly why Peter, machine memory models and language memory models are usually always expressed in terms of reorderings, happens-before or other relative concepts.Montane
S
11

But will my recent stores be visible to subsequent load instructions too?

This sentence makes little sense. Loads are the only way any thread can see the contents of memory. Not sure why you say "too", since there's nothing else. (Other than DMA reads by non-CPU system devices.)

The definition of a store becoming globally visible is that loads in any other thread will get the data from it. It means that the store has left the CPU's private store-buffer and is part of the coherency domain that includes the data caches of all CPUs. (https://en.wikipedia.org/wiki/Cache_coherence).

CPUs always try to commit stores from their store buffer to the globally visible cache/memory state as quickly as possible. All you can do with barriers is make this thread wait until that happens before doing later operations. That can certainly be necessary in multithreaded programs with streaming stores, and it looks like that's what you're actually asking about. But I think it's important to understand that NT stores do reliably become visible to other threads very quickly even with no synchronization.

A mutex unlock on x86 is sometimes a lock add, in which case that's a full fence for NT stores already. But if you can't rule out a mutex implementation using a simple mov store then you need at least sfence at some point after NT stores, before unlock.


Normal x86 stores have release memory-ordering semantics (C++11 std::memory_order_release). MOVNT streaming stores have relaxed ordering, but mutex / spinlock functions, and compiler support for C++11 std::atomic, basically ignores them. For multi-threaded code, you have to fence them yourself to avoid breaking the synchronization behaviour of mutex / locking library functions, because they only synchronize normal x86 strongly-ordered loads and stores.

Loads in the thread that executed the stores will still always see most recently stored value, even from movnt stores. You never need fences in a single-threaded program. The cardinal rule of out-of-order execution and memory reordering is that it never breaks the illusion of running in program order within a single thread. Same thing for compile-time reordering: since concurrent read/write access to shared data is C++ Undefined Behaviour, compilers only have to preserve single-threaded behaviour unless you use fences to limit compile-time reordering.


MOVNT + SFENCE is useful in cases like producer-consumer multi-threading, or with normal locking where the unlock of a spinlock is just a release-store.

A producer thread writes a big buffer with streaming stores, then stores "true" (or the address of the buffer, or whatever) into a shared flag variable. (Jeff Preshing calls this a payload + guard variable).

A consumer thread is spinning on that synchronization variable, and starts reading the buffer after seeing it become true.

The producer must use sfence after writing the buffer, but before writing the flag, to make sure all the stores into the buffer are globally visible before the flag. (But remember, NT stores are still always locally visible right away to the current thread.)

(With a locking library function, the flag being stored to is the lock. Other threads trying to acquire the lock are using acquire-loads.)

std::atomic <bool> buffer_ready;

producer() {
    for(...) {
        _mm256_stream_si256(buffer);
    }
    _mm_sfence();

    buffer_ready.store(true, std::memory_order_release);
}

The asm would be something like

 vmovntdq  [buf], ymm0
 ...
 sfence
 mov  byte [buffer_ready], 1

Without sfence, some of the movnt stores could be delayed until after the flag store, violating the release semantics of the normal non-NT store.

If you know what hardware you're running on, and you know the buffer is always large, you might get away with skipping the sfence if you know the consumer always reads the buffer from front to back (in the same order it was written), so it's probably not possible for the stores to the end of the buffer to still be in-flight in a store buffer in the core of the CPU running the producer thread by the time the consumer thread gets to the end of the buffer.


(in comments) by "subsequent" I mean happening later in time.

There's no way to make this happen unless you limit when those loads can be executed, by using something that synchronizes the producer thread with the consumer. As worded, you're asking for sfence to make NT stores globally visible the instant it executes, so that loads on other cores that execute 1 clock cycle after sfence will see the stores. A sane definition of "subsequent" would be "in the next thread that takes the lock this thread currently holds".


Fences stronger than sfence work, too:

Any atomic read-modify-write operation on x86 needs a lock prefix, which is a full memory barrier (like mfence).

So if you for example increment an atomic counter after your streaming stores, you don't also need sfence. Unfortunately, in C++ std:atomic and _mm_sfence() don't know about each other, and compilers are allowed to optimize atomics following the as-if rule. So it's hard to be sure that a locked RMW instruction will be in exactly the place you need it in the resulting asm.

(Basically, if a certain ordering is possible in the C++ abstract machine, the compiler can emit asm that makes it always happen that way. e.g. fold two successive increments into one +=2 so that no thread can ever observe the counter being an odd number.)

Still, the default mo_seq_cst prevents a lot of compile-time reordering, and there's not much downside to using it for a read-modify-write operation when you're only targeting x86. sfence is quite cheap, though, so it's probably not worth the effort trying to avoid it between some streaming stores and an locked operation.

Related: pthreads v. SSE weak memory ordering. The asker of that question thought that unlocking a lock would always do a locked operation, thus making sfence redundant.


C++ compilers don't try to insert sfence for you after streaming stores, even when there are std::atomic operations with ordering stronger than relaxed. It would be too hard for compilers to reliably get this right without being very conservative (e.g. sfence at the end of every function with an NT store, in case the caller uses atomics).

The Intel intrinsics predate C11 stdatomic and C++11 std::atomic. The implementation of std::atomic pretends that weakly-ordered stores didn't exist, so you have to fence them yourself with intrinsics.

This seems like a good design choice, since you only want to use movnt stores in special cases, because of their cache-evicting behaviour. You don't want the compiler ever inserting sfence where it wasn't needed, or using movnti for std::memory_order_relaxed.

Situated answered 2/7, 2017 at 1:7 Comment(0)
W
-1

But will my recent stores of the current thread be visible to subsequent load instructions too (in the other threads)? Or do I have to call _mm_mfence()? (The latter seems to be slow)

Answer is NO. You are not guaranteed to see previous stores in one thread without making any synchronization attempts in other thread. Why is that?

  1. You compiler could reorder instructions
  2. Your processor can reorder instructions (on some platforms)

In C++ compiler is required to emit sequentially consistent code but only for single threaded execution. So consider following code:

int x = 5;
int y = 7;
int z = x;

In this program compiler can chose to put x = 5 after y = 7 but no later as it will be inconsistent.
If you then consider following code in other thread

int a = y;
int b = x;

Same instruction reordering can happen here as a and b are independent of each other. What will be result of running those threads?

a    b
7    5
7    ? - whatever was stored in x before the assignment of 5
...

And this result we can get even if we put memory barrier between x = 5 and y = 7 because without putting barrier between a = y and b = x too you never know in which order they will be read.

This is just rough presentation of what you can read in Jeff Preshing's blog post Memory Ordering at Compile Time

Weft answered 3/7, 2017 at 18:34 Comment(8)
In this program compiler can chose to put x = 5 after y = 7 but no later as it will be inconsistent. No, as long as the compiler's asm output loads the old value of x before the x=5 store, it can delay the x=5 store as long as it wants (e.g. sink it out of a loop and keep the value of x live in a register (or as an immediate operand like mov dword [x],5 if it's really a compile-time constant) , only storing the final value of x before returning).Situated
required to emit sequentially consistent code (for single-threaded execution) is not a good way of describing things. The values in memory when a function returns have to match what the source code says. (after inlining and inter-procedural optimizations like optimizing away static variables whose address doesn't escape the compilation unit). The asm that achieves that result doesn't have to bear any resemblance to the order the C++ source does things in.Situated
e.g. loop inversion optimization could write an array in row-major order even if the source says column-major. The compiler has to prove this is safe (e.g. any non-inline function calls that could have a pointer to the memory in question have to see the right values, as well as not changing the results of the function itself), but loop inversion is how some compilers "defeated" some of the benchmarks in SPECint or SPECfp (I forget which), making them trivial and meaningless.Situated
Also note that x = 5; is a C++ assignment. Whether or not it compiles to an asm store instruction anywhere in your function depends on the surrounding code. Local variables with automatic storage can often stay in registers, or be optimized away entirely.Situated
You are wrong, Compiler can't put int x = 5; after int z = x; . It wouldn't be consistent. And regarding rest of your comment - sequential consistency [Leslie Lamport, 1979] the result of any execution is the same as-if 1. the operations of all threads are executed in some sequential order 2. the operations of each thread appear in this sequence in the order specified by their program. - - So for single thread you can reorder as long as you maintain consistency with original code. More detailed information ca be found in §1.10 of c++ standard.Weft
I'm talking about where the compiler stores to memory in its assembly output. Of course it still has to set z = 5 if x = 5 appeared first in the source. But it doesn't have to touch the memory for x until later (if x has a memory location at all), because nothing else is allowed to observe the memory locations while this thread is running, because z and x are int, not std::atomic<int>. Reading x and z from another thread while they're being written is Undefined Behaviour in C++, which is what allows the compiler to store to memory in whatever order it wants.Situated
I can prove it with a simple example: void foo() { x = 5; z = x; x = 7; } compiles to only two stores. The x=5 store never appears in the asm output, because the compiler delays it until it can collapse with the x=7 store. See gcc and clang asm output for x86-64 here: godbolt.org/g/H3aTKr The store to z of course stores 5, because that's the current value of x at that point in program order. The compiler doesn't reorder the source itself, it re-orders the asm that implements a function that behaves the same as the source would, for a single thread.Situated
@PeterCordes If you don't like my post, it's fine. But please stop making fool of yourself. With each post it is getting worse. Visibility of data in other thread has nothing to do with being declared as std::atomic. You have just proved nothing other than the fact that literal assignment can be optimized out. If you missed it the code I presented was to demonstrate need for memory fences in both threads and not only in one as OP assumed. In fact it wasn't code at all as in the way I wrote it it wouldn't even be visible in other thread as all variables are local automatic. Enjoy your day.Weft

© 2022 - 2024 — McMap. All rights reserved.