Acquire/release semantics with non-temporal stores on x64
Asked Answered
N

1

18

I have something like:

if (f = acquire_load() == ) {
   ... use Foo
}

and:

auto f = new Foo();
release_store(f)

You could easily imagine an implementation of acquire_load and release_store that uses atomic with load(memory_order_acquire) and store(memory_order_release). But now what if release_store is implemented with _mm_stream_si64, a non-temporal write, which is not ordered with respect to other stores on x64? How to get the same semantics?

I think the following is the minimum required:

atomic<Foo*> gFoo;

Foo* acquire_load() {
    return gFoo.load(memory_order_relaxed);
}

void release_store(Foo* f) {
   _mm_stream_si64(*(Foo**)&gFoo, f);
}

And use it as so:

// thread 1
if (f = acquire_load() == ) {
   _mm_lfence(); 
   ... use Foo
}

and:

// thread 2
auto f = new Foo();
_mm_sfence(); // ensures Foo is constructed by the time f is published to gFoo
release_store(f)

Is that correct? I'm pretty sure the sfence is absolutely required here. But what about the lfence? Is it required or would a simple compiler barrier be enough for x64? e.g. asm volatile("": : :"memory"). According the the x86 memory model, loads are not re-ordered with other loads. So to my understanding, acquire_load() must happen before any load inside the if statement, as long as there's a compiler barrier.

Nanice answered 19/2, 2016 at 23:21 Comment(17)
Btw, none of the SIMD load/stores (even when aligned) guarantee atomicity.Forceful
@Forceful _mm_stream_si64 generates the movnti instruction, which while being sse2, is a 64bit store. There's a 32bit one as well. They must guarantee atomicity - there's almost no sane CPU architecture for a 64bit CPU where they wouldn't.Nanice
The 32/64-bit load/stores are atomic on x86/x64 respectively (when aligned). But loads/stores on entire SIMD registers are not. That's because many processors don't have full width load/stores. (i.e. Early AMD x64 chips only had 64-bit widths. And Sandy Bridge only has 128-bit wide load/store while supporting 256-bit SIMD.)Forceful
Which is true, but irrelevant since these are not SIMD instructions.Nanice
Oh, you're right. My bad, when I saw the _mm_stream stuff, I automatically assumed they were full width SIMD stores. Sorry.Forceful
You 're asking about lfence but the code has sfence - which one did you mean?Whiteley
@Whiteley it has both fencesNanice
If you're worried about visibility of f after construction, you probably want a fence, or barrier, between ctor and assignment to f. Putting the barrier after the store is like tieing the horses behind the carriage.Narcotism
@BitWhistler: The concern is for the visibility of *f, given that f is visible.Manella
To my understanding a barrier would be enough. Unrelated: Have you ever seen a return value assigned before stores in the function called? (even if inlined)Narcotism
@BitWhistler right, that's where the sfance is, between ctor and assignment to gFoo.Nanice
Thanks @EOF. Maybe I'm reading ensures Foo is constructed by the time f is visible incorrectly. Either way, a barrier should be enough, IMO. AFAIK fences are needed only for cross-cpu effects. Writes from a single will keep order.Narcotism
@BitWhistler: The question, as I understand it, is not whether an explicit (release/acquire/full)-barrier is sufficient (it is), but rather whether the normal x86 TSO is sufficient (in which case acquire-release semantics require no fences at all) for non-temporal loads/stores (it is not, however the next question is how to know which functions will will need explicit fences, given that you don't know which functions use non-temporal loads/stores).Manella
Ahhh... you're storing to global... not seeing right at that hour... gnight:)Narcotism
I believe the lfence is required only if your "use f..." is using nt loads. @EOF, I believe that to be sane, library functions that use nt stores wash them down with sfence. You can verify with the impl of a modern memsetNarcotism
@BitWhistler: The NT load instruction apparently doesn't change memory ordering semantics.Quebec
@jrh: C11 has an equivalent stdatomic with the same memory_order_acquire and so on. The actual syntax isn't relevant to what the OP is asking about, which is how x86 NT stores interact with C11 / C++11 memory ordering semantics. You could write equivalent code in C, just with different syntax. Still, there are only room for 5 tags, and stdatomic is probably more important to have than c.Quebec
Q
7

I might be wrong about some things in this answer (proof-reading welcome from people that know this stuff!). It's based on reading the docs and Jeff Preshing's blog, not actual recent experience or testing.

Linus Torvalds strongly recommends against trying to invent your own locking, because it's so easy to get it wrong. It's more of an issue when writing portable code for the Linux kernel, rather than something that's x86-only, so I feel brave enough to try to sort things out for x86.


The normal way to use NT stores is to do a bunch of them in a row, like as part of a memset or memcpy, then an SFENCE, then a normal release store to a shared flag variable: done_flag.store(1, std::memory_order_release).

Using a movnti store to the synchronization variable will hurt performance. You might want to use NT stores into the Foo it points to, but evicting the pointer itself from cache is perverse. (movnt stores evict the cache line if it was in cache to start with; see vol1 ch 10.4.6.2 Caching of Temporal vs. Non-Temporal Data).

The whole point of NT stores is for use with Non-Temporal data, which won't be used again (by any thread) for a long time if ever. The locks that control access to shared buffers, or the flags that producers/consumers use to mark data as read, are expected to be read by other cores.

Your function names also don't really reflect what you're doing.

x86 hardware is extremely heavily optimized for doing normal (not NT) release-stores, because every normal store is a release-store. The hardware has to be good at it for x86 to run fast.

Using normal stores/loads only requires a trip to L3 cache, not to DRAM, for communication between threads on Intel CPUs. Intel's large inclusive L3 cache works as a backstop for cache-coherency traffic. Probing the L3 tags on a miss from one core will detect the fact that another core has the cache line in the Modified or Exclusive state. NT stores would require synchronization variables to go all the way out to DRAM and back for another core to see it.


Memory ordering for NT streaming stores

movnt stores can be reordered with other stores, but not with older reads.

Intel's x86 manual vol3, chapter 8.2.2 (Memory Ordering in P6 and More Recent Processor Families):

  • Reads are not reordered with other reads.
  • Writes are not reordered with older reads. (note the lack of exceptions).
  • Writes to memory are not reordered with other writes, with the following exceptions:
  • ... stuff about clflushopt and the fence instructions

update: There's also a note (in 8.1.2.2 Software Controlled Bus Locking) that says:

Do not implement semaphores using the WC memory type. Do not perform non-temporal stores to a cache line containing a location used to implement a semaphore.

This may just be a performance suggestion; they don't explain whether it can cause a correctness problem. Note that NT stores are not cache-coherent, though (data can sit in the line fill buffer even if conflicting data for the same line is present somewhere else in the system, or in memory). Maybe you could safely use NT stores as a release-store that synchronizes with regular loads, but would run into problems with atomic RMW ops like lock add dword [mem], 1.


Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order.

To block reordering with earlier stores, we need an SFENCE instruction, which is a StoreStore barrier even for NT stores. (And is also a barrier to some kinds of compile-time reordering, but I'm not sure if it blocks earlier loads from crossing the barrier.) Normal stores don't need any kind of barrier instruction to be release-stores, so you only need SFENCE when using NT stores.

For loads: The x86 memory model for WB (write-back, i.e. "normal") memory already prevents LoadStore reordering even for weakly-ordered stores, so we don't need an LFENCE for its LoadStore barrier effect, only a LoadStore compiler barrier before the NT store. In gcc's implementation at least, std::atomic_signal_fence(std::memory_order_release) is a compiler-barrier even for non-atomic loads/stores, but atomic_thread_fence is only a barrier for atomic<> loads/stores (including mo_relaxed). Using an atomic_thread_fence still allows the compiler more freedom to reorder loads/stores to non-shared variables. See this Q&A for more.

// The function can't be called release_store unless it actually is one (i.e. includes all necessary barriers)
// Your original function should be called relaxed_store
void NT_release_store(const Foo* f) {
   // _mm_lfence();  // make sure all reads from the locked region are already globally visible.  Not needed: this is already guaranteed
   std::atomic_thread_fence(std::memory_order_release);  // no insns emitted on x86 (since it assumes no NT stores), but still a compiler barrier for earlier atomic<> ops
   _mm_sfence();  // make sure all writes to the locked region are already globally visible, and don't reorder with the NT store
   _mm_stream_si64((long long int*)&gFoo, (int64_t)f);
}

This stores to the atomic variable (note the lack of dereferencing &gFoo). Your function stores to the Foo it points to, which is super weird; IDK what the point of that was. Also note that it compiles as valid C++11 code.

When thinking about what a release-store means, think about it as the store that releases the lock on a shared data structure. In your case, when the release-store becomes globally visible, any thread that sees it should be able to safely dereference it.


To do an acquire-load, just tell the compiler you want one.

x86 doesn't need any barrier instructions, but specifying mo_acquire instead of mo_relaxed gives you the necessary compiler-barrier. As a bonus, this function is portable: you'll get any and all necessary barriers on other architectures:

Foo* acquire_load() {
    return gFoo.load(std::memory_order_acquire);
}

You didn't say anything about storing gFoo in weakly-ordered WC (uncacheable write-combining) memory. It's probably really hard to arrange for your program's data segment to be mapped into WC memory... It would be a lot easier for gFoo to simply point to WC memory, after you mmap some WC video RAM or something. But if you want acquire-loads from WC memory, you probably do need LFENCE. IDK. Ask another question about that, because this answer mostly assumes you're using WB memory.

Note that using a pointer instead of a flag creates a data dependency. I think you should be able to use gFoo.load(std::memory_order_consume), which doesn't require barriers even on weakly-ordered CPUs (other than Alpha). Once compilers are sufficiently advanced to make sure they don't break the data dependency, they can actually make better code (instead of promoting mo_consume to mo_acquire. Read up on this before using mo_consume in production code, and esp. be careful to note that testing it properly is impossible because future compilers are expected to give weaker guarantees than current compilers in practice do.


Initially I was thinking that we did need LFENCE to get a LoadStore barrier. ("Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions". This in turn prevents them from passing (becoming globally visible before) reads that are before the LFENCE).

Note that LFENCE + SFENCE is still weaker than a full MFENCE, because it's not a StoreLoad barrier. SFENCE's own documentation says it's ordered wrt. LFENCE, but that table of the x86 memory model from Intel manual vol3 doesn't mention that. If SFENCE can't execute until after an LFENCE, then sfence / lfence might actually be a slower equivalent to mfence, but lfence / sfence / movnti would give release semantics without a full barrier. Note that the NT store could become globally visible after some following loads/stores, unlike a normal strongly-ordered x86 store.)


Related: NT loads

In x86, every load has acquire semantics, except for loads from WC memory. SSE4.1 MOVNTDQA is the only non-temporal load instruction, and it isn't weakly ordered when used on normal (WriteBack) memory. So it's an acquire-load, too (when used on WB memory).

Note that movntdq only has a store form, while movntdqa only has a load form. But apparently Intel couldn't just call them storentdqa and loadntdqa. They both have a 16B or 32B alignment requirement, so leaving off the a doesn't make a lot of sense to me. I guess SSE1 and SSE2 had already introduced some NT stores already using the mov... mnemonic (like movntps), but no loads until years later in SSE4.1. (2nd-gen Core2: 45nm Penryn).

The docs say MOVNTDQA doesn't change the ordering semantics for the memory type it's used on.

... An implementation may also make use of the non-temporal hint associated with this instruction if the memory source is WB (write back) memory type.

A processor’s implementation of the non-temporal hint does not override the effective memory type semantics, but the implementation of the hint is processor dependent. For example, a processor implementation may choose to ignore the hint and process the instruction as a normal MOVDQA for any memory type.

In practice, current Intel mainsream CPUs (Haswell, Skylake) seem to ignore the hint for PREFETCHNTA and MOVNTDQA loads from WB memory. See Do current x86 architectures support non-temporal loads (from "normal" memory)?, and also Non-temporal loads and the hardware prefetcher, do they work together? for more details.


Also, if you are using it on WC memory (e.g. copying from video RAM, like in this Intel guide):

Because the WC protocol uses a weakly-ordered memory consistency model, an MFENCE or locked instruction should be used in conjunction with MOVNTDQA instructions if multiple processors might reference the same WC memory locations or in order to synchronize reads of a processor with writes by other agents in the system.

That doesn't spell out how it should be used, though. And I'm not sure why they say MFENCE rather than LFENCE for reading. Maybe they're talking about a write-to-device-memory, read-from-device-memory situation where stores have to be ordered with respect to loads (StoreLoad barrier), not just with each other (StoreStore barrier).

I searched in Vol3 for movntdqa, and didn't get any hits (in the whole pdf). 3 hits for movntdq: All the discussion of weak ordering and memory types only talks about stores. Note that LFENCE was introduced long before SSE4.1. Presumably it's useful for something, but IDK what. For load ordering, probably only with WC memory, but I haven't read up on when that would be useful.


LFENCE appears to be more than just a LoadLoad barrier for weakly-ordered loads: it orders other instructions too. (Not the global-visibility of stores, though, just their local execution).

From Intel's insn ref manual:

Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruc- tion begins execution until LFENCE completes.
...
Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute until the LFENCE completes.

The entry for rdtsc suggests using LFENCE;RDTSC to prevent it from executing ahead of previous instructions, when RDTSCP isn't available (and the weaker ordering guarantee is ok: rdtscp doesn't stop following instructions from executing ahead of it). (CPUID is a common suggestion for a serializing the instruction stream around rdtsc).

Quebec answered 23/2, 2016 at 7:33 Comment(12)
Thanks for a really detailed and well-researched answer. As for when it might ever make sense to use a NT store to a single pointer - when you don't expect the pointer to be in cache and you don't expect it to be accessed anytime soon. For a large set of pointers that have a long-tail access pattern this could (you have to measure!) be an effective optimization. Note that I don't think NT stores actually evict anything from cache - either they update cache or they go to memory directly. If I'm wrong about that it could make this technique a lot less useful.Nanice
As for using atomic_var.load(std::memory_order_acquire) to get acquire semantics without a corresponding release store - I wonder if that's safe of if it's UB. If acquire/release semantics are only defined together, then one or the other alone could be optimized away by the compiler, or worse treated as UB. It really depends on the wording of the spec. It might work on current compilers and break in the future. For this reason I prefer to insert the compiler barrier manually in this case.Nanice
@Eloff: I forget what I've read about the behaviour of an NT store that hits in cache. The non-temporal hint is literally telling the processor "I won't need this again soon", so it easily could work that way on some CPUs. Also, they don't go "directly" to memory. They go into a fill buffer, which does a partial-line write if it's evicted before it's full. (This helps do write-combining to non WB areas.)Quebec
Release and Acquire are both well-defined on their own. That blog post from Jeff Preshing that I linked explain this. The docs for release and acquire don't say anything about needing to pair them. There's no reason whatsoever to define that as UB. By using sfence, you are giving your stores release semantics. Data races are UB, IIRC, but you don't have one because you're manually generating release-stores. Since the store and load functions could be compiled separately, there's no way the compiler can care how it happens. It's better to use load(mo_acquire).Quebec
yes, you're right. Although I don't see why an implementation would choose to eagerly evict a cache-line when there's no demand for replacing it currently - but that doesn't mean it's not possible. I guess what matters is measuring the change with some real-life workloads and seeing if it currently makes things better or not. Which like many optimizations of this sort might work on only some CPUs and stop working on newer ones. Like the fiasco with prefetching on Ivy Bridge.Nanice
So to summarize your answer, an LFENCE is not required in this case on x86, just a compiler barrier. Which is the conclusion I came to when researching this as well. An SFENCE with a compiler barrier is needed, not just an SFENCE. AFK this is how the linux kernel defines wmb() for x86. It does use an LFENCE for rmb(), but those barriers are meant to work with WC memory as well, and the LFENCE really is needed in that case.Nanice
@Eloff: An implementation would evict because you told it to, so that a valuable cache line doesn't get evicted next time one has to be allocated. Cache replacement policies aren't really strict LRU, because nobody builds the hardware to perfectly track usage. AFAIK, there's nowhere to store the NT hint for future evictions, so it's now or never. And yes, tuning for one uarch can hurt another uarch, esp. between AMD and Intel.Quebec
The insn ref manual entry for movntps says to see Vol1, chapter 10. That doc says (when an NT store is used on cacheable memory): If the memory location being written to is present in the cache hierarchy, the data in the caches is evicted. They warn that older CPUs like Pentium M might update in place instead of flushing, so a fencing operation wouldn't flush the new data. This is a problem if you're writing to device memory. Device drivers on multi-core CPUs are probably one reason it works this way.Quebec
that probably makes writing to a single pointer like this worthless. It saves an unnecessary cache line load plus the cache pollution, but it also would evict the hot part of the data - so it could even end up hurting performance. Thanks for looking that up.Nanice
@Eloff: yeah, most of the time, caches work! I've read that trying to statically decide how to manage cache is usually a bad plan, and at best is a brittle way to tune for a specific uarch / cache size. I was sceptical about using cache-bypassing stores for pointers even before checking on that. (Why do you need so many pointers? They have to point to something which can also miss in cache, and having a chain of cache misses really sucks.) NT stores are called "streaming stores" for a reason: they're usually only good for streaming bulk data into buffers too big for cache.Quebec
Your answer operates the terms WC and WB memory, but I couldn't find a good description for them on the internet. So I've asked here: stackoverflow.com/questions/45623007/…Hanleigh
@SergeRogatch: A quick search for x86 wb memory has en.wikipedia.org/wiki/Memory_type_range_register as the first hit. Nevertheless, I'm updating this answer to define terms the first time I use them.Quebec

© 2022 - 2024 — McMap. All rights reserved.