What is the difference between MOVDQA and MOVNTDQA, and VMOVDQA and VMOVNTDQ for WB/WC marked region?

Asked 26/9, 2013 at 18:16 Answered 29/3, 2022 at 19:47

What is the main difference between instructions through using memory marked as WB (write back) and WC (write combine): What is different between MOVDQA and MOVNTDQA, and what is different between VMOVDQA and VMOVNTDQ?

Is it right, that for the memory have marked as WC - instructions with [NT] is no different from usual (without [NT]), and that memory is marked WB - instructions with [NT] works with it as if it was a memory WC?

Immoderacy answered 26/9, 2013 at 18:16 Comment(0)

Note : This answer discusses primarily NT stores. Peter's answer is more comprehensive.

You would typically use the NT (non temporal) instructions when writing to memory-mapped IO (ie: GPU, etc) where the memory is strictly uncacheable and is always accessed directly.

With regular reads and writes the CPU will try to cache and write out larger blocks to main memory when it needs to. With uncacheable regions (such as MMIO) the writes have to go directly to memory and the CPU will not try to cache them. Using the NT instruction hints to the CPU that you are probably streaming a large amount of data (ie: to a frame buffer, etc) and it will try to combine those writes when it can fill an entire cache-line.

The "non-temporal" part means that you're telling the CPU that you don't intend for the write to happen immediately but that it can be delayed, within reason, until enough NT instructions have been issued to fill the cache line.

As far as I understand, you can also use the NT instructions with regular write-back memory and it will not attempt to cache those writes but will also attempt to stream when it can fill a line. In the case of writing to WB memory I'd say the application would be pretty specialized and you would need to know that you could do a better job than the CPU at managing its cache. Also the write is not going to happen immediately so anything reading back afterwards would read stale data until the combined write was executed. You need to manage this with SFENCE instructions if you need to flush any outstanding combined writes.

Kilauea answered 26/9, 2013 at 18:33 Comment(8)

Thanks. But what about WC(write combine) memory region, that already "Uncacheable Write Combining (USWC) memory" even without [NT] - as writen in article by your link, do I need to use [NT] for this WC-memory in this case, and for what? – Immoderacy 26/9, 2013 at 18:45

@Immoderacy - you don't have to write combine to USWC, but if you don't then the writes can take much longer because the CPU isn't writing to cache but it has to write it all the way out to main memory before executing the next instruction. If you're writing a large block in sequence the NT instructions allow you to save time by giving the CPU a hint that you're going to be giving it more writes and to hold off on transferring out to main memory until it can do whole lines in one go. – Kilauea 26/9, 2013 at 21:16

@Immoderacy - you can think of it as a sort of optional "false cache" for uncacheable memory. I say "false" cache because in between the NT instruction and the combined read/write actually executing the real memory becomes stale (whereas with a real cache the CPU knows which value is current and can access it immediately). – Kilauea 26/9, 2013 at 21:20

SSE4.1 MOVNTDQA is an NT load. It doesn't override the ordering semantics, so it only does anything special on WC memory. Most of your answer seems to be talking only about NT stores, like SSE1 MOVNTDQ / AVX1 VMOVNTDQ. The link at the bottom is about using NT loads to copy back from video RAM to the CPU, totally separate from what the rest of your answer discusses. – Antique 29/3, 2022 at 19:37

@PeterCordes Wow, yeah, quite right. I wrote this almost a decade ago and haven't really looked back. +1 to your new answer. If you want to edit this one, by all means. – Kilauea 29/3, 2022 at 20:37

I think the best thing might be a disclaimer at the top of yours that it's about NT stores only, and maybe a link to my answer if you think people should read that. I'd be more comfortable if you made that edit, though :P What I posted is I think the right way to answer the question, and that would be a big change if I edited your answer to be more like that. – Antique 29/3, 2022 at 20:40

@PeterCordes Sure, just busy up to the eyeballs right now and I'd need to get my head back into this to edit it effectively. The x86 world is not as crisp in my mind as I'm sure it is in yours! – Kilauea 29/3, 2022 at 20:50

No worries, that little banner at the top is more than fine. What your answer says isn't wrong, it just missed discussing MOVNTDQA which the question asked about. – Antique 29/3, 2022 at 20:58

NT stores are useful on large blocks of WB memory

NT stores movntps / movntdq / etc (and their AVX forms vmovntps etc.) work well on WB memory, treating it like WC memory, overriding the memory-ordering semantics of the region and bypassing cache, building up a full 64-byte chunk of data in an LFB to send to memory when it's fully written. (But still maintaining cache-coherency with other cores.) And yes, normal stores on WC memory work like that, too.

If evicted early, before the LFB has a full line of writes, it has to do a partial update of a DDR SDRAM block when the write request reaches a memory controller. The DRAM burst size is 64 bytes, same as the cache line size; not a coincidence.
(SSE2 maskmovdqu has an NT hint (unlike AVX vmaskmovps and so on), and causes the same problem; maybe it was efficient on early single-core CPUs and could get the memory controller to use byte-masking for writes, but it's just slow now.)

If you want NT stores ordered wrt. normal stores, use sfence (_mm_sfence) after you're done with streaming (NT) stores to a big buffer, before a normal store of a flag or pointer that other cores might read. If you don't care about the order other cores see your NT stores in (because your code is single-threaded), that's unnecessary; the current core always sees its own stores in program order, even NT stores. And they will eventually make it to a memory-mapped file or whatever.

NT loads are quite different

The SSE4.1 NT load instruction, movntdqa, is only special on WC memory. On WB memory on existing CPUs, it's the same as movdqa, just a 16-byte alignment-required load, but costing an extra uop. (Same goes for the vmovntdqa AVX form for 16 or 32-byte operations.) The NT load hint is ignored on current CPUs, and the instruction is not architecturally allowed to override the memory-ordering semantics; WB memory is strongly ordered, only WC is weakly ordered allowing load-load reordering.

Perhaps because loads without HW prefetching would normally be disastrous, but HW prefetch only knows how to do normal prefetches, not NT prefetches like prefetchnta that minimize cache pollution by bypassing L3 if possible, or on CPUs with inclusive L3 cache (client CPUs, and Xeon before SKX), using only a single "way" in each set. And bypassing L2 while prefetching into L1d, unless you're actually prefetching from WC memory. From WC memory, NT prefetch can actually prefetch into an LFB, IIRC. (NT loads from WC memory load into an LFB not cache, where later loads from the same line can pull data from, if I'm remembering correctly.) See Difference between PREFETCH and PREFETCHNTA instructions for more details about SW prefetches.

Intel's whitepaper about copying from video RAM to main memory has some examples and details: https://web.archive.org/web/20120918010837/http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/

Regular loads from WC memory (like movdqu / movdqa or plain integer mov) do in theory allow load speculation, but Dr. McCalpin reports that on Sandybridge at least, you don't actually get much if any memory-level parallelism.

Antique answered 29/3, 2022 at 19:47 Comment(3)

For the WC memory type load speculation is allowed, but the loaded data are not cached. My testing (back in the Sandy Bridge generation) showed that it is difficult to get the processor to generate effective speculation to WC memory -- regular loads are almost completely serialized. MOVNTDQA is essentially a special case that allows a modest amount of speculative concurrency for loads to WC memory. – Forefoot 29/3, 2022 at 20:35

excellent answer. But as NT load is not supported, what alternative approach should we take to load data while ensuring cache cleanliness? – Bodine 14/3, 2023 at 11:42

@grayxu: NT prefetch can work if you get the prefetch distance right, but it's rather brittle to tune. (Too far ahead and the data will have been evicted, so the actual loads will miss in cache all the way to L3, or on Skylake Xeon and later, all the way to DRAM since L3 isn't inclusive anymore.) – Antique 14/3, 2023 at 11:48

Beware of processor errata when using the non-temporal instructions though, if you need them to be ordered against memory barriers (e.g. LOCK ADD, MFENCE).

Errata HSD162, BDM116 and SKL079 apply, please refer to the Haswell/Broadwell/Skylake specification updates. Basically, non-temporal MOVNTDQA from WC memory will bypass LOCK on Haswell/Broadwell and you must use MFENCE to fix. On Skylake, it is broken the other way, so non-temporal MOVNTDQA from WC memory will bypass MFENCE, and the fix is to update the Skylake microcode...

Abridge answered 3/8, 2016 at 11:27 Comment(1)

On SKL with update microcode, mfence is a barrier for out-of-order execution of everything, including ALU instructions. So it's like lfence. This sucks for performance :( Are loads and stores the only instructions that gets reordered? – Antique 29/6, 2018 at 23:46

NT stores are useful on large blocks of WB memory

NT loads are quite different

Recommended topics

Hot tags