NT stores are useful on large blocks of WB memory
NT stores movntps
/ movntdq
/ etc (and their AVX forms vmovntps
etc.) work well on WB memory, treating it like WC memory, overriding the memory-ordering semantics of the region and bypassing cache, building up a full 64-byte chunk of data in an LFB to send to memory when it's fully written. (But still maintaining cache-coherency with other cores.) And yes, normal stores on WC memory work like that, too.
If evicted early, before the LFB has a full line of writes, it has to do a partial update of a DDR SDRAM block when the write request reaches a memory controller. The DRAM burst size is 64 bytes, same as the cache line size; not a coincidence.
(SSE2 maskmovdqu
has an NT hint (unlike AVX vmaskmovps
and so on), and causes the same problem; maybe it was efficient on early single-core CPUs and could get the memory controller to use byte-masking for writes, but it's just slow now.)
If you want NT stores ordered wrt. normal stores, use sfence
(_mm_sfence
) after you're done with streaming (NT) stores to a big buffer, before a normal store of a flag or pointer that other cores might read. If you don't care about the order other cores see your NT stores in (because your code is single-threaded), that's unnecessary; the current core always sees its own stores in program order, even NT stores. And they will eventually make it to a memory-mapped file or whatever.
NT loads are quite different
The SSE4.1 NT load instruction, movntdqa
, is only special on WC memory. On WB memory on existing CPUs, it's the same as movdqa
, just a 16-byte alignment-required load, but costing an extra uop. (Same goes for the vmovntdqa
AVX form for 16 or 32-byte operations.) The NT load hint is ignored on current CPUs, and the instruction is not architecturally allowed to override the memory-ordering semantics; WB memory is strongly ordered, only WC is weakly ordered allowing load-load reordering.
Perhaps because loads without HW prefetching would normally be disastrous, but HW prefetch only knows how to do normal prefetches, not NT prefetches like prefetchnta
that minimize cache pollution by bypassing L3 if possible, or on CPUs with inclusive L3 cache (client CPUs, and Xeon before SKX), using only a single "way" in each set. And bypassing L2 while prefetching into L1d, unless you're actually prefetching from WC memory. From WC memory, NT prefetch can actually prefetch into an LFB, IIRC. (NT loads from WC memory load into an LFB not cache, where later loads from the same line can pull data from, if I'm remembering correctly.) See Difference between PREFETCH and PREFETCHNTA instructions for more details about SW prefetches.
Intel's whitepaper about copying from video RAM to main memory has some examples and details: https://web.archive.org/web/20120918010837/http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/
Regular loads from WC memory (like movdqu
/ movdqa
or plain integer mov
) do in theory allow load speculation, but Dr. McCalpin reports that on Sandybridge at least, you don't actually get much if any memory-level parallelism.