Why doesn't Ice Lake have MOVDIRx like tremont? Do they already have better ones?

Asked 28/2, 2019 at 5:57 Answered 10/6, 2020 at 16:9

Solved assembly x86 intel cpu-architecture instruction-set

I notice that Intel Tremont has 64 bytes store instructions with MOVDIRI and MOVDIR64B.
Those guarantees atomic write to memory, whereas don't guarantee the load atomicity. Moreover, the write is weakly ordered, immediately followed fencing may be needed.
I find no MOVDIRx in IceLake.

Why doesn't Ice Lake need such instructions like MOVDIRx?

(At the bottom of page 15)
Intel® ArchitectureInstruction Set Extensions and Future FeaturesProgramming Reference
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf#page=15

Petronia answered 28/2, 2019 at 5:57 Comment(0)

Ice Lake has AVX512, which gives us 64-byte loads + stores, but no guarantee of 64-byte store atomicity.

We do get 64-byte NT stores with movntps [mem], zmm / movntdq [mem], zmm. Interestingly, NT stores don't support merge-masking to leave some bytes unwritten. That would basically defeat the purpose of NT stores by creating partial-line writes, though.

Probably Ice Lake Pentium / Celeron CPUs still won't have AVX1/2, let alone AVX512 (probably so they can sell chips with defects in the upper 128 bits of the FMA units and/or register file on at least one core), so only rep movsb will be able to internally use 64-byte loads/stores on those CPUs. (IceLake will have the "fast short rep" feature, which may make it useful even for small 64-byte copies, useful in kernel code that can't use vector regs.)

(Update: Ice Lake and later Pentium / Celeron finally have AVX1/2 and BMI1/2, but not AVX-512. And Alder Lake and later don't have AVX-512 even in their high-end models, probably for market segmentation since Intel won't support it even for CPUs with no E cores.)

Possibly Intel can't (or doesn't want to) provide that atomicity guarantee on their mainstream CPUs, only on low-power chips that don't support multiple sockets, but I haven't heard any reports of tearing actually existing within a cache line on Intel CPUs. In practice, I think cached loads/stores that don't cross a cache-line boundary on current Intel CPUs are always atomic.

(Unlike on AMD K10 where HyperTransport did create tearing on 8B boundaries between sockets, while no tearing could be seen between cores on a single socket. SSE instructions: which CPUs can do atomic 16B memory operations?)

In any case, there's no way to detect this with CPUID, and it's not documented, so it's basically impossible to safely take advantage. It would be nice if there was a CPUID leaf that told you the atomicity width for the system and for within a single socket, so implementations that split 512-bit AVX512 ops into 256-bit halves would still be allowed....

Anyway, rather than introducing a special instruction with guaranteed store atomicity, I think it would be more likely for CPU vendors to start documenting and providing CPUID detection of wider store atomicity for either all power-of-2-size stores, or for only NT stores, or something.

(Update: Intel recently documented that the AVX feature bit implies 128-bit atomicity for aligned loads/stores like movaps. This retroactively documents / guarantees something that's been true for a long time: https://rigtorp.se/isatomic/)

Making some part of AVX512 require 64-byte atomicity would make it much harder for AMD to support, if they follow their current strategy of half-width vector implementation. (Zen2 has 256-bit vector ALUs, making AVX1/AVX2 instructions mostly single-uop, but no AVX512 until Zen 4, unfortunately. AVX512 is a very nice ISA even if you only use it at 256-bit width, filling more gaps in what can be done conveniently / efficiently, e.g. unsigned int<->FP and [u]int64<->double.)

So IDK if maybe Intel agreed not to do that, or chose not to for their own reasons.

Use case for 64B write atomicity:

I suspect the main use-case is reliably creating 64-byte PCIe transactions, not actually "atomicity" per-se, and not for observation by another core.

If you cared about reading from other cores, normally you'd want L3 cache to backstop the data, not bypass it to DRAM. A seqlock is probably a faster way to emulate 64-byte atomicity between CPU cores, even if movdir64B is available.

Skylake already has 12 write-combining buffers (up from 10 in Haswell), so it's (maybe?) not too hard to use regular NT stores to create a full-size PCIe transaction, avoiding early flushes. But maybe low-power CPUs have fewer buffers and maybe it's a challenge to reliably create 64B transactions to a NIC buffer or something.

Budweis answered 28/2, 2019 at 6:22 Comment(5)

Is there even real need for 64B atomicity? I mean, you need atomic writes/reads usually only for semaphore/spinlock flags to sync between threads or to pass new set of data between, so effectively ptr_t atomicity (which is already 8B in 64b :-o ) is IMO all you should need practically. If one needs more, I guess there should be some reasonably simple way to transform that design into something with only flag/ptr_t requirement (this is more like question, than claim). So maybe that's another reason why there's not much push to introduce such instructions... – Mulholland 28/2, 2019 at 7:39

@Ped7g: Any use-case that you would use a seqlock for (Implementing 64 bit atomic counter with 32 bit atomics), e.g. a 128-bit counter or timestamp, would be cheaper with just 128-bit atomic store/load. Or a larger data structure. Also How can I implement ABA counter with c++11 CAS? shows some clunky union hacks to get GCC not to use lock cmpxchg16b to do a 16-byte atomic load when we really only need the low half. (We do still need a DWCAS for updating in that case though; not just a pure store.) – Budweis 28/2, 2019 at 7:42

@Ped7g: update further consideration, I bet the actual use-case is creating 64-byte PCIe transactions. – Budweis 28/2, 2019 at 16:41

Well the new AVX stores don't support merge masking (too bad), but there is the curious case of MOVMASKDQU, which does. However it was left behind by the newest extensions, so I'm not sure how efficient it will going forward. Masked stores are I guess harder to support: you can't just send the whole line down to RAM, you either need to have mask capabilities all the way down to DRAM, or you have to read it back at some point and then do the merge and write it back. – Algonquian 28/2, 2019 at 17:35

Modern DIMMs and protocols and memory controllers are optimized for burst transfers of whole lines, so even if masked writes are possible all the way down to RAM, they may be slow. – Algonquian 28/2, 2019 at 17:36

Why doesn't Ice Lake need such instructions like MOVDIRx?

I would not try to answer this from the perspective of need but a consequence of the practical realities of how instruction set architecture features and Intel products are developed.

From the previous answer:

Possibly Intel can't (or doesn't want to) provide that atomicity guarantee on their mainstream CPUs,

https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf says in Table 1-1 that these instructions will be supported in a range of microarchitectures:

"Direct stores: MOVDIRI, MOVDIR64B Tremont, Tiger Lake, Sapphire Rapids"

Tiger Lake was announced as "the newest Intel® Core™ mobile processors" on https://newsroom.intel.com/news-releases/intel-ces-2020/.

Sapphire Rapids is described as "10nm-based Intel® Xeon® Scalable processors" on https://newsroom.intel.com/news-releases/intel-unveils-new-gpu-architecture-optimized-for-hpc-ai-oneapi/. See also https://s21.q4cdn.com/600692695/files/doc_presentations/2019/05/2019-Intel-Investor-Meeting-Shenoy.pdf.

Disclaimer: I work for Intel and will only cite and discuss official sources.

Label answered 10/6, 2020 at 16:9 Comment(6)

Oh good, I was hoping those instructions would eventually appear in mainstream SnB-family CPUs. Still too bad that we aren't AFAIK getting any real atomicity guarantees for AVX512 64-byte plain stores, only via this movdir64b NT store. – Budweis 10/6, 2020 at 16:27

There are no guarantees because doing that would limit the implementation flexibility. If you look at rigtorp.se/isatomic, you can see the atomicity observed in practice is a function of the datapath width in the CPU pipeline, or something like that. – Label 10/6, 2020 at 16:30

I was hoping for a CPUID feature flag (separate from AVX512F) that could let software detect whether 16, 32, and/or 64-byte aligned loads / stores are guaranteed atomic. That way software could take advantage, e.g. for a std::atomic<16B_struct> load or store. (GCC7 and later compile those accesses to a call to a libatomic helper function, so dynamic linker symbol resolution could check once whether the target function can be one using SSE, or lock cmpxchg16b.) – Budweis 10/6, 2020 at 16:55

That would still let a CPU implement AVX512 by splitting 512-bit instructions into 2 or 4 uops (AMD style) or 2 cycles for a single uop in the load or store-data port (Sandybridge style); they'd simply advertise atomicity for plain cacheable loads/stores only as wide as they actually provide. But that might require CPUID to report something that depends on the interconnect between cores, depending on how it works. e.g. note the 16-byte tearing only between sockets in K10 Opteron Why is integer assignment on a naturally aligned variable atomic on x86?) – Budweis 10/6, 2020 at 16:58

Thank you. This is good news! I'd love to see this x86 big.LITTLE in mainstream laptops against cupertino's summer 2021 mbp. If you can spill a promise. 😉 – Petronia 10/6, 2020 at 18:28

And how about AMD? – Ovine 22/5 at 2:3

Use case for 64B write atomicity:

Recommended topics

Hot tags