(Persistence) ordering of Intel non-temporal stores to the same cache line
Asked Answered
H

1

5

Do non-temporal stores (such as movnti), to the same cache line, issued by the same thread, reach the memory in program order?

So that for a system with NVRAM (like Intel Cascade Lake processor with an Intel 3D XPoint NVRAM), in case of a crash, the lack of reordering guarantees that a prefix of the writes to the same cache line prevails?

Hayfork answered 2/4, 2021 at 10:46 Comment(1)
B
4

Assuming that the resolved memory type of the non-temporal stores is WC (or WC+), which is what I think you're asking about, the answer is mostly not on Intel and AMD processors.

For Intel processors, certain statements from Section 11.3.1 of the Intel SDM V2 specify the behavior of write-combing writes on microarchitecturs with at least one WC buffer.

The protocol for evicting the WC buffers is implementation dependent and should not be relied on by software for system memory coherency.

This is a general statement that says that the causes of WC evictions and transactions performed for evicting a WC buffer are implementation-dependent. But there are specific statements in different places in the manual.

Likewise [like on P6], for more recent processors starting with those based on Intel NetBurst microarchitectures, a full WC buffer will always be propagated as a single burst transactions, using any chunk order within a transaction.

If all the bytes in the same WC buffer are valid, meaning that each byte was written to at least once since the buffer was allocated, when the buffer is evicted for any reason, the entire cache line in the buffer is evicted using a single transaction. If the target of the buffer is a memory controller, which is the the first unit in the persistence domain on CLX, either all the bytes of the transaction are persisted or none of the bytes. This implies that the program order of write instructions that have written into that line is maintained. The ordering between these particular writes and other writes will be discussed later.

The "using any chunk order within a transaction" part in this context is not important from the perspective of software when the target of the transaction is a memory controller, but is important for other targets.

Intel has specified the chunk size to be aligned 8 bytes on all microarchitectures. This chunk size only applies on the core and uncore interconnects, but not beyond that where other protocols are implemented. But with respect to writes targeting an IMC, persist atomicity is guaranteed at the granularity of a transaction, which may contain anywhere from 1 to 64 bytes (the size of a WC buffer on all modern Intel and AMD processors is 64 bytes), depending on the distribution of valid bytes within the same WC buffer at the time when the buffer got evicted and depending on the exact eviction protocol. On Intel processors, the transaction is guaranteed to contain all of the 64 valid bytes in case of a full WC buffer eviction.

The AMD manual only says that full a WC buffer eviction can be performed as a single transaction.

The following quote specifies ordering guarantees in the case partial WC buffer evictions (where not all bytes are marked as valid in the buffer) and ordering between writes in different WC buffers. It applies to Intel and AMD processors.

Once the eviction of a WC buffer has started, the data is subject to the weak ordering semantics of its definition.

The rest of the paragraph proceeds to elaborate. A partial WC buffer can be evicted using one or more transactions and there is no ordering guarantees between these transactions. Once a write instruction is committed to a WC buffer, it's location in program order is completely lost. If the target of these transactions is an IMC, persist atomicity is only provided at the granularity of a single transaction. That's how a write with effective memory type of WC can persist without persisting an earlier WC write. If different write instructions partially overlap within the same WC buffer, a write instruction can become partially persistent out of order with respect to other writes in the same WC buffer. A write operation in a WC buffer that crosses a chunk boundary is not architecturally guaranteed to be atomic, unless the buffer is entirely full after combining the write (on Intel processors).

WC buffers can be evicted in an order that is different from the buffer allocation order. Fence instructions cannot be used to selectively flush WC buffers. However, a write of any type other than WC where there is an overlapping allocated WC buffer causes that buffer in particular to be evicted before performing the write. A load that hits in a WCB may not cause the buffer to be evicted.

The transactions that occur to flush a single WC buffer are not necessarily ordered with respect to the transactions that occur to flush another WC buffer in the same physical core. Even if WC eviction logic is implemented such that WC buffers are evicted in serially, which is likely, there is no guarantee that transactions from different WC buffers won't end up being interleaved outside the physical core domain.

This all means that persist ordering is not guaranteed between different chunks of the same WC buffer and of different WC buffers, even in the same physical core.

The events that cause a WC buffer to be evicted may differ between vendors and processors from the same vendor. Some events are architectural (documented in the developer manuals) while others are implementation-specific (documented in the datasheets). Store serializing instructions are an example of a synchronous event that does guarantee flushing all WC buffers on the same logical core. A hardware interrupt delivered to a logical core is an example of an asynchronous event that also causes all of its WC buffers to be evicted. Moreover, the number of WC buffers per physical or logical core is implementation-dependent and could be zero. The size of a WC buffer is also implementation-dependent and could be, architecturally speaking, larger or smaller than the size of an L1D cache line. Also WC buffers could be used for multiple purposes other than combining WC writes, depending on the microarchitecure.

Therefore, even if you're only writing full WC buffers, it's impossible to ensure that a WC buffer is only evicted when it becomes full for the purpose of persist atomicity, even on Intel processors where a full WC eviction is performed using a single transaction.

Instead of performing multiple WC write instructions, you can use MOVDIR64B, which guarantees atomicity. MOVDIR64B doesn't allocate a WC buffer and goes directly to the destination, but it may be combined with an already allocated WC buffer, in which case the buffer is evicted immediately after combining the existing contents of the buffer and MOVDIR64B. In any case, the write operation of MOVDIR64B is always performed as a single transaction. Note that the destination memory operand of MOVDIR64B is required to be aligned on a 64-byte boundary. Similar to a traditional WC store, MOVDIR64B is weakly-ordered with any other store, except UC. MOVDIR64B is supported on TNT, TGL, and SPR.

A WC/WC+ write is not ordered with respect to other writes of any memory type except UC on Intel and AMD processors. In addition, a single write instruction (or an instruction that writes to the physical memory address space) of any memory type that crosses an aligned 8-byte boundary is itself not guaranteed to be atomic at a granularity beyond aligned 8-bytes. This includes persist atomicity. The only exceptions are MOVDIR64B, ENQCMD, and ENQCMDS. The last two are relevant when doing MMIO writes. Aligned 64-byte AVX-512 stores are likely to be persistently atomic, but this is not architecturally guaranteed and should not be relied upon.

Bisulcate answered 2/4, 2021 at 15:14 Comment(1)
On Intel processors supporting AVX512, there are indications that naturally-aligned 512-bit (64-Byte, i.e., full cacheline) nontemporal stores always fill the WC buffer in a single internal transaction and therefore never split those streaming stores for transmission to memory. (This solves fewer problems than one might imagine....) Related note: Table 11-1 and the text in Section 11.3.1 of Volume 3 of the SWDM do provide specific information about the width and number of WC buffers on several generations of processors.Aubert

© 2022 - 2024 — McMap. All rights reserved.