Generating a 64-byte read PCIe TLP from an x86 CPU

Asked 19/8, 2018 at 14:44 Answered 19/8, 2018 at 17:52

When writing data to a PCIe device, it is possible to use a write-combining mapping to hint the CPU that it should generate 64-byte TLPs towards the device.

Is it possible to do something similar for reads? Somehow hint the CPU to read an entire cache line or a larger buffer instead of reading one word at a time?

Searcy answered 19/8, 2018 at 14:44 Comment(0)

Intel has a white-paper on copying from video RAM to main memory; this should be similar but a lot simpler (because the data fits in 2 or 4 vector registers).

It says that NT loads will pull a whole cache-line of data from WC memory into a LFB:

Ordinary load instructions pull data from USWC memory in units of the same size the instruction requests. By contrast, a streaming load instruction such as MOVNTDQA will commonly pull a full cache line of data to a special "fill buffer" in the CPU. Subsequent streaming loads would read from that fill buffer, incurring much less delay.

Use AVX2 _mm256_stream_load_si256() or the SSE4.1/AVX1 128-bit version.

Fill-buffers are a limited resource, so you definitely want the compiler to generate asm that does the two aligned loads of a 64-byte cache-line back to back, then store to regular memory.

If you're doing more than one 64-byte block at a time, see Intel's white-paper for a suggestion on using a small bounce buffer that stays hot in L1d to avoid mixing stores to DRAM with NT loads. (L1d evictions to DRAM, like NT stores, also require line-fill buffers, LFBs).

Note that _mm256_stream_load_si256() is not useful at all on memory types other than WC. The NT hint is ignored on current hardware, but it costs an extra ALU uop anyway vs. a regular load. There is prefetchnta, but that's a totally different beast.

Mihalco answered 19/8, 2018 at 17:52 Comment(1)

NT loads may use buffers that are different from the LFBs. See: Non-temporal loads and the hardware prefetcher, do they work together?. – Hydromedusa 20/8, 2018 at 2:8

Intel posted a white paper on how to do 64B PCIe transfers: How to Implement a 64B PCIe* Burst Transfer on Intel® Architecture.

The principles are:

Map the region as WC.

Use the following code to write 64B

_mm256_store_si256(pcie_memory_address, ymm0);
_mm256_store_si256(pcie_memory_address+32, ymm1);
_mm_mfence();

Where _mm256_store_si256 is the intrinsic of (v)movdqa and the mfence is used to order the stores with newer ones and flush the WC buffer.

As for my limited understanding of the WC part of the cache subsystem, there are a number of assumptions:

The CPU writes a WC buffer as a burst-transaction only if the WC buffer is full:

The only elements of WC propagation to the system bus that are guaranteed are those provided by transaction atomicity. For example, with a P6 family processor, a completely full WC buffer will always be propagated as a single 32-bit burst transaction using any chunk order. In a WC buffer eviction where data will be evicted as partials, all data contained in the same chunk (0 mod 8 aligned) will be propagated simultaneously.

So one must be sure to use an empty WC buffer otherwise a 32B transaction will be made and, even worst, the upper chunk may be written before the lower one.
There is a practical experimentation on the Intel's forum using an FPGA where the WC buffer is sometimes flushed prematurely.
The WC cache type ensures the core writes a burst-transaction but the uncore must also be able to handle this transaction as a whole.
Particularly, after the subtractive decoding, the Root complex must be able to process it as a 64B transaction.
From the same forum post of above, it seems that the uncore is able to coalesce sequential WC writes into a single TLP but playing with the write ordering (e.g. swapping the two _mm256_store_si256 or leaving a hole for sizes smaller than 64B) may fall out of the Root Complex capabilities.

Knowable answered 19/8, 2018 at 17:39 Comment(6)

The OP says they already know how to do burst writes. This does look like a useful collection of avoiding gotchas, though. But are you sure you really need mfence? It's quite slow; shouldn't sfence be sufficient to order stores to WC memory with regular stores? You'd only need mfence (or a locked operation which doesn't block OoO exec, just memory reordering) to make sure the store happened before reading from the device. But see Does lock xchg have the same behavior as mfence?: NT loads from WC might be one case where there's a diff. – Mihalco 19/8, 2018 at 20:36

@PeterCordes Yes, I believe sfence will suffice. I'm not sure why Intel uses mfence. BTW, what I consider more valuable in my answer is the linked paper. Having the core performing a burst-write doesn't mean that the Root complex will emit 64B TLP. Aaaand... you are right, I forgot about the read part :D I'll complete this answer as soon as I have some spare time. – Knowable 19/8, 2018 at 21:19

mfence would make sure that later loads aren't competing for fill buffers. It probably depends on the surrounding code whether mfence is helpful or harmful. – Mihalco 19/8, 2018 at 21:39

What if it's a kernel module code where intrinsics can't be used? – Boyla 17/3, 2020 at 12:29

@Alexis_FR_JP Probably the kernel has its own functions or some inline assembly guarded by the relevant definition. Try searching the name of the instruction in kernel tree source, something should pop up. – Knowable 17/3, 2020 at 14:20

Roger, for kernel space: https://mcmap.net/q/23022/-how-to-load-a-avx-512-zmm-register-from-a-ioremap-address/4748326 – Boyla 17/3, 2020 at 20:34

Recommended topics

Hot tags