I'd like to thank Peter Cordes, John D McCalpin, Neel Natu, Christian Ludloff, and David Mazières for their help with figuring this out!
In order to prefetch, you need to be able to store MMIO reads in the CPU cache hierarchy. When you use UC or WC page table entries, you cannot do this. However, you can use the cache hierarchy if you use WT page table entries.
The only caveat is that when you use WT page table entries, previous MMIO reads with stale data can linger in the cache. You must implement a coherence protocol in software to flush the stale cache lines from the cache and read the latest data via an MMIO read. This is alright in my case because I control what happens on the PCIe device, so I know when to flush. You may not know when to flush in all scenarios though, which could make this approach unhelpful to you.
Here is how I set up my system:
Mark the page table entries that map to the PCIe BAR as WT. You can use ioremap_wt()
for this (or ioremap_change_attr()
if the BAR has already been mapped into the kernel).
According to https://sandpile.org/x86/coherent.htm, there are conflicts between the PAT type and the MTRR type. The MTRR type for the PCIe BAR must also be set to WT, otherwise the PAT WT type is ignored. You can do this with the command below. Be sure to update the command with the PCIe BAR address (which you can see with lspci -vv
) and the PCIe BAR size. The size is a hexadecimal value in units of bytes.
echo "base=$ADDRESS size=$SIZE type=write-through" >| /proc/mtrr
As a quick check at this point, you may want to issue a large number of MMIO reads in a loop to the same cache line in the BAR. You should see the cost per MMIO read go down substantially after the first MMIO read. The first MMIO read will still be expensive because you need to fetch the value from the PCIe device, but the subsequent reads should be much cheaper because they all read from the cache hierarchy.
You can now issue a prefetch to an address in the PCIe BAR and have the prefetched cache line stored in the cache hierarchy. Linux has the prefetch()
function to help with issuing a prefetch.
You must implement a simple coherence protocol in software to ensure that stale cache lines backed by the PCIe BAR are flushed from the cache. You can use clflush
to flush a stale cache line. Linux has the clflush()
function to help with this.
A note about clflush
in this scenario: Since the memory type is WT, each store goes to both the cache line in the cache and the MMIO. Thus, from the CPU's perspective, the contents of the cache line in the cache always match the contents of the MMIO. Therefore, clflush
will just invalidate the cache line in the cache -- it will not also write the stale cache line to the MMIO.
- Note that in my system, I immediately issue a prefetch after the
clflush
. However, the code below is incorrect:
clflush(address);
prefetch(address);
This code is incorrect, because according to https://c9x.me/x86/html/file_module_x86_id_252.html, the prefetch could be reordered before the clflush
. Thus, the prefetch could be issued before the clflush
, and the prefetch would presumably be invalidated when the clflush
occurs.
To fix this, according to the link, you should issue cpuid
in between the clflush
and the prefetch:
int eax, ebx, ecx, edx;
clflush(address);
cpuid(0, &eax, &ebx, &ecx, &edx);
prefetch(address);
Peter Cordes said it is sufficient to issue an lfence
instead of cpuid
above.
prefetchnta
isn't ignored on WC memory, it would probably bring data into LFBs on Intel CPUs (like SSE4.1movntdqa
loads from WC - not into any level of cache).sfence
just orders operations, it doesn't forcibly evict anything. (Except maybe on AMD, it might wait for the store buffer to drain; I think AMD'ssfence
has much stronger ordering semantics than Intel's, even guaranteed on paper).sfence
might also evict dirty LFB or whatever the AMD equivalent to make past NT stores visible before later stores. But I wouldn't expect it to touch clean LFBs from loads. – Lonamfence
or maybe evenlfence
might evict data in LFBs that got there frommovntda
loads, to make sure latermovntdqa
loads got "fresh" data. Related: Non-temporal loads and the hardware prefetcher, do they work together? and the Intel whitepaper it links about SSE4.1 movntdqa to copy from WC video RAM back to WB DRAM. – Lona