Prefetch from MMIO?

Is it possible to issue a prefetch for an address backed by an MMIO region in a PCIe BAR (and mapped via either UC or WC page table entries)? I am currently issuing a load for this address which causes the hyperthread to stall for quite some time. There is a non-temporal access hint via PREFETCHNTA, so it seems like this may be possible.

If it is possible, do you know where the prefetched value is stored and what would possibly cause it to become invalidated before I am able to issue a load for it? For example, if I issue a synchronizing instruction such as sfence for something unrelated, would this cause the prefetched value to become invalidated?

From the Intel Software Development Manual:

"Prefetches from uncacheable or WC memory are ignored. ... It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types)."

The PCIe BAR that the MMIO region is in is marked as prefetchable, so I am not sure if that means prefetches will work with it given the language from the manual above.

I'd like to thank Peter Cordes, John D McCalpin, Neel Natu, Christian Ludloff, and David Mazières for their help with figuring this out!

In order to prefetch, you need to be able to store MMIO reads in the CPU cache hierarchy. When you use UC or WC page table entries, you cannot do this. However, you can use the cache hierarchy if you use WT page table entries.

The only caveat is that when you use WT page table entries, previous MMIO reads with stale data can linger in the cache. You must implement a coherence protocol in software to flush the stale cache lines from the cache and read the latest data via an MMIO read. This is alright in my case because I control what happens on the PCIe device, so I know when to flush. You may not know when to flush in all scenarios though, which could make this approach unhelpful to you.

Here is how I set up my system:

Mark the page table entries that map to the PCIe BAR as WT. You can use ioremap_wt() for this (or ioremap_change_attr() if the BAR has already been mapped into the kernel).
According to https://sandpile.org/x86/coherent.htm, there are conflicts between the PAT type and the MTRR type. The MTRR type for the PCIe BAR must also be set to WT, otherwise the PAT WT type is ignored. You can do this with the command below. Be sure to update the command with the PCIe BAR address (which you can see with lspci -vv) and the PCIe BAR size. The size is a hexadecimal value in units of bytes.

echo "base=$ADDRESS size=$SIZE type=write-through" >| /proc/mtrr

As a quick check at this point, you may want to issue a large number of MMIO reads in a loop to the same cache line in the BAR. You should see the cost per MMIO read go down substantially after the first MMIO read. The first MMIO read will still be expensive because you need to fetch the value from the PCIe device, but the subsequent reads should be much cheaper because they all read from the cache hierarchy.
You can now issue a prefetch to an address in the PCIe BAR and have the prefetched cache line stored in the cache hierarchy. Linux has the prefetch() function to help with issuing a prefetch.
You must implement a simple coherence protocol in software to ensure that stale cache lines backed by the PCIe BAR are flushed from the cache. You can use clflush to flush a stale cache line. Linux has the clflush() function to help with this.

A note about clflush in this scenario: Since the memory type is WT, each store goes to both the cache line in the cache and the MMIO. Thus, from the CPU's perspective, the contents of the cache line in the cache always match the contents of the MMIO. Therefore, clflush will just invalidate the cache line in the cache -- it will not also write the stale cache line to the MMIO.

Note that in my system, I immediately issue a prefetch after the clflush. However, the code below is incorrect:

clflush(address);
prefetch(address);

This code is incorrect, because according to https://c9x.me/x86/html/file_module_x86_id_252.html, the prefetch could be reordered before the clflush. Thus, the prefetch could be issued before the clflush, and the prefetch would presumably be invalidated when the clflush occurs.

To fix this, according to the link, you should issue cpuid in between the clflush and the prefetch:

int eax, ebx, ecx, edx;

clflush(address);
cpuid(0, &eax, &ebx, &ecx, &edx);
prefetch(address);

Peter Cordes said it is sufficient to issue an lfence instead of cpuid above.

Recommended topics

Hot tags