how to do mmap for cacheable PCIe BAR

Asked 28/6, 2012 at 22:52 Answered 6/6, 2013 at 17:52

I am trying to write a driver with custom mmap() function for PCIe BAR, with the goal to make this BAR cacheable in the processor cache. I am aware this is not the best way to achieve highest bandwidth and that the order of writes is unpredictable (neither are the issues in this case).

This is similar to what is described in How would one prevent MMAP from caching values?

The processor is Sandy Bridge i7, PCIe device is Altera Stratix IV dev. board.

First, I tried to do it on CentOS 5 (2.6.18). I changed the MTRR settings to make sure the BAR is not within uncacheable MTRR and used io_remap_pfn_range() with _PAGE_PCD and _PAGE_PWT bits cleared. Reads worked as expected: reads returned correct values and second read to the same address does not necessarily cause the read to go to PCIe (read counter was checked in FPGA). However, the writes caused the system to freeze and then reboot without any messages in the logs or on the screen.

Second, I tried to do it on CentOS 6 (2.6.32), which has PAT support. The result is the same: reads work correctly, writes cause system freeze and reboot. Interestingly, non-temporal/write-combining full cache line writes (AVX/SSE) work as expected, i.e. they always go to FPGA and FPGA observes full cache line writes, reads return correct values afterwards. However, simple 64-bit writes still cause system freeze/reboot.

I also tried to ioremap_cache() and then iowrite32() inside the driver code. The result is the same.

I think it is a hardware issue but I would appreciate if somebody can share any ideas about what's going on.

EDIT: I was able to capture MCE message on CentOS 6: Machine Check Exception: 5 Bank 5: be2000000003110a.

I also tried the same code on 2-socket Sandy Bridge (Romley): reads and non-temporal write behavior is the same, simple writes do not cause MCE/crash but have no effect on system state, i.e. value in memory does not change.

Also, I tried the same code on older 2-socket Nehalem system: simple writes also cause MCE, although the codes are different.

Clubbable answered 28/6, 2012 at 22:52 Comment(0)

I am not aware of any x86 hardware that supports the WriteBack (WB) memory type for MMIO addresses, and you are almost certainly seeing a result of that incompatibility. I have posted a discussion of this topic on my blog at http://blogs.utexas.edu/jdm4372/2013/05/29/ and http://blogs.utexas.edu/jdm4372/2013/05/30/

In those postings, I discuss a method that works on some processors -- map the MMIO range twice -- once for store operations from the processor to the FPGA using the Write-Combining (WC) memory type, and once for reads from the processor to the FPGA using the Write Protect (WP) or Write Through (WT) types. You will need to maintain coherence manually by using CLFLUSH on cache lines in the "read only" region when you write to the alias of that line in the "write only" region. You will also need to maintain coherence manually with respect to changes in the values in the FPGA memory, since IO devices cannot generate cache invalidation transactions for MMIO addresses.

My team did this a few years ago when I was at AMD, and am now trying to figure out how to do it with newer Linux kernels and with Intel processors. Linux does not directly support WP or WT memory types with its pre-defined mapping functions, so some hacking is required.... It is fairly easy to override the MTRR for a region, but I am having more trouble finding the correct place(s) in the descendents of the remap_pfn_range() function that I need to change in order to get the WP or WT attribute set in the PAT entries for the range.

This method is probably better suited for FPGAs than for other (pre-defined) types of IO devices, since the programmability of the FPGA allows the flexibility to define the PCI BARs to operate in this double-mapped mode and to cooperate with the processor-side driver in maintaining cache coherence.

Cohlier answered 6/6, 2013 at 17:52 Comment(33)

Thanks for this answer John. Would it also be correct to use CLFLUSH to invalidate the cache line when you believe that the FPGA has changed the value of the memory and you want to load in the latest value from the PCIe BAR? I am not sure what to do in that case because it seems like CLFLUSH will write the value in the cache to the PCIe BAR, which is not what I want -- I want to just clear the cache line from the cache and load in the latest cache line value from the PCIe BAR. – Kalin 12/11, 2022 at 6:51

BTW, Jack also posted Prefetch from MMIO? if you have any insight into that. – Remote 12/11, 2022 at 7:40

@JackHumphries: If the cache line is clean (not in a MESI "modified" state), it can and will just be dropped without write-back first. That's hopefully always be the case with a WT region, since writes go through, letting the cache stay clean? But I've never played around with IO myself, just performance stuff on WB memory, and wouldn't want to bet that there wasn't some timing race condition with a recent store. Hmm, but only one that's still in progress that you want to happen. – Remote 12/11, 2022 at 7:45

@PeterCordes Thanks Peter, that's a good point about WT. Even if that ends up not being true, I can reorganize my struct such that the part I want to read from the host is on a separate cache line from the part I want to write from the host. It sounds like CLFLUSH is what I will use to clear the cache line from the cache. – Kalin 12/11, 2022 at 7:49

@JackHumphries: (I edited the previous comment significantly). Continuing: Speculative load is possible at any time in a cacheable region, but clflushopt and some kind of barrier, or maybe just clflush, might be enough on a WT region. If a cache line ever could be dirty after one write of it has already happened, AFAIK the only way to invalidate without write-back is INVD which affects all lines on all cores, totally unusable. x86 has cache-coherent DMA (backwards compat... 8086 had no cache), so until CLFLUSH there weren't cache-control instructions at all, except prefetch. – Remote 12/11, 2022 at 7:52

@PeterCordes Thanks Peter, your comments make sense to me. I think I am still going to do what I wrote in my comment. Does the direction I propose make sense to you? – Kalin 12/11, 2022 at 7:54

@JackHumphries: Yeah, definitely worth a try, I'd expect it to work to clflush / mfence / load to guarantee a cache miss and a PCIe request. Possibly just lfence, if retiring clflush guarantees that the flush has completed. The more I think about it, the less likely it seems that a corner case could make clflush write anything from WT memory if it woks the way I expect. Except if there actually was a pending store that hadn't made it out of the store buffer yet, but presumably there's enough synchronization with the device that you aren't actively storing at this time. – Remote 12/11, 2022 at 8:4

@JackHumphries In the double-mapped case, the "read-only" region will always be clean, so the CLFLUSH instruction should just cause a silent invalidation. I think I just use LFENCE after each CLFLUSH. For the FPGA case, I planned to have the FPGA provide a bit-map to tell me which lines needed to be flushed. For the "write-only" region, nothing will be cached, so it just requires fencing after the write-combining stores. – Cohlier 13/11, 2022 at 0:52

@PeterCordes Thanks both. I marked the PTEs at WT with ioremap_wt(). As a quick test, I tried reading a 64-bit int (READ_ONCE() in each iteration) in the MMIO region 50 times in a loop. I took a measurement with rdtsc_ordered(), and it seems that reading the int 50 times is 50x more expensive than reading the int once (the overhead is high and is clearly from crossing the bus 50 times), even though I would expect the int to be cached after the first read. I am looking into this though please let me know if I missed an obvious step. Thanks. – Kalin 13/11, 2022 at 19:50

This is on AMD Zen 3 by the way. – Kalin 13/11, 2022 at 19:54

Just to clarify, the double mapping is not necessary in order for the int to be cached, correct? Just a single WT mapping is sufficient? – Kalin 13/11, 2022 at 20:8

@JackHumphries: Sorry, I pretty much just know the theory when it comes to I/O and cacheability settings, never tried any of this stuff myself or poked around at kernel code. But yes, READ_ONCE() just compiles to a plain mov load, and I'd have expected that to hit in cache if WT actually worked to make MMIO cacheable. – Remote 13/11, 2022 at 20:27

@PeterCordes No worries, thanks Peter. I will look at this more closely and see if I find any issues with what I've done. – Kalin 13/11, 2022 at 20:29

@JackHumphries: It might also be possible a CPU treats WT as UC for non-DRAM. Your test procedure sounds like a correct litmus test to check if it's actually working. Especially if you do a READ_ONCE outside the timed loop (maybe somewhere else in the same page) to rule out TLB miss after modifying the page table in one case but not another. – Remote 13/11, 2022 at 20:32

@PeterCordes I just found an issue where the same physical address range is also being mapped somewhere else as UC. I tried changing that to WT as well. I will let you know if that fixes it once I can reboot the machine. – Kalin 13/11, 2022 at 20:42

@PeterCordes Even when I remove the second mapping altogether -- so there is only one WT mapping to the PCIe BAR -- I still see the same overhead for 50 MMIO reads in a row. I suppose it may not be possible to cache the reads or prefetch unfortunately. Let me try one more time with a WB PTE just to see what happens. – Kalin 13/11, 2022 at 21:42

@PeterCordes I tried with ioremap_cache() (which is for WB) and ioremap_wp() (which is for WP) and still saw the same overhead of 50x vs. a single MMIO read. I suppose caching (and therefore prefetching) is not possible, at least on my AMD machine. – Kalin 13/11, 2022 at 23:47

@PeterCordes As a long resort, is there any possible value to just trying to do a normal load to the int (rather than a prefetch) in the MMIO region well in advance of when I need it? If there are no subsequent dependencies on the load in the pipeline (for at least a few thousand cycles), perhaps the CPU could overlap the MMIO load with the instructions that come after. But I am not sure how long the CPU will let the MMIO load sit in the pipeline as it moves onto the next instructions before stalling. – Kalin 13/11, 2022 at 23:49

@JackHumphries: IDK if AMD has forums where you could ask and maybe get a response from an AMD engineer who'd be able to confirm this is expected behaviour for their CPUs, or if cacheable MMIO is possible somehow. Re: early loads independent of later instructions for a while: yes that would probably help. x86 CPUs AFAIK don't let loads retire from the ROB until data actually arrives, unlike ARM where LoadStore reordering is allowed so it can probably retire a load once it's known to be non-faulting, separately tracking that the reg isn't ready... – Remote 14/11, 2022 at 0:0

@JackHumphries: ... So on x86, the ROB size will limit how much load latency you can hide. If all later instructions are independent of that load, they can all execute, up to the limit of the ROB capacity, so as soon as it does arrive and retire, the pipeline can start filling the RS with new work, and the ROB can retire the completed work after the load at its max rate, like 4 to 6 per clock (per logical core on Skylake, even though issue is only 4 per physical core). vs. if you had later work that was mostly dependent on the MMIO load, the CPU would have to wait to exec it. – Remote 14/11, 2022 at 0:4

@JackHumphries: So yes, software pipelining like early loads can help the hardware get more work done in the shadow of a slow load, even one that eventually stalls. Especially good if you can put it right before some other low-IPC code that's known to have cache misses and/or branch mispredicts, or latency bottlenecks, not like 4 instruction per clock highly tuned code that will fill the ROB and complete that work in a minimal number of cycles. See also blog.stuffedcow.net/2013/05/measuring-rob-capacity – Remote 14/11, 2022 at 0:7

@PeterCordes Sounds good. And I suppose I didn't see that effect in my 50 loop experiment above because I used rdtsc_ordered() before and after the loop to measure the start and end timestamps, respectively? So rdtsc_ordered() forced the loads to retire before taking the end timestamp? – Kalin 14/11, 2022 at 0:26

@JackHumphries: Right. And if it's treating the memory as strongly-ordered UC, it might not even pipeline the requests with each other, not starting the next one until the first one arrives maybe? Otherwise you'd expect (if you're only timing outside the loop with lfence;rdtsc or rdtscp) that the MMIO reads could be pipelined, so they'd take 50 / max_concurrency times longer than barrier; load ; barrier. Where max_concurrency is probably same as DRAM, 10 or 12. Unless them all being to the same location is introducing extra ordering, not separate dwords or cache lines. – Remote 14/11, 2022 at 1:30

@PeterCordes According to the table at the bottom right of sandpile.org/x86/coherent.htm, both the MTRR and the PTE need to be set to WT in order for the memory to be WT. cat /proc/mtrr indicates that the MTRR is not WT. Let me figure out how to set this... – Kalin 14/11, 2022 at 4:11

@JackHumphries: Ah, I wondered briefly if MTRR would override PAT, but I'd assumed Linux's kernel function would handle that, or that you'd already checked that it wasn't necessary. I guess I should have mentioned that idea earlier, but glad you found it. You can try WT on a page of DRAM and do performance experiments to verify that stores are still slow but reads are fast, to make sure your mapping code is working. – Remote 14/11, 2022 at 4:22

@PeterCordes I added an MTRR with type WT for the PCIe BAR and the overhead for 50 MMIO reads to the same int is now only 1.44x more expensive (rather than 50x more expensive) than one MMIO read. It looks like prefetching may be possible after all... I will let you know. Thanks! – Kalin 14/11, 2022 at 5:16

@JackHumphries Glad that you found the PAT/MTRR combination tables -- it is definitely not made easy by the Linux kernel. What sort of absolute latencies are you seeing for the single-read case? The double-mapping is just needed to get (not terrible) performance for both reads and writes to a single FPGA MMIO-mapped region. – Cohlier 14/11, 2022 at 14:3

@PeterCordes Hi Peter and John, thanks for your help! I was able to get the prefetching to work today. I now observe very low overheads for MMIO reads that were prefetched. I will write up an answer on my other post and give both of you credit, though you are welcome to write an answer yourself if you prefer. By the way, I do clflush, then cpuid, then I prefetch. The point of the cpuid is to ensure that the prefetch happens after the clflush. Not sure if there is a way to do this with lower overhead than cpuid, but not sure that it matters too much. – Kalin 15/11, 2022 at 5:24

@JackHumphries: I'd expect that clflush / lfence would work to make sure the clflush has retired before later instructions can exec. (With Spectre mitigation enabled, lfence on AMD CPUs is an execution barrier like on Intel, so you can pretty much assume that these days, especially in Linux). Intel's manual says clflush is ordered wrt. fences, so that would include lfence. Prefetch isn't ordered wrt. fences in general, but a prefetch after an lfence execution barrier won't even be seen by the back-end until after the ROB drains. (Although that's an implementation detail.) – Remote 15/11, 2022 at 5:33

As far as I know, in practice at least on current Intel/AMD CPUs, the sequence mfence; lfence is as strong as a serializing instruction (such as cpuid) as far as ordering anything that could matter, draining the store buffer and ROB. You don't need mfence because clflush is ordered wrt. fences including lfence. In kernel code, wrmsr might possibly be cheaper than cpuid if you know a safe MSR number that doesn't do much. Still slow, though, and probably still a VM exit if virtualizing. – Remote 15/11, 2022 at 5:36

@PeterCordes Interesting, let me try an lfence. Thanks! – Kalin 15/11, 2022 at 7:10

@JackHumphries: Indeed. A prefetch before an lfence or mfence isn't guaranteed to be complete after the fence. But knowing how lfence works in practice (stopping the front-end from issuing instructions into the back-end until the ROB drains), the execution units simply can't see the prefetch until after the lfence lets execution of later instructions begin. This is documented for Intel CPUs: felixcloutier.com/x86/lfence - LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes – Remote 15/11, 2022 at 7:14

@JackHumphries: Re: AMD processors, Is LFENCE serializing on AMD processors? has the details on what their manuals say when MSR_F10H_DECFG_LFENCE_SERIALIZE_BIT is set. A prefetch instruction can't do anything until dispatched to a load execution unit, so in that sense lfence can order prefetches in one direction. Also re: lfence, see Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths – Remote 15/11, 2022 at 7:20

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags