I have a PCIe device with a userspace driver. I'm writing commands to the device through a BAR, the commands are latency sensitive and amount of data is small (~64-bytes) so I don't want to use DMA.
If I remap the physical address of the BAR in the kernel using ioremap_wc
and then write 64-bytes to the BAR inside the kernel, I can see that the 64-bytes are written as a single TLP over PCIe. If I allow my userspace program to mmap
the region with the MAP_SHARED
flag and then write 64-bytes I see multiple TPLs on the PCIe bus, rather than a single transaction.
According to the kernel PAT documentation I should be able to export write-combined pages through to userspace:
Drivers wanting to export some pages to userspace do it by using mmap interface and a combination of
1)
pgprot_noncached()
2)
io_remap_pfn_range()
orremap_pfn_range()
orvm_insert_pfn()
With PAT support, a new API
pgprot_writecombine
is being added. So, drivers can continue to use the above sequence, with eitherpgprot_noncached()
orpgprot_writecombine()
in step 1, followed by step 2.
Based on this documentation, the relevant kernel code from my mmap handler looks like this:
vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
return io_remap_pfn_range(vma,
vma->vm_start,
info->mem[vma->vm_pgoff].addr >> PAGE_SHIFT,
vma->vm_end - vma->vm_start,
vma->vm_page_prot);
My PCIe device shows up in lspci with the BARs marked as prefetchable as expected:
Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 11 Region 0: Memory at d8000000 (64-bit, prefetchable) [size=32M] Region 2: Memory at d4000000 (64-bit, prefetchable) [size=64M]
When I call mmap
from userspace I see a log message (having set debugpat kernel boot parameter):
reserve_memtype added [mem 0xd4000000-0xd7ffffff], track write-combining, req write-combining, ret write-combining
I can also see in /sys/kernel/debug/x86/pat_memtype_list
that a PAT entry looks correct and there are no overlapping regions:
write-combining @ 0xd4000000-0xd8000000
uncached-minus @ 0xd8000000-0xda000000
I have also checked that there are no MTRR entries that would conflict with the PAT configuration. As far as I can see, everything is set up correctly for write-combining to occur in userspace, however using a PCIe analyser to observe the transactions on the PCIe bus there the userspace access pattern is completely different to the same write performed from the kernel after an ioremap_wc
call.
Why is write-combining not working as expected from userspace?
What can I do to debug further?
I'm currently running on a single socket 6-core i7-3930K.