Linux PCIe DMA Driver (Xilinx XDMA)

Asked 16/2, 2018 at 8:58 Answered 12/3, 2018 at 13:42

I am currently working with the Xilinx XDMA driver (see here for source code: XDMA Source), and am attempting to get it to run (before you ask: I have contacted my technical support point of contact and the Xilinx forum is riddled with people having the same issue). However, I may have found a snag in Xilinx's code that might be a deal breaker for me. I am hoping there is something that I'm not considering.

First off, there are two primary modes of the driver, AXI-Memory Mapped (AXI-MM) and AXI-Streaming (AXI-ST). For my particular application, I require AXI-ST, since data will continuously be flowing from the device.

The driver is written to take advantage of scatter-gather lists. In AXI-MM mode, this works because reads are rather random events (i.e., there isn't a flow of data out of the device, instead the userspace application simply requests data when it needs to). As such, the DMA transfer is built up, the data is transfered, and the transfer is then torn down. This is a combination of get_user_pages(), pci_map_sg(), and pci_unmap_sg().

For AXI-ST, things get weird, and the source code is far from orthodox. The driver allocates a circular buffer where the data is meant to continuously flow into. This buffer is generally sized to be somewhat large (mine is set on the order of 32MB), since you want to be able to handle transient events where the userspace application forgot about the driver and can then later work off the incoming data.

Here's where things get wonky... the circular buffer is allocated using vmalloc32() and the pages from that allocation are mapped in the same way as the userspace buffer is in AXI-MM mode (i.e., using the pci_map_sg() interface). As a result, because the circular buffer is shared between the device and CPU, every read() call requires me to call pci_dma_sync_sg_for_cpu() and pci_dma_sync_sg_for_device(), which absolutely destroys my performance (I can not keep up with the device!), since this works on the entire buffer. Funny enough, Xilinx never included these sync calls in their code, so I first knew I had a problem when I edited their test script to attempt more than one DMA transfer before exiting and the resulting data buffer was corrupted.

As a result, I'm wondering how I can fix this. I've considered rewriting the code to build up my own buffer allocated using pci_alloc_consistent()/dma_alloc_coherent(), but this is easier said than done. Namely, the code is architected to assume using scatter-gather lists everywhere (there appears to be a strange, proprietary mapping between the scatter-gather list and the memory descriptors that the FPGA understands).

Are there any other API calls I should be made aware of? Can I use the "single" variants (i.e., pci dma_sync_single_for_cpu()) via some translation mechanism to not sync the entire buffer? Alternatively, is there perhaps some function that can make the circular buffer allocated with vmalloc() coherent?

Niveous answered 16/2, 2018 at 8:58 Comment(2)

In April 2018, Xilinx released an updated driver. You can find it in the link you have in the original question. – Doradorado 12/6, 2018 at 19:41

@Doradorado and Xilinx didn't bother fixing any of the issues brought up in this question. They don't deal with any syncing or using coherent memory, so they now added a note saying that the driver is x86 (aka. hardware handles coherency) only. – Stumper 13/6, 2018 at 4:44

Alright, I figured it out.

Basically, my assumptions and/or understanding of the kernel documentation regarding the sync API were totally incorrect. Namely, I was wrong on two key assumptions:

If the buffer is never written to by the CPU, you don't need to sync for the device. Removing this call doubled my read() throughput.
You don't need to sync the entire scatterlist. Instead, now in my read() call, I figure out what pages are going to be affected by the copy_to_user() call (i.e., what is going to be copied out of the circular buffer) and only sync those pages that I care about. Basically, I can call something like pci_dma_sync_sg_for_cpu(lro->pci_dev, &transfer->sgm->sgl[sgl_index], pages_to_sync, DMA_FROM_DEVICE) where sgl_index is where I figured the copy will start and pages_to_sync is how large the data is in number of pages.

With the above two changes my code now meets my throughput requirements.

Niveous answered 19/2, 2018 at 23:51 Comment(2)

I'm still not sure I understand the kernel documentation regarding the sync API. Do you know if I need to call <dma_sync_single_for_cpu> if I want my driver to read the data from the device ? Or should I use it only when my driver changes the content of the data ? From my understanding (not sure if that correct), if the driver won't use <dma_sync_single_for_cpu> before reading the data from device, the driver may read an unupdated data. – Arsenical 25/9, 2020 at 11:34

Are you sure that &transfer->sgm->sgl[sgl_index] is a safe solution? Please note, that the scatterlist may be a linked list ( see elixir.bootlin.com/linux/v5.16.4/source/include/linux/… ). That's why a special sg_next macro is provided to iterate the scatterlist. – Dubois 30/1, 2022 at 11:19

I think XDMA was originally written for x86, in which case the sync functions do nothing.

It does not seem likely that you can use the single sync variants unless you modify the circular buffer. Replacing the circular buffer with a list of buffers to send seems like a good idea to me. You pre-allocate a number of such buffers and have a list of buffers to send and a free list for your app to reuse.

If you're using a Zynq FPGA, you could connect the DMA engine to the ACP port so that FPGA memory access will be coherent. Alternatively, you can map the memory regions as uncached/buffered instead of cached.

Finally, in my FPGA applications, I map the control registers and buffers into the application process and only implement mmap() and poll() in the driver, to give apps more flexibility in how they do DMA. I generally implement my own DMA engines.

Irresponsible answered 16/2, 2018 at 13:53 Comment(4)

Are you sure the sync functions don't do anything on X86? I walked the kernel code and it looks like there is no arch specific code for either X86 or ARM (my target platform). – Stumper 16/2, 2018 at 18:4

Also, if I'm going to redo the circular buffer, wouldn't it make sense to have a pool of coherent buffers, so I can skip the mapping/unmapping/syncing? – Stumper 16/2, 2018 at 18:12

@It'sPete: on x86, I/O is cache coherent so there should be nothing to do in the sync functions. On older ARM processors, I/O was not cache coherent. On newer ARM chips, some I/O is cache coherent. For example, the Zynq board provides both a coherent ACP port and multiple non-snooped ports to DRAM from the programmable logic. – Irresponsible 1/3, 2018 at 20:3

@It'sPete: take a look at arch/arm/mm/dma-mapping.c for some of the ARM specific DMA functions. – Irresponsible 1/3, 2018 at 20:31

Pete, I am the original developer of the driver code (before the X of XMDA came into place).

The ringbuffer was always an unorthodox thing and indeed meant for cache-coherent systems and disabled by default. It's initial purpose was to get rid of the DMA (re)start latency; even with full asynchronous I/O support (even with zero-latency descriptor chaining in some cases) we had use cases where this could not be guaranteed, and where a true hardware ringbuffer/cyclic/loop mode was required.

There is no equivalent to a ringbuffer API in Linux, so it's open-coded a bit.

I am happy to re-think the IP/driver design.

Can you share your fix?

Topless answered 12/3, 2018 at 13:42 Comment(4)

Hi Leon! Thanks for reaching out. Unfortunately, the code is not mine to share publicly, although that might very well still happen in the future. That said, I was able to cobble together a working solution based on a few main changes. First, I ignore EOP and take into account the buffer size passed in from userspace. Second, I use pci_dma_sync_sg_for_cpu() on only the part of the sgl that matters (see my answer). Finally, I remove any assumptions about RX_BUF_BLOCK size, since pci_map_sg() is free to combine descriptors if it wants. – Stumper 13/3, 2018 at 0:22

Also, to be frank, my solution is still far from ideal and relies on the hardware copying to kernel space and then another copy to userspace. However, after I had gotten my code working, a colleague of mine pointed me to this repo: gitlab.com/WZab/Artix-DMA1 and associated paper: koral.ise.pw.edu.pl/~wzab/artykuly/…. – Stumper 13/3, 2018 at 0:25

Finally, I have to say, the driver is honestly pretty well written. Had I needed AXI-MM mode, it would have worked out of the box. I don't want to speak out of turn, but from what I gather from looking at the different code styles in the source from various authors, my guess is that the "features" that caused my issues might have been added after you had delivered your working driver to an entity I will only refer to as "X". – Stumper 13/3, 2018 at 0:28

Just to note, recent Linux kernels have the io_uring ring-buffer construct. Have not reached the point of studying this, to pump through DMAs faster. – Titanium 6/10, 2023 at 16:6

Recommended topics

Hot tags