DMA PCIe read transfer from PC to FPGA
Asked Answered
O

1

6

I'm trying to get DMA transfer working between an FPGA and an x86_64 Linux machine.

On the PC side I'm doing this initialization:

//driver probe
... 
pci_set_master(dev); //set endpoint as master
result = pci_set_dma_mask(dev, 0xffffffffffffffff); //set as 64bit capable
...

//read
pagePointer = __get_free_page(__GFP_HIGHMEM); //get 1 page
temp_addr = dma_map_page(&myPCIDev->dev,pagePointer,0,PAGE_SIZE,DMA_TO_DEVICE);
printk(KERN_WARNING "[%s]Page address: 0x%lx Bus address: 0x%lx\n",DEVICE_NAME,pagePointer,temp_addr);
writeq(cpu_to_be64(temp_addr),bar0Addr); //send address to FPGA
wmb();
writeq(cpu_to_be64(1),bar1Addr); //start trasnfer
wmb();

The bus address is a 64bits address. On the FPGA side the TLP I'm sending out for the read of 1 DW:

Fmt: "001"
Type: "00000"
R|TC|R|Attr|R|TH : "00000000"
TD|EP|Attr|AT : "000000"
Length : "0000000001"
Requester ID
Tag : "00000000"
Byte Enable : "00001111";
Address : (address from dma map page)

The completion that I get back from the PC is :

Fmt: "000"
Type: "01010"
R|TC|R|Attr|R|TH : "00000000"
TD|EP|Attr|AT : "000000"
Length : "0000000000"
Completer ID
Compl Status|BCM : "0010"
Length : "0000000000";
Requester ID
Tag : "00000000"
R|Lower address : "00000000"

so basically a completion without data and with the status Unsupported Request. I don't think there is something wrong on the construction of the TLP but I cannot see any problem on the driver side either. The kernel I'm using has the PCIe error reporting enabled but I see nothing in the dmesg output. What's wrong? Or, is there a way to find why I get that Unsupported Request Completion?

Marco

Ovida answered 2/6, 2015 at 22:24 Comment(1)
You could compare your code to other open PCIe drivers like Riffa 2.x or XilliBus on how to use kernel function for DMA.Obnubilate
J
2

This is an extract from one of my designs (that works!). It's VHDL and slightly different but hopefully it will help you:

-- First dword of TLP Header
tlp_header_0(31 downto 30)  <= "01";            -- Format = MemWr
tlp_header_0(29)                        <= '0' when pcie_addr(63 downto 32) = 0 else '1'; -- 3DW header or 4DW header
tlp_header_0(28 downto 24)  <= "00000";         -- Type
tlp_header_0(23)                        <= '0'; -- Reserved
tlp_header_0(22 downto 20)  <= "000";           -- Default traffic class
tlp_header_0(19)                        <= '0'; -- Reserved
tlp_header_0(18)                        <= '0'; -- No ID-based ordering
tlp_header_0(17)                        <= '0'; -- Reserved
tlp_header_0(16)                        <= '0'; -- No TLP processing hint
tlp_header_0(15)                        <= '0'; -- No TLP Digest
tlp_header_0(14)                        <= '0'; -- Not poisoned
tlp_header_0(13 downto 12)  <= "00";            -- No PCI-X relaxed ordering, no snooping
tlp_header_0(11 downto 10)  <= "00";            -- No address translation
tlp_header_0( 9 downto  0)  <= "00" & X"20";    -- Length = 32 dwords

-- Second dword of TLP Header
-- Bits 31 downto 16 are Requester ID, set by hardware PCIe core
tlp_header_1(15 downto 8)       <= X"00";   -- Tag, it may have to increment
tlp_header_1( 7 downto 4)       <= "1111";  -- Last dword byte enable
tlp_header_1( 3 downto 0)       <= "1111";  -- First dword byte enable

-- Third and fourth dwords of TLP Header, fourth is *not* sent when pcie_addr is 32 bits
tlp_header_2    <= std_logic_vector(pcie_addr(31 downto  0)) when pcie_addr(63 downto 32) = 0 else std_logic_vector(pcie_addr(31 downto 0));
tlp_header_3    <= std_logic_vector(pcie_addr(31 downto  0));

Let's ignore the obvious difference that I was performing MemWr of 32 dwords instead of reading a dword. The other difference, which caused me trouble the first time I did this, is that you have to use 3DW header if the address is below 4GB.

That means you have to check the address you get from the host and determine if you need to use the 3DW header (with only LSBs of address) or the full 4DW header mode.

Unless you need to transfer ungodly amount of data, you can set the dma address mask to 32 bits to be always in the 3DW case, Linux should reserve plenty of memory location below 4GB by default.

Jute answered 3/6, 2015 at 3:18 Comment(7)
From the dma_map_page I always receive a 64 bits address that's why I'm using a 4DW header. If I set the dma mask to 32 bits, the kernel crashes when calling dma_map_page. BTW I use VHDL.Ovida
You can't be sure of that, what if you run your code on a system with less than 4GB physical memory? Can you use dma_map_single or dma_map_coherent instead of dma_map_page, at least to test? It probably crashes since the get_page is unrelated to dma, so if it gives you a page > 4GB and request a map on < 4GB... The kernel also has a GFP_DMA memory region, but I'm not sure that's relevant to moder system where all region should be available for DMA.Jute
You are right that I cannot be sure every time. Just in my case I print the bus address and is always 64bits. I saw somewhere the GPF_DMA32 but not well documented. The problem is that in the future i would probably need to map a huge amount of memory, even > 4GB. Have you ever tried to transfer a page in the high memory (>4GB) region and using the 4DW TLP? Did it work?Ovida
So I tried with dma_map_single and it works, with a page in the highmem region. The bus address is totally different. With dma_map_page I get something like 0xFFFFFFF110A80000 while with dma_map_single something like 0x0000000110A80000. There must be something wrong in the implementation. thanksOvida
Ok I just found the huge mistake. dma_map_single wants an addres as the second parameter while dma_map_page expect a struct page* ! This is the culprit of the malfuction.Ovida
Yes, the 0xFFxx address would be virtual memory, while the FPGA needs physical address (unless you use VTd), which is 0 to size-of-memory. I don't think high memory is relevant to 64-bits architecture. DMA memory from dma_map_* is contiguous, and will likely fail if you ask for 4GB. In that case, you should implement scatter-gather and use dma_map_sg, which makes thing more complicated.Jute
High memory is relevant if you have more than 4GB and your device can address more than 32bits. I'm already using the scatter but in a custom way.Ovida

© 2022 - 2024 — McMap. All rights reserved.