Flush cache to DRAM

Asked 19/9, 2013 at 14:1 Answered 12/9, 2014 at 15:6

I'm using a Xilinx Zynq platform with a region of memory shared between the programmable HW and the ARM processor.

I've reserved this memory using memmap on the kernel command line and then exposed it to userspace via mmap/io_remap_pfn_range calls in my driver.

The problem I'm having is that it takes some time for the writes to show up in DRAM and I presume it's stuck in dcache. There's a bunch of flush_cache_* calls defined but none of them are exported, which is a clue to me that I'm barking up the wrong tree...

As a trial I locally exported flush_cache_mm and just to see what would happen and no joy.

In short, how can I be sure that any writes to this mmap'd regions have been committed to DRAM?

Thanks.

Dysfunction answered 19/9, 2013 at 14:1 Comment(2)

memory.txt shows the types of memory available from Linux kernel space. These memory options might serve you better than the memmap command line. You can handle command line arguments in a driver. – Blouson 20/9, 2013 at 14:16

Possible duplicate of How to flush the CPU cache for a region of address space in Linux? – Cambell 24/8, 2017 at 6:42

The ARM processors typically have both a I/D cache and a write buffer. The idea of a write buffer is to gang sequential writes together (great for synchronous DRAM) and to not delay the CPU to wait for a write to complete.

To be generic, you can flush the d cache and the write buffer. The following is some inline ARM assembler which should work for many architectures and memory configurations.

 static inline void dcache_clean(void)
 {
     const int zero = 0;
     /* clean entire D cache -> push to external memory. */
     __asm volatile ("1: mrc p15, 0, r15, c7, c10, 3\n"
                     " bne 1b\n" ::: "cc");
     /* drain the write buffer */
    __asm volatile ("mcr 15, 0, %0, c7, c10, 4"::"r" (zero));
 }

You may need more if you have an L2 cache.

To answer in a Linux context, there are different CPU variants and different routines depending on memory/MMU configurations and even CPU errata. See for instance,

These routines are either called directly or looked up in a cpu info structure with function pointers to the appropriate routine for the detected CPU and configuration; depending on whether the kernel is special purpose for a single CPU or multi-purpose like a Ubuntu distribution.

To answer the question specifically for your situation, we need to know L2 cache, write buffered memory, CPU architecture specifics; maybe including silicon revisions for errata. Another tactic is to avoid this completely by using the dma_alloc_XXX() routines which mark memory as un-cacheable and un-bufferable so that the CPU writes are pushed externally immediately. Depending on your memory access pattern, either solution is valid. You may wish to cache if the memory only needs to be synchronized at some checkpoint (vsync/*hsync* for video, etc).

Blouson answered 19/9, 2013 at 14:22 Comment(4)

Clean entire D-cache does not work on ARMv7 (and I think the Zynq is a Cortex-A9). Also, it is a privileged operation, so would undef from userspace. – Robet 19/9, 2013 at 22:25

cache-v7.S for the github Xilinx tree might be more appropriate; although cursorily, it looks the same as the mainline. My intent was to be more generic and hope that the OP would understand something instead of spoon feeding. Mainly, it is almost always better not to use cacheable or bufferable memory in these cases. – Blouson 20/9, 2013 at 13:57

@Robet All of those routines are meant to run in supervisor mode. I just meant to show the two steps; v7_flush_dcache_all() in cache-v7.S might be more appropriate for a Cortex-A9. The driver would have to provide a mechanism to flush from user space, like an ioctl() command, or proc or sysfs file entries. Thanks, I thought this was obvious and it probably isn't. – Blouson 20/9, 2013 at 14:18

Yes, I've already implemented the 'flush' function as a ioctl in my driver. Thus far I've not seen any improvement in behavior. Lots of good info in here though. Right now I've been attempting to use flush_cache_all(). I see v7_flush_dcache_all declared in cache-v7.S but it's not clear to me how to call this asm function from my driver... – Dysfunction 21/9, 2013 at 15:56

I hit the exact same problem, on zynq. Finally got L2 flushed/invalidated with:

#include <asm/outercache.h>
outer_cache.flush_range(start,size);
outer_cache.inv_range(start,size);

start is a kernel virtual space pointer. You also need to flush L1 to L2:

__cpuc_flush_dcache_area(start,size);

I'm not sure if invalidating L1 is needed before reading, and I haven't found the function to do this. I assume it would need to be, and I've thus far only been lucky...

Seems any suggestions on the 'net that I found assume the device to be "inside" of the L2 cache coherency, so they did not work if the AXI-HP ports were used. With the AXI-ACP port used, L2 flushing was not needed. (For those not familiar with zync: the HP-ports access the DRAM controller directly, bypassing any cache/MMU implemented on ARM side)

Gilkey answered 12/9, 2014 at 15:6 Comment(1)

ACE versus AXI is exactly the same sort of paradigm as AXI-HP (normal AXI) and AXI-ACP (ACE). – Blouson 19/1, 2022 at 16:21

I'm not familiar with Zynq, but you essentially have two options that really work:

either include your other logic on the FPGA in the same coherency domain (if Zynq has an ACP port, for example)
or mark the memory you map as device memory (or other non-cacheable if you don't care about gather, reorder and early write acknowledge) and use a DSB after any write that should be seen.

If the memory is marked as cacheable and your other observer is not in the same coherency domain you are asking for trouble - when you clean the D-cache with a DCCISW or similar op and you have an L2 cache - that's where it'll all end up in.

Teleplay answered 25/9, 2013 at 7:7 Comment(0)

Recommended topics

Hot tags