Can mmap's performance be improved for shared memory?

Asked 8/8, 2022 at 13:25 Answered 8/8, 2022 at 21:1

I'm writing a small program for Wayland that uses software rendering and wl_shm for display. This requires that I pass a file descriptor for my screen buffer to the Wayland server, which then calls mmap() on it, i.e. the screen buffer must be shareable between processes.

In this program, startup latency is key. Currently, there is only one remaining bottleneck: the initial draw to the screen buffer, where the entire buffer is painted over. The code below shows a simplified version of this:

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/mman.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main()
{
    /* Fullscreen buffers are around 10-30 MiB for common resolutions. */
    const size_t size = 2880 * 1800 * 4;
    int fd = memfd_create("shm", 0);
    ftruncate(fd, size);
    void *pool = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

    /* Ideally, we could just malloc, but this memory needs to be shared. */
    //void *pool = malloc(size);

    /* In reality this is a cairo_paint() call. */
    memset(pool, 0xCF, size);

    /* Subsequent paints (or memsets) after the first take negligible time. */
}

On my laptop, the memset() above takes around 21-28 ms. Switching to malloc()'ed memory drops this to 12 ms, but the problem is that the memory needs to be shared between processes. The behaviour is similar on my desktop: 7 ms for mmap(), 3 ms for malloc().

My question is: Is there something I'm missing that can improve the performance of shared memory on Linux? I've tried madvise() with MADV_WILLNEED and MADV_SEQUENTIAL, and using mlock(), but none of those made a difference. I've also thought about whether 2MB Huge Pages would help given the buffer sizes of around 10-30 MB, but that's not usually available.

Edit: I've tried mmap() with MAP_ANONYMOUS | MAP_SHARED, which is just as slow as before. MAP_ANONYMOUS | MAP_PRIVATE results in the same speed as malloc(), however that defeats the purpose.

Molecule answered 8/8, 2022 at 13:25 Comment(18)

Is it 21-28ms the first time or every time? malloc gets its memory from the same place mmap does so it's surprising that there's a difference. If you map anonymous memory is it the same speed as malloc? – Worser 8/8, 2022 at 13:41

Only the first; a subsequent memset() takes about 2ms. I'll edit the code with a clarification. – Molecule 8/8, 2022 at 13:43

Good point on anonymous memory - I've added a brief edit to the end of the question. – Molecule 8/8, 2022 at 13:53

does MAP_POPULATE shift the timings so mmap takes longer and memset takes less time? – Worser 8/8, 2022 at 13:58

Indeed, the total time to completion of the first memset remains the same however. – Molecule 8/8, 2022 at 14:1

elixir.bootlin.com/linux/latest/source/mm/mmap.c#L1529 I'm looking at what is different when the VM_SHARED | VM_MAYSHARE flags are set, and I think the main one is this balance_dirty_pages function sometimes gets called on page faults: elixir.bootlin.com/linux/latest/source/mm/… which may decide to slow down your process. I wonder if you write to each page individually (4096 byte increments) do some pages take more time than others? Some info found here: lwn.net/Articles/456904 but this hypothesis is not confirmed – Worser 8/8, 2022 at 14:20

Thanks for the investigation! I've done some quick trials doing memset of a 20MiB (5120 page) buffer, in 4096 byte chunks, just printing process time at each 512th chunk here: hastebin.com/raw/lenuxemixu (timings are all quicker than before due to being on AC). Looks like using MAP_SHARED without MAP_POPULATE makes it fast to start with, but somewhere around 512-1024 pages it slows down (I tried plotting every memset here: i.imgur.com/49qySqw.png, but I think the time taken to print each one is affecting things). – Molecule 8/8, 2022 at 15:24

is that the amount of time per page or the total amount of time for that many pages? – Worser 8/8, 2022 at 15:25

Ah that's cumulative, this is per page: i.imgur.com/lXsMdyY.png. It does look like there's a pattern of slow accesses every 50-100ish pages – Molecule 8/8, 2022 at 15:33

For comparison, here's the same thing but for malloc'ed memory: i.imgur.com/OmjR4B7.png. Looks like it's doing something in much larger chunks (2MiB?), but I'm not very well versed in kernel / libc internals. – Molecule 8/8, 2022 at 15:42

I think the perf tool can measure what's happening inside the kernel and show where time is spent, but I don't know how to use it. You might consider trying it. I'm also not sure whether it will just work, or whether you'll need to set up a virtual machine with a customly configured kernel (which is not terribly difficult). – Worser 8/8, 2022 at 15:45

Even just running it under time will count the number of minor page faults, which is most likely where the time is going. If you time a few runs with different number of pages, you can verify that pretty easily. – Sundry 8/8, 2022 at 16:16

For perf, read the manpages. They're pretty detailed. You probably want to record some combination of page faults, (data) cache misses, TLB misses, reloads and flushes. Run perf list to see what's available. – Sundry 8/8, 2022 at 16:19

Good point on time. For 20MiB with malloc, I get (0major+586minor)pagefaults, compared to (5120major+63minor)pagefaults with mmap, or (0major+5181minor)pagefaults when using MAP_POPULATE. – Molecule 8/8, 2022 at 17:9

@Sundry not just page faults but I am suggesting to find out what the kernel is doing in the page fault handler. It is surprising that mmap generates more pagefaults than malloc because again, they get their memory from the exact same place! malloc probably just calls mmap to allocate 20MiB (but it's private) – Worser 8/8, 2022 at 18:15

Perhaps for private mappings the kernel automatically loads the next 8-9 pages whereas it does not do that for shared mappings. – Worser 8/8, 2022 at 18:19

Pretty sure you're getting zero pages mapped in there COW and madvise only only advises the reading behaviour?! – Ogburn 8/8, 2022 at 18:31

Ah, I think I've found the answer for the difference. The man page for madvise() actually states that "currently, Transparent Huge Pages work only with private anonymous pages". It turns out this is configurable under /sys/kernel/mm/transparent_hugepage/shmem_enabled, which is set to "never" on my system (up-to-date Arch linux). Setting this to "always" causes mmap() with MAP_SHARED to be just as fast as malloc(). It'd be good if there's another way to solve this, but I doubt it. – Molecule 8/8, 2022 at 18:57

The difference in performance between malloc() and mmap() seems to be due to the differing application of Transparent Hugepages.

By default on x86_64, the page size is 4KiB and the huge page size is 2MiB. Transparent Hugepages allows programs that don't know about hugepages to still use them, reducing page faults. This is only enabled by default for private, anonymous memory however - thus for both malloc() and mmap() with MAP_ANONYMOUS | MAP_PRIVATE set, explaining why the performance of these is identical. For shared memory mappings, this is disabled, resulting in more page handling overhead (for the 10-30MiB buffers I need), and causing slowdowns.

Hugepages can be enabled for shared memory mappings, as explained in the kernel docs page, via the /sys/kernel/mm/transparent_hugepage/shmem_enabled knob. This defaults to never, but setting it to always (or advise, and adding the corresponding madvise(..., MADV_HUGEPAGE) call) allows memory mapped with MAP_SHARED to use hugepages, and the performance matches malloc()'ed memory.

I'm unsure why the default is never for shared memory. While not very satisfactory, for now it seems the only solution is to use madvise(MADV_HUGEPAGE) to improve performance on any systems which happen to have shmem_enabled set to at least advise (or if it's enabled by default in future).

Molecule answered 8/8, 2022 at 21:1 Comment(3)

Since you have such a tight target where 28ms isn't okay I assumed that you had a lot of control over the system whether your program was running. What if the user has some other programs running and that causes your startup to take 28ms or longer, is that a problem? – Worser 9/8, 2022 at 14:19

It's not really a problem, as this isn't performance-critical; it's only a program launcher à la rofi/dmenu. I just wanted to make sure I wasn't leaving performance on the table. I've been trying to get it to start as quickly as possible, as it's very satisfying to have it open on the same frame as you press a keyboard shortcut, and I don't want to daemonize it and hog memory in the background. I've put some brief benchmarks of hugepages on the project page. – Molecule 9/8, 2022 at 16:26

always good to see someone writing efficient software. Modern software does waste so much potential performance: "What Andy giveth, Bill taketh away." – Worser 9/8, 2022 at 16:27

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags