I'm writing a small program for Wayland that uses software rendering and wl_shm for display. This requires that I pass a file descriptor for my screen buffer to the Wayland server, which then calls mmap()
on it, i.e. the screen buffer must be shareable between processes.
In this program, startup latency is key. Currently, there is only one remaining bottleneck: the initial draw to the screen buffer, where the entire buffer is painted over. The code below shows a simplified version of this:
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/mman.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main()
{
/* Fullscreen buffers are around 10-30 MiB for common resolutions. */
const size_t size = 2880 * 1800 * 4;
int fd = memfd_create("shm", 0);
ftruncate(fd, size);
void *pool = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
/* Ideally, we could just malloc, but this memory needs to be shared. */
//void *pool = malloc(size);
/* In reality this is a cairo_paint() call. */
memset(pool, 0xCF, size);
/* Subsequent paints (or memsets) after the first take negligible time. */
}
On my laptop, the memset()
above takes around 21-28 ms. Switching to malloc()
'ed memory drops this to 12 ms, but the problem is that the memory needs to be shared between processes. The behaviour is similar on my desktop: 7 ms for mmap()
, 3 ms for malloc()
.
My question is: Is there something I'm missing that can improve the performance of shared memory on Linux? I've tried madvise()
with MADV_WILLNEED
and MADV_SEQUENTIAL
, and using mlock()
, but none of those made a difference. I've also thought about whether 2MB Huge Pages would help given the buffer sizes of around 10-30 MB, but that's not usually available.
Edit: I've tried mmap()
with MAP_ANONYMOUS | MAP_SHARED
, which is just as slow as before. MAP_ANONYMOUS | MAP_PRIVATE
results in the same speed as malloc()
, however that defeats the purpose.
memset()
takes about 2ms. I'll edit the code with a clarification. – MoleculeMAP_SHARED
withoutMAP_POPULATE
makes it fast to start with, but somewhere around 512-1024 pages it slows down (I tried plotting everymemset
here: i.imgur.com/49qySqw.png, but I think the time taken to print each one is affecting things). – Moleculemalloc
'ed memory: i.imgur.com/OmjR4B7.png. Looks like it's doing something in much larger chunks (2MiB?), but I'm not very well versed in kernel / libc internals. – Moleculeperf
tool can measure what's happening inside the kernel and show where time is spent, but I don't know how to use it. You might consider trying it. I'm also not sure whether it will just work, or whether you'll need to set up a virtual machine with a customly configured kernel (which is not terribly difficult). – Worsertime
will count the number of minor page faults, which is most likely where the time is going. If youtime
a few runs with different number of pages, you can verify that pretty easily. – Sundryperf
, read the manpages. They're pretty detailed. You probably want to record some combination of page faults, (data) cache misses, TLB misses, reloads and flushes. Runperf list
to see what's available. – Sundrytime
. For 20MiB withmalloc
, I get(0major+586minor)pagefaults
, compared to(5120major+63minor)pagefaults
withmmap
, or(0major+5181minor)pagefaults
when usingMAP_POPULATE
. – Moleculemadvise()
actually states that "currently, Transparent Huge Pages work only with private anonymous pages". It turns out this is configurable under/sys/kernel/mm/transparent_hugepage/shmem_enabled
, which is set to "never" on my system (up-to-date Arch linux). Setting this to "always" causesmmap()
withMAP_SHARED
to be just as fast asmalloc()
. It'd be good if there's another way to solve this, but I doubt it. – Molecule