Can I ask the kernel to populate (fault in) a range of anonymous pages?

In Linux, using C, if I ask for a large amount of memory via malloc or a similar dynamic allocation mechanism, it is likely that most of the pages backing the returned region won't actually be mapped into the address space of my process.

Instead, a page fault is incurred each time I access one of the allocated pages for the first time, and then kernel will map in the "anonymous" page (consisting entirely of zeros) and return to user space.

For a large region (say 1 GiB) this is a large number of page faults (~260 thousand for 4 KiB pages), and each fault incurs a user-to-kernel-user transition which are especially slow on kernels with Spectre and Meltdown mitigations. For some uses, this page-faulting time might dominate the actual work being done on the buffer.

If I know I'm going to use the entire buffer, is there some way to ask the kernel to map an already mapped region ahead of time?

If I was allocating my own memory using mmap, the way to do this would be MAP_POPULATE - but that doesn't work for regions received from malloc or new.

There is the madvise call, but the options there seem mostly to apply to file-backed regions. For example, the madvise(..., MADV_WILLNEED) call seems promising - from the man page:

MADV_WILLNEED

Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.)

The obvious implication is if the region is file-backed, this call might trigger an asynchronous file read-ahead, or perhaps a synchronous additional read-ahead on subsequent faults. From the description, it isn't clear if it will do anything for anonymous pages, and based on my testing, it doesn't.

It's a bit of a dirty hack, and works best for priviledged processes or on systems with a high RLIMIT_MEMLOCK, but... an mlock and munlock pair will achieve the effect you are looking for.

For example, given the following test program:

# compile with (for e.g.,): cc -O1 -Wall    pagefaults.c   -o pagefaults

#include <stdlib.h>
#include <stdio.h>
#include <err.h>
#include <sys/mman.h>

#define DEFAULT_SIZE        (40 * 1024 * 1024)
#define PG_SIZE     4096

void failcheck(int ret, const char* what) {
    if (ret) {
        err(EXIT_FAILURE, "%s failed", what);
    } else {
        printf("%s OK\n", what);
    }
}

int main(int argc, char **argv) {
    size_t size = (argc == 2 ? atol(argv[1]) : DEFAULT_SIZE);
    char *mem = malloc(size);

    if (getenv("DO_MADVISE")) {
        failcheck(madvise(mem, size, MADV_WILLNEED), "madvise");
    }

    if (getenv("DO_MLOCK")) {
        failcheck(mlock(mem, size), "mlock");
        failcheck(munlock(mem, size), "munlock");
    }

    for (volatile char *p = mem; p < mem + size; p += PG_SIZE) {
        *p = 'z';
    }
    printf("size: %6.2f MiB, pages touched: %zu\npoitner value : %p\n",
            size / 1024. / 1024., size / PG_SIZE, mem);
}

Running it as root for a 1 GB region and counting pagefaults with perf results in:

$ perf stat ./pagefaults 1000000000
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f2fc2584010

 Performance counter stats for './pagefaults 1000000000':

        352.474676      task-clock (msec)         #    0.999 CPUs utilized          
                 2      context-switches          #    0.006 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           244,189      page-faults               #    0.693 M/sec                  
       914,276,474      cycles                    #    2.594 GHz                    
       703,359,688      instructions              #    0.77  insn per cycle         
       117,710,381      branches                  #  333.954 M/sec                  
           447,022      branch-misses             #    0.38% of all branches        

       0.352814087 seconds time elapsed

However, if you run prefixed with DO_MLOCK=1, you get:

sudo DO_MLOCK=1 perf stat ./pagefaults 1000000000
mlock OK
munlock OK
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f8047f6b010

 Performance counter stats for './pagefaults 1000000000':

        240.236189      task-clock (msec)         #    0.999 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                49      page-faults               #    0.204 K/sec                  
       623,152,764      cycles                    #    2.594 GHz                    
       959,640,219      instructions              #    1.54  insn per cycle         
       150,713,144      branches                  #  627.354 M/sec                  
           484,400      branch-misses             #    0.32% of all branches        

       0.240538327 seconds time elapsed

Note that the number of page faults has dropped from 244,189 to 49, and there is a 1.46x speedup. The overwhelming majority of the time is still spend in the kernel, so this could probably be a lot faster if it wasn't necessary to invoke both mlock and munlock and possibly also because the semantics of mlock are more than is required.

For non-privileged processes, you'll probably hit the RLIMIT_MEMLOCK if you try to do a large region all at once (on my Ubuntu system it's set at 64 Kib), but you could loop over the region calling mlock(); munlock() on a smaller region.

Recommended topics

Hot tags