Can I ask the kernel to populate (fault in) a range of anonymous pages?
Asked Answered
K

1

3

In Linux, using C, if I ask for a large amount of memory via malloc or a similar dynamic allocation mechanism, it is likely that most of the pages backing the returned region won't actually be mapped into the address space of my process.

Instead, a page fault is incurred each time I access one of the allocated pages for the first time, and then kernel will map in the "anonymous" page (consisting entirely of zeros) and return to user space.

For a large region (say 1 GiB) this is a large number of page faults (~260 thousand for 4 KiB pages), and each fault incurs a user-to-kernel-user transition which are especially slow on kernels with Spectre and Meltdown mitigations. For some uses, this page-faulting time might dominate the actual work being done on the buffer.

If I know I'm going to use the entire buffer, is there some way to ask the kernel to map an already mapped region ahead of time?

If I was allocating my own memory using mmap, the way to do this would be MAP_POPULATE - but that doesn't work for regions received from malloc or new.

There is the madvise call, but the options there seem mostly to apply to file-backed regions. For example, the madvise(..., MADV_WILLNEED) call seems promising - from the man page:

MADV_WILLNEED

Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.)

The obvious implication is if the region is file-backed, this call might trigger an asynchronous file read-ahead, or perhaps a synchronous additional read-ahead on subsequent faults. From the description, it isn't clear if it will do anything for anonymous pages, and based on my testing, it doesn't.

Kumamoto answered 1/6, 2019 at 23:33 Comment(6)
Since you're already relying on OS-specific behaviors, why not use mmap? Especially considering the large allocations you're talking about.Upbraiding
@JonathonReinhart - because I need to share memory with other users of malloc and in some cases the malloc call and code that knows it is time to populate everything are in separate components, and mmap only applies at the malloc site, not later. I'm also interested in this problem in other languages where leaving the standard allocation routines and using mmap is even less feasible.Kumamoto
madvise(addr, len, MADV_WILLNEED)...Shayla
@ChrisDodd - it doesn't work, at least for me. I get the same number of pages faults (allocated region / 4096 + a few more) with or without that call. If you read between the lines of the man page, it seems oriented towards file-backed region readahead, and doesn't say that it will populate pages from anonymous regions. Here's the test code I used.Kumamoto
Dont forget that if you insist on attached physical pages, something else might have to be pushed out, which is also expemsive.Heptateuch
@Heptateuch - yes, but in the scenario I'm considering, the pages are all about to be accessed anyways, so they'll all get physical soon either way. BTW I don't really insist on physical pages, I just want to take the 244,000 page faults at all once, without 244,000 user-kernel transitions. If for some reason there isn't enough physical memory to accommodate that, it's fine is some pages aren't populated.Kumamoto
K
2

It's a bit of a dirty hack, and works best for priviledged processes or on systems with a high RLIMIT_MEMLOCK, but... an mlock and munlock pair will achieve the effect you are looking for.

For example, given the following test program:

# compile with (for e.g.,): cc -O1 -Wall    pagefaults.c   -o pagefaults

#include <stdlib.h>
#include <stdio.h>
#include <err.h>
#include <sys/mman.h>

#define DEFAULT_SIZE        (40 * 1024 * 1024)
#define PG_SIZE     4096

void failcheck(int ret, const char* what) {
    if (ret) {
        err(EXIT_FAILURE, "%s failed", what);
    } else {
        printf("%s OK\n", what);
    }
}

int main(int argc, char **argv) {
    size_t size = (argc == 2 ? atol(argv[1]) : DEFAULT_SIZE);
    char *mem = malloc(size);

    if (getenv("DO_MADVISE")) {
        failcheck(madvise(mem, size, MADV_WILLNEED), "madvise");
    }

    if (getenv("DO_MLOCK")) {
        failcheck(mlock(mem, size), "mlock");
        failcheck(munlock(mem, size), "munlock");
    }

    for (volatile char *p = mem; p < mem + size; p += PG_SIZE) {
        *p = 'z';
    }
    printf("size: %6.2f MiB, pages touched: %zu\npoitner value : %p\n",
            size / 1024. / 1024., size / PG_SIZE, mem);
}

Running it as root for a 1 GB region and counting pagefaults with perf results in:

$ perf stat ./pagefaults 1000000000
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f2fc2584010

 Performance counter stats for './pagefaults 1000000000':

        352.474676      task-clock (msec)         #    0.999 CPUs utilized          
                 2      context-switches          #    0.006 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           244,189      page-faults               #    0.693 M/sec                  
       914,276,474      cycles                    #    2.594 GHz                    
       703,359,688      instructions              #    0.77  insn per cycle         
       117,710,381      branches                  #  333.954 M/sec                  
           447,022      branch-misses             #    0.38% of all branches        

       0.352814087 seconds time elapsed

However, if you run prefixed with DO_MLOCK=1, you get:

sudo DO_MLOCK=1 perf stat ./pagefaults 1000000000
mlock OK
munlock OK
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f8047f6b010

 Performance counter stats for './pagefaults 1000000000':

        240.236189      task-clock (msec)         #    0.999 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                49      page-faults               #    0.204 K/sec                  
       623,152,764      cycles                    #    2.594 GHz                    
       959,640,219      instructions              #    1.54  insn per cycle         
       150,713,144      branches                  #  627.354 M/sec                  
           484,400      branch-misses             #    0.32% of all branches        

       0.240538327 seconds time elapsed

Note that the number of page faults has dropped from 244,189 to 49, and there is a 1.46x speedup. The overwhelming majority of the time is still spend in the kernel, so this could probably be a lot faster if it wasn't necessary to invoke both mlock and munlock and possibly also because the semantics of mlock are more than is required.

For non-privileged processes, you'll probably hit the RLIMIT_MEMLOCK if you try to do a large region all at once (on my Ubuntu system it's set at 64 Kib), but you could loop over the region calling mlock(); munlock() on a smaller region.

Kumamoto answered 5/6, 2019 at 23:31 Comment(8)
You could lock / unlock in smaller chunks, at the cost of making even more system calls. Possibly useful for soft realtime purposes (to prep something ahead of time for when it matters), but bad for overall throughput if the system-call cost is part of the time that matters.Herzel
@PeterCordes - yes, I meant to imply that, but I think it is unclear. On my system the limit is 64k, so you could get up to an 16x reduction in syscalls at least.Kumamoto
I suspect a lot of that kernel time is in zeroing pages so it's "real work". Maybe also in constructing the HW page table data structures? Although if it can use 2M hugepages that should be fast. Or even a 1G hugepage.Herzel
@PeterCordes - yes, there is definitely real work here, but what's your estimate for zeroing 4096 bytes, with rep stos?Kumamoto
So you managed to do the (un)dirty work before the counting starts.Heptateuch
@Heptateuch - you mean the counting done by perf stat?Kumamoto
I mean: what about the other processes.[I'm from Europe,yuo know]Heptateuch
@Heptateuch - sorry, I'm still not following you. No, I didn't know you were from Europe. Congratulations, I guess?Kumamoto

© 2022 - 2024 — McMap. All rights reserved.