munmap() failure with ENOMEM with private anonymous mapping
Asked Answered
I

1

23

I have recently discovered that Linux does not guarantee that memory allocated with mmap can be freed with munmap if this leads to situation when number of VMA (Virtual Memory Area) structures exceed vm.max_map_count. Manpage states this (almost) clearly:

 ENOMEM The process's maximum number of mappings would have been exceeded.
 This error can also occur for munmap(), when unmapping a region
 in the middle of an existing mapping, since this results in two
 smaller mappings on either side of the region being unmapped.

The problem is that Linux kernel always tries to merge VMA structures if possible, making munmap fail even for separately created mappings. I was able to write a small program to confirm this behavior:

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>

#include <sys/mman.h>

// value of vm.max_map_count
#define VM_MAX_MAP_COUNT        (65530)

// number of vma for the empty process linked against libc - /proc/<id>/maps
#define VMA_PREMAPPED           (15)

#define VMA_SIZE                (4096)
#define VMA_COUNT               ((VM_MAX_MAP_COUNT - VMA_PREMAPPED) * 2)

int main(void)
{
    static void *vma[VMA_COUNT];

    for (int i = 0; i < VMA_COUNT; i++) {
        vma[i] = mmap(0, VMA_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

        if (vma[i] == MAP_FAILED) {
            printf("mmap() failed at %d\n", i);
            return 1;
        }
    }

    for (int i = 0; i < VMA_COUNT; i += 2) {
        if (munmap(vma[i], VMA_SIZE) != 0) {
            printf("munmap() failed at %d (%p): %m\n", i, vma[i]);
        }
    }
}

It allocates a large number of pages (twice the default allowed maximum) using mmap, then munmaps every second page to create separate VMA structure for each remaining page. On my machine the last munmap call always fails with ENOMEM.

Initially I thought that munmap never fails if used with the same values for address and size that were used to create mapping. Apparently this is not the case on Linux and I was not able to find information about similar behavior on other systems.

At the same time in my opinion partial unmapping applied to the middle of a mapped region is expected to fail on any OS for every sane implementation, but I haven't found any documentation that says such failure is possible.

I would usually consider this a bug in the kernel, but knowing how Linux deals with memory overcommit and OOM I am almost sure this is a "feature" that exists to improve performance and decrease memory consumption.

Other information I was able to find:

  • Similar APIs on Windows do not have this "feature" due to their design (see MapViewOfFile, UnmapViewOfFile, VirtualAlloc, VirtualFree) - they simply do not support partial unmapping.
  • glibc malloc implementation does not create more than 65535 mappings, backing off to sbrk when this limit is reached: https://code.woboq.org/userspace/glibc/malloc/malloc.c.html. This looks like a workaround for this issue, but it is still possible to make free silently leak memory.
  • jemalloc had trouble with this and tried to avoid using mmap/munmap because of this issue (I don't know how it ended for them).

Do other OS's really guarantee deallocation of memory mappings? I know Windows does this, but what about other Unix-like operating systems? FreeBSD? QNX?


EDIT: I am adding example that shows how glibc's free can leak memory when internal munmap call fails with ENOMEM. Use strace to see that munmap fails:

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>

#include <sys/mman.h>

// value of vm.max_map_count
#define VM_MAX_MAP_COUNT        (65530)

#define VMA_MMAP_SIZE           (4096)
#define VMA_MMAP_COUNT          (VM_MAX_MAP_COUNT)

// glibc's malloc default mmap_threshold is 128 KiB
#define VMA_MALLOC_SIZE         (128 * 1024)
#define VMA_MALLOC_COUNT        (VM_MAX_MAP_COUNT)

int main(void)
{
    static void *mmap_vma[VMA_MMAP_COUNT];

    for (int i = 0; i < VMA_MMAP_COUNT; i++) {
        mmap_vma[i] = mmap(0, VMA_MMAP_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

        if (mmap_vma[i] == MAP_FAILED) {
            printf("mmap() failed at %d\n", i);
            return 1;
        }
    }

    for (int i = 0; i < VMA_MMAP_COUNT; i += 2) {
        if (munmap(mmap_vma[i], VMA_MMAP_SIZE) != 0) {
            printf("munmap() failed at %d (%p): %m\n", i, mmap_vma[i]);
            return 1;
        }
    }

    static void *malloc_vma[VMA_MALLOC_COUNT];

    for (int i = 0; i < VMA_MALLOC_COUNT; i++) {
        malloc_vma[i] = malloc(VMA_MALLOC_SIZE);

        if (malloc_vma[i] == NULL) {
            printf("malloc() failed at %d\n", i);
            return 1;
        }
    }

    for (int i = 0; i < VMA_MALLOC_COUNT; i += 2) {
        free(malloc_vma[i]);
    }
}
Inartificial answered 2/5, 2017 at 17:6 Comment(14)
First observe that (you think) you are mapping twice as many VMAs as allowed. But probably this is optimized to be a single VMA. So when you start punching holes in it, you start increasing the number of VMAs until you exceed the limit. Said another way, because of optimization, what you are thinking is a "non-partial unmapping" is likely a partial unmapping due to optimization.Contusion
Yes, this is exactly what is happening. I mentioned this optimization in the second paragraph. The issue is that code does not know what VMAs are merged. Imagine two libraries working in the same process. Each allocates a chunk of memory, both chunks get merged into a single VMA. The only way to reliably free this memory is by removing the whole VMA with a single munmap() call. Obviously this is not possible for two separate libraries as they are not aware of each other. I would expected kernel to guarantee that separate VMA is allocated for every mmap() call to avoid such failure.Inartificial
OK, so then I don't quite know where you are trying to go with this. I don't think you could call this a bug in the kernel, but I agree it is difficult for an application to understand how many VMAs are in use. But fragmentation is an issue with any memory allocation scheme. To avoid failure you can count the number of mmaped segments that are in use. As long as you don't exceed the VMA limit it is fine. It may be optimized to use fewer. Your example fails because you allocated twice the limit and now there is no way to guarantee that you won't get into trouble.Contusion
You cannot count number of mapped segments. You will need a single per-process counter this. Glibc's malloc() already tries to do this, but if your program allocates memory using mmap() this counter won't be able to guarantee anything. I will post later an example that show how glibc's free() is not able to free memory in such situation, silently leaking memory instead.Inartificial
The problem here is that this is just an optimistic optimization made by the kernel that makes user code leak memory is some situations. I am almost sure that the assumption was that 99.99% of program will never hit this limit and even if some program does, the user can just increase vm.max_map_count manually to prevent this in the future. This is why I also called this a "feature". It might not be that bad, but from Computer Science point of view this is a bug. The problem here is that at least Glibc and jemalloc try to workaround this, so probability of this happening is not that low.Inartificial
And it is possible for the kernel to avoid such situations. As I mentioned in my post at least Windows does not do such optimization, so you can always free memory there.Inartificial
OK, I understand your point now. I'll have to look at the malloc code, but if it just leaks memory on getting ENOMEM from unmap that is plausibly a bug, and maybe it should keep track of the segments that it could not unmap and try again later after other maps and unmaps have reduced fragmentation. But do keep in mind that in linux just because you succeded in mmaping a segment doesn't mean you can use all the pages. There are a lot of different opinions on this, but I think the bigger issue is that you can successfully mmap a segment and then get an OOM on trying to use it.Contusion
Yes, and this optimization looks just like continuation of OOM killer. If you don't like default OOM behavior in Linux you have to tune it using vm.overcommit_memory, oom_score_adj etc otherwise your program can be killed. Same thing here - you have to tune vm.max_map_count or your program might fail. Default Linux configuration looks very desktop oriented with its "anything can die at any moment and thats okay" attitude, but I suspect that most servers run the same configuration. As the result people are trying to workaround those issues (Glibc and jemalloc are examples of this).Inartificial
This is just another reason why building systems that are rock solid is so hard. Any implementation of malloc on any OS will make some claim on system resources. In this case we happen to be focusing on VMAs. But if you want a guarantee that you will be able to malloc and use a total of 2GB, no OS I am aware of will make any guarantee on what the worst case usage of those system resources will be, as it depends on things like worst case fragmentation. It would certainly be nice to have such an OS.Contusion
OS can at least allow application to fail gracefully. And sometimes application can proceed even after memory allocation failure by simply dropping connection it couldn't handle as an example. In such case other connections can be kept intact. Linux already provides graceful handling of similar situation - accept() shall drop the connection it couldn't accept (possibly because maximum number of file descriptors has been reached).Inartificial
Well, returning ENOMEM and letting the app decide how to deal with it and allowing for the configuration of the OOM killer does allow for graceful failure. As much as some people don't like the OOM killer, a CPU panic is not graceful degradation.Contusion
@PSkocik It could. It also could remember mappings that failed to unmap and could retry unmapping them later, but I haven't seen any allocator implementation that bothers doing this.Inartificial
The number of mapped areas is equal to the number of lines in the file /proc/self/maps. And the assumption is that 100% of programs will never reach the 65530 limit on the number of mapped areas. If you are writing critical software, you don't use malloc() and free() ; you manage the memory yourself. (I can't write an answer, because you are not really asking a question, you are just saying that it is lame ...)Constanta
@Constanta i was asking whether other listed OSes do the same optimization. The assumption is also wrong because there are other types of mappings besides private anonymous mappings. If you search those failure, you can see that some programs do reach that limits (which is much easier for a large programs that does a lot of file mappings). I guess this is just something that has to be kept in mind if you are mapping lots of memory.Inartificial
M
6

One way to work around this problem on Linux is to mmap more that 1 page at once (e.g. 1 MB at a time), and also map a separator page after it. So, you actually call mmap on 257 pages of memory, then remap the last page with PROT_NONE, so that it cannot be accessed. This should defeat the VMA merging optimization in the kernel. Since you are allocating many pages at once, you should not run into the max mapping limit. The downside is you have to manually manage how you want to slice the large mmap.

As to your questions:

  1. System calls can fail on any system for a variety of reasons. Documentation is not always complete.

  2. You are allowed to munmap a part of a mmapd region as long as the address passed in lies on a page boundary, and the length argument is rounded up to the next multiple of the page size.

Mauro answered 17/7, 2017 at 22:40 Comment(6)
There is still a change that two threads can mmap two regions of memory that will be part of the same VMA structure, in this case both following mprotect and munmap calls will fail. As for syscall failures: some things should never be allowed to fail for correctly working programs, as it might be impossible to handle such failure.Inartificial
@Ivan: Your program can fail for unexpected reasons. The most obvious one is via an external signal, like a kill -9. Less obvious would be due to disk being full, or faults due to Linux's overcommit memory allocation strategy. Nobody really calls these bugs, it's just limitations of the system. I don't understand your thread problem. My solution is to reduce the number of mapped pages by having the software (not the OS) manage the individual pages, and ask mmap to return many pages at once.Mauro
Explicitly terminated process is a separate thing - those are not unpredicatable. Both memory overcommit and VMA merging can definetely be called bugs in some circumstances and "features" in others. That is why not all kernels do such optimizations: Windows NT never merges VMAs. As for splitting VMAs - in multithreaded environment there is still a race that can trigger mprotect failure followed by munmap failure and process still has to deal with VMA count limit. I can understand why they do this and I understand this is rarely an issue, especially if it also improves kernel performance.Inartificial
@Ivan: Again, the point of the answer is to suggest process management of individual pages, and few mmap calls of page clusters. Thus, the OS sees the process only has low VMA count that matches process mmap calls.Mauro
Linux lets you raise the map count limit with sysctl up to like 2^31 -- that's 8TiB or RAM with 4KiB pages with no possibility of ENOMEM fail in munmap. Seems like a simpler solution.Nel
@PSkocik: Nice, vm.max_map_count, but it would be nicer if there was a per process limit that could be set...Mauro

© 2022 - 2024 — McMap. All rights reserved.