I have recently discovered that Linux does not guarantee that memory allocated with mmap
can be freed with munmap
if this leads to situation when number of VMA (Virtual Memory Area) structures exceed vm.max_map_count
. Manpage states this (almost) clearly:
ENOMEM The process's maximum number of mappings would have been exceeded.
This error can also occur for munmap(), when unmapping a region
in the middle of an existing mapping, since this results in two
smaller mappings on either side of the region being unmapped.
The problem is that Linux kernel always tries to merge VMA structures if possible, making munmap
fail even for separately created mappings. I was able to write a small program to confirm this behavior:
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
// value of vm.max_map_count
#define VM_MAX_MAP_COUNT (65530)
// number of vma for the empty process linked against libc - /proc/<id>/maps
#define VMA_PREMAPPED (15)
#define VMA_SIZE (4096)
#define VMA_COUNT ((VM_MAX_MAP_COUNT - VMA_PREMAPPED) * 2)
int main(void)
{
static void *vma[VMA_COUNT];
for (int i = 0; i < VMA_COUNT; i++) {
vma[i] = mmap(0, VMA_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (vma[i] == MAP_FAILED) {
printf("mmap() failed at %d\n", i);
return 1;
}
}
for (int i = 0; i < VMA_COUNT; i += 2) {
if (munmap(vma[i], VMA_SIZE) != 0) {
printf("munmap() failed at %d (%p): %m\n", i, vma[i]);
}
}
}
It allocates a large number of pages (twice the default allowed maximum) using mmap
, then munmap
s every second page to create separate VMA structure for each remaining page. On my machine the last munmap
call always fails with ENOMEM
.
Initially I thought that munmap
never fails if used with the same values for address and size that were used to create mapping. Apparently this is not the case on Linux and I was not able to find information about similar behavior on other systems.
At the same time in my opinion partial unmapping applied to the middle of a mapped region is expected to fail on any OS for every sane implementation, but I haven't found any documentation that says such failure is possible.
I would usually consider this a bug in the kernel, but knowing how Linux deals with memory overcommit and OOM I am almost sure this is a "feature" that exists to improve performance and decrease memory consumption.
Other information I was able to find:
- Similar APIs on Windows do not have this "feature" due to their design (see
MapViewOfFile
,UnmapViewOfFile
,VirtualAlloc
,VirtualFree
) - they simply do not support partial unmapping. - glibc
malloc
implementation does not create more than65535
mappings, backing off tosbrk
when this limit is reached: https://code.woboq.org/userspace/glibc/malloc/malloc.c.html. This looks like a workaround for this issue, but it is still possible to makefree
silently leak memory. - jemalloc had trouble with this and tried to avoid using
mmap
/munmap
because of this issue (I don't know how it ended for them).
Do other OS's really guarantee deallocation of memory mappings? I know Windows does this, but what about other Unix-like operating systems? FreeBSD? QNX?
EDIT: I am adding example that shows how glibc's free
can leak memory when internal munmap
call fails with ENOMEM
. Use strace
to see that munmap
fails:
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
// value of vm.max_map_count
#define VM_MAX_MAP_COUNT (65530)
#define VMA_MMAP_SIZE (4096)
#define VMA_MMAP_COUNT (VM_MAX_MAP_COUNT)
// glibc's malloc default mmap_threshold is 128 KiB
#define VMA_MALLOC_SIZE (128 * 1024)
#define VMA_MALLOC_COUNT (VM_MAX_MAP_COUNT)
int main(void)
{
static void *mmap_vma[VMA_MMAP_COUNT];
for (int i = 0; i < VMA_MMAP_COUNT; i++) {
mmap_vma[i] = mmap(0, VMA_MMAP_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (mmap_vma[i] == MAP_FAILED) {
printf("mmap() failed at %d\n", i);
return 1;
}
}
for (int i = 0; i < VMA_MMAP_COUNT; i += 2) {
if (munmap(mmap_vma[i], VMA_MMAP_SIZE) != 0) {
printf("munmap() failed at %d (%p): %m\n", i, mmap_vma[i]);
return 1;
}
}
static void *malloc_vma[VMA_MALLOC_COUNT];
for (int i = 0; i < VMA_MALLOC_COUNT; i++) {
malloc_vma[i] = malloc(VMA_MALLOC_SIZE);
if (malloc_vma[i] == NULL) {
printf("malloc() failed at %d\n", i);
return 1;
}
}
for (int i = 0; i < VMA_MALLOC_COUNT; i += 2) {
free(malloc_vma[i]);
}
}
ENOMEM
from unmap that is plausibly a bug, and maybe it should keep track of the segments that it could not unmap and try again later after other maps and unmaps have reduced fragmentation. But do keep in mind that in linux just because you succeded in mmaping a segment doesn't mean you can use all the pages. There are a lot of different opinions on this, but I think the bigger issue is that you can successfully mmap a segment and then get an OOM on trying to use it. – ContusionENOMEM
and letting the app decide how to deal with it and allowing for the configuration of the OOM killer does allow for graceful failure. As much as some people don't like the OOM killer, a CPU panic is not graceful degradation. – Contusion/proc/self/maps
. And the assumption is that 100% of programs will never reach the 65530 limit on the number of mapped areas. If you are writing critical software, you don't usemalloc()
andfree()
; you manage the memory yourself. (I can't write an answer, because you are not really asking a question, you are just saying that it is lame ...) – Constanta