Clarification
First, some clarification. You asked: ... my program a.out terminates already, there is no other process that is maintaining this memory cache - or, is there one?
Everything we are talking about is within the lifetime of a single process: the process always returns all allocated memory when it exits. There is no cache that outlives the process1. The memory is returned even without any help from the runtime allocator: the OS simply "takes it back" when the process is terminated. So there is no system-wide leak possible from terminated applications with normal allocations.
Now what Valgrind is reporting is memory that is in use at the moment the process terminated, but before the OS cleans everything up. It works at the runtime library level, and not at the OS level. So it's saying "Hey, when the program finished, there were 72,000 bytes that hadn't been returned to the runtime" but an unstated implication is that "these allocations will be cleaned up shortly by the OS".
The Underlying Questions
The code and Valgrind output shown doesn't really correlate well with the titular question, so let's break them apart. First we'll just try to answer the questions you asked about allocators: why they exist and why they don't generally don't immediately return freed memory to the OS, ignoring the example.
You asked:
1) Why keep them in an internal cache? If it is for speed, how is it
faster? Yes, the OS needs to maintain a data structure to keep track
of memory allocation, but this the maintainer of this cache also needs
to do so.
This is sort of two questions in one: one is why bother having a userland runtime allocator at all, and then the other one is (perhaps?) why don't these allocators immediately return memory to the OS when it is freed. They are related, but let's tackle them one at a time.
Why Runtime Allocators Exist
Why not just rely on the OS memory allocation routines?
Many operating systems, including most Linux and other Unix-like operating systems, simply don't have an OS system call to allocate and free arbitrary blocks of memory. Unix-alikes offer brk
which only grows or shrinks one contiguous block of memory - you have no way to "free" arbitrary earlier allocations. They also offer mmap
which allows you to independently allocate and free chunks of memory, but these allocate on a PAGE_SIZE
granularity, which on Linux is 4096 bytes. So if you want request 32 bytes, you'll have to waste 4096 - 32 == 4064
bytes if you don't have your own allocator. On these operating systems you practically need a separate memory allocation runtime which turns these coarse-grained tools into something capable of efficiently allocating small blocks.
Windows is a bit different. It has the HeapAlloc
call, which is part of the "OS" and does offer malloc
-like capabilities of allocating and freeing arbitrarily sized chunks of memory. With some compilers then, malloc
is just implemented as a thin wrapper around HeapAlloc
(the performance of this call has improved greatly in recent Windows versions, making this feasible). Still, while HeapAlloc
is part of the OS it isn't implemented in the kernel - it is also mostly implemented in a user-mode library, managing a list of free and used blocks, with occasional kernel calls to get chunks of memory from the kernel. So it is mostly malloc
in another disguise and any memory it is holding on to is also not available to any other processes.
- Performance! Even if there were appropriate kernel-level calls to allocate arbitrary blocks of memory, the simple overhead roundtrip to the kernel is usually hundreds of nanoseconds or more. A well-tuned
malloc
allocation or free, on other hand, is often only a dozen instructions and may complete in 10 ns or less. On top of that, system calls can't "trust their input" and so must carefully validate parameters passed from user-space. In the case of free
this means that it much check that the user passed a pointer which is valid! Most runtime free
implements simply crash or silently corrupt memory since there is no responsibility to protect a process from itself.
- Closer link to the rest of the language runtime. The functions you use to allocate memory in C++, namely
new
, malloc
and friends, are part of an defined by the language. It is then entirely natural to implement them as part of the runtime that implements the rest of the language, rather than the OS which is for the most part language-agnostic. For example, the language may have specific alignment requirements for various objects, which can best be handled by language aware allocators. Changes to the language or compiler might also imply necessary changes to the allocation routines, and it would be a tough call to hope for the kernel to be updated to accommodate your language features!
Why Not Return Memory to the OS
Your example doesn't show it, but you asked and if you wrote a different test you would probably find that after allocating and then freeing a bunch of memory, your processes resident set size and/or virtual size as reported by the OS might not decrease after the free. That is, it seems like the process holds on to the memory even though you have freed it. This is in fact true of many malloc
implementations. First, note that this is not a leak per se - the unreturned memory is still available to the process that allocated it, even if not to other processes.
Why do they do that? Here are some reasons:
The kernel API makes it hard. For the old-school brk
and sbrk
system calls, it simply isn't feasible to return freed memory unless it happens to be at the end of very last block allocated from brk
or sbrk
. That's because the abstraction offered by these calls is a single large contiguous region that you can only extend from one end. You can't hand back memory from the middle of it. Rather than trying to support the unusual case where all the freed memory happens to be at the end of brk
region, most allocators don't even bother.
The mmap
call is more flexible (and this discussion generally applies also to Windows where VirtualAlloc
is the mmap
equivalent), allowing you to at least return memory at a page granularity - but even that is hard! You can't return a page until all allocations that are part of that page are freed. Depending on the size and allocation/free pattern of the application that may be common or uncommon. A case where it works well is for large allocations - greater than a page. Here you're guaranteed to be able to free most of the allocation if it was done via mmap
and indeed some modern allocators satisfy large allocations directly from mmap
and free them back to the OS with munmap
. For glibc
(and by extension the C++ allocation operators), you can even control this threshold:
M_MMAP_THRESHOLD
For allocations greater than or equal to the limit specified
(in bytes) by M_MMAP_THRESHOLD that can't be satisfied from
the free list, the memory-allocation functions employ mmap(2)
instead of increasing the program break using sbrk(2).
Allocating memory using mmap(2) has the significant advantage
that the allocated memory blocks can always be independently
released back to the system. (By contrast, the heap can be
trimmed only if memory is freed at the top end.) On the other
hand, there are some disadvantages to the use of mmap(2):
deallocated space is not placed on the free list for reuse by
later allocations; memory may be wasted because mmap(2)
allocations must be page-aligned; and the kernel must perform
the expensive task of zeroing out memory allocated via
mmap(2). Balancing these factors leads to a default setting
of 128*1024 for the M_MMAP_THRESHOLD parameter.
So by default allocations of 128K or more will be allocated by the runtime directly from the OS and freed back to the OS on free. So sometimes you will see the behavior you might have expected is always the case.
- Performance! Every kernel call is expensive, as described in the other list above. Memory that is freed by a process will be needed shortly later to satisfy another allocation. Rather than trying to return it to the OS, a relatively heavyweight operation, why not just keep it around on a free list to satisfy future allocations? As pointed out in the man page entry, this also avoids the overhead of zeroing out all the memory returned by the kernel. It also gives the best chance of good cache behavior since the process is continually re-using the same region of the address space. Finally, it avoids TLB flushes which would be imposed by
munmap
(and possibly by shrinking via brk
).
- The "problem" of not returning memory is the worst for long-lived processes that allocate a bunch of memory at some point, free it and then never allocate that much again. I.e., processes whose allocation high-water mark is larger than their long term typical allocation amount. Most processes just don't follow that pattern, however. Processes often free a lot of memory, but allocate at a rate such that their overall memory use is constant or perhaps increasing. Applications that do have the "big then small" live size pattern could perhaps force the issue with
malloc_trim
.
Virtual memory helps mitigate the issue. So far I've been throwing around terms like "allocated memory" without really defining what it means. If a program allocates and then frees 2 GB of memory and then sits around doing nothing, is it wasting 2 GB of actual DRAM plugged into your motherboard somewhere? Probably not. It is using 2 GB of virtual address space in your process, sure, but virtual address space is per-process, so that doesn't directly take anything away from other processes. If the process actually wrote to the memory at some point, it would be allocated physical memory (yes, DRAM) - after freeing it, you are - by definition - no longer using it. At this point the OS may reclaim those physical pages by use for someone else.
Now this still requires you have swap to absorb the dirty not-used pages, but some allocators are smart: they can issue a madvise(..., MADV_DONTNEED)
call which tells the OS "this range doesn't have anything useful, you don't have to preserve its contents in swap". It still leaves the virtual address space mapped in the process and usable later (zero filled) and so it's more efficient than munmap
and a subsequent mmap
, but it avoid pointlessly swapping freed memory regions to swap.2
The Demonstrated Code
As pointed out in this answer your test with vector<int>
isn't really testing anything because an empty, unused std::vector<int> v
won't even create the vector object as long as you are using some minimal level of optimization. Even without optimization, no allocation is likely to occur because most vector
implementations allocate on first insertion, and not in the constructor. Finally, even if you are using some unusual compiler or library that does an allocation, it will be for a handful of bytes, not the ~72,000 bytes Valgrind is reporting.
You should do something like this to actually see the impact of a vector allocation:
#include <vector>
volatile vector<int> *sink;
int main() {
std::vector<int> v(12345678);
sink = &v;
}
That results in actual allocation and de-allocation. It isn't going to change the Valgrind output, however, since the vector allocation is correctly freed before the program exits, so there is no issue as far as Valgrind is concerned.
At a high level, Valgrind basically categorizes things into "definite leaks" and "not freed at exit". The former occur when the program no longer has a reference to a pointer to memory that it allocated. It cannot free such memory and so has leaked it. Memory which hasn't been freed at exit may be a "leak" - i.e., objects that should have been freed, but it may also simply be memory that the developer knew would live the length of the program and so doesn't need to be explicitly freed (because of order-of-destruction issues for globals, especially when shared libraries are involved, it may be very hard to reliably free memory associated with global or static objects even if you wanted to).
1 In some cases some deliberately special allocations may outlive the process, such as shared memory and memory mapped files, but that doesn't relate to plain C++ allocations and you can ignore it for the purposes of this discussion.
2 Recent Linux kernels also have the Linux-specific MADV_FREE
which seems to have similar semantics to MADV_DONTNEED
.
vector
or justnew char[1234567]
) and observe that the OS-reported memory use goes up by a similar amount, then free it observe that it may not go down (or it may, it depends!). – Tramroad