Memory pools can be more efficient than generalised memory allocation, but usually only because you have extra information about the allocation patterns. Perhaps their most important property is that they are deterministic in their run time, especially important in, for example, real-time operating systems.
As an example, I once coded up an embedded system where I knew the largest allocation needed was 128 bytes (called a chunk hereafter). To that end, I maintained a contiguous set of chunks and used a map for deciding whether one was free or not.
It was originally a bitmap but we ended up getting more performance simply by storing each used/unused flag in a separate byte. Eight times as much memory usage for the map but, since the pool size was known and reasonably limited (a thousand or so), that wasn't too bad. And it gave us even more speed by not having to do bit twiddling to do pool management.
And there were other optimisations we added like storing the first free block so we could quickly find it. It was easy to maintain since:
- freeing a block lower than the current lowest would simply update the lowest; and
- allocating the lowest block would simply increment the lowest - while that didn't guarantee it pointed at a free block, it would still make the search faster, and avoided a possibly unnecessary search on allocation (if, for example, you first freed a block lower than the one you just allocated).
Then, if you asked for more than the chunk size, it returned NULL (this never happened in that system but, being paranoid, I coded for it just in case). If you asked for something that would fit in a chunk, you got a full chunk regardless (but, of course, you should still only use the memory you asked for, just in case I wanted to later change the chunk size or allocate from separate pools with differing chunk sizes).
This turned out to be much faster than the generalised allocators that were around at the time, since they had to handle differing request sizes and worry about things like coalescing contiguous free blocks when freeing memory.
But it required that extra knowledge, the fact that no allocation would ever be more than the chunk size.
Another model is to have a pool for requests below a certain size but to revert to general allocation if either:
- you requested a block beyond the chunk size; or
- the pool was currently exhausted.
That allows you the extra efficiency in most cases (depending on your allocation patterns, of course) but allowing for allocation beyond that. It introduces a little extra effort in every allocation in that you need to evaluate request size and pool exhaustion, but it still may out-perform the general case.
As an aside, I recall something similar in Java strings (not sure if this is still the case, I haven't used Java for quite a while). The string object allocation had a buffer inside it for storing small strings but could also use that space to store a pointer for a separately allocated chunk of characters (if it was bigger than the internal buffer). This reduced fragmentation (and dereferencing) for what was possibly a large number of small strings, but still allowed for larger strings if needed.
Interestingly, I once tried an experiment in the CPython
source code to see if a memory pool could improve performance, especially given the quantity of memory allocations that goes on in there. It used a strategy similar to the one given above, allocating from the pool in preference, but reverting to the original strategy if the requested size was beyond the chunk size or if the pool was exhausted.
Again, it had optimisations as discussed, and then some. For example, the last block freed was cached so it could be handed out immediately without any searching of the pool, in an attempt to speed up the many-times(single-free-then-allocate)
pattern.
However, even with a variety of optimisations, pool and chunk sizes, it appeared to make no substantial difference to the performance of some test code I wrote to exercise it, leading me to believe that the actual allocator used in CPython is already pretty good.
And, just having finished reading the excellent book I purchased a couple of weeks ago(a), I now know why I didn't make any headway.
Turns out CPython is already heavily optimised, including the use of memory pools. The "Memory Management" chapter goes into a lot more detail but it basically only uses the normal allocator (raw domain) for getting large blocks (> 256K) or specific non-object-related memory.
All objects, and Python is pretty much all objects :-), come from the object domain (other than a few legacy things).
For this domain, it maintains its own heap and allocates arenas sized to match the system page size, using mmap
if supported to reduce fragmentation. All the used arenas are held in a doubly-linked list, the empty ones are maintained in a singly-linked free list.
Within each arena, 4K pools are created (so 64 per arena) and a pool can only serve one size of allocation, locked in when the first allocation is requested from that pool. For example, requests for 1-16 bytes will get a 16-byte block from a pool serving 16-byte blocks, 33-48 byte requests will come from a pool serving 48-byte blocks.
Note this is for a 64-bit system where the block sizes are {16, 32, 48, ..., 512}
. 32-bit systems have a slightly different set of block sizes, {8, 16, 24, 32, ..., 512}
.
For pools within an arena, they are either:
- partially used, in which case they live on a doubly-linked list in the arena, based on their block size. The arena maintains free lists for the pools, one per block size.
- unused, in which case they live on a free list of pools, able to serve any request size (though, once locked in, that's the block size they're limited to).
- full, in which case they are made inaccessible, except for de-allocation.
Keep in mind that transitions between those three states all result in pools moving from list to list.
I won't go into any more detail since your head may explode, like mine almost did :-)
In short, CPython object allocations are always for a specific block size, the smallest one larger or equal to what you need. These come from pools that serve a single block size (once locked in). These pools exist in arenas heavily optimised for preventing fragmentation. And arenas can be created as needed.
Suffice to say, this is why my little experiment didn't improve CPython: it already does memory pooling in a quite complex, yet efficient, way and my attempt to intercept malloc
did no good at all.
And my opening statement that pooled memory can be more efficient "but usually only because you have extra information about the allocation patterns" is supported by a comment from that book:
Most of the memory allocation requests are small and of a fixed size. because PyObject
is 16 bytes, PyASCIIObject
is 42 bytes, PyCompactUnicodeObject
is 72 bytes, and PyLongObject
is 32 bytes.
(a) CPython Internals if you're interested, I have no affiliation other than the fact I like good tech books about how things work.
malloc
and similar need to be good in general, and they use strategies for that (including memory pools). – Herrickmalloc()
at runtime -- it might run fine for 6 weeks, and then fail due to memory exhaustion because some other process grabbed almost all the RAM; not what embedded programmers like to deal with – Raspberry