If memory pools are faster than malloc, why doesn't malloc use them under the covers?

Asked 8/5, 2020 at 20:34 Answered 14/7, 2023 at 16:27

I keep hearing that memory pools can signficantly improve performance when allocating memory.. so why aren't they used in some way by traditional malloc implementations?

I know part of it is that memory pools use fixed sized blocks of memory but seems like some don't and the only thing they require is grabbing a little extra memory in advance. Is there a way they could be sufficiently generalized for such purposes?

Monumental answered 8/5, 2020 at 20:34 Comment(9)

Custom memory pools are not faster in general. They can be faster in specific circumstances. malloc and similar need to be good in general, and they use strategies for that (including memory pools). – Herrick 8/5, 2020 at 20:37

If you know your application's memory allocation patterns you can almost always do a better job than a general-purpose memory manager. If you don't, you can't. – Sexton 8/5, 2020 at 21:22

For what scenarios should memory pools be used? – Monumental 8/5, 2020 at 21:27

Memory pools are great for systems with confined or limited memory (such as embedded systems). In most embedded systems, the theme is to avoid dynamic memory allocation unless absolutely necessary. For example, using a circular array instead dynamically allocating the array. – Yellowthroat 8/5, 2020 at 21:50

I would thought that's exactly when you don't want them, when memory is limited as they take up more room? – Monumental 8/5, 2020 at 21:55

@hughmanwho it's not that they take up more room, so much as they take up a predictable amount of room. i.e. if you allocate 5MB up-front for your memory pool, then your program will either fail at startup (because that 5MB couldn't be allocated), or you will have 5MB guaranteed available for the lifetime of the program. Contrast that with a program that directly calls malloc() at runtime -- it might run fine for 6 weeks, and then fail due to memory exhaustion because some other process grabbed almost all the RAM; not what embedded programmers like to deal with – Raspberry 8/5, 2020 at 23:13

In safety critical programs, memory pools are often used (there are other options) because predictability of behaviour is an essential design requirement, and arbitrary allocation/deallocation of memory gives unpredictability (e.g. of the time for allocation, or potential failed allocation). Typically, pools are ONLY allocated at startup and, if allocation fails, startup is aborted, since allocating memory after startup represents a failure to meet an essential system requirement. – Ciliary 9/5, 2020 at 1:12

I’m voting to close this question because it is based on a fallacy. – Towny 21/5, 2021 at 2:33

@user207421: I'm not convinced that's a good reason to close the question since a rather large chunk of other useful questions here on SO are based on people's misunderstanding or lack of knowledge. Rather I'd see this as an opportunity to explain why the basis may be a fallacy. In any case, as explained in my answer, memory pools are faster in certain circumstances so it's not a fallacy, it just has to be put in context. – Anthology 21/5, 2021 at 3:29

Memory pools can be more efficient than generalised memory allocation, but usually only because you have extra information about the allocation patterns. Perhaps their most important property is that they are deterministic in their run time, especially important in, for example, real-time operating systems.

As an example, I once coded up an embedded system where I knew the largest allocation needed was 128 bytes (called a chunk hereafter). To that end, I maintained a contiguous set of chunks and used a map for deciding whether one was free or not.

It was originally a bitmap but we ended up getting more performance simply by storing each used/unused flag in a separate byte. Eight times as much memory usage for the map but, since the pool size was known and reasonably limited (a thousand or so), that wasn't too bad. And it gave us even more speed by not having to do bit twiddling to do pool management.

And there were other optimisations we added like storing the first free block so we could quickly find it. It was easy to maintain since:

freeing a block lower than the current lowest would simply update the lowest; and
allocating the lowest block would simply increment the lowest - while that didn't guarantee it pointed at a free block, it would still make the search faster, and avoided a possibly unnecessary search on allocation (if, for example, you first freed a block lower than the one you just allocated).

Then, if you asked for more than the chunk size, it returned NULL (this never happened in that system but, being paranoid, I coded for it just in case). If you asked for something that would fit in a chunk, you got a full chunk regardless (but, of course, you should still only use the memory you asked for, just in case I wanted to later change the chunk size or allocate from separate pools with differing chunk sizes).

This turned out to be much faster than the generalised allocators that were around at the time, since they had to handle differing request sizes and worry about things like coalescing contiguous free blocks when freeing memory.

But it required that extra knowledge, the fact that no allocation would ever be more than the chunk size.

Another model is to have a pool for requests below a certain size but to revert to general allocation if either:

you requested a block beyond the chunk size; or
the pool was currently exhausted.

That allows you the extra efficiency in most cases (depending on your allocation patterns, of course) but allowing for allocation beyond that. It introduces a little extra effort in every allocation in that you need to evaluate request size and pool exhaustion, but it still may out-perform the general case.

As an aside, I recall something similar in Java strings (not sure if this is still the case, I haven't used Java for quite a while). The string object allocation had a buffer inside it for storing small strings but could also use that space to store a pointer for a separately allocated chunk of characters (if it was bigger than the internal buffer). This reduced fragmentation (and dereferencing) for what was possibly a large number of small strings, but still allowed for larger strings if needed.

Interestingly, I once tried an experiment in the CPython source code to see if a memory pool could improve performance, especially given the quantity of memory allocations that goes on in there. It used a strategy similar to the one given above, allocating from the pool in preference, but reverting to the original strategy if the requested size was beyond the chunk size or if the pool was exhausted.

Again, it had optimisations as discussed, and then some. For example, the last block freed was cached so it could be handed out immediately without any searching of the pool, in an attempt to speed up the many-times(single-free-then-allocate) pattern.

However, even with a variety of optimisations, pool and chunk sizes, it appeared to make no substantial difference to the performance of some test code I wrote to exercise it, leading me to believe that the actual allocator used in CPython is already pretty good.

And, just having finished reading the excellent book I purchased a couple of weeks ago^(a), I now know why I didn't make any headway.

Turns out CPython is already heavily optimised, including the use of memory pools. The "Memory Management" chapter goes into a lot more detail but it basically only uses the normal allocator (raw domain) for getting large blocks (> 256K) or specific non-object-related memory.

All objects, and Python is pretty much all objects :-), come from the object domain (other than a few legacy things).

For this domain, it maintains its own heap and allocates arenas sized to match the system page size, using mmap if supported to reduce fragmentation. All the used arenas are held in a doubly-linked list, the empty ones are maintained in a singly-linked free list.

Within each arena, 4K pools are created (so 64 per arena) and a pool can only serve one size of allocation, locked in when the first allocation is requested from that pool. For example, requests for 1-16 bytes will get a 16-byte block from a pool serving 16-byte blocks, 33-48 byte requests will come from a pool serving 48-byte blocks.

Note this is for a 64-bit system where the block sizes are {16, 32, 48, ..., 512}. 32-bit systems have a slightly different set of block sizes, {8, 16, 24, 32, ..., 512}.

For pools within an arena, they are either:

partially used, in which case they live on a doubly-linked list in the arena, based on their block size. The arena maintains free lists for the pools, one per block size.
unused, in which case they live on a free list of pools, able to serve any request size (though, once locked in, that's the block size they're limited to).
full, in which case they are made inaccessible, except for de-allocation.

Keep in mind that transitions between those three states all result in pools moving from list to list.

I won't go into any more detail since your head may explode, like mine almost did :-)

In short, CPython object allocations are always for a specific block size, the smallest one larger or equal to what you need. These come from pools that serve a single block size (once locked in). These pools exist in arenas heavily optimised for preventing fragmentation. And arenas can be created as needed.

Suffice to say, this is why my little experiment didn't improve CPython: it already does memory pooling in a quite complex, yet efficient, way and my attempt to intercept malloc did no good at all.

And my opening statement that pooled memory can be more efficient "but usually only because you have extra information about the allocation patterns" is supported by a comment from that book:

Most of the memory allocation requests are small and of a fixed size. because PyObject is 16 bytes, PyASCIIObject is 42 bytes, PyCompactUnicodeObject is 72 bytes, and PyLongObject is 32 bytes.

^(a) CPython Internals if you're interested, I have no affiliation other than the fact I like good tech books about how things work.

Anthology answered 21/5, 2021 at 1:55 Comment(3)

Hah, nice memories... 'The string object allocation had a buffer inside it for storing small strings but could also use that space to store a pointer for a separately allocated chunk of characters' I have invented that myself, too, in 1997 :D Needed it for parsing documents (spreadsheet definitions, which contained thousands of references to a few functions). – Helium 1/6, 2021 at 7:9

What you describe here about CPython is basically how any malloc implementation that ships with modern systems work. Show me only one malloc implementation that does not work that way. – Squib 14/7, 2023 at 16:33

@Mecki, I didn't state that this concept was not used elsewhere, just that it's useful in cases where you have extra information unavailable to the normal allocation code. CPython uses extra knowledge to make its memory usage more efficient. I once wrote an embedded system where every allocation would be for 128 or 256 bytes so could bypass the normal "linked list" strategy used in allocators with a much faster bitmap one. Pretty much every general algorithm can be improved if there's extra info on how it's used. – Anthology 15/7, 2023 at 0:4

I've written memory pools, with multiple approaches and tradeoffs. I believe malloc() doesn't use them under the covers (if that's true) because:

Memory pools have wasted memory since they may (they frequently do) use discrete (fixed) block sizes in order to be faster.
1. Ex: if you ask for 12 bytes, you may secretly get 64 bytes (assuming that's the nearest block size >= 12 bytes, and with proper alignment padding), depending on the memory pool implementation. Maybe malloc(), however, would give you 16 bytes, which is the nearest alignment requirement alone, and therefore wastes fewer bytes.
2. Notice the memory pool allocates a block to the nearest aligned block size (meaning it must be both A) aligned and B) one of the valid block sizes you allow to be allocated) >= n bytes requested, whereas malloc() just allocates to the nearest alignment requirement (generally alignas(max_align_t), which is usually 8 or 16 byte alignment, depending on architecture) >= n bytes requested.
Memory pools have wasted memory since they may (depending on implementation) have large mapping arrays or hash tables (there are multiple approaches and tradeoffs here) to map from n bytes you want allocated to a free list linked list (for the next block size >= n bytes) that you can pull from.

In other words, as is the case with most things, there are tradeoffs. I suspect malloc() has chosen to be slower and nondeterministic in order to:

make the most of your RAM, rather than wasting it by giving you a 64-byte (for instance), block every time you ask for 12 bytes,
and to NOT have massive (dozens of kilobytes in size) mapping arrays, or strange max size limitations on the max block size you can allocate, which memory pools may enforce.

Memory pools are frequently customized in speed, RAM usage, block size, and max number of blocks, for your specific requirement and application at-hand. malloc(), on the other hand, must be universally-applicable and universally-functional for all quantities of bytes possible for your given size RAM. It has much different constraints and requirements.

Having said all that, I'm considering writing some drop-in replacements called fast_malloc() and fast_free() for general-purpose usage. They would either have O(1) allocate and free time complexity by using a huge mapping array to map from n bytes to a block size, or I coud choose an option which uses less program space and/or RAM but uses binary search to map from n bytes to a block size, and therefore has O(log m) time complexity, where m is the number of possible block sizes you can allocate. I could even make it use malloc() under the hood to expand the memory pool at run-time whenever the memory pool runs out, if desired--but that shouldn't be done on microcontrollers or in real-time, safety-critical, deterministic applications, in which case I'd disable that feature and only allocate statically at compile-time, or once at initialization during run-time.

Speed note:

One O(log m) time complexity (was O(log m) for allocation, but was O(1) for free) algorithm tunable-block-size memory pool I wrote performed 1x~3x as fast as system malloc() (ie: my implementation took ~33% to ~100% as much time), depending on the block sizes allowed and the number of bytes being allocated. See my comment below this answer for details.
For maximum speed, a large 1:1 mapping array can be written to map directly in O(1) time from n bytes (when calling fast_malloc(n)) to an index into a mapping array containing structs of {block_size and ptr_to_free_list}. This 1:1 O(N_MAX) in size mapping array used to map into another O(m) in size mapping array would perform much faster, but at the cost of a ton more memory usage in program space/flash, and possibly RAM, depending on the hardware you're running on: microcontroller vs PC.

Anyway, it is indeed possible to write a fast_malloc() implementation which does use memory pools under the hood, and has O(1) allocate and free time complexity, and which reverts to regular malloc() if necessary for block sizes too big to allocate in O(1) time (ie: for allocating n bytes where n > N_MAX), in which case it'd just pass the call to regular malloc().

Additional reading:

*****Very helpful to learn some of the basics of memory pool allocation strategies: http://dmitrysoshnikov.com/compilers/writing-a-pool-allocator/
See also these Google searches:

Preliminary `fast_malloc()` speed test results:

Preliminary results indicate my single-threaded version of fast_malloc() is ~36 clock_cycles/14 clock_cycles = 2.6x faster than glibc's malloc().
1. From my LinkedIn post here.
My multi-threaded implementation of fast_malloc(), however, currently is waaaay slower. Apparently mutexes are insanely slow. But, I have some ideas to fix it. I have a lot of work to do, and I am still convinced I can make a faster implementation than anything currently out there, including for multi-threaded applications. It has tradeoffs, of course, though--namely: less-efficient usage of memory space.

[my Q&A]: In gcc is there any way to dynamically add a function call to the start of main()? - the question also shows speed test results of various malloc() implementations

Wardieu answered 1/6, 2021 at 6:27 Comment(2)

as system malloc() What system were you using/comparing to (windows, glibc, musl) ? – Prune 1/6, 2021 at 7:39

@KamilCuk, I'm on 64-bit Linux Ubuntu 20.04. My build options were -Wall -Wextra -Werror -O3 -std=c11. ldd --version shows ldd (Ubuntu GLIBC 2.31-0ubuntu9.2) 2.31, so it looks like I'm using glibc 2.31. Source where I learned the ldd --version cmd.. I also ran Profile-Guided Optimization (PGO) and saw no real improvement with it. Using -O3, however, was very important over -O0, of course. – Wardieu 1/6, 2021 at 8:6

As usual, it all depends.
In this case it depends mainly on what you mean by performance.

As others already said, memory pools are usually faster than standard malloc and free implementations, but the speed is not the only factor one must consider in general case. A general allocator should not allocate too much data in advance, until necessary (which pools usually do) and it should allocate blocks of arbitrary size (which pools usually not do).

Memory pools are more efficient in allocating large amounts of small blocks, especially of the same size, because they can be allocated as an array, so a bunch of blocks has a single common header instead of a separate header for each block.
On the other hand, not the whole bunch will be used up usually, which may be seen as a loss of memory efficiency.

Pools can be faster in malloc/new because they can almost instantly give you a block of data from a pre-allocated array of blocks of a fitting size, instead of searching for a fitting block to slice from. However, you can't have a pool for every possible size, so usually you get blocks a bit longer than needed, which again is some memory loss.

They can also be faster in free/delete, because they just mark a block in a pool as freed and don't need to seek for adjacent blocks to see if they're free, too, and 'glue' the newly freed block to them.

Helium answered 1/6, 2021 at 6:58 Comment(0)

Pretty much every malloc implementation today does use memory pools for ages. Only when requesting large amounts of memory, malloc will directly request virtual memory from the kernel and return that memory back to the system once freed.

In all other cases, malloc requests memory from the system in larger blocks. E.g. if you request 4 bytes of memory, malloc may request 4 KB from the system and then only assign the first 4 bytes from that block to you. The reason why malloc does that is because requesting memory from the system is expensive and that way the next time you request 4 byte malloc doesn't have to request memory from the system at all but can just assign you some more memory from that block.

This also means, that if you free an allocation, malloc cannot return that block to the system, unless all allocations from that block have been freed. And even then malloc will always keep some spare blocks around. Also note that you cannot request arbitrary amounts of virtual memory from the system, usually only a multiple of page sizes, so without an internal memory pool every allocation would in fact be 4 KB of RAM, no matter how small the allocation is. So malloc will always request one or more pages, split those up into chuncks and allocations are in fact taken from these chunks.

Here's a pretty good article how that works: https://levelup.gitconnected.com/understand-heap-memory-allocation-a-hands-on-approach-775151caf2ea

Those blocks basically form a memory pool. However, not a specialized one. malloc cannot know in advance how often you will require blocks or of what size. It has to deal with arbitrary memory allocations in arbitrary order. Also there is some management overhead, finding a free spot in the allocated blocks for your allocation and remembering that this spot is now in use, as well as marking it free again once it isn't in use anymore.

When people say a memory pool is faster, they mean a specialized memory pool. E.g. one where all blocks have the same size (e.g. matching the size of memory allocations you will need) and are stored in a single linked list, so finding a free block is just removing the first block from the list (all have the same size, so any block will do) and freeing it again is just adding it back to the list; nothing could be simpler and faster than that. And as you never return the memory to the system, you never have to request it again from the system, which is the slowest part of memory management.

The downside of such an approach is that your code constantly reserves more memory than it may currently need (and that memory is not available to other processes or can cause the system to swap) and all allocations have a fixed size, which wastes a lot of memory for nothing. Memory pools are specialized, malloc must be general and conservative with resources. So it won't use a pool approach for large allocation and even for small ones where it does use a pool, it cannot compete with a simple specialized pool tailored to your memory needs.

Squib answered 14/7, 2023 at 16:27 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Additional reading:

Preliminary `fast_malloc()` speed test results:

Related

Recommended topics

Hot tags

Additional reading:

Preliminary fast_malloc() speed test results:

Related

Recommended topics

Hot tags

Preliminary `fast_malloc()` speed test results: