Which memory allocation algorithm suits best for performance and time critical c++ applications?

Asked 7/2, 2011 at 8:13 Answered 9/10, 2017 at 19:46

Solved c++memory-management performance memory-fragmentation

I ask this question to determine which memory allocation algorithm gives better results with performance critical applications, like game engines, or embedded applications. Results are actually depends percentage of memory fragmented and time-determinism of memory request.

There are several algorithms in the text books (e.g. Buddy memory allocation), but also there are others like TLSF. Therefore, regarding memory allocation algorithms available, which one of them is fastest and cause less fragmentation. BTW, Garbage collectors should be not included.

Please also, note that this question is not about profiling, it just aims to find out optimum algorithm for given requirements.

Tongue answered 7/2, 2011 at 8:13 Comment(6)

Are there other requirements, like how long you're going to be holding onto items, how often you allocate/deallocate/and size of allocations? I don't know enough about different algorithms, but I can imagine scenarios where those could be deciding factors between using one allocation method vs another due to the size of bookkeeping structures, the time overhead involved w/ allocation/deallocation algorithm, etc. – Vendee 7/2, 2011 at 8:21

@Merlyn Morgan-Graham minimal execution time and memory fragmentation are the most important requirements. Others, at this point, can be discarded. – Tongue 7/2, 2011 at 8:24

Not making this an answer because it's such a cliche, but: use the default. If (if) performance becomes an issue, profile. Only when that proves a bottleneck in memory allocation would one tinker with memory allocation. – Greenlaw 7/2, 2011 at 8:53

One thing to keep in mind is that while video games have real-time interactivity, that doesn't mean that they have the same requirements as real time applications (by the overall industry's definition). For example, in an actual real time program, it may be acceptable to allow a higher average execution time if a guarantee can be made that the allocation will complete under a maximum number of processor cycles. Video game optimizations tend to involve either minimizing or avoiding allocation all-together, or simply reducing average runtime, rather than trying to give a cycle-count guarantee. – Vendee 7/2, 2011 at 9:54

The reason I mention this is that some people are mentioning "real time" in their answers. I don't actually see "real time" in your question... – Vendee 7/2, 2011 at 9:57

There was a time when the question asked about real time performance and fragmentation. – Bathulda 8/2, 2011 at 8:2

It all depends on the application. Server applications which can clear out all memory relating to a particular request at defined moments will have a different memory access pattern than video games, for instance.

If there was one memory allocation algorithm that was always best for performance and fragmentation, wouldn't the people implementing malloc and new always choose that algorithm?

Nowadays, it's usually best to assume that the people who wrote your operating system and runtime libraries weren't brain dead; and unless you have some unusual memory access pattern don't try to beat them.

Instead, try to reduce the number of allocations (or reallocations) you make. For instance, I often use a std::vector, but if I know ahead of time how many elements it will have, I can reserve that all in one go. This is much more efficient than letting it grow "naturally" through several calls to push_back().

Many people coming from languages where new just means "gimme an object" will allocate things for no good reason. If you don't have to put it on the heap, don't call new.

As for fragmentation: it still depends. Unfortunately I can't find the link now, but I remember a blog post from somebody at Microsoft who had worked on a C++ server application that suffered from memory fragmentation. The team solved the problem by allocating memory from two regions. Memory for all requests would come from region A until it was full (requests would free memory as normal). When region A was full, all memory would be allocated from region B. By the time region B was full, region A was completely empty again. This solved their fragmentation problem.

Will it solve yours? I have no idea. Are you working on a project which services several independent requests? Are you working on a game?

As for determinism: it still depends. What is your deadline? What happens when you miss the deadline (astronauts lost in space? the music being played back starts to sound like garbage?)? There are real time allocators, but remember: "real time" means "makes a promise about meeting a deadline," not necessarily "fast."

I did just come across a post describing various things Facebook has done to both speed up and reduce fragmentation in jemalloc. You may find that discussion interesting.

Bathulda answered 7/2, 2011 at 8:23 Comment(2)

but what about time-determinism and fragmentation, usage of heap memory will eventually lead to fragmentation and loss of determinism. BTW, thanks for the link. – Tongue 7/2, 2011 at 8:27

Fragmentation, like all other parameters, depends on the application. For instance, it's trivial to prove that if all allocations are the same size (eg 16 bytes), then no fragmentation can occur (given a halfway-sane allocator). – Greenlaw 7/2, 2011 at 8:51

Barış:

Your question is very general, but here's my answer/guidance:

I don't know about game engines, but for embedded and real time applications, The general goals of an allocation algorithm are:

1- Bounded execution time: You have to know in advance the worst case allocation time so you can plan your real time tasks accordingly.

2- Fast execution: Well, the faster the better, obviously

3- Always allocate: Especially for real-time, security critical applications, all requests must be satisfied. If you request some memory space and get a null pointer: trouble!

4- Reduce fragmentation: Although this depends on the algorithm used, generally, less fragmented allocations provide better performance, due to a number of reasons, including caching effects.

In most critical systems, you are not allowed to dynamically allocate any memory to begin with. You analyze your requirements and determine your maximum memory use and allocate a large chunk of memory as soon as your application starts. If you can't, then the application does not even start, if it does start, no new memory blocks are allocated during execution.

If speed is a concern, I'd recommend following a similar approach. You can implement a memory pool which manages your memory. The pool could initialize a "sufficient" block of memory in the start of your application and serve your memory requests from this block. If you require more memory, the pool can do another -probably large- allocation (in anticipation of more memory requests), and your application can start using this newly allocated memory. There are various memory pooling schemes around as well, and managing these pools is another whole topic.

As for some examples: VxWorks RTOS used to employ a first-fit allocation algorithm where the algorithm analyzed a linked list to find a big enough free block. In VxWorks 6, they're using a best-fit algorithm, where the free space is kept in a tree and allocations traverse the tree for a big enough free block. There's a white paper titled Memory Allocation in VxWorks 6.0, by Zoltan Laszlo, which you can find by Googling, that has more detail.

Going back to your question about speed/fragmentation: It really depends on your application. Things to consider are:

Are you going to make lots of very small allocations, or relatively larger ones?
Will the allocations come in bursts, or spread equally throughout the application?
What is the lifetime of the allocations?

If you're asking this question because you're going to implement your own allocator, you should probably design it in such a way that you can change the underlying allocation/deallocation algorithm, because if the speed/fragmentation is really that critical in your application, you're going to want to experiment with different allocators. If I were to recommend something without knowing any of your requirements, I'd start with TLSF, since it has good overall characteristics.

Aleasealeatory answered 7/2, 2011 at 9:42 Comment(0)

The best practice is - use whatever you can use to make the thing done in time (in your case - default allocator). If the whole thing is very complex - write tests and samples that will emulate parts of the whole thing. Then, run performance tests and benchmarks to find bottle necks (probably they will nothing to do with memory allocation :). From this point you will see what exactly slowdowns your code and why. Only based on such precise knowledge you can ever optimize something and choose one algorithm over another. Without tests its just a waste of time since you can't even measure how much your optimization will speedup your app (in fact such "premature" optimizations can really slowdown it).

Memory allocation is a very complex thing and it really depends on many factors. For example, such allocator is simple and damn fast but can be used only in limited number of situations:

char pool[MAX_MEMORY_REQUIRED_TO_RENDER_FRAME];
char *poolHead = pool;

void *alloc(size_t sz) { char *p = poolHead; poolHead += sz; return p; }
void free() { poolHead  = pool; }

So there is no "the best algorithm ever".

Nyx answered 7/2, 2011 at 8:42 Comment(4)

thanks for the reply but this question is not related with profiling, it is about the best memory allocation algorithm for given requirements – Tongue 7/2, 2011 at 8:48

Ok, then consider my pool-based implementation as the fastest one, since it fits your requirements :) – Nyx 7/2, 2011 at 9:8

@baris_a: the question may not intend to be about profiling. As you should infer from the answers so far, profiling is essential because there is no "one size fits all" best solution. It's like buying clothes: if you don't measure, you're bound to end up with something that doesn't fit you. – Greenlaw 7/2, 2011 at 9:31

@baris_a: A concrete example for games is that you'd want a different allocation strategy for model/image data than you would for individual particles in a particle system. In a particle system, all your objects are the same size, and miniscule. Whereas model/image data is often variably sized (maybe not textures, but it depends on the game), and always much larger than individual particles. Model/image data tends to be loaded once, and kept around for a while, whereas particles appear and disappear constantly. You might have different sub-allocator algorithms for each. – Vendee 7/2, 2011 at 10:12

As other already wrote, there is no "optimum algorithm" for each possible application. It was already proven that for any possible algorithm you can find an allocation sequence which will cause a fragmentation.

Below I write a few hints from my game development experience:

Avoid allocations if you can

A common practices in the game development field was (and to certain extent still is) to solve the dynamic memory allocation performance issues by avoiding the memory allocations like a plague. It is quite often possible to use stack based memory instead - even for dynamic arrays you can often come with an estimate which will cover 99 % of cases for you and you need to allocate only when you are over this boundary. Another commonly used approach is "preallocation": estimate how much memory you will need in some function or for some object, create a kind of small and simplistic "local heap" you allocate up front and perform the individual allocations from this heap only.

Memory allocator libraries

Another option is to use some of the memory allocation libraries - they are usually created by experts in the field to fit some special requirements, and if you have similar requiremens, they may fit your requirements.

Multithreading

There is one particular case in which you will find the "default" OS/CRT allocator performs badly, and that is multithreading. If you are targeting Windows, by aware both OS and CRT allocators provided by Microsoft (including the otherwise excellent Low Fragmentation Heap) are currently blocking. If you want to perform significant threading, you need either to reduce the allocation as much as possible, or to use some of the alternatives. See Can multithreading speed up memory allocation?

Solus answered 7/2, 2011 at 9:8 Comment(0)

One constraint that's worth mentioning, which has not been mentioned yet, is multi-threading: Standard allocators must be implemented to support several threads, all allocating/deallocating concurrently, and passing objects from one thread to another so that it gets deallocated by a different thread.

As you may have guessed from that description, it is a tricky task to implement an allocator that handles all of this well. And it does cost performance as it is impossible to satisfy all these constrains without inter-thread communication (= use of atomic variables and locks) which is quite costly.

As such, if you can avoid concurrency in your allocations, you stand a good chance to implement your own allocator that significantly outperforms the standard allocators: I once did this myself, and it saved me roughly 250 CPU cycles per allocation with a fairly simple allocator that's based on a number of fixed sized memory pools for small objects, stacking free objects with an intrusive linked list.

Of course, avoiding concurrency is likely a no-go for you, but if you don't use it anyway, exploiting that fact might be something worth thinking about.

Chenab answered 9/10, 2017 at 19:46 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Avoid allocations if you can

Memory allocator libraries

Multithreading

Recommended topics

Hot tags