Avoid memory fragmentation when memory pools are a bad idea
Asked Answered
S

4

6

I am developing a C++ application, where the program run endlessly, allocating and freeing millions of strings (char*) over time. And RAM usage is a serious consideration in the program. This results in RAM usage getting higher and higher over time. I think the problem is heap fragmentation. And I really need to find a solution.

Ram usage Report 1

You can see in the image, after millions of allocation and freeing in the program, the usage is just increasing. And the way I am testing it, I know for a fact that the data it stores is not increasing. I can guess that you will ask, "How are you sure of that?", "How are you sure it's not just a memory leak?", Well.

enter image description here

This test run much longer. I run malloc_trim(0), whenever possible in my program. And it seems, application can finally return the unused memory to the OS, and it goes almost to zero (the actual data size my program has currently). This implies the problem is not a memory leak. But I can't rely on this behavior, the allocation and freeing pattern of my program is random, what if it never releases the memory ?

  • I said memory pools are a bad idea for this project in the title. Of course I don't have absolute knowledge. But the strings I am allocating can be anything between 30-4000 bytes. Which makes many optimizations and clever ideas much harder. Memory pools are one of them.
  • I am using GCC 11 / G++ 11 as a compiler. If some old versions have bad allocators. I shouldn't have that problem.
  • How am I getting memory usage ? Python psutil module. proc.memory_full_info()[0], which gives me RSS.
  • Of course, you don't know the details of my program. It is still a valid question, if this is indeed because of heap fragmentation. Well what I can say is, I am keeping a up to date information about how many allocations and frees took place. And I know the element counts of every container in my program. But if you still have some ideas about the causes of the problem, I am open to suggestions.
  • I can't just allocate, say 4096 bytes for all the strings so it would become easier to optimize. That's the opposite I am trying to do.

So my question is, what do programmers do(what should I do), in an application where millions of alloc's and free's take place over time, and they are of different sizes so memory pools are hard to use efficiently. I can't change what the program does, I can only change implementation details.

Bounty Edit: When trying to utilize memory pools, isn't it possible to make multiple of them, to the extent that there is a pool for every possible byte count ? For example my strings can be something in between 30-4000 bytes. So couldn't somebody make 4000 - 30 + 1, 3971 memory pools, for each and every possible allocation size of the program. Isn't this applicable ? All pools could start small (no not lose much memory), then enlarge, in a balance between performance and memory. I am not trying to make a use of memory pool's ability to reserve big spaces beforehand. I am just trying to effectively reuse freed space, because of frequent alloc's and free's.

Last edit: It turns out that, the memory growth appearing in the graphs, was actually from a http request queue in my program. I failed to see that hundreds of thousands of tests that I did, bloated this queue (something like webhook). And the reasonable explanation of figure 2 is, I finally get DDOS banned from the server (or can't open a connection anymore for some reason), the queue emptied, and the RAM issue resolved. So anyone reading this question later in the future, consider every possibility. It would have never crossed my mind that it was something like this. Not a memory leak, but an implementation detail. Still I think @Hajo Kirchhoff deserves the bounty, his answer was really enlightening.

Sauerbraten answered 6/1, 2022 at 3:15 Comment(19)
@Passerby Thanks for your input. My max available RAM changes with the computer/server my program is going to run on. The maximum number of strings exist I saw the most is 200M. But the maximum RAM problem is something I can deal with. Continuously increasing RAM usage is hard to manage however, because it is hard to make a plan or precaution for that. I am considering defragmenting, but it is not yet possible with the default glibc allocator (still need to decide what to use instead). It is still better than, restarting my program, which is both taxing / expensive on the CPU and the disk.Sauerbraten
You can still pool. We allocate fixed size blocks (next power of two) and pool those allocations. It wastes some memory but solves the fragmentation problem since there's never a chunk in the pool that isn't the same size. We also split based on lifetime. Things that are expected to live longer go elsewhere, the pools are just for constant churn.Symbolize
@RetiredNinja I really want to do this. But there are some issues. I have zero knowledge about the lifetime of objects, I string I allocated may live 10 seconds, or 2 weeks, and this goes for all of them. I was considering splitting pools based on sizes. But it is still hard to do, because I also don't have knowledge about the frequency which sized string is gonna appear. I may end up sacrificing much more memory than I wish. All of them could be 4000 bytes, all of them could be 30 bytes. It may be possible If I created many pools, with small starting block count, and allow them build up.Sauerbraten
The first graph looks exactly like the second graph apart from you did not wait as long. Seems like you need more information on allocation/deallocation patterns and size.Philoprogenitive
I find strange that fragmentation alone would cause a nearly perfect linear increment of RAM usage over time. I'd put a diagnostic inside the program counting the currently allocated bytes in order to estimate the fragmentation ratio. Are you dealing directly with arrays of char and new/delete? Have you tried to get some more info with tools such valgrind?Panek
@MartinYork Greetings, I already mentioned in the question, second graph is from a test which run longer. I added the first one to point out the memory increase, and the uncertainty of actual memory release taking place. Based on randomness, the second graph's test could have also just kept increasing.Sauerbraten
@Panek The test code I am using, causes fast but consistent alloc's free's. Consistent in the sense of time (similar intervals). This being in mind, the heap could be being fragmented in a pattern. Like blocks of _ _ #_ # # # _. I am trying to diagnose my program indeed. And I am using char* and malloc/free.Sauerbraten
If you cannot name an upper bound of how much memory the application worst case is going to use, there is nothing you can do, as statistically at one point or another you will run out of memory. If you have an upper bound, which is larger than what your computer has to offer, you will need to refactor such, that your working set stays bounded at any time. If the program is having spikes of working set size, you could consider using custom heaps on windows.Mycorrhiza
@Mycorrhiza Yes, unfortunately the worst case is known but not applicable. But it is clear that the worst case will never happen (so most of my strings wouldn't be 4000 bytes). But still, I am trying to make this work, with an average of 200-300 bytes considered, and stop the memory usage increase over time. Which is not because of the worst case, but program is not returning memory. I am using Ubuntu for the project.Sauerbraten
The test with malloc_trim doesn’t prove that the heap is fragmented; it could also mean that memory usage in the program is poorly understood (no leaks but objects live longer than expected). Use tools that can examine the heap, e.g. #37988864Aplanatic
@NickZavaritsky Thanks for your input. I was of course considering the case you're mentioning. The thing is, I am keeping counts of the allocations happened. It is equivalent to almost overloading malloc and free, to keep counts. I know exactly how many alloc's and free's took place. This is the reason I am working towards the thing I asked in the question. Also, there is no mechanism in my program to cause the sudden drop in memory which you can see in Figure 2.Sauerbraten
@MaxPaython Could you please repeat the test with MALLOC_ARENA_MAX environment variable set to 1? Rationale: additional heaps are created internally to decrease contention, could lead to memory ballooning.Aplanatic
@NickZavaritsky I did made some experiments with mallopt, the parameters I played around with as I remember were, M_MXFAST and M_TRIM_THRESHOLD. I will try M_ARENA_MAX, but setting it to 1, seems to decrease multithread performance, and using a single heap may increase fragmentation rather than decreasing it.Sauerbraten
How about tcmalloc ? I think this solves what you need and there should be no change of existing code. It's easy to get on ubuntu.Socialistic
@LouisGo I was of course considering it (and jemalloc), mentioned them in of the comments. But given the allocation pattern that I have, it seems unlikely that any general solution for memory management will be successful in avoiding fragmentation.Sauerbraten
Just read your explanation on what really happened (Last edit). I'm glad you found out what was the actual, underlying problem. This is what I alluded to in my first sentence. I've been in your position many times. You think you know what is going on only to find out its actually a completely different scenario :)Hurds
@HajoKirchhoff Fragmentation still seems to be happening, just over a longer period of time I guess. I am doing more tests to confirm. Thanks for your help :)Sauerbraten
BTW, if your strings are mostly similar (HTTP requests) except for small snippets such as time/date, you might want to use the "flyweight" design pattern. There is even a boost library for it.Hurds
@HajoKirchhoff I wish. I would be implementing a trie in an instant. Unfortunately, they are very different.Sauerbraten
H
1

If everything really is/works as you say it does and there is no bug you have not yet found, then try this:

malloc and other memory allocation usually uses chunks of 16 bytes anyway, even if the actual requested size is smaller than 16 bytes. So you only need 4000/16 - 30/16 ~ 250 different memory pools.

const int chunk_size = 16;

memory_pools pool[250]; // 250 memory pools, managing '(idx+1)*chunk_size' size

char* reserve_mem(size_t sz)
{
    size_t pool_idx_to_use = sz/chunk_size;
    char * rc=pool[pool_idx_to_use].allocate();
}

IOW, you have 250 memory pools. pool[0] allocates and manages chunk with a length of 16 bytes. pool[100] manages chunks with 1600 bytes etc...

If you know the length distribution of your strings in advance, you can reserve initial memory for the pools based on this knowledge. Otherwise I'd just probably reserve memory for the pools in 4096 bytes increment.

Because while the malloc C heap usually allocates memory in multiple of 16 bytes it will (at least unter Windows, but I am guessing, Linux is similar here) ask the OS for memory - which usually works with 4K pages. IOW, the "outer" memory heap managed by the operating system reserves and frees 4096 bytes.

So increasing your own internal memory pool in 4096 bytes means no fragmentation in the OS app heap. This 4096 page size (or multiple of...) comes from the processor architecture. Intel processors have a builtin page size of 4K (or multiple of). Don't know about other processors, but I suspect similar architectures there.

So, to sum it up:

Use chunks of multiple of 16 bytes for your strings per memory pool. Use chunks of multiple of 4K bytes to increase your memory pool.

That will align the memory use of your application with the memory management of the OS and avoid fragmentation as much as possible.

From the OS point of view, your application will only increment memory in 4K chunks. That's very easy to allocate and release. And there is no fragmentation.

From the internal (lib) C heap management point of view, your application will use memory pools and waste at most 15 bytes per string. Also all similar length allocations will be heaped together, so also no fragmentation.

Hurds answered 9/1, 2022 at 12:25 Comment(19)
Thanks for your answer. On occasion I deal with over than 200M strings. If I grow the pool let's say 4K bytes every time, this would possibly make the pool grow thousands of times. This would create many pool blocks (linked list). Would this decrease malloc/free performance ? Or is it the same as alloc'ing it in one go. I am not talking about the Block Alloc, but the millions of free and alloc's I will do inside pool after.Sauerbraten
200M strings are not really a problem. In fact, using memory pools should be way faster than relying on the heap. That's one of the primary reasons to use them. Example: If you allocate/free 100K strings of 32 to 47 bytes length, you'll actually just use blocks from the 48 bytes memory pool. And since each block is exactly 48 bytes long, allocating means simply using the first free 48 bytes block and freeing means simply putting the 48 block back to the pool. A simple linked list can do that within nanoseconds.Hurds
Only if your 48 bytes memory pool is exhausted an actual (4K) memory allocation with all the overhead required will be necessary. So: a lot faster - and on top of this, even if this specific pool handles all strings between 32 and 47 length, it will alloc only 4K pages.Hurds
The increase for the pools for the longer strings should be in multiple of 4K. E.g. increase your 1600...1615 size pool in 16K or 32K chunks.Hurds
Since the ratio between the string length and the pool page size is so large (>100:1), alloc/free performance should not really matter any more. If you still find it does, increase the ratio even more. IOW the memory pool "page size" could be 256 times the string length. (for 64...79 bytes size use 5 times 4K = 20480 bytes).Hurds
What I meant was, imagine I am extending the pool by 32K bytes each time, and eventually use ~25GB memory. This would create tens of thousands of noncontinuous 32K memory areas like []->[]->[]->[]->[]->... Which I guess for example Boost.Pool keeps in a linked list. Would this decrease malloc/free performance inside the pool later on ? Because it seems like out of standard pool usage.Sauerbraten
No, not at all. Remember you have one pool per size. Allocating is a simple matter of choosing the first free block. All blocks inside the pool have the exact same size. Allocating one does not change in any way if the entire 25GB are continuous or not. It's just a linked list. The fact that it is noncontinuous is completely irrelevant for the performance inside the pool.Hurds
I should also guess that, I should keep all the pointers that I took from these pools inside a map/hashmap. Given that, when free'ing, I don't want to iterate over all pools to find which it belongs to (remember I have char* and not std::string). Because I only have the pointer and not the size. It's either that, or strlen every time.Sauerbraten
In that case I wouldn't use a hashmap. Too big a performance penalty. probably (I'm not sure). I'd probably store the pool index instead at ptr[-1], although that may cause a different kind of performance penalty, since the start of the string is no longer word aligned. But basically you'd allocate one more byte than the size of the string, store the pool index (size/16) at the first allocated byte and return a pointer to the second allocated byte. This is how heap allocation systems work generally: They create an internal data struct and the returned pointer points just beyond.Hurds
That way, when you free your string and you need to find the pool you need to return the pointer to, all you have to do is uint_8t pool_index=uint8_t(ptr[-1]);Hurds
Most of the things you mention are applicable. But there is this problem where, I am allocating strings with multiple-threads. As many threads as I can use, to be clear. malloc has problems, but it did gave me separate heaps for each thread, to use in parallel. Even if I create separate mutex'es for every different 16-byte increasing pools, there are cases in my program where same-sized millions of strings are allocated in parallel. Wouldn't this destroy these benefits ?Sauerbraten
Quite the contrary. With fixed size blocks within a memory pool you can actually have an extremely fast thread-safe implementation, probably even a lock-free one (except for the case where you need to increase the size of the pool). There are en.wikipedia.org/wiki/Non-blocking_linked_list linked list implementations that don't use a mutex at all. It doesn't get any faster than that.Hurds
Dig around different mempool libaries. It is quite possible that boost memory pools use lock-free implementations. You could even write this yourself, it is not really that difficult. But I'd probably prefer using a library, if there is one.Hurds
That said, lock-free isn't the only way for a fast implementation and boost-mutex itself is very fast unless many different lock attempts happen at the same time. Also you might want to read up "wait-free", "lock-free" and "obstruction-free". See "boost.lockfree" for a good explanation.Hurds
Oh, and do use a performance profiler to find out what the bottlenecks are. Intel VTune is very good and it's free (I think). There's no reason to poke around in the dark, when it comes to performance issues.Hurds
To be fair, you might waste even more memory with pools: imagine that there used to be a lot of objects of size1, causing the growth of the corresponding pool. All but a few objects have been freed, but the survivors are keeping the memory chunks in the pool alive since there’s at least one live object in each chunk.Aplanatic
@NickZavaritsky Even if there is one live object and the memory block is not releasing itself, the empty chunks would be used for new allocations. Which is what I want actually.Sauerbraten
@NickZavaritsky There is a case that a single pool keeps growing, and others are not releasing because of single chunks are alive inside blocks. It is possible. Then the only solution would be implementing a defragmentation.Sauerbraten
@NickZavaritsky And there is always the consideration of jemalloc and tcmalloc.Sauerbraten
M
1

Here a solution, which might help, if your application has the following traits:

  • Runs on Windows
  • Has episodes, where the working set is large, but all that data is released at around the same point in time entirely. (If the data is processed in some kind of batch mode and the work is done when its done and then the related data is freed).

This approach uses a (unique?) Windows feature, called custom heaps. Possibly, you can create yourself something similar on other OS.

The functions, you need are in a header called <heapapi.h>.

Here is how it would look like:

  1. At the start of a memory intensive phase of your program, use HeapCreate() to have a heap, where all your data will go.
  2. Perform the memory intensive tasks.
  3. At the end of the memory intensive phase, Free your data and call HeapDestroy().

Depending on the detailed behavior of your application (e.g. whether or not this memory intensive computation runs in 1 or multiple threads), you can configure the heap accordingly and possibly even gain a little speed (If only 1 thread uses the data, you can give HeapCreate() the HEAP_NO_SERIALIZE flag, so it would not take a lock.) You can also give an upper bound of what the heap will allow to store.

And once, your computation is complete, destroying the heap also prevents long term fragmentation, because when the time comes for the next computation phase, you start with a fresh heap.

Here is the documentation of the heap api.

In fact, I found this feature so useful, that I replicated it on FreeBSD and Linux for an application I ported, using virtual memory functions and my own heap implementation. Was a few years back and I do not have the rights for that piece of code, so I cannot show or share.

You could also combine this approach with fixed element size pools, using one heap for one dedicated block size and then expect less/no fragmentation within each of those heaps (because all blocks have same size).

struct EcoString {
  size_t heapIndex;
  size_t capacity;
  char* data;
  char* get() { return data; }
};

struct FixedSizeHeap {
  const size_t SIZE;
  HANDLE m_heap;
  explicit FixedSizeHeap(size_t size) 
    : SIZE(size)
    , m_heap(HeapCreate(0,0,0)
  {
  }
  ~FixedSizeHeap() {
     HeapDestroy(m_heap);
  }
  bool allocString(capacity, EcoString& str) {
    assert(capacity <= SIZE);
    str.capacity = SIZE; // we alloc SIZE bytes anyway...
    str.data = (char*)HeapAlloc(m_heap, 0, SIZE);
    if (nullptr != str.data)
      return true;
    str.capacity = 0;
    return false;
  }
  void freeString(EcoString& str) {
     HeapFree(m_heap, 0, str.data);
     str.data = nullptr;
     str.capacity = 0;
  }
};

struct BucketHeap {
  using Buckets = std::vector<FixedSizeHeap>; // or std::array
  Buckets buckets;
/*
(loop for i from 0
           for size = 40 then (+ size (ash size -1))
           while (< size 80000)
           collecting (cons i size))
((0 . 40) (1 . 60) (2 . 90) (3 . 135) (4 . 202) (5 . 303) (6 . 454) (7 . 681)
 (8 . 1021) (9 . 1531) (10 . 2296) (11 . 3444) (12 . 5166) (13 . 7749)
 (14 . 11623) (15 . 17434) (16 . 26151) (17 . 39226) (18 . 58839))
  Init Buckets with index (first item) with SIZE (rest item) 
in the above item list.
*/
 // allocate(nbytes) looks for the correct bucket (linear or binary 
 // search) and calls allocString() on that bucket. And stores the 
 // buckets index in the EcoString.heapIndex field (so free has an 
 // easier time).
 // free(EcoString& str) uses heapIndex to find the right bucket and
 // then calls free on that bucket.
}
Mycorrhiza answered 9/1, 2022 at 12:46 Comment(2)
Thanks for your answer. I am indeed working inside an Ubuntu environment. And I never destroy my heap. It's just that the elements inside get recycled/replaced/deleted/added randomly over random intervals of time.Sauerbraten
I recently wrote some code, which actually might be useful for you. But... it is not yet feature complete and it inherently has some "inner fragmentation" i.e. per allocated item you have some memory overhead. It basically renders a piece of memory movable, such, that the page it resides in can be defragmented, independent of allocation patterns. But it is too much code for this site... It's a kind of garbage collector without garbage...Mycorrhiza
N
0

What you are trying to do is called a slab allocator and it is a very well studied problem with lots of research papers.

You don't need every possible size. Usually slabs come in power of 2 sizes.

Don't reinvent the wheel, there are plenty of opensource implementations of a slab allocator. The Linux kernel uses one.

Start by reading the Wikipedia page on Slab Allocation.

Neaten answered 10/1, 2022 at 13:27 Comment(3)
It wasn't what I was trying to do in the beginning. I was trying to learn every worth-mentioning possible way. And depending on the suggestions and answers, it is going in that direction.Sauerbraten
Memory allocation is a problem that has been very very extensively researched since the very early days of computer science - don't expect to make a sudden breakthrough in it. It is considered a closed problem.Neaten
Thanks for your input. I was never considering to implement any memory management system in the first place. I am just trying to learn about the existing ones, and learn to fine tune it to my explained situation(a strategy). When I made research on my own, I got confused. So I included my knowledge and problem in the question :)Sauerbraten
W
0

I don't know the details of your program. There are many patterns to work with memory, each of which leads to different results and problems. However you can take a look of .Net garbage collector. It uses several generations of objects (similar to pool of memory). Each object in that generation is movable (its address in memory is changed during compressing of generation). Such trick allows to "compress/compact" memory and reduce fragmentation of memory. You can implement memory manager with similar semantics. It's not neccesary to implement garbage collector at all.

Willowwillowy answered 10/1, 2022 at 13:54 Comment(2)
Thanks for your answer. Given that I am working with raw C pointers, I was thinking of a mechanism of storing double pointers inside my containers to allow a de-fragmentation (like moving objects in their life time). But for that, I need a pool library that gives me access to its contents, lest I want to implement it myself. And there are many things to be considered.Sauerbraten
@MaxPaython it is not new paradigm. I guess that there is implementations of that. At least I heard that from conference in 2017 year.Willowwillowy

© 2022 - 2024 — McMap. All rights reserved.