Efficiently allocating many short-lived small objects
Asked Answered
T

3

11

I've got a small class (16 bytes on a 32bit system) which I need to dynamically allocate. In most cases the life-time of any given instance is very short. Some instances may also be passed across thread boundaries.

Having done some profiling, I found that my program appears to be spending more time allocating and deallocating the things than it's actually spending using them so I want to replace the default new and delete with something that a little more efficient.

For a large object (db connections as it happens, which are expensive to construct rather than allocate), I'm already using a pooling system, however that involves a list for storing the "free" objects, and also a mutex for thread safety. Between the mutex and the list it actually performs worse than with the basic new/delete for the small objects.

I found a number of small object allocators on Google, however they seem to be using a global/static pool which is not used in a thread safe manner, making them unsuitable for my use :(

What other options have I got for efficient memory management of such small objects?

Trammell answered 28/1, 2010 at 18:5 Comment(5)
If you object is very small, why not just pass by value? Also, are your many small objects the same, or are they all different? If the first, look at boost.flyweightPyne
There is little evidence in your question that suggests that you can actually do better. What makes you think you do?Finial
When you write mutex, do you mean in general or mutex in particular? If in particular, then CriticalSection will be better as long as you aren't crossing process boundaries.Brade
The one Im using is boost::mutex which IIRC is implemented via the interlocked API's on Windows. I don't think there's any major difference in performance compared to CriticalSection.Trammell
Two instances are very unlikely to be the same. There dynamically allocated because it's common for several objects to reference them, and they need to all see changes. Given that the single-thread pool solutions have led to major speed increases, I'm hoping there's a thread-safe solution that is almost as fast. Even a heap-allocator optimised for fixed-sized objects could well be faster than a general-purpose allocator id expect, and could totally avoid the fragmentation issue, however I have no-idea how id write any heap allocator to start with :(Trammell
M
1

Maybe try using Google's tcmalloc? It is optimized for fast allocation/deallocation in a threaded program, and has low overhead for small objects.

Minority answered 28/1, 2010 at 19:0 Comment(0)
C
1

Some instances may also be passed across thread boundaries

Only "some"? So perhaps you can afford to pay extra for these, if it makes the ones that don't get passed to other threads cheaper. There are various ways I can think of to get to one allocator per thread and avoid the need to lock when allocating or freeing in the thread to which the allocator belongs. I don't know which might be possible in your program:

  • Copy things across the thread boundary, instead of passing them.

  • Arrange that if they're passed to another thread for any reason, then they're passed back to the original thread to free. This doesn't necessarily have to happen very often, you could queue up a few in the receiving thread and pass them all back in a message later. This assumes of course that the thread which owns the allocator is going to stick around.

  • Have two free lists per allocator, one synchronised (to which objects are added when they're freed from another thread), and one unsynchronised. Only if the unsynchronised list is empty, and you're allocating (and hence in the thread which owns the allocator), do you need to lock the synchronised free list and move all of its current contents to the unsynchronised list. If objects being passed to other threads is rare, this basically eliminates contention on the mutex and massively reduces the number of times it's taken at all.

  • If all the above fails, having one allocator per thread might still allow you to get rid of the mutex and use a lock-free queue for the free list (multiple writers freeing, single reader allocating), which could improve performance a bit. Implementing a lock-free queue is platform-specific.

Taking a step further back, does your app frequently hit a state in which you know that all cells allocated after a certain point (perhaps a little in the past), are no longer in use? If so, and assuming the destructor of your small objects doesn't do anything terribly urgent, then don't bother freeing cells at all - at the "certain point" create a new allocator and mark the old one as no longer in use for new allocations. When you "hit the state", free the whole allocator and its underlying buffer. If the "certain point" and the "state" are simultaneous, all the easier - just reset your allocator.

Clericals answered 28/1, 2010 at 22:49 Comment(3)
"after a certain point (perhaps a little in the past), are no longer in use?" unfortunately no, they are used throughout the lifetime of the program. Following on the one-allocator per thread. What if I made a pool allocator that had it's free-list in some kind of TLS. Then id just need a occasional synchronised process to combat the possibility of objects migrating from one-thread to another causing one to free more than it allocates?Trammell
Yep, that sounds plausible. The slower the drift from one thread to another, the less often you have to synchronise and the lower the overhead.Clericals
You could also take a look at this: goog-perftools.sourceforge.net/doc/tcmalloc.htmlClericals
G
0

You might make sure that you are using the low fragmentation heap. It is on by default in Vista and later, but I do not think that is so with earlier OS's. That can make a big difference in allocation speed for small objects.

Grating answered 28/1, 2010 at 18:22 Comment(2)
Well I'm testing on Vista so in that case it's already on. Worth knowing though since I am supporting XP.Trammell
It's an interesting issue (more so since I don't have to solve it :). I am curious to see the final solution.Grating

© 2022 - 2024 — McMap. All rights reserved.