Can multithreading speed up memory allocation?

I

10

14

I'm working with an 8 core processor, and am using Boost threads to run a large program. Logically, the program can be split into groups, where each group is run by a thread. Inside each group, some classes invoke the 'new' operator a total of 10000 times. Rational Quantify shows that the 'new' memory allocation is taking up the maximum processing time when the program runs, and is slowing down the entire program.

One way I can speed up the system could be to use threads inside each 'group', so that the 10000 memory allocations can happen in parallel.

I'm unclear of how the memory allocation will be managed here. Will the OS scheduler really be able to allocate memory in parallel?

Immigration answered 1/2, 2011 at 5:30 Comment(2)

Thank you for profiling your application. – Westley 3/2, 2011 at 9:32

@Everyone: Okay, so "Heap Contention" is the right phrase to look for in this regard. Apparently glibc v2 onward handles malloc's in parallel citi.umich.edu/projects/linux-scalability/reports/malloc.html but the contention with free() will (probably) be handled only from version 2.2.4 onward bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html. I wonder if that means that libraries like Hoard would become redundant. – Immigration 4/2, 2011 at 6:0

G

7

Dynamic allocation of memory uses the heap of the application/module/process (but not thread). The heap can only handle one allocation request at a time. If you try to allocate memory in "parallel" threads, they will be handled in due order by the heap. You will not get a behaviour like: one thread is waiting to get its memory while another can ask for some, while a third one is getting some. The threads will have to line-up in queue to get their chunk of memory.

What you would need is a pool of heaps. Use whichever heap is not busy at the moment to allocate the memory. But then, you have to watch out throughout the life of this variable such that it does not get de-allocated on another heap (that would cause a crash).

I know that Win32 API has functions such as GetProcessHeap(), CreateHeap(), HeapAlloc() and HeapFree(), that allow you to create a new heap and allocate/deallocate memory from a specific heap HANDLE. I don't know of an equivalence in other operating systems (I have looked for them, but to no avail).

You should, of course, try to avoid doing frequent dynamic allocations. But if you can't, you might consider (for portability) to create your own "heap" class (doesn't have to be a heap per se, just a very efficient allocator) that can manage a large chunk of memory and surely a smart pointer class that would hold a reference to the heap from which it came. This would enable you to use multiple heaps (make sure they are thread-safe).

Greenlaw answered 1/2, 2011 at 6:6 Comment(3)

Question: By heap pool, did you mean this: en.wikipedia.org/wiki/Memory_pool ? (I was wondering if it was memory pool you were talking about, then I could use TBB scalable allocators. But custom allocators have come under fire by people like Scott Meyers en.wikipedia.org/wiki/Allocator_%28C%2B%2B%29#Custom_allocators) – Immigration 3/2, 2011 at 10:27

By heap pool, I just meant having a list of heaps that you use (either OS-native heaps, or home-brewed, or from a library like boost), and you allocate from which ever is not busy at one particular time (i.e. a priority queue based on busy-ness, available memory, and fragmentation). And sure, custom allocators are not recommended unless you do it carefully and very well. All in all, I would suggest you go with some of the off-the-shelf stuff suggested by others here (HOARD or TBB seem pretty reliable at first glance). – Greenlaw 4/2, 2011 at 1:55

Mikael, your statement isn't correct. Modern heap implementations use techniques like thread caches to speed up parallel allocations. That means you can do significantly more allocations with multiple concurrent threads than with just one thread. – Tricycle 27/6, 2012 at 20:27

F

12

Standard CRT

While with older of Visual Studio the default CRT allocator was blocking, this is no longer true at least for Visual Studio 2010 and newer, which calls corresponding OS functions directly. The Windows heap manager was blocking until Widows XP, in XP the optional Low Fragmentation Heap is not blocking, while the default one is, and newer OSes (Vista/Win7) use LFH by default. The performance of recent (Windows 7) allocators is very good, comparable to scalable replacements listed below (you still might prefer them if targeting older platforms or when you need some other features they provide). There exist several multiple "scalable allocators", with different licenses and different drawbacks. I think on Linux the default runtime library already uses a scalable allocator (some variant of PTMalloc).

Scalable replacements

I know about:

HOARD (GNU + commercial licenses)
MicroQuill SmartHeap for SMP (commercial license)
Google Perf Tools TCMalloc (BSD license)
NedMalloc (BSD license)
JemAlloc (BSD license)
PTMalloc (GNU, no Windows port yet?)
Intel Thread Building Blocks (GNU, commercial)

You might want to check Scalable memory allocator experiences for my experiences with trying to use some of them in a Windows project.

In practice most of them work by having a per thread cache and per thread pre-allocated regions for allocations, which means that small allocations most often happen inside of a context of thread only, OS services are called only infrequently.

Furriery answered 1/2, 2011 at 6:21 Comment(3)

Hey thanks! Just to add to the list, Intel Threading Building Blocks also has scalable_malloc, scalable_free, scalable_realloc, scalable_calloc, scalable_allocator and cache_aligned_allocator. – Immigration 1/2, 2011 at 13:3

Suma, this isn't correct either. All modern MSVC versions use the OS heap functions by default (unless told not to do so). And the OS heap functions will perform rather well if the low-fragmentation heap is enabled, which it is by default since Windows Vista (on Windows XP it can be enabled by the application with a simple call to HeapSetInformation()). And with the LFH enabled, the performance of the Windows heap is comparable to the fastest available other allocators - I personally did a benchmark against NedMalloc, and the differenct was negligible. – Tricycle 27/6, 2012 at 20:34

@PaulGroke You are correct, I have tried to update the answer. – Furriery 28/6, 2012 at 6:55

G

7

Dynamic allocation of memory uses the heap of the application/module/process (but not thread). The heap can only handle one allocation request at a time. If you try to allocate memory in "parallel" threads, they will be handled in due order by the heap. You will not get a behaviour like: one thread is waiting to get its memory while another can ask for some, while a third one is getting some. The threads will have to line-up in queue to get their chunk of memory.

What you would need is a pool of heaps. Use whichever heap is not busy at the moment to allocate the memory. But then, you have to watch out throughout the life of this variable such that it does not get de-allocated on another heap (that would cause a crash).

I know that Win32 API has functions such as GetProcessHeap(), CreateHeap(), HeapAlloc() and HeapFree(), that allow you to create a new heap and allocate/deallocate memory from a specific heap HANDLE. I don't know of an equivalence in other operating systems (I have looked for them, but to no avail).

You should, of course, try to avoid doing frequent dynamic allocations. But if you can't, you might consider (for portability) to create your own "heap" class (doesn't have to be a heap per se, just a very efficient allocator) that can manage a large chunk of memory and surely a smart pointer class that would hold a reference to the heap from which it came. This would enable you to use multiple heaps (make sure they are thread-safe).

Greenlaw answered 1/2, 2011 at 6:6 Comment(3)

Question: By heap pool, did you mean this: en.wikipedia.org/wiki/Memory_pool ? (I was wondering if it was memory pool you were talking about, then I could use TBB scalable allocators. But custom allocators have come under fire by people like Scott Meyers en.wikipedia.org/wiki/Allocator_%28C%2B%2B%29#Custom_allocators) – Immigration 3/2, 2011 at 10:27

By heap pool, I just meant having a list of heaps that you use (either OS-native heaps, or home-brewed, or from a library like boost), and you allocate from which ever is not busy at one particular time (i.e. a priority queue based on busy-ness, available memory, and fragmentation). And sure, custom allocators are not recommended unless you do it carefully and very well. All in all, I would suggest you go with some of the off-the-shelf stuff suggested by others here (HOARD or TBB seem pretty reliable at first glance). – Greenlaw 4/2, 2011 at 1:55

Mikael, your statement isn't correct. Modern heap implementations use techniques like thread caches to speed up parallel allocations. That means you can do significantly more allocations with multiple concurrent threads than with just one thread. – Tricycle 27/6, 2012 at 20:27

U

5

There are 2 scalable drop-in replacements for malloc that I know of:

Google's tcmalloc
Facebook's jemalloc (link to a performance study comparing to tcmalloc)

I don't have any experience with Hoard (which performed poorly in the study), but Emery Berger lurks on this site and was astonished by the results. He said he would have a look and I surmise there might have been some specifics to either the test or implementation that "trapped" Hoard as the general feedback is usually good.

One word of caution with jemalloc, it can waste a bit of space when you rapidly create then discard threads (as it creates a new pool for each thread you allocate from). If your threads are stable, there should not be any issue with this.

Unbounded answered 1/2, 2011 at 7:43 Comment(0)

P

4

I believe the short answer to your question is : yes, probably. And as already pointed out by several people here there are ways to achieve this.

Aside from your question and the answers already posted here, it would be good to start with your expectations on improvements, because that will pretty much tell which path to take. Maybe you need to be 100x faster. Also, do you see yourself doing speed improvements in the near future as well or is there a level which will be good enough? Not knowing your application or problem domain it's difficult to also advice you specifically. Are you for instance in a problem domain where speed continuously have to be improved?

One good thing to start off with when doing performance improvements is to question if you need to do things the way you currently do it? In this case, can you pre-allocate objects? Is there a maximum number of X objects in the system? Could you re-use objects? All of this is better, because you don't necessarily need to do allocations on the critical path. E.g. if you can re-use objects, a custom allocator with pre-allocated objects would work well. Also, what OS are you on?

If you don't have concrete expectations or a certain level of performance, just start experimenting with any of the advices here and you'll find out more.

Good luck!

Pinguid answered 1/2, 2011 at 8:38 Comment(1)

Pre-allocation was something I considered, but the program requires dynamic instantiation of classes (using virtual), so I can't pre-instantiate these classes. Can't re-use objects either. I guess the use of a scalable memory allocator is the only option now. Thanks :) – Immigration 1/2, 2011 at 13:10

H

3

Roll your own non-multi-threaded new memory allocator a distinct copy of which each thread has.

(you can override new and delete)

So it's allocating in large chunks that it works through and doesn't need any locking as each is owned by a single thread.

limit your threads to the number of cores you have available.

Halter answered 1/2, 2011 at 5:34 Comment(1)

OK maybe that's the typical problem, but it doesn't answer the question. – Faction 1/2, 2011 at 5:44

S

2

new is pretty much blocking, it has to find the next free bit of memory which is tricky to do if you have lots of threads all asking for that at once.

Memory allocation is slow - if you are doing it more than a few times, especially on lots of threads then you need a redesign. Can you pre-allocate enough space at the start, can you just allocate a big chunk with 'new' and then partition it out yourself?

Strawboard answered 1/2, 2011 at 5:42 Comment(5)

Nope. Am using virtual functions and copying a lot of objects which have boost matrices inside them. So the memory allocation has to be done dynamically. I guess 'redesign' is the only option then. – Immigration 1/2, 2011 at 5:46

"Memory allocation is slow" this highly depends on platform. Using standard Visual Studio CRT I was used to this, but recently I have started using scalable allocators, and to my surprise their performance is excellent - most of them reduce cost for memory allocation significantly even for single threaded use, and have excellent scalability with multiple cores. See my answer below. – Furriery 1/2, 2011 at 8:24

@Suma: slow compared to stack or pre-allocation. – Pinguid 1/2, 2011 at 8:40

@Furriery - and slow compared to not doing it ;-) – Strawboard 1/2, 2011 at 16:28

I just wanted to point out that some of the modern scalable allocators are often close to "allocate a big chunk with 'new' and then partition it out yourself?" unless they hit some pattern patological for them, and using them saves gives you almost the same performance with the elegance of native and natural language support. – Furriery 1/2, 2011 at 16:45

F

1

You need to check your compiler documentation whether it makes the allocator thread safe or not. If it does not, then you will need to overload your new operator and make it thread safe. Else it will result in either a segfault or UB.

Freshet answered 1/2, 2011 at 5:45 Comment(2)

Well, this thread says that new is 'generally' thread safe on gcc: #796599 – Immigration 1/2, 2011 at 6:0

@Immigration : What I believe is "new" operator is re-entrant but its thread safety is implementation dependent . I would be happy to see any standard documentation on the same if you could post any. – Freshet 1/2, 2011 at 6:9

P

1

On some platforms like Windows, access to the global heap is serialized by the OS. Having a thread-separate heap could substantially improve allocation times.

Of course, in this case, it might be worth questioning whether or not you genuinely need heap allocation as opposed to some other form of dynamic allocation.

Pilot answered 1/2, 2011 at 5:56 Comment(2)

What is 'thread-separate heap'? Heap allocation IS dynamic allocation, right? What other form of dynamic allocation is available? en.wikipedia.org/wiki/Dynamic_memory_allocation – Immigration 1/2, 2011 at 6:1

@Nav: Some OSes can create several heaps. You can allocate one for each thread. And there are different forms of dynamic allocation - for example, object pools. If you have a known pattern of object allocation, then you can likely write a custom allocator that's much more efficient at it. The existing heap allocation subroutines are designed to have maximum flexibility in their performance. – Pilot 1/2, 2011 at 6:12

D

1

You may want to take a look at The Hoard Memory Allocator: "is a drop-in replacement for malloc() that can dramatically improve application performance, especially for multithreaded programs running on multiprocessors."

Describe answered 1/2, 2011 at 6:10 Comment(0)

E

0

The best what you can try to reach ~8 memory allocation in parallel (since you have 8 physical cores), not 10000 as you wrote
standard malloc uses mutex and standard STL allocator does the same. Therefore it will not speed up automatically when you introduce threading. Nevertheless, you can use another malloc library (google for e.g. "ptmalloc") which does not use global locking. if you allocate using STL (e.g. allocate strings, vectors) you have to write your own allocator.

Rather interesting article: http://developers.sun.com/solaris/articles/multiproc/multiproc.html

Earreach answered 3/2, 2011 at 9:13 Comment(4)

Now the mention of mutex was so very very very helpful! I wanted to know whether it happened serially. Eight allocations is a bit disappointing. Don't you think it could happen faster with the heap-pool that others have mentioned? – Immigration 3/2, 2011 at 13:22

@Nav: Well: there is no magic - you have 8 cores, so this is a parallelism you can reach. – Earreach 3/2, 2011 at 14:51

sorry, sent comment to early. I guess, heap pool is what ptmalloc internally does. Do not think that you there is any reason to implement heap pool by yourself. PS: added a lint to an article to my answer – Earreach 3/2, 2011 at 15:2

On the other hand if you reduce number of real heap allocation, doing allocation by blocks it can help. This can help anyway - since malloc is rather expensive operation. – Earreach 3/2, 2011 at 15:13

Standard CRT

Scalable replacements

Recommended topics

Hot tags