C++ memory allocation mechanism performance comparison (tcmalloc vs. jemalloc)
Asked Answered
W

6

32

I have an application which allocates lots of memory and I am considering using a better memory allocation mechanism than malloc.

My main options are: jemalloc and tcmalloc. Is there any benefits in using any of them over the other?

There is a good comparison between some mechanisms (including the author's proprietary mechanism -- lockless) in http://locklessinc.com/benchmarks.shtml and it mentions some pros and cons of each of them.

Given that both of the mechanisms are active and constantly improving. Does anyone have any insight or experience about the relative performance of these two?

Weft answered 21/10, 2011 at 16:57 Comment(8)
why are you using malloc in C++?Dimpledimwit
@JohnDibling PerformanceWeft
I guess the next natural question is, why are you using C++?Dimpledimwit
There is a discussion about malloc vs. new here: #185037. I am using malloc just for allocating blobs of data. There is no benefit in using new. (See the comments of the best answer)Weft
@JohnDibling: I would note that common implementations of new rely on malloc to get memory anyway...Centric
@Matthieu: I understand that.Dimpledimwit
You can also get improved performance by simply not allocating as much. Object pools are helpful here. Can get a bit trickier to program, but if the allocation scheme is causing a performance problem then you're at the point where this should be considered.Towland
about TC and PT malloc ou have some graph here goog-perftools.sourceforge.net/doc/tcmalloc.html but not example of program.Wendeline
C
49

If I remember correctly, the main difference was with multi-threaded projects.

Both libraries try to de-contention memory acquire by having threads pick the memory from different caches, but they have different strategies:

  • jemalloc (used by Facebook) maintains a cache per thread
  • tcmalloc (from Google) maintains a pool of caches, and threads develop a "natural" affinity for a cache, but may change

This led, once again if I remember correctly, to an important difference in term of thread management.

  • jemalloc is faster if threads are static, for example using pools
  • tcmalloc is faster when threads are created/destructed

There is also the problem that since jemalloc spin new caches to accommodate new thread ids, having a sudden spike of threads will leave you with (mostly) empty caches in the subsequent calm phase.

As a result, I would recommend tcmalloc in the general case, and reserve jemalloc for very specific usages (low variation on the number of threads during the lifetime of the application).

Centric answered 21/10, 2011 at 17:30 Comment(0)
M
15

I have recently considered tcmalloc for a project at work. This is what I observed:

  • Greatly improved performance for heavy usage of malloc in a multithreaded setting. I used it with a tool at work and the performance improved almost twofold. The reason is that in this tool there were a few threads performing allocations of small objects in a critical loop. Using glibc, the performance suffers because of, I think, lock contentions between malloc/free calls in different threads.

  • Unfortunately, tcmalloc increases the memory footprint. The tool I mentioned above would consume two or three times more memory (as measured by the maximum resident set size). The increased footprint is a no go for us since we are actually looking for ways to reduce memory footprint.

In the end I have decided not to use tcmalloc and instead optimize the application code directly: this means removing the allocations from the inner loops to avoid the malloc/free lock contentions. (For the curious, using a form of compression rather than using memory pools.)

The lesson for you would be that you should carefully measure your application with typical workloads. If you can afford the additional memory usage, tcmalloc could be great for you. If not, tcmalloc is still useful to see what you would gain by avoiding the frequent calls to memory allocation across threads.

Municipal answered 4/6, 2012 at 11:21 Comment(1)
Pure guess but this could be due to the library over allocating to make allocations fast, if there is a "shrink" function that could be used after all these tiny allocations happen it should help reduce memoryRozalin
D
4

Be aware that according to the 'nedmalloc' homepage, modern OS's allocators are actually pretty fast now:

"Windows 7, Linux 3.x, FreeBSD 8, Mac OS X 10.6 all contain state-of-the-art allocators and no third party allocator is likely to significantly improve on them in real world results"

http://www.nedprod.com/programs/portable/nedmalloc

So you might be able to get away with just recommending your users upgrade or something like it :)

Despicable answered 13/8, 2013 at 22:56 Comment(8)
that's also my observation. and the observation of the CRT devs because the standard malloc today is just a wrapper over VirtualAlloc in win32 and mmap in linux.Triplicity
@v.oddou: all malloc implementations end up calling mmap at some point -- as that's the only way to get memory from the system (almost). The problem with mmap is that it is slow no matter what (it's a system call that involves a context switch) and it can allocate memory only with a page granularity (4K or more). The goal of malloc implementations is to get big chunks of memory from the system using mmap, and then allocate smaller objects in those memory regions from within the user process.Vonvona
@rogerdpack: note that jemalloc is the FreeBSD's allocator.Vonvona
@YakovGalka Without explicit research on my end to be sure, I'll emit a disclaimer of uncertainty. But by my current knowledge your information is wrong or outdated. Some old malloc used srbk to get blocks from the system. And mmap was a fallback for large independent blocks. Today it is called directly, not "at some point", malloc implementations are empty nowadays. they literally wrap mmap with no decorum. Also system calls have fast traps.Triplicity
@Triplicity I sincerely recommend you to look at the sources of any of those "empty modern malloc implementations" and figure out why wrapping "mmap" involves thousands of lines of code.Vonvona
@YakovGalka Yeah well, it's how I said. github.com/Chuyu-Team/VC-LTL/blob/master/src/ucrt/heap/… Just a wrapper to HeapAlloc and free is a wrapper to HeapFree. There is no thousands of lines, just minimalist boilerplate for errors and comments.Triplicity
@v.oddou: ... except that it's not what you said. You said that "standard malloc today is just a wrapper over VirtualAlloc in win32 and mmap in linux". But as you just showed, that's not the case with MSVC CRT where it calls HeapAlloc instead. Now, you may want to learn the difference between MSVC CRT, WinAPI, and Windows NT kernel API. While VirtualAlloc is more or less a direct syscall and is WinAPI's mmap analogue, HeapAlloc is not a kernel API at all.Vonvona
It is a user-space heap allocator, just like glibc malloc, jemalloc, tcmalloc, and dozen others are. And guess what? They all have thousands of lines of code sitting there between your malloc call and the kernel API that acquires memory (VirtualAlloc/mmap).Vonvona
E
2

You could also consider using Boehm conservative garbage collector. Basically, you replace every malloc in your source code with GC_malloc (etc...), and you don't bother calling free. Boehm's GC don't allocate memory more quickly than malloc (it is about the same, or can be 30% slower), but it has the advantage to deal with useless memory zones automatically, which might improve your program (and certainly eases coding, since you don't care any more about free). And Boehm's GC can also be used as a C++ allocator.

If you really think that malloc is too slow (but you should benchmark; most malloc-s take less than microsecond), and if you fully understand the allocating behavior of your program, you might replace some malloc-s with your special allocator (which could, for instance, get memory from the kernel in big chunks using mmap and manage memory by yourself). But I believe doing that is a pain. In C++ you have the allocator concept and std::allocator_traits, with most standard containers templates accepting such an allocator (see also std::allocator), e.g. the optional second template argument to std::vector, etc...

As others suggested, if you believe malloc is a bottleneck, you could allocate data in chunks (or using arenas), or just in an array.

Sometimes, implementing a specialized copying garbage collector (for some of your data) could help. Consider perhaps MPS.

But don't forget that premature optimization is evil and please benchmark & profile your application to understand exactly where time is lost.

Enrobe answered 26/10, 2011 at 21:29 Comment(2)
I would speculate that migrating to a garbage collector is not as simple as just changing malloc to gc_malloc. You also need to change the types of your pointers to somekind of opaque handle typeTriplicity
Not with Boehm's GC. Its gc_malloc is designed to be a replacement for malloc, and also gives a void*. It is a conservative GC.Enrobe
E
1

There's a pretty good discussion about allocators here:

http://www.reddit.com/r/programming/comments/7o8d9/tcmalloca_faster_malloc_than_glibcs_open_sourced/

Eisenhower answered 21/10, 2011 at 17:6 Comment(2)
Thanks. I'll read it. But as I said, I think it is outdated by now.Weft
yet another thread reddit.com/r/programming/comments/7o8d9/…Pileus
I
1

Your post do not mention threading, but before considering mixing C and C++ allocation methods, I would investigate the concept of memory pool.BOOST has a good one.

Ilocano answered 21/10, 2011 at 17:13 Comment(1)
Thanks. First, that seems good. I'll look at it. Second, I am in profile and optimize phase, and optimizing the bottlenecks (here memory allocation). Third, is there any problem in mixing C/C++ allocation methods? (other than making the code dirty/non-standard).Weft

© 2022 - 2024 — McMap. All rights reserved.