Is there a custom memory allocator design pattern that does not store metadata in its allocations? [closed]

Asked 26/1, 2015 at 15:40 Answered 19/2, 2015 at 0:34

c++memory-management cuda memory-pool tcmalloc

Basically, I need a memory pool for fast allocation of small objects. Ideally, I'd like to replace allocations on both the host, and for memory allocated on GPUs with cudaMalloc. I can write my own, and I will if I have to, but I wouldn't mind swapping in one of the solid open-source implementations.

The only issue is that, with cudaMalloc, the memory pool cannot touch the allocated memory. My understanding is that many (all?) of the common memory allocators, like those in the title, store a small amount of metadata in the allocated data. They therefore wouldn't work.

Does anyone know of a memory allocator for which this is not the case?

Kemberlykemble answered 26/1, 2015 at 15:40 Comment(14)

I think here is pretty clear explained the internals of malloc, so you can write your own based on the information provided #3479830 – Choate 26/1, 2015 at 15:47

Well you need to store metadata somewhere - if you're planning on keeping the metadata on the host and only the allocated data on the CUDA GPU then things could get pretty ugly and inefficient. – Fibrinous 26/1, 2015 at 15:49

There doesn't need to be metadata with a compile time solution – Byte 26/1, 2015 at 15:50

Thanks for the comment, Paul R! But, I don't think that's really true. If you are storing (say) the allocated sizes and pointers on the CPU, I don't see the inefficiency. – Kemberlykemble 26/1, 2015 at 15:51

Michael: Interesting, can you point me to an example? – Kemberlykemble 26/1, 2015 at 15:52

Development still in progress, but very close to completion... – Byte 26/1, 2015 at 15:53

I may be going out on a limb here, but if you use cudaMalloc, why do you want to layer another memory allocator on top of it? – Conductor 26/1, 2015 at 23:14

@PaulR, CUDA has been "keeping metadata on the host and only the allocated data on the CUDA GPU" since forever. Since the GPU is poor at running scalar, divergent code like memory allocators, and the CPU cannot access GPU memory without jumping through flaming hoops, that's always been a comfortable tradeoff. – Aeronaut 26/1, 2015 at 23:24

@ArchaeaSoftware: sure, but presumably the CUDA libraries have access to driver internals for handling this kind of thing. I can't imagine that rolling your own cudaMalloc will be a rewarding exercise, but I'm happy to be proved wrong. [I just read your profile - I guess you know what you're talking about when it comes to implementing cudaMalloc!] – Fibrinous 26/1, 2015 at 23:28

Oh yes, I agree with you there! The CUDA memory allocator buckets free lists using a variety of fixed-size allocations, so I suspect it is already a good fit for the requirements. Wanting to replace malloc() is a rite of passage for new-ish software engineers, who usually grow out of it after being asked to concretely demonstrate the need. – Aeronaut 26/1, 2015 at 23:31

@Aeronaut The need is from rigorous benchmarks, cudaMalloc is taking hundreds of microseconds in a place where it's unacceptable. I dropped in a simple caching allocator with a hashmap earlier today, which shaved off the time (although my caching allocator sucks). Definitely appreciate you dropping in, but hope you're not calling me a newish software engineer. – Kemberlykemble 26/1, 2015 at 23:42

@Aeronaut (and others): We frequently allocate small objects on the device. They don't stick around long. Launching kernels that initialize their data is actually quite quick (a few microseconds), but allocating them with cudaMalloc was taking 100 times longer. – Kemberlykemble 26/1, 2015 at 23:48

You could try one from github.com/Iwan-Zotow/FixedBlockAllocator, but at the end there still some overhead – Commensal 27/1, 2015 at 1:52

Well, you've met the "concretely demonstrate the need" criterion, but I think you're going to have more luck hand-rolling something (fixed length allocator that uses bit vectors?) than trying to repurpose open source allocators designed to work on CPUs. – Aeronaut 27/1, 2015 at 3:14

If all your small allocations are the same size or have a reasonable upper bound, then a fixed size pool allocator is a good pattern.

The idea is that the allocator grabs a big block using the system call then manages its own free list of fixed size blocks within the big block. Allocation is as easy as taking the block at the head of the free list. Deallocation is a little more complicated but can be implemented in different ways depending on your requirements.

It is simple enough to write your own, or if you google C++ fixed size allocator you can find a number of good implementations including boost::pool

Kinshasa answered 27/1, 2015 at 0:36 Comment(0)

Any allocator needs to store some metadata, somewhere. When the allocation need gets simpler, of course, the amount of metadata will decrease.

I think, a normal fixed size allocator will still give you trouble, when I understand your problem right. You have a really special hardware constraint, as much I see.

You could of course use a fixed pool allocator, that does not offer to free single allocations but only free the whole pool. Thus the need to store metadata inside the allocated memory would be eliminated.

Of course you always can implement an allocator that stores the metadata outside the allocated area, by using a different memory region. But most libraries do store the metadata in the allocated area, because it is most convenient for normal architectures.

So the best guess would be to find a fixed pool allocator that does either not offer the functionality to free single allocations or where you can just not use this feature (and thus the allocator does not store any). This of course is only an option, when it would OK for you, to always release whole memory pools instead of single allocations (what is, btw. a good precaution against memory leaks, if it is applicable).

The other alternative of course would be to implement an own allocator, maybe on the basis of a simple allocator that uses as simple metadata as possible.

Ebby answered 19/2, 2015 at 0:34 Comment(0)

Recommended topics

Hot tags