I'm building an AVL tree class which will have a fixed maximum number of items. So I thought instead of allocating each item by itself, I'd just allocate the entire chunk at once and use a bitmap to assign new memory when needed.
My allocation / deallocation code:
avltree::avltree(UINT64 numitems)
{
root = NULL;
if (!numitems)
buffer = NULL;
else {
UINT64 memsize = sizeof(avlnode) * numitems + bitlist::storagesize(numitems);
buffer = (avlnode *) malloc(memsize);
memmap.init(numitems, buffer + numitems);
memmap.clear_all();
freeaddr = 0;
}
}
avlnode *avltree::newnode(keytype key)
{
if (!buffer)
return new avlnode(key);
else
{
UINT64 pos;
if (freeaddr < memmap.size_bits)
pos = freeaddr++;
else
pos = memmap.get_first_unset();
memmap.set_bit(pos);
return new (&buffer[pos]) avlnode(key);
}
}
void avltree::deletenode(avlnode *node)
{
if (!buffer)
delete node;
else
memmap.clear_bit(node - buffer);
}
In order to use standard new / delete, I have to construct the tree with numitems == 0. In order to use my own allocator, I just pass number of items. All functions are inlined for maximum performance.
This is all fine and dandy, but my own allocator is abut 20% slower than new / delete. Now, I know how complex memory allocators are, there's no way that code can run faster than an array lookup + one bit set, but that is exactly the case here. What's worse: my deallocator is slower even if I remove all code from it?!?
When I check assembly output, my allocator's code path is ridden with QWORD PTR instructions dealing with bitmap, avltree or avlnode. It doesn't seem to be much different for the new / delete path.
For example, assembly output of avltree::newnode:
;avltree::newnode, COMDAT
mov QWORD PTR [rsp+8], rbx
push rdi
sub rsp, 32
;if (!buffer)
cmp QWORD PTR [rcx+8], 0
mov edi, edx
mov rbx, rcx
jne SHORT $LN4@newnode
; return new avlnode(key);
mov ecx, 24
call ??2@YAPEAX_K@Z ; operator new
jmp SHORT $LN27@newnode
;$LN4@newnode:
;else {
; UINT64 pos;
; if (freeaddr < memmap.size_bits)
mov r9, QWORD PTR [rcx+40]
cmp r9, QWORD PTR [rcx+32]
jae SHORT $LN2@newnode
; pos = freeaddr++;
lea rax, QWORD PTR [r9+1]
mov QWORD PTR [rcx+40], rax
; else
jmp SHORT $LN1@newnode
$LN2@newnode:
; pos = memmap.get_first_unset();
add rcx, 16
call ?get_first_unset@bitlist@@QEAA_KXZ ; bitlist::get_first_unset
mov r9, rax
$LN1@newnode:
; memmap.set_bit(pos);
mov rcx, QWORD PTR [rbx+16] ;data[bindex(pos)] |= bmask(pos);
mov rdx, r9 ;return pos / (sizeof(BITINT) * 8);
shr rdx, 6
lea r8, QWORD PTR [rcx+rdx*8] ;data[bindex(pos)] |= bmask(pos);
movzx ecx, r9b ;return 1ull << (pos % (sizeof(BITINT) * 8));
mov edx, 1
and cl, 63
shl rdx, cl
; return new (&buffer[pos]) avlnode(key);
lea rcx, QWORD PTR [r9+r9*2]
; File c:\projects\vvd\vvd\util\bitlist.h
or QWORD PTR [r8], rdx ;data[bindex(pos)] |= bmask(pos)
; 195 : return new (&buffer[pos]) avlnode(key);
mov rax, QWORD PTR [rbx+8]
lea rax, QWORD PTR [rax+rcx*8]
; $LN27@newnode:
test rax, rax
je SHORT $LN9@newnode
; avlnode constructor;
mov BYTE PTR [rax+4], 1
mov QWORD PTR [rax+8], 0
mov QWORD PTR [rax+16], 0
mov DWORD PTR [rax], edi
; 196 : }
; 197 : }
; $LN9@newnode:
mov rbx, QWORD PTR [rsp+48]
add rsp, 32 ; 00000020H
pop rdi
ret 0
?newnode@avltree@@QEAAPEAUavlnode@@H@Z ENDP ; avltree::newnode
_TEXT ENDS
I have checked multiple times the output of compilation when I construct my avltree with default / custom allocator and it remains the same in this particular region of code. I have tried removing / replacing all relevant parts to no significant effect.
To be honest, I expected compiler to inline all of this since there are very few variables. I was hoping for everything except the avlnode objects themselves to be placed in registers, but that doesn't seem to be the case.
Yet the speed difference is clearly measurable. I'm not calling 3 seconds per 10 million nodes inserted slow, but I expected my code to be faster, not slower than generic allocator (2.5 seconds). That goes especially for the slower deallocator which is slower even when all code is stripped from it.
Why is it slower?
Edit: Thank you all for excellent thoughts on this. But I would like to stress again that the issue is not so much within my method of allocation as it is in suboptimal way of using the variables: the entire avltree class contains just 4 UINT64 variables, bitlist only has 3.
However, despite that, the compiler doesn't optimise this into registers. It insists on QWORD PTR instructions that are orders of magnitude slower. Is this because I'm using classes? Should I move to C / plain variables? Scratch that. Stupid me. I have all the avltree code in there as well, things can't be in registers.
Also, I am at a total loss why my deallocator would still be slower, even if I remove ALL code from it. Yet QueryPerformanceCounter tells me just that. It's insane to even think that: that same deallocator also gets called for the new / delete code path and it has to delete the node... It doesn't have to do anything for my custom allocator (when I strip the code).
Edit2: I have now completely removed the bitlist and implemented free space tracking through a singly-linked list. The avltree::newnode function is now much more compact (21 instructions for my custom allocator path, 7 of those are QWORD PTR ops dealing with avltree and 4 are used for constructor of avlnode). The end result (time) decreased from ~3 seconds to ~2.95 seconds for 10 million allocations.
Edit3: I also rewrote the entire code such that now everything is handled by the singly linked list. Now the avltree class only has two relevant members: root and first_free. The speed difference remains.
Edit4: Rearranging code and looking at performance figures, these things are what helped the most:
- As pointed out by all contributors, having a bitmap in there was just plain bad. Removed in favour of singly-linked free slot list.
- Code locality: by adding dependent functions (avl tree handling ones) into a function-local class instead of having them declared globally helped some 15% with code speed (3 secs --> 2.5 secs)
- avlnode struct size: just adding
#pragma pack(1)
before struct declaration decreased execution time a further 20% (2,5 secs --> 2 secs)
Edit 5:
Since this querstion seems to have been quite popular, I have posted the final complete code as an answer below. I'm quite satisfied with its performance.
get_first_unset
once the has been filled, this will hurt performance. The traditional data structure choice for this case would be a singly-linked free list with the next pointer overlaying the allocated data, thereby avoiding search. The standard allocator likely has a couple of such specialized pool allocators for fixed-length objects, though it ought to still suffer from some additional overhead to handle the general case (object size inference, locking, etc). – Marmite