Custom C++ allocator far too slow in GCC in debug. is there a fix for this?
Asked Answered
C

1

12

I am struggling with the performance of my custom allocator. My question is regarding a debug build.

normally I don't mind if there is only a bit of a drop. but currently I am playing something in 4fps, while without the custom allocator sits at 60fps ( and can go probably faster) . This makes it harder to develop on the software.

I nailed it down all the way to... basically inheriting the standard allocator

Please see the following results from 'quick-bench.com' https://quick-bench.com/q/ep3uyYNK6rh_6f8AGAP0zIAflAA

here is a picture: enter image description here

the blue bar is simply:

int main() {
    std::vector<uint8_t, std::vector<uint8_t>::allocator_type> buffer;
    buffer.reserve(numBytes);
    buffer.resize(numBytes);
    return 0;
}

The yellow bar:

template<typename T>
class CustomAllocatorType : public std::vector<uint8_t>::allocator_type {};

int main() {
    std::vector<uint8_t, CustomAllocatorType<uint8_t>> buffer;
    buffer.reserve(numBytes);
    buffer.resize(numBytes);
    return 0;
}

envelopping the custom allocator with:

#pragma GCC push_options
#pragma GCC optimize ("-O3")
// ....
#pragma GCC pop_options

did not have any effect. I suppose I would need to do this for the vector instance itself, but I don't want to go that far...

Does anyone know a solution for this ?

Chew answered 29/4, 2022 at 16:51 Comment(10)
It looks like the two code use to produce identical assembly with C++17 but since C++20 additional code is generated for the allocator : godbolt.org/z/1P3qTeW46Bravar
"Does anyone know a solution for this ?" Maybe fall back to -std=c++17? quick-bench.com/q/LUk0HQxpY4Tqk1dTEU3pzJAxEPMSochor
If you're allocating memory during every frame, you're probably doing something wrong. It would be preferrable to preallocate as much memory as needed and then reuse this memory.Ramachandra
It is common for unoptimized g++ to be 10x or 100x slower than optimized code. You can try using -Og to optimize while maintaining debuggability, but that doesn't always work (can be hard sanely debug the result).Sari
@ChrisDodd The question isn't about debug being slower. It's why std::vector<uint8_t, CustomAllocatorType<uint8_t>> is slower than std::vector<uint8_t> when CustomAllocatorType is apparently std::allocator.Bravar
@Ramachandra agreed. actually I was planning on using some kind of memory pool inside the allocator. The example here is just demonstration purposes though, to highlight where I am getting stuck. Thanks so far for the replies. I didn't realize this issue is not in C++17, but I still like to use C++20.Chew
This is extremely weird. My thought was that the STL is specializing/overloading some stuff for std::allocator to improve performance. This would not necessarily be done for CustomAllocatorType. And indeed, there is some stuff that gets optimized specifically. But: I then tried to copy & paste gcc's implementation of std::allocator + the specializations/overloads of various stuff, just adapting the name of the allocator to the custom one. In the end, I get exactly the same assembly output between the custom allocator and the std one. But the custom one still runs slower. ...Confirm
... See on godbolt (the custom allocator is named allocator_c), and QuickBench. If you copy & paste the assembly into a text editor and replace allocator_c with allocator, you see that they are identical. I have no idea what is going on here. It also does not seem to be a quirk of quick bench, because exchanging the custom and the std::allocator exchanges results, too. It also shouldn't be any optimization magic that detects std::allocator because optimizations are turned off.Confirm
If I try it locally, I do not get the issue that the assembly code is the same but the performance differs. I have absolutely no idea what is going on on quick bench here.Confirm
Ah, I get it now. It does not work on quick bench because quick bench force-includes benchmark/benchmark.h before any of my code, and which in turn includes <vector>. godbolt does not do this. Also compare my answer below.Confirm
C
17

Reason for the performance decrease

gcc's libstdc++ uses certain performance improvements if the allocator is std::allocator. Your CustomAllocatorType is a different type than std::allocator, meaning that the optimizations are disabled. Note that I am not talking about compiler optimizations but rather that gcc's implementation of the C++ standard library implements overloads or specializations specifically for std::allocator. To name an example relevant to your example code, std::vector::resize() internally calls __uninitialized_default_n_a() which has a special overload for std::allocator. The special overload bypasses the allocator entirely. If you use CustomAllocatorType, the generic version is used which calls the allocator for every single element. This costs a lot of time. Another function with a special definition and which is relevant to your simple code example is _Destroy().

Put another way, gcc's implementation of the C++ standard library has some measures implemented to ensure that optimal code is generated in cases where it is known that it is safe. This works regardless of compiler optimizations. If the non-optimized code paths are taken and you enable compiler optimizations (e.g. -O3), the compiler is often able to recognize patterns in the non-optimal code (such as initializing successive trivial elements) and can optimize everything away so that you end up with the same instructions (more or less).

C++20 vs C++17 and why your CustomAllocatorType is broken

As noted in the comments, the performance decrease when using CustomAllocatorType only occurs in C++20 but not in C++17. To understand why, note that gcc's std::vector implementation does not use the Allocator from the declaration std::vector<T,Allocator> as allocator, i.e. in your case CustomAllocatorType. Rather, it uses std::allocator_traits<T>::rebind_alloc<T> (see here and here). Also see e.g. this post about rebind for some more information.

Since you did not define a specialization std::allocator_traits<CustomAllocatorType>, it uses the generic one. The standard says:

rebind_alloc<T>: Alloc::rebind<T>::other if present, otherwise Alloc<T, Args> if this Alloc is Alloc<U, Args>

I.e. the generic one attempts to delegate to your allocator, if possible. Now, your allocator CustomAllocatorType inherits from std::allocator. And here comes the important difference between C++17 and C++20: std::allocator::rebind was removed in C++20. Hence:

  • C++17: CustomAllocatorType::rebind is inherited and thus defined and is std::allocator. Therefore, std::allocator_traits<CustomAllocatorType>::rebind_alloc, meaning that std::vector ends up actually using std::allocator instead of CustomAllocatorType. If you pass in a CustomAllocatorType instance in the std::vector constructor, you end up with splicing.
  • C++20: CustomAllocatorType::rebind is not defined. Thus, std::allocator_traits<CustomAllocatorType>::rebind_alloc is CustomAllocatorType and std::vector ends up using CustomAllocatorType.

So the C++17 version uses std::allocator and thus enjoys the library based optimizations described above, while the C++20 version does not.

Your code is simply incorrect, or at least the C++17 version. std::vector does not use your allocator at all in C++17. You can also notice that if you attempt to call buffer.get_allocator() in your example, which will fail to compile in C++17 because it will try to convert std::allocator (as used internally) to CustomAllocatorType.

I think the correct way to fix the issue is to define CustomAllocatorType::rebind instead of specializing std::allocator_traits (see here and here), like so:

template<typename T>
class CustomAllocatorType: public std::allocator<T> 
{
  template< class U > struct rebind {
    typedef CustomAllocatorType<U> other;
  };
};

Of course, doing so means that the C++17 version will be slow in debug but actually working.

I think this also shows again the general rule that inheriting from C++ standard library types is usually a bad idea. If CustomAllocatorType did not inherit from std::allocator, the problem would not appear in the first place (also, because you'd need to think about how to set the elements correctly).

Improving performance

Assuming the allocator was fixed for C++17 or you use C++20, you get the bad performance in debug because the library implementation uses the generic versions of the above mentioned functions to fill and destroy data. Unfortunately, all of this is an implementation detail of the library, meaning that there is no nice standard way to enforce the generation of good code.

Hacky solution

A hack that works in your trivial example (and probably only there!) would be to define custom overloads of the functions in question, e.g.:

#include <bits/stl_uninitialized.h>
#include <cstdint>
#include <cstdlib>

// Must be defined BEFORE including <vector>!
namespace std{
  template<typename _ForwardIterator, typename _Size, typename _Tp>
  inline _ForwardIterator
  __uninitialized_default_n_a(_ForwardIterator __first, _Size __n, CustomAllocatorType<_Tp>&)
  { return std::__uninitialized_default_n(__first, __n); }


  template<typename _ForwardIterator, typename _Tp>
  _GLIBCXX20_CONSTEXPR inline void
  _Destroy(_ForwardIterator __first, _ForwardIterator __last, CustomAllocatorType<_Tp>&) {
    _Destroy(__first, __last);
  }
}

These here are copy & paste from gcc's std::allocator overloads (here and here), but overloaded for CustomAllocatorType. More special overloads would be required in real applications (e.g. for is_copy_constructible and is_move_constructible or __relocate_a_1, no idea how many more). Defining the above two functions before the include of <vector> leads to decent performance in debug for your minimal example. At least it does so for me locally using gcc 11.2. It does not work on quick bench because quick bench force-includes benchmark/benchmark.h before any of your code, and which in turn includes <vector> (also compare the second bullet point coming next).

This hack is awful on multiple levels:

  • It is absolutely non-standard. It only works with stdlibc++ and might break at any up- or downgrade of the library version.
  • You also need to ensure that the overloads are defined before the <vector> header is included, because otherwise they will not be picked up. The reason is that the calls to std::__uninitialized_default_n_a() are qualified, i.e. are std::__uninitialized_default_n_a(arguments) rather than __uninitialized_default_n_a(arguments), meaning that overloads after the definition of std::vector are not found (cf. e.g. this post or this one). As already explained above, this is the reason why the hack fails on quick bench. Also, if you mess this up in some places, you might violate the one-definition-rule (which will probably lead to more weirdness).
  • The example hack assumes that allocating and freeing memory does not require the use of CustomAllocatorType, just like std::allocator. I highly doubt that this holds for your true CustomAllocatorType implementation. But maybe you could actually implement e.g. __uninitialized_default_n_a() properly and more efficiently for your CustomAllocatorType by calling an appropriate function on your allocator.

I do not recommend doing this. But depending on the use case, it might be a viable solution.

Enabling -Og

I do get notably better performance with gcc when compiling everything with -Og. It attempts to perform some optimizations without interfering with the debugging experience too much. In your trivial example the performance is improved from 160x slower to 5x slower compared to the std::allocator version. So if you cannot change the compiler, I think that might be the best way to go.

Using clang

Switching to clang (without any optimization flags) seems to improve the performance somewhat. With libstdc++, the custom allocator version is "only" 90x slower. Surprisingly, with libc++ quick bench reports roughly the same performance. Unfortunately, I cannot reproduce this locally: libc++ is also taking ages. No idea why the result differs locally and on quick bench.

But I can reproduce that clang is optimizing with -Og much better than gcc and gives roughly the same performance with the custom allocator. This holds both with libstdc++ and libc++.

So my suggestion is to use clang, possibly with libc++, and use -Og.

Alternative ideas

Enabling optimizations locally (#pragma GCC optimize ("-O3") etc) is rather unreliable. It did not work for me. The most likely reason is that the optimization flag is not propagated to the instantiation of std::vector because its definition is somewhere else entirely. You'd probably need to compile the C++ standard library headers themselves with optimizations.

Another idea would be to use a different container library. For example, boost has a vector class. But I have not checked if its debug performance would be better.

Confirm answered 30/4, 2022 at 13:54 Comment(3)
@FrançoisAndrieux Thank you! To my shame, I have to admit that I was not aware of the problems with the term "STL". I always thought it was synonymous with the C++ standard library, but have never really thought about it. My bad. Also, yes, I meant C++17, not C++20, that was a typo. Moreover, I now also understand why the hack fails on quick bench: Quick bench force-include <vector>. I have updated my answer accordingly.Confirm
I also added a more viable solution: -Og has a great impact, as well has switching to clang. So I think the best solution would be to use clang with -Og.Confirm
It blows my mind on how deep you managed to dive into this. Thank you very much for the details! I'll see if I can maybe just use clang for now. Seems to be the nicest solution in my caseChew

© 2022 - 2024 — McMap. All rights reserved.