Custom C++ allocator far too slow in GCC in debug. is there a fix for this?

I am struggling with the performance of my custom allocator. My question is regarding a debug build.

normally I don't mind if there is only a bit of a drop. but currently I am playing something in 4fps, while without the custom allocator sits at 60fps ( and can go probably faster) . This makes it harder to develop on the software.

I nailed it down all the way to... basically inheriting the standard allocator

Please see the following results from 'quick-bench.com' https://quick-bench.com/q/ep3uyYNK6rh_6f8AGAP0zIAflAA

here is a picture:

the blue bar is simply:

int main() {
    std::vector<uint8_t, std::vector<uint8_t>::allocator_type> buffer;
    buffer.reserve(numBytes);
    buffer.resize(numBytes);
    return 0;
}

The yellow bar:

template<typename T>
class CustomAllocatorType : public std::vector<uint8_t>::allocator_type {};

int main() {
    std::vector<uint8_t, CustomAllocatorType<uint8_t>> buffer;
    buffer.reserve(numBytes);
    buffer.resize(numBytes);
    return 0;
}

envelopping the custom allocator with:

#pragma GCC push_options
#pragma GCC optimize ("-O3")
// ....
#pragma GCC pop_options

did not have any effect. I suppose I would need to do this for the vector instance itself, but I don't want to go that far...

Does anyone know a solution for this ?

Reason for the performance decrease

gcc's libstdc++ uses certain performance improvements if the allocator is std::allocator. Your CustomAllocatorType is a different type than std::allocator, meaning that the optimizations are disabled. Note that I am not talking about compiler optimizations but rather that gcc's implementation of the C++ standard library implements overloads or specializations specifically for std::allocator. To name an example relevant to your example code, std::vector::resize() internally calls __uninitialized_default_n_a() which has a special overload for std::allocator. The special overload bypasses the allocator entirely. If you use CustomAllocatorType, the generic version is used which calls the allocator for every single element. This costs a lot of time. Another function with a special definition and which is relevant to your simple code example is _Destroy().

Put another way, gcc's implementation of the C++ standard library has some measures implemented to ensure that optimal code is generated in cases where it is known that it is safe. This works regardless of compiler optimizations. If the non-optimized code paths are taken and you enable compiler optimizations (e.g. -O3), the compiler is often able to recognize patterns in the non-optimal code (such as initializing successive trivial elements) and can optimize everything away so that you end up with the same instructions (more or less).

C++20 vs C++17 and why your `CustomAllocatorType` is broken

As noted in the comments, the performance decrease when using CustomAllocatorType only occurs in C++20 but not in C++17. To understand why, note that gcc's std::vector implementation does not use the Allocator from the declaration std::vector<T,Allocator> as allocator, i.e. in your case CustomAllocatorType. Rather, it uses std::allocator_traits<T>::rebind_alloc<T> (see here and here). Also see e.g. this post about rebind for some more information.

Since you did not define a specialization std::allocator_traits<CustomAllocatorType>, it uses the generic one. The standard says:

rebind_alloc<T>: Alloc::rebind<T>::other if present, otherwise Alloc<T, Args> if this Alloc is Alloc<U, Args>

I.e. the generic one attempts to delegate to your allocator, if possible. Now, your allocator CustomAllocatorType inherits from std::allocator. And here comes the important difference between C++17 and C++20: std::allocator::rebind was removed in C++20. Hence:

C++17: CustomAllocatorType::rebind is inherited and thus defined and is std::allocator. Therefore, std::allocator_traits<CustomAllocatorType>::rebind_alloc, meaning that std::vector ends up actually using std::allocator instead of CustomAllocatorType. If you pass in a CustomAllocatorType instance in the std::vector constructor, you end up with splicing.
C++20: CustomAllocatorType::rebind is not defined. Thus, std::allocator_traits<CustomAllocatorType>::rebind_alloc is CustomAllocatorType and std::vector ends up using CustomAllocatorType.

So the C++17 version uses std::allocator and thus enjoys the library based optimizations described above, while the C++20 version does not.

Your code is simply incorrect, or at least the C++17 version. std::vector does not use your allocator at all in C++17. You can also notice that if you attempt to call buffer.get_allocator() in your example, which will fail to compile in C++17 because it will try to convert std::allocator (as used internally) to CustomAllocatorType.

I think the correct way to fix the issue is to define CustomAllocatorType::rebind instead of specializing std::allocator_traits (see here and here), like so:

template<typename T>
class CustomAllocatorType: public std::allocator<T> 
{
  template< class U > struct rebind {
    typedef CustomAllocatorType<U> other;
  };
};

Of course, doing so means that the C++17 version will be slow in debug but actually working.

I think this also shows again the general rule that inheriting from C++ standard library types is usually a bad idea. If CustomAllocatorType did not inherit from std::allocator, the problem would not appear in the first place (also, because you'd need to think about how to set the elements correctly).

Improving performance

Assuming the allocator was fixed for C++17 or you use C++20, you get the bad performance in debug because the library implementation uses the generic versions of the above mentioned functions to fill and destroy data. Unfortunately, all of this is an implementation detail of the library, meaning that there is no nice standard way to enforce the generation of good code.

Hacky solution

A hack that works in your trivial example (and probably only there!) would be to define custom overloads of the functions in question, e.g.:

#include <bits/stl_uninitialized.h>
#include <cstdint>
#include <cstdlib>

// Must be defined BEFORE including <vector>!
namespace std{
  template<typename _ForwardIterator, typename _Size, typename _Tp>
  inline _ForwardIterator
  __uninitialized_default_n_a(_ForwardIterator __first, _Size __n, CustomAllocatorType<_Tp>&)
  { return std::__uninitialized_default_n(__first, __n); }


  template<typename _ForwardIterator, typename _Tp>
  _GLIBCXX20_CONSTEXPR inline void
  _Destroy(_ForwardIterator __first, _ForwardIterator __last, CustomAllocatorType<_Tp>&) {
    _Destroy(__first, __last);
  }
}

These here are copy & paste from gcc's std::allocator overloads (here and here), but overloaded for CustomAllocatorType. More special overloads would be required in real applications (e.g. for is_copy_constructible and is_move_constructible or __relocate_a_1, no idea how many more). Defining the above two functions before the include of <vector> leads to decent performance in debug for your minimal example. At least it does so for me locally using gcc 11.2. It does not work on quick bench because quick bench force-includes benchmark/benchmark.h before any of your code, and which in turn includes <vector> (also compare the second bullet point coming next).

This hack is awful on multiple levels:

It is absolutely non-standard. It only works with stdlibc++ and might break at any up- or downgrade of the library version.
You also need to ensure that the overloads are defined before the <vector> header is included, because otherwise they will not be picked up. The reason is that the calls to std::__uninitialized_default_n_a() are qualified, i.e. are std::__uninitialized_default_n_a(arguments) rather than __uninitialized_default_n_a(arguments), meaning that overloads after the definition of std::vector are not found (cf. e.g. this post or this one). As already explained above, this is the reason why the hack fails on quick bench. Also, if you mess this up in some places, you might violate the one-definition-rule (which will probably lead to more weirdness).
The example hack assumes that allocating and freeing memory does not require the use of CustomAllocatorType, just like std::allocator. I highly doubt that this holds for your true CustomAllocatorType implementation. But maybe you could actually implement e.g. __uninitialized_default_n_a() properly and more efficiently for your CustomAllocatorType by calling an appropriate function on your allocator.

I do not recommend doing this. But depending on the use case, it might be a viable solution.

Enabling -Og

I do get notably better performance with gcc when compiling everything with -Og. It attempts to perform some optimizations without interfering with the debugging experience too much. In your trivial example the performance is improved from 160x slower to 5x slower compared to the std::allocator version. So if you cannot change the compiler, I think that might be the best way to go.

Using clang

Switching to clang (without any optimization flags) seems to improve the performance somewhat. With libstdc++, the custom allocator version is "only" 90x slower. Surprisingly, with libc++ quick bench reports roughly the same performance. Unfortunately, I cannot reproduce this locally: libc++ is also taking ages. No idea why the result differs locally and on quick bench.

But I can reproduce that clang is optimizing with -Og much better than gcc and gives roughly the same performance with the custom allocator. This holds both with libstdc++ and libc++.

So my suggestion is to use clang, possibly with libc++, and use -Og.

Alternative ideas

Enabling optimizations locally (#pragma GCC optimize ("-O3") etc) is rather unreliable. It did not work for me. The most likely reason is that the optimization flag is not propagated to the instantiation of std::vector because its definition is somewhere else entirely. You'd probably need to compile the C++ standard library headers themselves with optimizations.

Another idea would be to use a different container library. For example, boost has a vector class. But I have not checked if its debug performance would be better.

Reason for the performance decrease

C++20 vs C++17 and why your `CustomAllocatorType` is broken

Improving performance

Recommended topics

Hot tags

Reason for the performance decrease

C++20 vs C++17 and why your CustomAllocatorType is broken

Improving performance

Recommended topics

Hot tags

C++20 vs C++17 and why your `CustomAllocatorType` is broken