Reason for the performance decrease
gcc's libstdc++ uses certain performance improvements if the allocator is std::allocator
. Your CustomAllocatorType
is a different type than std::allocator
, meaning that the optimizations are disabled. Note that I am not talking about compiler optimizations but rather that gcc's implementation of the C++ standard library implements overloads or specializations specifically for std::allocator
.
To name an example relevant to your example code, std::vector::resize()
internally calls __uninitialized_default_n_a()
which has a special overload for std::allocator
. The special overload bypasses the allocator entirely. If you use CustomAllocatorType
, the generic version is used which calls the allocator for every single element. This costs a lot of time. Another function with a special definition and which is relevant to your simple code example is _Destroy()
.
Put another way, gcc's implementation of the C++ standard library has some measures implemented to ensure that optimal code is generated in cases where it is known that it is safe. This works regardless of compiler optimizations.
If the non-optimized code paths are taken and you enable compiler optimizations (e.g. -O3
), the compiler is often able to recognize patterns in the non-optimal code (such as initializing successive trivial elements) and can optimize everything away so that you end up with the same instructions (more or less).
C++20 vs C++17 and why your CustomAllocatorType
is broken
As noted in the comments, the performance decrease when using CustomAllocatorType
only occurs in C++20 but not in C++17.
To understand why, note that gcc's std::vector
implementation does not use the Allocator
from the declaration std::vector<T,Allocator>
as allocator, i.e. in your case CustomAllocatorType
. Rather, it uses std::allocator_traits<T>::rebind_alloc<T>
(see here and here). Also see e.g. this post about rebind for some more information.
Since you did not define a specialization std::allocator_traits<CustomAllocatorType>
, it uses the generic one. The standard says:
rebind_alloc<T>: Alloc::rebind<T>::other
if present, otherwise Alloc<T, Args>
if this Alloc is Alloc<U, Args>
I.e. the generic one attempts to delegate to your allocator, if possible. Now, your allocator CustomAllocatorType
inherits from std::allocator
. And here comes the important difference between C++17 and C++20: std::allocator::rebind
was removed in C++20. Hence:
- C++17:
CustomAllocatorType::rebind
is inherited and thus defined and is std::allocator
. Therefore, std::allocator_traits<CustomAllocatorType>::rebind_alloc
, meaning that std::vector
ends up actually using std::allocator
instead of CustomAllocatorType
. If you pass in a CustomAllocatorType
instance in the std::vector
constructor, you end up with splicing.
- C++20:
CustomAllocatorType::rebind
is not defined. Thus, std::allocator_traits<CustomAllocatorType>::rebind_alloc
is CustomAllocatorType
and std::vector
ends up using CustomAllocatorType
.
So the C++17 version uses std::allocator
and thus enjoys the library based optimizations described above, while the C++20 version does not.
Your code is simply incorrect, or at least the C++17 version. std::vector
does not use your allocator at all in C++17. You can also notice that if you attempt to call buffer.get_allocator()
in your example, which will fail to compile in C++17 because it will try to convert std::allocator
(as used internally) to CustomAllocatorType
.
I think the correct way to fix the issue is to define CustomAllocatorType::rebind
instead of specializing std::allocator_traits
(see here and here), like so:
template<typename T>
class CustomAllocatorType: public std::allocator<T>
{
template< class U > struct rebind {
typedef CustomAllocatorType<U> other;
};
};
Of course, doing so means that the C++17 version will be slow in debug but actually working.
I think this also shows again the general rule that inheriting from C++ standard library types is usually a bad idea. If CustomAllocatorType
did not inherit from std::allocator
, the problem would not appear in the first place (also, because you'd need to think about how to set the elements correctly).
Improving performance
Assuming the allocator was fixed for C++17 or you use C++20, you get the bad performance in debug because the library implementation uses the generic versions of the above mentioned functions to fill and destroy data. Unfortunately, all of this is an implementation detail of the library, meaning that there is no nice standard way to enforce the generation of good code.
Hacky solution
A hack that works in your trivial example (and probably only there!) would be to define custom overloads of the functions in question, e.g.:
#include <bits/stl_uninitialized.h>
#include <cstdint>
#include <cstdlib>
// Must be defined BEFORE including <vector>!
namespace std{
template<typename _ForwardIterator, typename _Size, typename _Tp>
inline _ForwardIterator
__uninitialized_default_n_a(_ForwardIterator __first, _Size __n, CustomAllocatorType<_Tp>&)
{ return std::__uninitialized_default_n(__first, __n); }
template<typename _ForwardIterator, typename _Tp>
_GLIBCXX20_CONSTEXPR inline void
_Destroy(_ForwardIterator __first, _ForwardIterator __last, CustomAllocatorType<_Tp>&) {
_Destroy(__first, __last);
}
}
These here are copy & paste from gcc's std::allocator
overloads (here and here), but overloaded for CustomAllocatorType
. More special overloads would be required in real applications (e.g. for is_copy_constructible
and is_move_constructible
or __relocate_a_1
, no idea how many more). Defining the above two functions before the include of <vector>
leads to decent performance in debug for your minimal example. At least it does so for me locally using gcc 11.2. It does not work on quick bench because quick bench force-includes benchmark/benchmark.h
before any of your code, and which in turn includes <vector>
(also compare the second bullet point coming next).
This hack is awful on multiple levels:
- It is absolutely non-standard. It only works with stdlibc++ and might break at any up- or downgrade of the library version.
- You also need to ensure that the overloads are defined before the
<vector>
header is included, because otherwise they will not be picked up. The reason is that the calls to std::__uninitialized_default_n_a()
are qualified, i.e. are std::__uninitialized_default_n_a(arguments)
rather than __uninitialized_default_n_a(arguments)
, meaning that overloads after the definition of std::vector
are not found (cf. e.g. this post or this one). As already explained above, this is the reason why the hack fails on quick bench. Also, if you mess this up in some places, you might violate the one-definition-rule (which will probably lead to more weirdness).
- The example hack assumes that allocating and freeing memory does not require the use of
CustomAllocatorType
, just like std::allocator
. I highly doubt that this holds for your true CustomAllocatorType
implementation. But maybe you could actually implement e.g. __uninitialized_default_n_a()
properly and more efficiently for your CustomAllocatorType
by calling an appropriate function on your allocator.
I do not recommend doing this. But depending on the use case, it might be a viable solution.
Enabling -Og
I do get notably better performance with gcc when compiling everything with -Og
. It attempts to perform some optimizations without interfering with the debugging experience too much. In your trivial example the performance is improved from 160x slower to 5x slower compared to the std::allocator
version. So if you cannot change the compiler, I think that might be the best way to go.
Using clang
Switching to clang (without any optimization flags) seems to improve the performance somewhat. With libstdc++, the custom allocator version is "only" 90x slower.
Surprisingly, with libc++ quick bench reports roughly the same performance. Unfortunately, I cannot reproduce this locally: libc++ is also taking ages. No idea why the result differs locally and on quick bench.
But I can reproduce that clang is optimizing with -Og
much better than gcc and gives roughly the same performance with the custom allocator. This holds both with libstdc++ and libc++.
So my suggestion is to use clang, possibly with libc++, and use -Og
.
Alternative ideas
Enabling optimizations locally (#pragma GCC optimize ("-O3")
etc) is rather unreliable. It did not work for me. The most likely reason is that the optimization flag is not propagated to the instantiation of std::vector
because its definition is somewhere else entirely. You'd probably need to compile the C++ standard library headers themselves with optimizations.
Another idea would be to use a different container library. For example, boost has a vector
class. But I have not checked if its debug performance would be better.
-std=c++17
? quick-bench.com/q/LUk0HQxpY4Tqk1dTEU3pzJAxEPM – Sochor-Og
to optimize while maintaining debuggability, but that doesn't always work (can be hard sanely debug the result). – Saristd::vector<uint8_t, CustomAllocatorType<uint8_t>>
is slower thanstd::vector<uint8_t>
whenCustomAllocatorType
is apparentlystd::allocator
. – Bravarstd::allocator
to improve performance. This would not necessarily be done forCustomAllocatorType
. And indeed, there is some stuff that gets optimized specifically. But: I then tried to copy & paste gcc's implementation ofstd::allocator
+ the specializations/overloads of various stuff, just adapting the name of the allocator to the custom one. In the end, I get exactly the same assembly output between the custom allocator and the std one. But the custom one still runs slower. ... – Confirmallocator_c
), and QuickBench. If you copy & paste the assembly into a text editor and replaceallocator_c
withallocator
, you see that they are identical. I have no idea what is going on here. It also does not seem to be a quirk of quick bench, because exchanging the custom and thestd::allocator
exchanges results, too. It also shouldn't be any optimization magic that detectsstd::allocator
because optimizations are turned off. – Confirmbenchmark/benchmark.h
before any of my code, and which in turn includes<vector>
. godbolt does not do this. Also compare my answer below. – Confirm