The claim that unique_ptr
performs as well as a raw pointer after optimization mostly applies only to the basic operations on a single pointer, such as creation, dereferencing, assignment of a single pointer and deletion. Those operations are defined simply enough that an optimizing compiler can usually make the required transformations such that the resulting code is equivalent (or nearly so) in performance to the raw version0.
One place this falls apart is especially higher level language-based optimizations on array-based containers such as std::vector
, as you have noted with your test. These containers usually use source level optimizations which depend on type traits to determine at compile time if a type can safely be copied using a byte-wise copy such as memcpy
, and delegate to such a method if so, or otherwise fall back to an element-wise copy loop.
To be safely copyable with memcpy
an object must be trivially copyable. Now std::unique_ptr
is not trivially copyable since indeed it fails several of the requirements such as having only trivial or deleted copy and move constructors. The exact mechanism depends on the standard library involved, but in general a quality std::vector
implementation will end up calling a specialized form of something like std::uninitialized_copy
for trivially-copyable types that just delegates to memmove
.
The typical implementation details are quite tortured, but for libstc++
(used by gcc
) you can see the high-level divergence in std::uninitialized_copy
:
template<typename _InputIterator, typename _ForwardIterator>
inline _ForwardIterator
uninitialized_copy(_InputIterator __first, _InputIterator __last,
_ForwardIterator __result)
{
...
return std::__uninitialized_copy<__is_trivial(_ValueType1)
&& __is_trivial(_ValueType2)
&& __assignable>::
__uninit_copy(__first, __last, __result);
}
From there you can take my word that many of the std::vector
"movement" methods end up here, and that __uninitialized_copy<true>::__uinit_copy(...)
ultimately calls memmove
while the <false>
version doesn't - or you can trace through the code yourself (but you already saw the result in your benchmark).
Ultimately then, you end up with a several loops that perform the required copy steps for non-trivial objects, such as calling the move constructor of the destination object, and subsequently calling the destructor of all the source objects. These are separate loops and even modern compilers will pretty much not be able to reason about something like "OK, in the first loop I moved all the destination objects so their ptr
member will be null, so the second loop is a no-op". Finally, to equal the speed of raw pointers, not only would compilers need to optimize across these two loops, they would need to have a transformation which recognizes that the whole thing can be replaced by memcpy
or memmove
2.
So one answer to your question is that compilers just aren't smart enough to do this optimization, but it's largely because the "raw" version has a lot of compile-time help to skip the need for this optimization entirely.
Loop Fusion
As mentioned the existing vector
implementations implement a resize-type operation in two separate loops (in addition to non-loop work such as allocating the new storage and freeing the old storage):
- Copying the source objects into the newly allocated destination array (conceptually using something like placement new calling the move constructor).
- Destroying the source objects in the old region.
Conceptually you could imagine an alternative way: doing this all in one loop, copying each element and them immediately destroying it. It possible that a compiler could even notice that the two loops iterate over the same set of values and fuse the two loops into one. [Apparently], howevever, (https://gcc.gnu.org/ml/gcc/2015-04/msg00291.html) gcc
doesn't do any loop fusion today, and nor do clang
or icc
if you believe this test.
So then we are left trying to put the loops together explicitly at the source level.
Now the two-loop implementation helps preserve the exception safety contract of the operation by not destroying any source objects until we know the construction part of the copy has completed, but it also helps to optimize the copy and destruction when we have trivially-copyable and trivially-destructible objects, respectively. In particular, with simple-traits based selection we can replace the copy with a memmove
and the destruction loop can be elided entirely3.
So the two-loop approach helps when those optimizations apply, but it actually hurts in the general case of objects which are neither trivially copyable or destructible. It means you need two passes over the objects and you lose the opportunity to optimize and eliminate code between the copy of the object and it's subsequent destruction. In the unique_ptr
case you lose the ability for the compiler to propagate the knowledge that the source unique_ptr
will have a NULL
internal ptr
member and hence skip the if (ptr) delete ptr
check entirely4.
Trivially Movable
Now one might ask whether we could apply the same type-traits compile-time optimization to the unique_ptr
case. For example, one might look at the trivially copyable requirements and see that they are perhaps too strict for the common move operations in std::vector
. Sure, a unique_ptr
is evidently not trivially copyable since a bit-wise copy would leave both the source and destination object owing the same pointer (and result in double-deletion), but it seems that it should be bit-wise movable: if you move a unique_ptr
from one area of memory to another, such that you no longer consider the source as a live object (and hence won't call its destructor) it should "just work", for the typical unique_ptr
implementation.
Unfortunately, no such "trivial move" concept exists, although you could try to roll your own. There seems to be an open debate about whether this is UB or not for objects that can be byte-wise copied and do not depend on their constructor or destructor behavior in the move scenario.
You could always implement your own trivially movable concept, which would be something like (a) the object has a trivial move constructor and (b) when used as the source argument of the move constructor the object is left in a state where it's destructor has no effect. Note that such a definition is currently mostly useless, since "trivial move constructor" (basically element-wise copy and nothing else) is not consistent with any modification of the source object. So for example, a trivial move constructor cannot set the ptr
member of the source unique_ptr
to zero. So you'd need to jump though some more hoops such as introducing the concept of a destructive move operation which leaves the source object destroyed, rather than in a valid-but-unspecified state.
You can find some more detailed discussion of this "trivially movable" on this thread on the ISO C++ usenet discussion group. In the particular, in the linked reply, the exact issue of vectors of unique_ptr
is addressed:
It turns out many smart pointers (unique_ptr and shared_ptr included)
fall into all three of those categories and by applying them you can
have vectors of smart pointers with essentially zero overhead over raw
pointers even in non-optimized debug builds.
See also the relocator proposal.
0 Although the non-vector examples at the end of your question show that this isn't always the case. Here it is due to possible aliasing as zneak explains in his answer. Raw pointers will avoid many of these aliasing issues since they lack the indirection that unique_ptr
has (e.g, you pass a raw pointer by value, rather than a structure with a pointer by reference) and can often omit the if (ptr) delete ptr
check entirely.
2 This is actually harder than you might think, because memmove
, for example, has subtly different semantics than an object copy loop, when the source and destination overlap. Of course the high level type traits code that works for raw points knows (by contract) that there is no overlap, or the behavior of memmove
is consistent even if there is overlap, but proving the same thing at some later arbitrary optimization pass may be much harder.
3 It is important to note that these optimizations are more or less independent. For example, many objects are trivially destructible that at are not trivially copyable.
4 Although in my test neither gcc
nor clang
were able to suppress the check, even with __restrict__
applied, apparently due to insufficiently powerful aliasing analysis, or perhaps because std::move
strips the "restrict" qualifier somehow.
std::unique_ptr
probably results in an almost NOP. – Hilariusback
asvolatile
, the compiler won't optimize it away and you won't need that exception trick. – Picassoshrink_to_fit()
is not required to actually do anything. It is possible it does stuff in theunique_ptr
case but not the raw pointer case. – Sharlasharleenperf report
that it applies in both versions. It's seemed like the simplest way to isolate the resizing. – Fastenstd::unique_ptr
can be destructively nothrow bit-blit moved with a source zero, and that vectors can use destructive nothrow bit-blit moves, than teach it that the state of 100000 elements pointed to by the following pointers can be proven by previous lines of code. I did see a nothrow move-and-destroy trait in one standard proposal a bit ago, I don't know its status. – Deniceshrink_to_fit()
but never in a tight loop. I wonder what the timings would be like when moving a large number of pointers to a parallel vector before shrinking and replacing. – Impossibilitystd::unique_ptr<>::pop_back
doesn't delete when the pointer isnullptr
- as it is when it's been moved from. – Fastenshrink_to_fit
here. If you remove the call, the unique_ptr case ends up being faster. Both versions speed up considerably, though. coliru.stacked-crooked.com/a/937e1c8218b8f191 – Ludewigunique_ptr
case has to move all the elements by calling the move constructors, whereas the compiler was able to move theT*
s by simply callingmemcpy
– Ludewigshrink_to_fit
is the problem here because it is the one "multi-element" resize type operation. Such operations benefit from the type-traits based dispatching tomemmove
that works for raw pointers, primitive types and trivially copyable types, but not forunique_ptr
- details in my answer. – Courtier