Performance of resizing std::vector<std::unique_ptr<T>>

Asked 13/7, 2017 at 19:1 Answered 13/7, 2017 at 20:55

Solved c++c++11 vector compiler-optimization unique-ptr

The general conception seems to be that std::unique_ptr has no time overhead compared to properly used owning raw pointers, given sufficient optimization.

But what about using std::unique_ptr in compound data structures, in particular std::vector<std::unique_ptr<T>>? For instance, resizing the underlying data of a vector, which can happen during push_back. To isolate the performance, I loop around pop_back, shrink_to_fit, emplace_back:

#include <chrono>
#include <vector>
#include <memory>
#include <iostream>

constexpr size_t size = 1000000;
constexpr size_t repeat = 1000;
using my_clock = std::chrono::high_resolution_clock;

template<class T>
auto test(std::vector<T>& v) {
    v.reserve(size);
    for (size_t i = 0; i < size; i++) {
        v.emplace_back(new int());
    }
    auto t0 = my_clock::now();
    for (int i = 0; i < repeat; i++) {
        auto back = std::move(v.back());
        v.pop_back();
        v.shrink_to_fit();
        if (back == nullptr) throw "don't optimize me away";
        v.emplace_back(std::move(back));
    }
    return my_clock::now() - t0;
}

int main() {
    std::vector<std::unique_ptr<int>> v_u;
    std::vector<int*> v_p;

    auto millis_p = std::chrono::duration_cast<std::chrono::milliseconds>(test(v_p));
    auto millis_u = std::chrono::duration_cast<std::chrono::milliseconds>(test(v_u));
    std::cout << "raw pointer: " << millis_p.count() << " ms, unique_ptr: " << millis_u.count() << " ms\n";
    for (auto p : v_p) delete p; // I don't like memory leaks ;-)
}

Compiling the code with -O3 -o -march=native -std=c++14 -g with gcc 7.1.0, clang 3.8.0, and 17.0.4 on Linux on a Intel Xeon E5-2690 v3 @ 2.6 GHz (no turbo):

raw pointer: 2746 ms, unique_ptr: 5140 ms  (gcc)
raw pointer: 2667 ms, unique_ptr: 5529 ms  (clang)
raw pointer: 1448 ms, unique_ptr: 5374 ms  (intel)

The raw pointer version spends all it's time in an optimized memmove (intel seems to have a much better one than clang and gcc). The unique_ptr code seems to first copy over the vector data from one memory block to the other and assign the original one with zero - all in a horribly un-optimized loop. And then it loops over the original block of data again to see if any of those that were just zero'd are nonzero and need to be deleted. The full gory detail can be seen on godbolt. The question is not how the compiled code differs, that is pretty clear. The question is why the compiler fails to optimize what is generally regarded as a no-extra-overhead abstraction.

Trying to understand how the compilers reason about handling std::unique_ptr, I was looking a bit more at isolated code. For instance:

void foo(std::unique_ptr<int>& a, std::unique_ptr<int>& b) {
  a.release();
  a = std::move(b);
}

or the similar

a.release();
a.reset(b.release());

none of the x86 compilers seem to be able to optimize away the senseless if (ptr) delete ptr;. The Intel compiler even gives the delete a 28 % chance. Surprisingly, the delete check is consistently omitted for:

auto tmp = b.release();
a.release();
a.reset(tmp);

These bits are not the main aspect of this question, but all of this makes me feel that I am missing something.

Why do various compilers fail to optimize reallocation within std::vector<std::unique_ptr<int>>? Is there anything in the standard that prevents generating code as efficient as with raw pointers? Is this an issue with the standard library implementation? Or are the compilers just not sufficiently clever enough (yet)?

What can one do to avoid performance impact compared to using raw pointers?

Note: Assume that T is polymorphic and expensive to move, so std::vector<T> is not an option.

Fasten answered 13/7, 2017 at 19:1 Comment(12)

A default constructed std::unique_ptr probably results in an almost NOP. – Hilarius 13/7, 2017 at 19:3

I'm pretty sure that if you declare back as volatile, the compiler won't optimize it away and you won't need that exception trick. – Picasso 13/7, 2017 at 19:4

Might not be anything but do note that shrink_to_fit() is not required to actually do anything. It is possible it does stuff in the unique_ptr case but not the raw pointer case. – Sharlasharleen 13/7, 2017 at 19:8

@Sharlasharleen I'm aware, but very confident from timing / and perf report that it applies in both versions. It's seemed like the simplest way to isolate the resizing. – Fasten 13/7, 2017 at 19:17

Reasoning about huge buffers and their unform state is hard? It would be easier to teach the compiler somehow that std::unique_ptr can be destructively nothrow bit-blit moved with a source zero, and that vectors can use destructive nothrow bit-blit moves, than teach it that the state of 100000 elements pointed to by the following pointers can be proven by previous lines of code. I did see a nothrow move-and-destroy trait in one standard proposal a bit ago, I don't know its status. – Denice 13/7, 2017 at 19:24

@Fasten Right, my bad. Comment removed. – Tman 13/7, 2017 at 19:25

I do call shrink_to_fit() but never in a tight loop. I wonder what the timings would be like when moving a large number of pointers to a parallel vector before shrinking and replacing. – Impossibility 13/7, 2017 at 19:28

godbolt is useful for exploring these kinds of questions (if you are assembler-literate). – Odd 13/7, 2017 at 19:29

@Justin no memory is leaked in either version. std::unique_ptr<>::pop_back doesn't delete when the pointer is nullptr - as it is when it's been moved from. – Fasten 13/7, 2017 at 19:29

The culprit seems to be shrink_to_fit here. If you remove the call, the unique_ptr case ends up being faster. Both versions speed up considerably, though. coliru.stacked-crooked.com/a/937e1c8218b8f191 – Ludewig 13/7, 2017 at 19:34

It's possible that the unique_ptr case has to move all the elements by calling the move constructors, whereas the compiler was able to move the T*s by simply calling memcpy – Ludewig 13/7, 2017 at 19:35

@Ludewig - yup, shrink_to_fit is the problem here because it is the one "multi-element" resize type operation. Such operations benefit from the type-traits based dispatching to memmove that works for raw pointers, primitive types and trivially copyable types, but not for unique_ptr - details in my answer. – Courtier 13/7, 2017 at 21:32

The claim that unique_ptr performs as well as a raw pointer after optimization mostly applies only to the basic operations on a single pointer, such as creation, dereferencing, assignment of a single pointer and deletion. Those operations are defined simply enough that an optimizing compiler can usually make the required transformations such that the resulting code is equivalent (or nearly so) in performance to the raw version⁰.

One place this falls apart is especially higher level language-based optimizations on array-based containers such as std::vector, as you have noted with your test. These containers usually use source level optimizations which depend on type traits to determine at compile time if a type can safely be copied using a byte-wise copy such as memcpy, and delegate to such a method if so, or otherwise fall back to an element-wise copy loop.

To be safely copyable with memcpy an object must be trivially copyable. Now std::unique_ptr is not trivially copyable since indeed it fails several of the requirements such as having only trivial or deleted copy and move constructors. The exact mechanism depends on the standard library involved, but in general a quality std::vector implementation will end up calling a specialized form of something like std::uninitialized_copy for trivially-copyable types that just delegates to memmove.

The typical implementation details are quite tortured, but for libstc++ (used by gcc) you can see the high-level divergence in std::uninitialized_copy:

 template<typename _InputIterator, typename _ForwardIterator>
 inline _ForwardIterator
 uninitialized_copy(_InputIterator __first, _InputIterator __last,
                    _ForwardIterator __result)
 {
        ...
   return std::__uninitialized_copy<__is_trivial(_ValueType1)
                                    && __is_trivial(_ValueType2)
                                    && __assignable>::
     __uninit_copy(__first, __last, __result);
 }

From there you can take my word that many of the std::vector "movement" methods end up here, and that __uninitialized_copy<true>::__uinit_copy(...) ultimately calls memmove while the <false> version doesn't - or you can trace through the code yourself (but you already saw the result in your benchmark).

Ultimately then, you end up with a several loops that perform the required copy steps for non-trivial objects, such as calling the move constructor of the destination object, and subsequently calling the destructor of all the source objects. These are separate loops and even modern compilers will pretty much not be able to reason about something like "OK, in the first loop I moved all the destination objects so their ptr member will be null, so the second loop is a no-op". Finally, to equal the speed of raw pointers, not only would compilers need to optimize across these two loops, they would need to have a transformation which recognizes that the whole thing can be replaced by memcpy or memmove².

So one answer to your question is that compilers just aren't smart enough to do this optimization, but it's largely because the "raw" version has a lot of compile-time help to skip the need for this optimization entirely.

Loop Fusion

As mentioned the existing vector implementations implement a resize-type operation in two separate loops (in addition to non-loop work such as allocating the new storage and freeing the old storage):

Copying the source objects into the newly allocated destination array (conceptually using something like placement new calling the move constructor).
Destroying the source objects in the old region.

Conceptually you could imagine an alternative way: doing this all in one loop, copying each element and them immediately destroying it. It possible that a compiler could even notice that the two loops iterate over the same set of values and fuse the two loops into one. [Apparently], howevever, (https://gcc.gnu.org/ml/gcc/2015-04/msg00291.html) gcc doesn't do any loop fusion today, and nor do clang or icc if you believe this test.

So then we are left trying to put the loops together explicitly at the source level.

Now the two-loop implementation helps preserve the exception safety contract of the operation by not destroying any source objects until we know the construction part of the copy has completed, but it also helps to optimize the copy and destruction when we have trivially-copyable and trivially-destructible objects, respectively. In particular, with simple-traits based selection we can replace the copy with a memmove and the destruction loop can be elided entirely³.

So the two-loop approach helps when those optimizations apply, but it actually hurts in the general case of objects which are neither trivially copyable or destructible. It means you need two passes over the objects and you lose the opportunity to optimize and eliminate code between the copy of the object and it's subsequent destruction. In the unique_ptr case you lose the ability for the compiler to propagate the knowledge that the source unique_ptr will have a NULL internal ptr member and hence skip the if (ptr) delete ptr check entirely⁴.

Trivially Movable

Now one might ask whether we could apply the same type-traits compile-time optimization to the unique_ptr case. For example, one might look at the trivially copyable requirements and see that they are perhaps too strict for the common move operations in std::vector. Sure, a unique_ptr is evidently not trivially copyable since a bit-wise copy would leave both the source and destination object owing the same pointer (and result in double-deletion), but it seems that it should be bit-wise movable: if you move a unique_ptr from one area of memory to another, such that you no longer consider the source as a live object (and hence won't call its destructor) it should "just work", for the typical unique_ptr implementation.

Unfortunately, no such "trivial move" concept exists, although you could try to roll your own. There seems to be an open debate about whether this is UB or not for objects that can be byte-wise copied and do not depend on their constructor or destructor behavior in the move scenario.

You could always implement your own trivially movable concept, which would be something like (a) the object has a trivial move constructor and (b) when used as the source argument of the move constructor the object is left in a state where it's destructor has no effect. Note that such a definition is currently mostly useless, since "trivial move constructor" (basically element-wise copy and nothing else) is not consistent with any modification of the source object. So for example, a trivial move constructor cannot set the ptr member of the source unique_ptr to zero. So you'd need to jump though some more hoops such as introducing the concept of a destructive move operation which leaves the source object destroyed, rather than in a valid-but-unspecified state.

You can find some more detailed discussion of this "trivially movable" on this thread on the ISO C++ usenet discussion group. In the particular, in the linked reply, the exact issue of vectors of unique_ptr is addressed:

It turns out many smart pointers (unique_ptr and shared_ptr included) fall into all three of those categories and by applying them you can have vectors of smart pointers with essentially zero overhead over raw pointers even in non-optimized debug builds.

See also the relocator proposal.

⁰ Although the non-vector examples at the end of your question show that this isn't always the case. Here it is due to possible aliasing as zneak explains in his answer. Raw pointers will avoid many of these aliasing issues since they lack the indirection that unique_ptr has (e.g, you pass a raw pointer by value, rather than a structure with a pointer by reference) and can often omit the if (ptr) delete ptr check entirely.

² This is actually harder than you might think, because memmove, for example, has subtly different semantics than an object copy loop, when the source and destination overlap. Of course the high level type traits code that works for raw points knows (by contract) that there is no overlap, or the behavior of memmove is consistent even if there is overlap, but proving the same thing at some later arbitrary optimization pass may be much harder.

³ It is important to note that these optimizations are more or less independent. For example, many objects are trivially destructible that at are not trivially copyable.

⁴ Although in my test neither gcc nor clang were able to suppress the check, even with __restrict__ applied, apparently due to insufficiently powerful aliasing analysis, or perhaps because std::move strips the "restrict" qualifier somehow.

Courtier answered 13/7, 2017 at 20:2 Comment(7)

Great answer. Thank you. – Toothless 13/7, 2017 at 20:32

You make some very good points. Do you think that this could be improved by specializing certain template around vector functions for unique_ptr - compensating for it's lack of being trivial copyable by exploiting knowledge about the implementation of unique_ptr within libstdc++? – Fasten 13/7, 2017 at 20:56

@Fasten - possibly. I added a bit more to the section about trivially movable just now, starting with "You could always implement your own..." - but as I note there it would require a change to the language and as some issues. You could always try to specialize the vector<> functions (or the various std::... helper functions they call) on unique_ptr, and just force them to the do the "simple" thing, and I suspect it would work, but this would still be technically undefined behavior because the standard doesn't guarantee that memcpy leaves the source object in a usable state. – Courtier 13/7, 2017 at 21:5

And this is the gist of my beef with C++ move semantics; by requiring that the source be left in a destructible/assignable state, it flushed performance down the drain :( – Digger 14/7, 2017 at 7:8

@MatthieuM. - yes, but move semantics would generally be less useful for a lot of scenarios (e.g., implementing swap, etc). Also, in general, you still want to call the destructor on many types of movable objects, so you'd still want that option, at least. You'd also want to avoid the case where compiler had to remember whether to destroy an object with automatic duration or not depending on the flow though the function. It would get messy... – Courtier 14/7, 2017 at 21:51

@BeeOnRope: Rust manages, and the implementation is rather conceptually simple. For each item on the stack that may or may not be moved, keep a single bit on/off. At the end of the scope, check the bit to know whether to execute the destructor. – Digger 15/7, 2017 at 9:55

This paper about noop constructors / destructors with the specific use-case of destructive move seems to be the latest effort for improving on that. Unfortunately there doesn't seem to be a lot of activity around that currently. The previous effort was not well received because it created some object lifetime vagueness. Seems like it's not getting the attention it deserves :-(. – Fasten 15/7, 2017 at 11:14

I don't have a precise answer for what is biting you in the back with vectors; looks like BeeOnRope might already have one for you.

Luckily, I can tell you what's biting you in the back for your micro-example involving different ways to reset pointers: alias analysis. Specifically, the compilers are unable to prove (or unwilling to infer) that the two unique_ptr references don't overlap. They force themselves to reload the unique_ptr value in case the write to the first one has modified the second one. baz doesn't suffer from it because the compiler can prove that neither parameter, in a well-formed program, could possibly alias with tmp, which has function-local automatic storage.

You can verify this by adding the __restrict__ keyword (which, as the double underscore somewhat implies, is not standard C++) to either unique_ptr reference parameter. That keyword informs the compiler that the reference is the only reference through which that memory can possibly be accessed, and therefore there is no risk that anything else can alias with it. When you do it, all three versions of your function compile to the same machine code and don't bother checking if the unique_ptr needs to be deleted.

Rorke answered 13/7, 2017 at 20:55 Comment(2)

Great addition, I was suspecting something with aliasing, but __restrict__ really shows it. Funny though that if the compiler knows for sure it's an alias - optimization does work. – Fasten 13/7, 2017 at 21:2

I'm not familiar with GCC, but LLVM's alias queries have three possible results: "not aliasing", "may alias", and "always alias". You can't really do anything with "may alias" (which is unfortunately the result of the majority of alias queries, because the problem is extremely hard to solve), but both "not aliasing" and "always aliasing" offer different optimization possibilities. – Rorke 13/7, 2017 at 21:6

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Loop Fusion

Trivially Movable

Recommended topics

Hot tags