Why would I std::move an std::shared_ptr?
Asked Answered
L

8

224

I have been looking through the Clang source code and I found this snippet:

void CompilerInstance::setInvocation(
    std::shared_ptr<CompilerInvocation> Value) {
  Invocation = std::move(Value);
}

Why would I want to std::move an std::shared_ptr?

Is there any point transferring ownership on a shared resource?

Why wouldn't I just do this instead?

void CompilerInstance::setInvocation(
    std::shared_ptr<CompilerInvocation> Value) {
  Invocation = Value;
}
Leroi answered 26/1, 2017 at 10:11 Comment(0)
C
237

I think that the one thing the other answers did not emphasize enough is the point of speed.

std::shared_ptr reference count is atomic. increasing or decreasing the reference count requires atomic increment or decrement. This is hundred times slower than non-atomic increment/decrement, not to mention that if we increment and decrement the same counter we wind up with the exact number, wasting a ton of time and resources in the process.

By moving the shared_ptr instead of copying it, we "steal" the atomic reference count and we nullify the other shared_ptr. "stealing" the reference count is not atomic, and it is hundred times faster than copying the shared_ptr (and causing atomic reference increment or decrement).

Do note that this technique is used purely for optimization. copying it (as you suggested) is just as fine functionality-wise.

Calutron answered 26/1, 2017 at 13:49 Comment(11)
Is it really hundred times faster? Do you have benchmarks for this?Pigeonhole
@Pigeonhole The assignment requires an atomic increment followed by an atomic decrement when Value goes out of scope. Atomic operations can take hundreds of clock cycles. So yes, it really is that much slower.Gonococcus
@Gonococcus that's the first I've heard the fetch and add operation (en.wikipedia.org/wiki/Fetch-and-add) could take hundreds of cycles more than a basic increment. Do you have a reference for that?Pigeonhole
@Pigeonhole : https://mcmap.net/q/14868/-what-is-the-cost-of-atomic-operations With register operations being a few cycles, 100's (100-300) of cycles for atomic fits the bill. Although metrics are from 2013, this still seems to be true especially for multi-socket NUMA systems.Xavler
I think that when there's no threading in the code, shared pointer won't have to use atomic operations (at least on GCC). (reddit.com/r/cpp/comments/9byrhy/how_fast_is_stdshared_ptr/…)Bradshaw
Sometimes you think there's no threading in your code... but then some darn library comes along and ruins it for u. Better to use const references and std::move... if it's clear and obvious that you can.... than rely on pointer reference counts.Malda
Made a small benchmark and found that its about three times faster. Cost of copy vs move std::shared_ptrLithomarge
@Adisak, If intent is to steal, isn't it better to have unique pointer as argument rather than having a shared pointerCentring
@Centring - no... you need to assign a shared pointer from a shared pointer to properly maintain the lifetime management of the object. You cannot assign a shared pointer to a unique pointer argument and then assign that to a shared pointer. Shared pointer and unique pointer are mutually exclusive methods in object lifetime management.Gonococcus
See herbsutter.com/2013/06/05/… for a discussion of perf implications of passing a shared_ptr by value.Melancholy
I think: If you can use std::move, why you not use a unique_ptr? I also think the caller of the ctr should move his resource into the ctr. The intended "atomic optimization" may already be eaten away, because of the copy on the caller side. Another (imo) way would be to use const shared_ptr& as signature in the ctr.Mushroom
D
146

By using move you avoid increasing, and then immediately decreasing, the number of shares. That might save you some expensive atomic operations on the use count.

Dichotomy answered 26/1, 2017 at 10:14 Comment(4)
Is it not premature optimization?Pendulous
@Pendulous not if whoever put it there actually tested it.Submit
@Pendulous Premature optimisation is evil if it makes code harder to read or maintain. This one does neither, at least IMO.Durance
Indeed. This is not a premature optimisation. It is instead the sensible way to write this function.Zamir
K
83

Move operations (like move constructor) for std::shared_ptr are cheap, as they basically are "stealing pointers" (from source to destination; to be more precise, the whole state control block is "stolen" from source to destination, including the reference count information).

Instead copy operations on std::shared_ptr invoke atomic reference count increase (i.e. not just ++RefCount on an integer RefCount data member, but e.g. calling InterlockedIncrement on Windows), which is more expensive than just stealing pointers/state.

So, analyzing the ref count dynamics of this case in details:

// shared_ptr<CompilerInvocation> sp;
compilerInstance.setInvocation(sp);

If you pass sp by value and then take a copy inside the CompilerInstance::setInvocation method, you have:

  1. When entering the method, the shared_ptr parameter is copy constructed: ref count atomic increment.
  2. Inside the method's body, you copy the shared_ptr parameter into the data member: ref count atomic increment.
  3. When exiting the method, the shared_ptr parameter is destructed: ref count atomic decrement.

You have two atomic increments and one atomic decrement, for a total of three atomic operations.

Instead, if you pass the shared_ptr parameter by value and then std::move inside the method (as properly done in Clang's code), you have:

  1. When entering the method, the shared_ptr parameter is copy constructed: ref count atomic increment.
  2. Inside the method's body, you std::move the shared_ptr parameter into the data member: ref count does not change! You are just stealing pointers/state: no expensive atomic ref count operations are involved.
  3. When exiting the method, the shared_ptr parameter is destructed; but since you moved in step 2, there's nothing to destruct, as the shared_ptr parameter is not pointing to anything anymore. Again, no atomic decrement happens in this case.

Bottom line: in this case you get just one ref count atomic increment, i.e. just one atomic operation.
As you can see, this is much better than two atomic increments plus one atomic decrement (for a total of three atomic operations) for the copy case.

Kristy answered 26/1, 2017 at 11:7 Comment(4)
Also worth noting: why don't they just pass by const reference, and avoid the whole std::move stuff? Because pass-by-value also lets you pass in a raw pointer directly and there will just be one shared_ptr created.Cerebration
@JosephIreland Because you cannot move a const referenceValaria
@JosephIreland because if you call it as compilerInstance.setInvocation(std::move(sp)); then there will be no increment. You can get the same behavior by adding a overload that takes a shared_ptr<>&& but why duplicate when you don't need to.Whack
@BrunoFerreira I was answering my own question. You wouldn't need to move it because it's a reference, just copy it. Still only one copy instead of two. The reason they don't do that is because it would unnecessarily copy newly constructed shared_ptrs, e.g. from setInvocation(new CompilerInvocation), or as ratchet mentioned, setInvocation(std::move(sp)). Sorry if my first comment was unclear, I actually posted it by accident, before I had finished writing, and I decided to just leave itCerebration
M
27

There are two reasons for using std::move in this situation. Most responses addressed the issue of speed, but ignored the important issue of showing the code's intent more clearly.

For a std::shared_ptr, std::move unambiguously denotes a transfer of ownership of the pointee, while a simple copy operation adds an additional owner. Of course, if the original owner subsequently relinquishes their ownership (such as by allowing their std::shared_ptr to be destroyed), then a transfer of ownership has been accomplished.

When you transfer ownership with std::move, it's obvious what is happening. If you use a normal copy, it isn't obvious that the intended operation is a transfer until you verify that the original owner immediately relinquishes ownership. As a bonus, a more efficient implementation is possible, since an atomic transfer of ownership can avoid the temporary state where the number of owners has increased by one (and the attendant changes in reference counts).

Multiplicity answered 26/1, 2017 at 18:53 Comment(2)
Exactly what I'm looking for. Surprised how other answers ignore this important semantic difference. the smart pointers is all about ownership.Weathered
I think ownership is especially crucial in lambda notation. Shared ptr capture by reference may not contribute to its reference counter and after the code exits and ptr destroyed you would have lambda with dangling pointer.Purulent
C
26

Copying a shared_ptr involves copying its internal state object pointer and changing the reference count. Moving it only involves swapping pointers to the internal reference counter, and the owned object, so it's faster.

Cruelty answered 26/1, 2017 at 10:15 Comment(0)
B
9

Since none of these answers offered an actual benchmark, I thought I'd try to provide one. However, think I've left myself more confused than when I started. I tried to come up with a test that would measure passing a shared_ptr<int> by value, by reference, and using std::move, performing an add operation on that value, and returning the result. I did this several times (one million) using two sets of tests. The first set added a constant value to the shared_ptr<int>, the other added a random value in the [0, 10] range. I figured the constant value addition would be a candidate for heavy optimization, whereas the random value test would not. That is more-or-less what I saw, but the extreme differences in execution time leads me to believe that other factors/problems with this test program are the contributing factors to the execution time differences, not the move semantics.

tl;dr

For no optimizations (-O0), constant addition

  • std::move was ~4x faster than pass-by-value
  • std::move was marginally slower than pass-by-reference

For high optimizations (-O3), constant addition

  • std::move was 70-90 thousand times faster than pass-by-value
  • std::move was marginally faster than pass-by-reference (anywhere from 1-1.4 times)

For no optimizations (-O0), random addition

  • std::move was 1-2 times faster than pass-by-value
  • std::move was marginally slower than pass-by-reference

For high optimizations (-O3), random addition

  • std::move was 1-1.3 times faster than pass-by-value (marginally worse than no optimizations)
  • std::move was essentially the same as pass-by-reference

Finally, the test

#include <memory>
#include <iostream>
#include <chrono>
#include <ctime>
#include <random>

constexpr auto MAX_NUM_ITS = 1000000;

// using random values to try to cut down on massive compiler optimizations
static std::random_device RAND_DEV;
static std::mt19937 RNG(RAND_DEV());
static std::uniform_int_distribution<std::mt19937::result_type> DIST11(0,10);

void CopyPtr(std::shared_ptr<int> myInt)
{
    // demonstrates that use_count increases with each copy
    std::cout << "In CopyPtr: ref count = " << myInt.use_count() << std::endl;
    std::shared_ptr<int> myCopyInt(myInt);
    std::cout << "In CopyPtr: ref count = " << myCopyInt.use_count() << std::endl;
}

void ReferencePtr(std::shared_ptr<int>& myInt)
{
    // reference count stays the same until a copy is made
    std::cout << "In ReferencePtr: ref count = " << myInt.use_count() << std::endl;
    std::shared_ptr<int> myCopyInt(myInt);
    std::cout << "In ReferencePtr: ref count = " << myCopyInt.use_count() << std::endl;
}

void MovePtr(std::shared_ptr<int>&& myInt)
{
    // demonstrates that use_count remains constant with each move
    std::cout << "In MovePtr: ref count = " << myInt.use_count() << std::endl;
    std::shared_ptr<int> myMovedInt(std::move(myInt));
    std::cout << "In MovePtr: ref count = " << myMovedInt.use_count() << std::endl;
}

int CopyPtrFastConst(std::shared_ptr<int> myInt)
{
    return 5 + *myInt;
}

int ReferencePtrFastConst(std::shared_ptr<int>& myInt)
{
    return 5 + *myInt;
}

int MovePtrFastConst(std::shared_ptr<int>&& myInt)
{
    return 5 + *myInt;
}

int CopyPtrFastRand(std::shared_ptr<int> myInt)
{
    return DIST11(RNG) + *myInt;
}

int ReferencePtrFastRand(std::shared_ptr<int>& myInt)
{
    return DIST11(RNG) + *myInt;
}

int MovePtrFastRand(std::shared_ptr<int>&& myInt)
{
    return DIST11(RNG) + *myInt;
}

void RunConstantFunctions(std::shared_ptr<int> myInt)
{
    std::cout << "\nIn constant funcs, ref count = " << myInt.use_count() << std::endl;
    // demonstrates speed of each function
    int sum = 0;

    // Copy pointer
    auto start = std::chrono::steady_clock::now();
    for (auto i=0; i<MAX_NUM_ITS; i++)
    {
        sum += CopyPtrFastConst(myInt);
    }
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> copyElapsed = end - start;
    std::cout << "CopyPtrConst sum = " << sum << ", took " << copyElapsed.count() << " seconds.\n";

    // pass pointer by reference
    sum = 0;
    start = std::chrono::steady_clock::now();
    for (auto i=0; i<MAX_NUM_ITS; i++)
    {
        sum += ReferencePtrFastConst(myInt);
    }
    end = std::chrono::steady_clock::now();
    std::chrono::duration<double> refElapsed = end - start;
    std::cout << "ReferencePtrConst sum = " << sum << ", took " << refElapsed.count() << " seconds.\n";

    // pass pointer using std::move
    sum = 0;
    start = std::chrono::steady_clock::now();
    for (auto i=0; i<MAX_NUM_ITS; i++)
    {
        sum += MovePtrFastConst(std::move(myInt));
    }
    end = std::chrono::steady_clock::now();
    std::chrono::duration<double> moveElapsed = end - start;
    std::cout << "MovePtrConst sum = " << sum << ", took " << moveElapsed.count() <<
        " seconds.\n";

    std::cout << "std::move vs pass by value: " << copyElapsed / moveElapsed << " times faster.\n";
    std::cout << "std::move vs pass by ref:   " << refElapsed / moveElapsed << " times faster.\n";
}

void RunRandomFunctions(std::shared_ptr<int> myInt)
{
    std::cout << "\nIn random funcs, ref count = " << myInt.use_count() << std::endl;
    // demonstrates speed of each function
    int sum = 0;

    // Copy pointer
    auto start = std::chrono::steady_clock::now();
    for (auto i=0; i<MAX_NUM_ITS; i++)
    {
        sum += CopyPtrFastRand(myInt);
    }
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> copyElapsed = end - start;
    std::cout << "CopyPtrRand sum = " << sum << ", took " << copyElapsed.count() << " seconds.\n";

    // pass pointer by reference
    sum = 0;
    start = std::chrono::steady_clock::now();
    for (auto i=0; i<MAX_NUM_ITS; i++)
    {
        sum += ReferencePtrFastRand(myInt);
    }
    end = std::chrono::steady_clock::now();
    std::chrono::duration<double> refElapsed = end - start;
    std::cout << "ReferencePtrRand sum = " << sum << ", took " << refElapsed.count() << " seconds.\n";

    // pass pointer using std::move
    sum = 0;
    start = std::chrono::steady_clock::now();
    for (auto i=0; i<MAX_NUM_ITS; i++)
    {
        sum += MovePtrFastRand(std::move(myInt));
    }
    end = std::chrono::steady_clock::now();
    std::chrono::duration<double> moveElapsed = end - start;
    std::cout << "MovePtrRand sum = " << sum << ", took " << moveElapsed.count() <<
        " seconds.\n";

    std::cout << "std::move vs pass by value: " << copyElapsed / moveElapsed << " times faster.\n";
    std::cout << "std::move vs pass by ref:   " << refElapsed / moveElapsed << " times faster.\n";
}

int main()
{
    // demonstrates how use counts are effected between copy and move
    std::shared_ptr<int> myInt = std::make_shared<int>(5);
    std::cout << "In main: ref count = " << myInt.use_count() << std::endl;
    CopyPtr(myInt);
    std::cout << "In main: ref count = " << myInt.use_count() << std::endl;
    ReferencePtr(myInt);
    std::cout << "In main: ref count = " << myInt.use_count() << std::endl;
    MovePtr(std::move(myInt));
    std::cout << "In main: ref count = " << myInt.use_count() << std::endl;

    // since myInt was moved to MovePtr and fell out of scope on return (was destroyed),
    // we have to reinitialize myInt
    myInt.reset();
    myInt = std::make_shared<int>(5);

    RunConstantFunctions(myInt);
    RunRandomFunctions(myInt);

    return 0;
}

live version here

I noticed that for -O0 and -O3, the constant functions both compiled to the same assembly for both sets of flags, both relatively short blocks. This makes me think a majority of the optimization comes from the calling code, but I'm not really seeing that in my amateur assembly knowledge.

The random functions compiled to quite a bit of assembly, even for -O3, so the random part must be dominating that routine.

So in the end, not really sure what to make of this. Please throw darts at it, tell me what I did wrong, offer some explanations.

Bugbee answered 27/7, 2021 at 19:11 Comment(0)
M
1

At least with libstdc++ you should get the same performance with move and assignment because operator= calls std::move on the incoming pointer. See: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/shared_ptr.h#L384

Menstruate answered 27/4, 2020 at 20:56 Comment(1)
Only if it is a move assignment (RHS is an rvalue), not if it is a copy assignment.Melancholy
L
1

Unfortunately I did not read @yano's anwer. So I did my own benchmark. Sad that nobody tried to verify the hypotheses around here. My results were the similar to yanos, in the sense that the improvement is far away from hundreds of times.

On my Macbook Air move is three times faster (g++ as well as clang++ -std=c++17 -O3 -DNDEBUG). Let me know if you see problems with the benchmark.

#include <chrono>
#include <iostream>
#include <vector>
#include <memory>
using namespace std;
using namespace std::chrono;


int COUNT = 50'000'000;

struct TimeIt
{
    system_clock::time_point start;
    TimeIt() {
        start = system_clock::now();
    }
    ~TimeIt() {
        auto runtime = duration_cast<milliseconds>(system_clock::now()-start).count();
        cout << runtime << " ms" << endl;
    }

};

void benchmark_copy(const vector<shared_ptr<int>> &vec_src)
{
    cout << "benchmark_copy" << endl;
    vector<shared_ptr<int>> vec_dst;
    vec_dst.reserve(COUNT);
    TimeIt ti;
    for(auto &sp : vec_src)
        vec_dst.emplace_back(sp);
}

void benchmark_move(vector<shared_ptr<int>> &&vec_src)
{
    cout << "benchmark_move" << endl;
    vector<shared_ptr<int>> vec_dst;
    vec_dst.reserve(COUNT);
    TimeIt ti;
    for(auto &sp : vec_src)
        vec_dst.emplace_back(move(sp));

}

int main (int arg, char **argv){

    vector<shared_ptr<int>> vec;
    for (int i = 0; i < COUNT; ++i)
        vec.emplace_back(new int);

    benchmark_copy(vec);
    benchmark_move(move(vec));

}
Lithomarge answered 21/11, 2021 at 14:59 Comment(1)
What this benchmark is missing (that's why you're not seeing a big difference) is multithreaded contention. Try having 16 threads each moving or copying the same shared_ptr (not a vector of distinct ones) many times, in parallel. For example, make a shared_ptr<vector<int>>, put 16 zeroes in the vector, then have the thread action be, in a tail-recursive function called COUNT times, increment the vector at position [thread index]. Making that recursive function copy or move the shared_ptr as appropriate for each benchmark. Uncontended atomic operations aren't so slow.Rung

© 2022 - 2024 — McMap. All rights reserved.