The suggestion that Tomasz Lewowski has brought up in his comment and I have expanded upon is pretty simple and based upon the following observation: A push_back
on a std::vector
potentially needs to re-allocate the backing store and copy (or, preferably, move) the elements. This constitutes a critical section that needs to be synchronized.
For the next examples, assume we want to have a vector filled with the first 12 primes but we don't care about their ordering. (I have just hard-coded the numbers here but assume they are obtained via some expensive computation that makes sense to do in parallel.) There is a dangerous race condition in the following scenario.
std::vector<int> numbers {}; // an empty vector
// thread A // thread B // thread C
numbers.push_back( 2); numbers.push_back(11); numbers.push_back(23);
numbers.push_back( 3); numbers.push_back(13); numbers.push_back(27);
numbers.push_back( 5); numbers.push_back(17); numbers.push_back(29);
numbers.push_back( 7); numbers.push_back(19); numbers.push_back(31);
There is also another problem with the push_back
. If two threads call it simultaneously, they will both attempt to construct an object at the same index with potentially disastrous consequences. So the problem is not solved with a reserve(n)
before forking the threads.
However, since you know the number of elements in advance, you can simply assign them to a specific location inside a std::vector
without changing its size. If you don't change the size, there is no critical section. Therefore, there is no race in the following scenario.
std::vector<int> numbers(12); // 12 elements initialized with 0
// thread A // thread B // thread C
numbers[ 0] = 2; numbers[ 1] = 3; numbers[ 2] = 5;
numbers[ 3] = 7; numbers[ 4] = 11; numbers[ 5] = 13;
numbers[ 6] = 17; numbers[ 7] = 19; numbers[ 8] = 23;
numbers[ 9] = 29; numbers[10] = 31; numbers[11] = 37;
Of course, if both threads attempt to write to the same index, the race will be there again. Fortunately, protecting against this is not difficult in practice. If your vector has n elements and you have p threads, thread i writes only to elements [i n / p, (i + 1) n / p). Note that this is preferable over having thread i write to elements at index j only if j mod p = i because it leads to fewer cache invalidations. So the access pattern in the above example is sub-optimal and had better been like this.
std::vector<int> numbers(12); // 12 elements initialized with 0
// thread A // thread B // thread C
numbers[ 0] = 2; numbers[ 4] = 11; numbers[ 8] = 23;
numbers[ 1] = 3; numbers[ 5] = 13; numbers[ 9] = 29;
numbers[ 2] = 5; numbers[ 6] = 17; numbers[10] = 31;
numbers[ 3] = 7; numbers[ 7] = 19; numbers[11] = 37;
So far so good. But what if you don't have a std::vector<int>
but a std::vector<Foo>
? If Foo
does not have a default constructor, then
std::vector<Foo> numbers(10);
will be invalid. And even if it has one, it would be outrageous to create many expensive default-constructed objects just to re-assign them soon without ever retrieving the value.
Of course, most well-designed classes should have a very cheap default constructor. For example, a std::string
is default constructed to an empty string that requires no memory allocation. A good implementation will reduce the cost of default-constructing a string to just
std::memset(this, 0, sizeof(std::string));
And if the compiler is smart enough to figure out that we are allocating and initializing an entire std::vector<std::string>(n)
, it might be able to optimize this further to a single call to
std::calloc(n, sizeof(std::string));
So if there is any chance you can make Foo
be cheaply default-constructible and assignable, you are done. However, if this turns out to be difficult, you can avoid the problem by moving it to the heap. A smart pointer is cheaply default-constructible, so
std::vector<std::unique_ptr<Foo>> foos(n);
will eventually reduce to a
std::calloc(n, sizeof(std::unique_ptr<Foo>));
without you doing anything to Foo
. Of course, this convenience comes at the price of a dynamic memory allocation for each element.
std::vector<std::unique_ptr<Foo>> foos(n);
// thread A // thread B // thread C
foos[0].reset(new Foo {...}); foos[n / 3 + 0].reset(new Foo {...}); foos[2 * n / 3 + 0].reset(new Foo {...});
foos[1].reset(new Foo {...}); foos[n / 3 + 1].reset(new Foo {...}); foos[2 * n / 3 + 1].reset(new Foo {...});
foos[2].reset(new Foo {...}); foos[n / 3 + 2].reset(new Foo {...}); foos[2 * n / 3 + 2].reset(new Foo {...});
... ... ...
This might not be as bad as you might think because while dynamic memory allocations are not free, the sizeof
a std::unique_ptr
is very small so if sizeof(Foo)
is large, you get the bonus of a more compact vector that is faster to iterate. It all depends of course how you intend to use your data.
If you don't know the exact number of elements in advance or are afraid you'll mess up the indexing, there is yet another way to do it: Have each thread fill its own vector and merge them at the end. Continuing the primes example, we get this.
std::vector<int> numbersA {}; // private store for thread A
std::vector<int> numbersB {}; // private store for thread B
std::vector<int> numbersC {}; // private store for thread C
// thread A // thread B // thread C
numbersA.push_back( 2); numbersB.push_back(11); numbersC.push_back(23);
numbersA.push_back( 3); numbersB.push_back(13); numbersC.push_back(27);
numbersA.push_back( 5); numbersB.push_back(17); numbersC.push_back(29);
numbersA.push_back( 7); numbersB.push_back(21); numbersC.push_back(31);
// Back on the main thread after A, B and C are joined:
std::vector<int> numbers(
numbersA.size() + numbersB.size() + numbersC.size());
auto pos = numbers.begin();
pos = std::move(numbersA.begin(), numbersA.end(), pos);
pos = std::move(numbersB.begin(), numbersB.end(), pos);
pos = std::move(numbersC.begin(), numbersC.end(), pos);
assert(pos == numbers.end());
// Now dispose of numbersA, numbersB and numbersC as soon as possible
// in order to release their no longer needed memory.
(The std::move
used in the above code is the one from the algorithms library.)
This approach has the most desirable memory access pattern of all because numbersA
, numbersB
and numbersC
are writing to completely independently allocated memory. Of course, we got to do the additional sequential work of joining the intermediate results. Note that the efficiency relies heavily on the fact that the cost of move-assigning an element is negligible compared to the cost of finding / creating it. At least as written above, the code also assumes that your type has a cheap default constructor. Of course, if this is not the case for your type, you can again use smart pointers.
I hope this provided you with enough ideas to optimize your problem.
If you have never uses smart pointers before, have a look at “RAII and smart pointers in C++” and check out the standard library's dynamic memory management library. The techniques shown above would of course also work with a std::vector<Foo *>
but we don't use resource owning raw pointers like this in modern C++ any more.
RKD::RKD
you may also squeeze some performance out ofRKD rkd = RKD(...);
outside the lock-section, androots.emplace_back(std::move(rkd))
inside your lock-section, but only if your class is sufficiently coded to utilize move semantics (worth researching ifRKD::RKD
really is expensive as you seem to indicate). – FormulateRKD::RKD
@WhozCraig? You mean the constructor? I was talking about the class, shouldn't this be the criterion? Since the copy constructor is going to be called as I have it. – Innateemplace_back()
and seems like the way to go. Are you going to answer so that I can accept it? – Innatenoexcept
). – Formulatestd::thread
. – InnateRKD
s. Resizing the vector means that some default constructor is going to be called from every object though. However, it sounds interesting. Maybe you can analyse further? – Innatestd::vector<std::unique_ptr<RKD>>
. But if you are the author ofRKD
, it might be better to give move semantics to it. – Lollandunique_ptr
before and I am not sure how to give the semantics to my class. If you could analyse your comment, in an answer maybe, that would be fantastic. – Innate