Synchronise push_back and std::thread
Asked Answered
I

2

4

My code

void build(std::vector<RKD <DivisionSpace> >& roots, ...) {
  try {
    // using a local lock_guard to lock mtx guarantees unlocking on destruction / exception:
    std::lock_guard<std::mutex> lck (mtx);
    roots.push_back(RKD<DivisionSpace>(...));
  }
  catch (const std::bad_alloc&) {
    std::cout << "[exception caught when constructing tree]\n";
    return;
  }
}

Now, the actual work should be done serially, not in parallel.

The constructor of RKD can run in parallel with other constructors of RKD. However, pushing the objects back in std::Vector is a critical section, right?

The number of the objects I am going to be build is known. It is going to be something in the range [2, 16] in practise. In theory it could be any positive number.

Also, I am not interesting for the order they will be inserted in the container.

So I could do something like:

RKD tree = RKD(...);
mutex_lock(...);
roots.push_back(tree);

However this would imply copying, wouldn't it?

What should I do to make my code parallel?


I decided to use a lock (instead of just a mutex) because of this answer.

Innate answered 11/1, 2015 at 13:30 Comment(13)
You're updating a shared resource (your std::vector). They're not thread-safe. Locking is mandated. Depending on the complexity of RKD::RKD you may also squeeze some performance out of RKD rkd = RKD(...); outside the lock-section, and roots.emplace_back(std::move(rkd)) inside your lock-section, but only if your class is sufficiently coded to utilize move semantics (worth researching if RKD::RKD really is expensive as you seem to indicate).Formulate
So with the lock, I am OK for my critical section. Why you say RKD::RKD @WhozCraig? You mean the constructor? I was talking about the class, shouldn't this be the criterion? Since the copy constructor is going to be called as I have it.Innate
I see what you mean with emplace_back() and seems like the way to go. Are you going to answer so that I can accept it?Innate
I didn't really answer the question. True parallel is only plausible with a lockless container (and only if you seriously confirmed this lock as a problem is it worth investigating, as it is not a trivial subject and very easy to get wrong). I was just mentioning the fact that you can avoid copy-ctor execution if you setup your object to be move-friendly (and btw, make sure your move-ctor is noexcept).Formulate
"and btw, make sure your move-ctor is noexcept". What does that mean @WhozCraig? If I move the construction before the lock, then the construction procedure of the objects is going to be executed in parallel and only the insertion in the container will be executed as a critical section. Do I see it wrong? Sorry if I am seeing silly, it's the first time I actually use std::thread.Innate
If you know from the beginning how many RKDs do you need, RKD is trivially constructible and its construction has no side effects, you may allocate vector earlier and then run either in-place construction or even copy to specified memory location - parallel access to vector's elements does not require locking, lock is needed only in case of resizing (or if you write to same memory from more than one thread)Flowering
@TomaszLewowski yes I know the number of the RKDs. Resizing the vector means that some default constructor is going to be called from every object though. However, it sounds interesting. Maybe you can analyse further?Innate
I think @TomaszLewowski's suggestion is the way to go. And if your objects don't fulfill said requirements, you might consider using a std::vector<std::unique_ptr<RKD>>. But if you are the author of RKD, it might be better to give move semantics to it.Lolland
I am the author @5gon12eder. Your comment looks smart, but I haven't used unique_ptr before and I am not sure how to give the semantics to my class. If you could analyse your comment, in an answer maybe, that would be fantastic.Innate
Perhaps stating the obvious, but if constructing an RKD is relatively fast, and there are a relatively small number of them, then a multithreaded solution will be slower because of the overhead of thread creation/join and locking if you decide you need locking. If multithreading makes sense in your case, and the sizeof(RKD) is small relative to a cache line, then consider having each thread construct sequential elements of the vector instead of inter-leaving them to avoid false sharing en.wikipedia.org/wiki/False_sharing.Impuissant
@Impuissant this is what 5gon12eder says in his answer, right?Innate
@G.Samaras yes, 5gon12eder's answer specifically shows an interleaved solution and states that is suboptimal because of cache invalidation (without using the term false sharing) and then shows a sequentially partitioned solution... I had missed that in the answer on first glance. The question of whether your code would benefit from parallel construction is still for you to answer. Will it be faster? Does it matter? Is the extra complexity warranted? Only you can answer that. If nothing else it's a great opportunity to learn, and there's a lot to be said for that.Impuissant
@Impuissant yes, theoretically there is a lot of gain through the parallel execution, but in practise, since my objects are not many, I am afraid but it worth a try!Innate
L
8

The suggestion that Tomasz Lewowski has brought up in his comment and I have expanded upon is pretty simple and based upon the following observation: A push_back on a std::vector potentially needs to re-allocate the backing store and copy (or, preferably, move) the elements. This constitutes a critical section that needs to be synchronized.

For the next examples, assume we want to have a vector filled with the first 12 primes but we don't care about their ordering. (I have just hard-coded the numbers here but assume they are obtained via some expensive computation that makes sense to do in parallel.) There is a dangerous race condition in the following scenario.

std::vector<int> numbers {};  // an empty vector

// thread A             // thread B             // thread C

numbers.push_back( 2);  numbers.push_back(11);  numbers.push_back(23);
numbers.push_back( 3);  numbers.push_back(13);  numbers.push_back(27);
numbers.push_back( 5);  numbers.push_back(17);  numbers.push_back(29);
numbers.push_back( 7);  numbers.push_back(19);  numbers.push_back(31);

There is also another problem with the push_back. If two threads call it simultaneously, they will both attempt to construct an object at the same index with potentially disastrous consequences. So the problem is not solved with a reserve(n) before forking the threads.

However, since you know the number of elements in advance, you can simply assign them to a specific location inside a std::vector without changing its size. If you don't change the size, there is no critical section. Therefore, there is no race in the following scenario.

std::vector<int> numbers(12);  // 12 elements initialized with 0

// thread A          // thread B          // thread C

numbers[ 0] =  2;    numbers[ 1] =  3;    numbers[ 2] =  5;
numbers[ 3] =  7;    numbers[ 4] = 11;    numbers[ 5] = 13;
numbers[ 6] = 17;    numbers[ 7] = 19;    numbers[ 8] = 23;
numbers[ 9] = 29;    numbers[10] = 31;    numbers[11] = 37;

Of course, if both threads attempt to write to the same index, the race will be there again. Fortunately, protecting against this is not difficult in practice. If your vector has n elements and you have p threads, thread i writes only to elements [i n / p, (i + 1) n / p). Note that this is preferable over having thread i write to elements at index j only if j mod p = i because it leads to fewer cache invalidations. So the access pattern in the above example is sub-optimal and had better been like this.

std::vector<int> numbers(12);  // 12 elements initialized with 0

// thread A          // thread B          // thread C

numbers[ 0] =  2;    numbers[ 4] = 11;    numbers[ 8] = 23;
numbers[ 1] =  3;    numbers[ 5] = 13;    numbers[ 9] = 29;
numbers[ 2] =  5;    numbers[ 6] = 17;    numbers[10] = 31;
numbers[ 3] =  7;    numbers[ 7] = 19;    numbers[11] = 37;

So far so good. But what if you don't have a std::vector<int> but a std::vector<Foo>? If Foo does not have a default constructor, then

std::vector<Foo> numbers(10);

will be invalid. And even if it has one, it would be outrageous to create many expensive default-constructed objects just to re-assign them soon without ever retrieving the value.

Of course, most well-designed classes should have a very cheap default constructor. For example, a std::string is default constructed to an empty string that requires no memory allocation. A good implementation will reduce the cost of default-constructing a string to just

std::memset(this, 0, sizeof(std::string));

And if the compiler is smart enough to figure out that we are allocating and initializing an entire std::vector<std::string>(n), it might be able to optimize this further to a single call to

std::calloc(n, sizeof(std::string));

So if there is any chance you can make Foo be cheaply default-constructible and assignable, you are done. However, if this turns out to be difficult, you can avoid the problem by moving it to the heap. A smart pointer is cheaply default-constructible, so

std::vector<std::unique_ptr<Foo>> foos(n);

will eventually reduce to a

std::calloc(n, sizeof(std::unique_ptr<Foo>));

without you doing anything to Foo. Of course, this convenience comes at the price of a dynamic memory allocation for each element.

std::vector<std::unique_ptr<Foo>> foos(n);

// thread A                    // thread B                           // thread C

foos[0].reset(new Foo {...});  foos[n / 3 + 0].reset(new Foo {...});  foos[2 * n / 3 + 0].reset(new Foo {...});
foos[1].reset(new Foo {...});  foos[n / 3 + 1].reset(new Foo {...});  foos[2 * n / 3 + 1].reset(new Foo {...});
foos[2].reset(new Foo {...});  foos[n / 3 + 2].reset(new Foo {...});  foos[2 * n / 3 + 2].reset(new Foo {...});
...                            ...                                    ...

This might not be as bad as you might think because while dynamic memory allocations are not free, the sizeof a std::unique_ptr is very small so if sizeof(Foo) is large, you get the bonus of a more compact vector that is faster to iterate. It all depends of course how you intend to use your data.

If you don't know the exact number of elements in advance or are afraid you'll mess up the indexing, there is yet another way to do it: Have each thread fill its own vector and merge them at the end. Continuing the primes example, we get this.

std::vector<int> numbersA {};  // private store for thread A
std::vector<int> numbersB {};  // private store for thread B
std::vector<int> numbersC {};  // private store for thread C

// thread A              // thread B              // thread C

numbersA.push_back( 2);  numbersB.push_back(11);  numbersC.push_back(23);
numbersA.push_back( 3);  numbersB.push_back(13);  numbersC.push_back(27);
numbersA.push_back( 5);  numbersB.push_back(17);  numbersC.push_back(29);
numbersA.push_back( 7);  numbersB.push_back(21);  numbersC.push_back(31);

// Back on the main thread after A, B and C are joined:

std::vector<int> numbers(
    numbersA.size() + numbersB.size() + numbersC.size());
auto pos = numbers.begin();
pos = std::move(numbersA.begin(), numbersA.end(), pos);
pos = std::move(numbersB.begin(), numbersB.end(), pos);
pos = std::move(numbersC.begin(), numbersC.end(), pos);
assert(pos == numbers.end());

// Now dispose of numbersA, numbersB and numbersC as soon as possible
// in order to release their no longer needed memory.

(The std::move used in the above code is the one from the algorithms library.)

This approach has the most desirable memory access pattern of all because numbersA, numbersB and numbersC are writing to completely independently allocated memory. Of course, we got to do the additional sequential work of joining the intermediate results. Note that the efficiency relies heavily on the fact that the cost of move-assigning an element is negligible compared to the cost of finding / creating it. At least as written above, the code also assumes that your type has a cheap default constructor. Of course, if this is not the case for your type, you can again use smart pointers.

I hope this provided you with enough ideas to optimize your problem.

If you have never uses smart pointers before, have a look at “RAII and smart pointers in C++ and check out the standard library's dynamic memory management library. The techniques shown above would of course also work with a std::vector<Foo *> but we don't use resource owning raw pointers like this in modern C++ any more.

Lolland answered 11/1, 2015 at 19:33 Comment(10)
Nice answer, will upvote it for sure (limit reached). What do you think about sehe's answer in comparison to yours?Innate
Well, it's more low-level so you'll be more likely to shoot yourself into the foot but it might give you some extra bits of performance. Anyway, the differences are small so you can give the simpler std::vector approach a try and if it doesn't meet your expectations on performance, switch to raw memory. You won't do much extra work this way because the logic what thread insert into which position doesn't change. Be sure to measure the performance.Lolland
I see. About the parameters for the index, p and n aren't the same in value?Innate
That's your decision, p = n would be a special case. As written, I've just used n as the number of objects you want to create and p as the number of threads you want to use to do so. So pick your p ranging from 1 to n. It's an engineering decision you'll have to make. If in doubt, maybe use p = min(n, P) where P is the number of CPUs you have (std::thread::hardware_concurrency), or let the user select it via an option.Lolland
Just a word of caution - be aware of what false sharing is, because you may be very badly surprised. Just because it is safe to assign to vec[1] and vec[2] from different threads, it doesn't mean it is efficient to do so or that no locking is going on. Think about how two cores writing to the same cache line ensure coherent results... (I now see this was mentioned in an answer below; sorry for the noise)Earache
@Lolland thanks for the tip. However, I am having a problem with indexing. Can you explain how it goes in your example? Also, notice the edit in my question about the number of objects I am going to construct. Does that affect your answer somehow? Stefan no need to apologize, if you have something to say, you should say it!;)Innate
@G.Samaras I'm not exactly sure what is unclear to you but I have revised my examples to be hopefully more clear now. I have also added yet another technique (merging individual vectors) that might be interesting for you. I could add complete compileable examples if you wish but I've assumed the given examples would map pretty straight-forward to real code and are easier to understand because they have less clutter.Lolland
Indexing is clear now. I like the added method, for sure I will compare them both. However, that method too should use the unique_ptr in order to avoid initialization, right? If so, then, the final vector will have as data the same unique pointer, right? So what we should do there? Reserve enough space and then copy from sub-vectors, or do it the way you have it in your answer?Innate
If your default constructor does not exist or involves a lot of work or your type is not move-able, yes, you should use smart pointers (or change your type). std::unique_ptr is move-able so you can directly use the method shown above. A moved-away-from std::unique_ptr is like a default-constructed one: it manages no object.Lolland
I accepted your answer, but the other one is good too. I made a follow-up question here: #27974106. You might find it interesting.Innate
H
3

The problem appears to be that your contructor is doing a lot of work and this breaks all kinds of library conventions around construction and container insertion.

Just fix it by decoupling the insertion from the creation.

The below code is very similar to the code suggested by @5gon12eder except that it doesn't "force" you to change object locality.

In my little demo

  • we use a raw region of memory that's trully uninitialized (this is not possible with vector, where insertion implies initialization), so instead of the "canonical"

    std::array<RKD, 500> rkd_buffer; // OR
    std::vector<RKD> rkd_buffer(500); // OR even
    std::unique_ptr<RKD[]> rkd_buffer(new RKD[500]);
    

    we're gonna use a custom combination:

    std::unique_ptr<RKD[N], decltype(&::free)> rkd_buffer(
        static_cast<RKD(*)[N]>(::malloc(sizeof(RKD) * N)),
        ::free
    );
    
  • we then create a few threads (5 in the sample) to construct all the elements. The items are just constructed in-place and their respective destructors will be invoked at program exit

  • it is, therefore, crucial that all items have been fully initialized before rkd_buffer goes out of scope (the join ensures this here).
  • the threads could synchronize by different means: constructions could e.g. be dispatched via a work queue to a thread pool, where either condition variables, promises, thread barriers (from boost) or even just atomic shared counters could be used for the coordination.

    All these choices are in essence unrelated to the task of getting construction to run in parallel, so I'll leave that to your imagination (or other SO answers)

Live On Coliru

struct RKD {
    RKD() { this_thread::sleep_for(chrono::milliseconds(rand() % 100)); } // expensive
};

int main() {
    static const int N         = 500;
    static const int ChunkSize = 100;
    std::unique_ptr<RKD[N], decltype(&::free)> rkd_buffer(static_cast<RKD(*)[N]>(::malloc(sizeof(RKD) * N)), ::free);

    vector<thread> group;
    for (int chunk = 0; chunk < N/ChunkSize; chunk += ChunkSize)
        group.emplace_back([&] { 
            for (int i=chunk * ChunkSize; i<(ChunkSize + chunk*ChunkSize); ++i)
                new (rkd_buffer.get() + i) RKD;
        });

    for (auto& t:group) if (t.joinable()) t.join();

    // we are responsible for destructing, since we also took responsibility for construction
    for (RKD& v : *rkd_buffer)
        v.~RKD();
}

You can see that there are 5 threads dividing 500 constructions. Each construction takes (on average) ~50ms so the total time taken should be 100*50ms ~= 5s. This is in fact precisely what happens:

real    0m5.193s
user    0m0.004s
sys 0m0.000s
Hysterical answered 11/1, 2015 at 21:2 Comment(11)
Is your answer (which I thank you for taking the time to compose, I will upvote when I can) in any way related to what amdn posted as a comment under my question (last comment)? Also, could you please explain what we do in std::unique_ptr<RKD[N], dec...? I also see malloc and free, instead of the C++ features (which actually call the C ones under the hood..). I could just copy the code, but I would like to learn too. I mean because of this question, I learned about move semantics, swap-copy idiom and many more things! Can I get greedy and learn one (or two) more? :DInnate
It's creating a smart pointer with unique ownership semantics and a custom deleter. That's covered in other answers.Hysterical
And no I didn't specifically refer to false sharing, although the loop chunking or scheduling was done with such performance concerns in mindHysterical
@G.Samaras, I agree that this design avoids false sharing and have upvoted.Impuissant
@G.Samaras Indeed I forgot to re-add destruction yesterday. Fixed that now. Disclaimer This code makes sense as a performance optimization only. Consider reorganizing your code as dealing with uninitialized buffers is error prone. Perhaps, instead just use std::vector<boost::optional<RKD> >. Profile before you optimize!Hysterical
"Profile before you optimize!" -> a must! Can you please link me to some of these answers you say that cover the dense line of code. I do not want to use boost. Did the edit of my question regarding the number of objects affect your answer?Innate
#8274951, codereview.stackexchange.com/questions/4679/…, #19053851, oh well. Lots of options via google. And there's c++-faq tag (Which kind of smart pointer do I use when?)Hysterical
@G.Samaras No, the number doesn't affect my answer. YES it affects your profiling results. Unless your constructors are completely unwieldy (can you just fix that?) you will have no benefits making construction parallel.Hysterical
Thanks for the links, I will study them. By unwieldy you mean that my class lacks a move constructor?Innate
@G.Samaras No, I mean that apparently the constructor is doing so much work that standard library patterns don't scale well (this is the source of your issues!). I'd say use boost::optional, a wrapper that lazily constructs or whatever to untie that knot.Hysterical
I did not accept your answer, I upvoted it. The other one is more introductive and easier for a beginner to digest. However, in your comments, there was the most clever thing mentioned in this post ;) . I made a follow-up question here: #27974106. You might find it interesting.Innate

© 2022 - 2024 — McMap. All rights reserved.