I am trying to use tbb to multi-thread an existing recursive algorithm. The single-thread version uses tail-call recursion, structurally it looks something like this:
void my_func() {
my_recusive_func (0);
}
bool doSomeWork (int i, int& a, int& b, int& c) {
// do some work
}
void my_recusive_func (int i) {
int a, b, c;
bool notDone = doSomeWork (i, a, b, c);
if (notDone) {
my_recusive_func (a);
my_recusive_func (b);
my_recusive_func (c);
}
}
I am a tbb novice so my first attempt used the parallel_invoke function:
void my_recusive_func (int i) {
int a, b, c;
bool notDone = doSomeWork (i, a, b, c);
if (notDone) {
tbb::parallel_invoke (
[a]{my_recusive_func (a);},
[b]{my_recusive_func (b);},
[c]{my_recusive_func (c);});
}
}
This does work and it runs faster than the single-threaded version but it doesn't seem to scale well with number of cores. The machine I'm targeting has 16 cores (32 hyper-threads) so scalability is very important for this project, but this version only gets about 8 times speedup at best on that machine and many cores seem idle while the algorithm is running.
My theory is that tbb is waiting for the child tasks to complete after the parallel_invoke so there may be many tasks sitting around idle waiting unnecessarily? Would this explain the idle cores? Is there any way to get the parent task to return without waiting for the children? I was thinking perhaps something like this but I don't know enough about the scheduler yet to know if this is OK or not:
void my_func()
{
tbb::task_group g;
my_recusive_func (0, g);
g.wait();
}
void my_recusive_func (int i, tbb::task_group& g) {
int a, b, c;
bool notDone = doSomeWork (i, a, b, c);
if (notDone) {
g.run([a,&g]{my_recusive_func(a, g);});
g.run([b,&g]{my_recusive_func(b, g);});
my_recusive_func (c, g);
}
}
My first question is is tbb::task_group::run()
thread-safe? I couldn't figure that out from the documentation. Also, is there better way to go about this? Perhaps I should be using the low-level scheduler calls instead?
(I typed this code without compiling so please forgive typos.)
my_recusive_func(i+1); my_recusive_func(i+2); my_recusive_func(i+3);
The parallel version callsmy_recusive_func(i+1)
three times since it capturesi
by value, and shouldn't even compile because you can't modify captured values in the lambda body unless the lambda is declaredmutable
. – Ioves